Data Set Fact Sheets

Sierra’s Data Set (Housing Prices Prediction)

  • This data set can be used to predict housing prices based on certain factors like house area, bedrooms, furnishings, nearness to the main road, etc.
  • This dataset can be improved by reducing the complexities that arise since it has strong multicollinearity.
  • This dataset can be improved by providing information about specific areas in the United States that it correlates with. Ex. City, State
  • Several ML options can be used to build predictive models.
  • Methodology and a support group for questions and comments are provided.
  • Readability 9/10  Standardization 10/10  Structure  8/10 Usability 9/10

Emanuel’s Data Set (Tech Careers)

  • There appears to be a correlation between educational attainment and salary, as individuals with higher education levels tend to have higher salaries. 
  • Some individuals with similar educational backgrounds and occupations have notably different salaries, suggesting that factors other than education and occupation may influence salary disparities. 
  • The spreadsheet’s usefulness depends on its intended purpose and context. It contains information about individuals’ demographics such as salaries, locations, occupations, and education levels. On a scale of 1 to 10 for ease of understanding, it rates about a 6, as the data has clear column headers and values
  • However, there is room for improvement in terms of clarity and organization, such as consistent formatting of the salary column and potentially adding more context.
  • Unfortunately, there is no documentation available to assist readers in understanding the dataset better. Providing a data dictionary explaining column meanings and abbreviations would be helpful.

Jadin’s Data Set (Employee Information)

  • My Overall Rating: 6.5/10 
  • Structure: 7/10 – The Structure to the Dataset overall is a good format, but the headers are inconsistent and should be more uniform and standardized, for example the [first Name] header should be changed to [First Name] same goes for [employed Since] which should be changed to [Employed Since] 
  • Standardization: 3/10 – This datasets has various inconsistencies and conflicting datapoints a few examples include: missing datapoints, incorrect standardization, inconsistent formatting, and a overall lack of structure. 
  • Readability: 8/10 – This Dataset as a whole has a lot of inconsistencies and would need cleaning and standardization before you could properly work in it, despite that however the format along with the structure is organized in a way that once its been cleaned and formatted properly the up-keeping would be simple to manage with minimal experience handles datasets. 

Selam’s Data Set (Sales)

  • A clean dataset is essential for accurate analysis, efficient work, effective reporting, data integrity, time savings, and better decision-making. It forms the foundation for reliable insights and informed actions.
  • The mistakes are a typo, capitalization, and an incorrect monetary value.  
  • How uncleaned data affects usefulness of data set. 
    • Uncleaned data can result in inaccuracies, inefficiency, and disruption of lookup operations.  
    • It slows down data entry and cleaning processes, impacting data consistency and sorting/filtering. 
    • Calculation and formulas are also adversely affected. 
    • Overall, data integrity is compromised.  
  • How would we clean data set:
    • Inspect data for errors, ensure consistent formatting, standardize data type, and perform necessary transformations, including unit and currency conversions, and text cleaning.  
    • Testing the data for accuracy and integrity.  
    • Integrate data from various sources as needed.