Source: https://drive.google.com/file/d/1efATrW7wpnBsTycfj29xohyOkkR0VFD1/view
Data Wrangling
Data Science
The scientific process of extracting value from data
Data Vocabulary/Fluency
Variables/Features
Variables are stored in columns on spreadsheets.
Observations
Observations are stored in rows on spreadsheets.
Tidy Data
- Each variable you measure should be in a single column.
- Every observation of a variable should be in a different row.
- There should be one table for each type of data.
- If you have multiple tables, they should include a column in each with the same column label that allows them to be joined.
Tidy data `==` rectangular data
Benefits of Tidy Data
- Consistent data structure
- Foster tool development
- Require only a small set of tools to be learned
- Allow for datasets to be combined (much like in an RDB!)
Good Spreadsheets
Rules for Tidy Spreadsheets
- Be consistent
- Choose good names for things
- Write dates as YYYY-MM-DD (ISO-8601)
- No empty cells
- Put just one thing in a cell
- Don’t use font color or highlighting as data
- Save the data as plain text files (i.e. CSV)
Be consistent
- Keep exactly the same variable names across spreadsheets.
- Pick a way to code your variables and stick to it.
Choose good names for things
- Avoid extra spaces
- Use underscores not spaces
- Choose meaningful names
Write dates as YYYY-MM-DD
Use ‘ISO 8601’ standard
No empty cells
Put just one thing in a cell
Don't use font color or highlighting as data
Common Problems with Messy Data Sets
- Column headers are values but should be variable names
- A single column has multiple variables
- Variables have been entered in both rows and columns
- Multiple “types” of data are in the same spreadsheet
- A single observation is stored across multiple spreadsheets
Tidy data is also relevant for unstructured data
Common data wrangling tasks (and verbs)
- subset dataset
- filter: filter rows
- select: select columns
- change order
- arrange: reorder rows
- reorder: reorder columns
- add a column: mutate
- summarize data:
- group by: group by other variables
- summarize: reduce multiple values down to a single value