Source: https://drive.google.com/file/d/1Om8FZiWzg8Ok3UoUpGIA87bPbVrKX7ou/view
Descriptive & Exploratory Data Analyses
Variability Tells you how spread out values are
Range: highest score - lowest score
Interquartile Range (IQR): 75th percentile - 25th percentile
Variance: measures how close the values in a distribution are to the middle of the middle of the distribution
- average squared difference of the scores from the mean
Standard deviation (SD): square-root of the variance
Why Central Tendency doesn’t tell the whole story
Anscomb's Quartet
The same summary statistics can be for wildly different data
Exploratory
The goal is to find unknown relationships between the variables you have measured in your data set. Exploratory analysis is open ended and designed to verify expected or find unexpected relationships between measurements.
Why Exploratory Data Analysis
- Summarize data - basic stats
- Understand basic properties
- Discover patterns
- Suggest modeling strategies
- Check assumptions (sanity checks)
- Communicate results (present the data)
and if you don’t, you’ll regret it
How to EDA
- Look for missing values
- Look for outlier values
- Calculate numerical summaries
- Generate plots to explore relationships
- Use tables to explore relationships
- Search for patterns (i.e. linear vs nonlinear)
- If necessary, transform variables
EDA Approaches to Get a Feel for the Data
Univariate
understanding one variable
i.e. histogram, densityplot, barplot
Bivariate
understanding relationship between 2 variables
i.e. boxplot, scatterplot, grouped barplot
Multivariate
projecting high-D data into lower-D space
i.e. PCA, ICA, clustering
Types of Variables within EDA
Explanatory (independent) variable
It explains variations in the response variable, this is the variable a research would change
Response (dependent) variable
This is either a predicted value, or a value explained by the explanatory variable. This is the measured outcome in an experiment.
General approach to using zip codes
- Map zip codes to latitude and longitude
- Count how many people fall into each zip code
- Plot each place on a map
When NOT to do EDA
- To identify samples that you can remove from your study after you’ve already analyzed all of your data
- After running a statistical test and seeing that your p-value is 0.054
- After completing an analysis and getting an answer you don’t like
- To improve the correlation between two variables
EDA is NOT a tool to get your data analysis to give you the results you want.
EDA (John Tukey)
Focuses on:
- understanding the data’s underlying structure
- develop intuition about the data set
- consider how the data were collected (to aid in cleaning)
- decide how to further investigate with more formal statistical methods