Source: https://drive.google.com/file/d/14-xuR8p5mXivM52VjypE2J2AixCDuf02/view
p-value
The probability of getting the observed results (or results more extreme) by chance alone
Regression
Does change in one variable mean change in another?
i.e. simple regression, multiple regression
Linear Regression
best-fitting line is a model of the data
- can be used to determine whether a change in one variable is related to the change in the other variable
- the magnitude of the relationship is measured by the slope of the line (referred to as the model’s effect size)
Models
Mathematical equations generated to represent the real life situation
“All models are wrong, but some are useful. - George Box
Correlation
Measures the strength of the linear relationship between two variables
Effect Size ( )
can be estimated using the slope of the line
Standard Error
The closer the points are to the regression line, the less uncertain we are in our estimate.
P-value takes into account effect size (
Assumptions of linear regression
- Linear relationship
- Multivariate normality
- No multicollinearity
- No autocorrelation
- Homoscedasticity
Multicollinearity
Linear regression assumes no multicollinearity. Multicollinearity occurs when the independent variables (in multiple linear regression) are too highly correlated with each other.
Autocorrelation
Autocorrelation occurs when the observations are not independent of one another (i.e. stock prices)
Homoscedastic
Points are relatively equidistant from the line of best fit at all points on the line
p-hacking
Many forms of p-hacking
- Using a subset of data
- Not adjusting for, or reporting multiple ‘testing’
- Trying different tests with the same hypothesis
- Experimenting with your data during model fitting
- Inclusionary/Exclusionary protocols for data, i.e. outliers, definitions e.g. “college students”, or “developing nations”, etc.
- Optional stopping of data collection based on results thus far
- Changing your alpha values on the fly
- Rounding your p-values arbitrarily, e.g. 0.0558
0.05
Confounding
Variable1
You can plan ahead to avoid confounding and/or include confounders in your models to account for their role on the outcome variable.
Ignoring confounders will lead you to draw incorrect conclusions from your analyses.
Machine Learning
Predictive Machine Learning
Apply machine learning techniques to data you have currently to generate a model that will be able to make a prediction on future data
What is machine learning?
“Machine Learning (ML) is a fascinating field of artificial intelligence (AI) research and practice where we investigate how computer agents can improve their perception, cognition, and action with experience. Machine learning is about machines improving from data, knowledge, experience, and interaction.”
- Manuela Veloso, Head of ML at Carnegie Mellon
Software Engineering vs. ML Systems
Data/Input, Program
Computation Output/Result
Data/Input, Output ResultComputation Program
ML is the field of study that gives computers the ability to learn without being explicity programmed. - Arthur Samuel (1959)
Machine learning approaches use data to make predictions in the future
Three Main Machine Learning Generalizations
Supervised Learning
- Labeled data
- Make predictions
- Classification or Regression!
Unsupervised Learning
- Unlabeled data
- Find structure
- Reduce dimensions
Reinforcement Learning
- Learn a set of actions
- Reward feedback system
- Agent explores a world
Two Types of Supervised Learning
Predicting a Continuous Value
Predicted a Class
What is Labeled Data?
Labeled images vs. unlabeled images
Labeled column vs. unlabeled column
Why does the label matter?
Supervised Approach
- Train on labels
- Predict labels
Unsupervised approach- Predict groups
Machine learning in the real world
- Classification: Spam Filter
- Classification: Image Recognition
- Regression: Electricity Demand
- Simple ML System
- Historical data
- Training Data
- + Code
- Train a model
- Win!