Data Type - Tabular
- Feature Table
- One of the most popular data types
- Features (attributes, dimensions, variables)
- A data field, representing a characteristic
Feature Types
- Nominal: categories, states, or “names of things”
- Binary: Nominal attribute with only 2 states (0 and 1)
- Symmetric binary: both outcomes equally important
- Asymmetric binary: outcomes not equally important
- Ordinal: values have a meaningful order (ranking) but magnitude between successive values is not known
- Numeric: quantitative (integer or real-valued)
- Interval-scaled
- Measured on a scale of equal-sized units
- Values have order
- No true zero-point
- Ratio-scaled
- Inherent zero-point
- We can speak of values as being an order of magnitude larger than the unit of measurement
- Interval-scaled
| | Categorical | | Quantitative | | |
| --- | --- | --- | --- | --- |
| Level | Nominal | Ordinal | Interval | Ratio |
| Defining Features | Distinct Categories | Ordered Categories | Meaningful Distances | Absolute Zero |
| Operations |
Discrete vs. Continuous
- Discrete: has only a finite or countably infinite set of values
- Continuous: has real numbers as attribute values
Other datatypes
- Relational data
- Transactional records
- Text data
- Networks/Graphs, aka information networks
- Sequential data
- Spatial/Image data
Important Characteristics of Structured Data
- Dimensionality (Curse of Dimensionality)
- Sparsity (Only presence counts)
- Resolution (Patterns depend on the scale)
- Distribution (Centrality and dispersion)
Mean, Median, and Mode
- Mean (sample vs. population)
- Weighted arithmetic mean
- Trimmed mean
- Median
- Estimated by interpolation (for grouped data)
- Mode
- Unimodal
- Multi-modal
Symmetric vs. Skewed Data
- Positively skewed: mean > median > mode
- Symmetric
- Negatively Skewed: mean < median < mode
Properties of Normal Distribution
- Z-score:
Variance and Standard Deviation
- Variance
- Can be computed incrementally by keeping track of sum and sum of squares
- Standard deviation: s (or
) is the square root of variance (or )
Quantiles of Outliers
- Quartiles: Q1 (25th percentile) and Q3 (75th percentile)
- Inter-quartile range: IQR = Q3 - Q1
- Five number summary: min, Q1, median, Q3, max
- Boxplot: data is represented with a box
- Outliers: points beyond a specified outlier threshold, plotted individually, usually higher than 1.5 * IQR
Histogram
- Histogram: graph display of tabulated frequencies, shown as bars
Scatter Plot and Correlations
- Provides a first look at bivariate data to see clusters of points, outliers, etc.
- Each pair is treated as coordinates and plotted on a plane
Supervised vs. Unsupervised
- Supervised tasks have one or more target variables
- Learning with labels and data
- Goal: predict some value
- e.g. classification, regression, ranking
- Unsupervised tasks are more like data analysis
- Learning with data only
- Goal: find some underlying patterns/groupings
- e.g. clustering, frequent patterns, dimension reduction, …
Classification vs. Regression
- Classification and Regression are both supervised tasks
- The target variables in classification are discrete
- The target variables in regression are continuous
Classification vs. Clustering
- Classes have their order, while Clusters are exchangeable
Cross Validation - k folds
- Partition data into 5 subsets (randomly) and evaluate based on each mini-validation set
Train/Dev/Test
- Can sometimes be too expensive to have k fold cross validation (i.e. LLMS)
Time Series Splitting
- Time series split vs. Blocking time series split
TP, TN, FP, FN
- True Positive
- True Negative
- False Positive
- False Negative
- Confusion Matrix: Actual values vs. Predicted values
Accuracy, Precision, Recall, F1, …
- Accuracy: (TP+TN)/(TP+FP+FN+TN)
- Precision: TP/(TP+FP)
- Recall: TP/(TP+FN)
- F1: Harmonic Mean between Prec and Rec
- 2 * Prec * Rec / (Prec + Rec)
Minkowski distance (L norm)
- A popular distance measure, where we have l-dimensional objects and p order
- Properties
- d(i, j) > 0 if i
j, and d(i, i) = 0 (positivity) - Symmetry
- Triangle Inequality
- d(i, j) > 0 if i
- A distance that satisfies these properties is a metric
MAE, MSE, RMSE
Ranking Measurements
- Precision-Recall Curve
- AUC
- MAP
- nDCG
- MRR