Data Type - Tabular

Feature Table
- One of the most popular data types
Features (attributes, dimensions, variables)
- A data field, representing a characteristic

Feature Types

Nominal: categories, states, or “names of things”
Binary: Nominal attribute with only 2 states (0 and 1)
- Symmetric binary: both outcomes equally important
- Asymmetric binary: outcomes not equally important
Ordinal: values have a meaningful order (ranking) but magnitude between successive values is not known
Numeric: quantitative (integer or real-valued)
- Interval-scaled
  - Measured on a scale of equal-sized units
  - Values have order
    - No true zero-point
- Ratio-scaled
  - Inherent zero-point
  - We can speak of values as being an order of magnitude larger than the unit of measurement

| | Categorical | | Quantitative | | | | --- | --- | --- | --- | --- | | Level | Nominal | Ordinal | Interval | Ratio | | Defining Features | Distinct Categories | Ordered Categories | Meaningful Distances | Absolute Zero | | Operations | | <> | +- | |

Discrete vs. Continuous

Discrete: has only a finite or countably infinite set of values
Continuous: has real numbers as attribute values

Other datatypes

Relational data
Transactional records
Text data
Networks/Graphs, aka information networks
Sequential data
Spatial/Image data

Important Characteristics of Structured Data

Dimensionality (Curse of Dimensionality)
Sparsity (Only presence counts)
Resolution (Patterns depend on the scale)
Distribution (Centrality and dispersion)

Mean, Median, and Mode

Mean (sample vs. population)
Weighted arithmetic mean
Trimmed mean
Median
Estimated by interpolation (for grouped data)
Mode
Unimodal
Multi-modal

Symmetric vs. Skewed Data

Positively skewed: mean > median > mode
Symmetric
Negatively Skewed: mean < median < mode

Properties of Normal Distribution

Z-score:

Variance and Standard Deviation

Variance
- Can be computed incrementally by keeping track of sum and sum of squares
Standard deviation: s (or ) is the square root of variance (or )

Quantiles of Outliers

Quartiles: Q1 (25th percentile) and Q3 (75th percentile)
Inter-quartile range: IQR = Q3 - Q1
Five number summary: min, Q1, median, Q3, max
Boxplot: data is represented with a box
Outliers: points beyond a specified outlier threshold, plotted individually, usually higher than 1.5 * IQR

Histogram

Histogram: graph display of tabulated frequencies, shown as bars

Scatter Plot and Correlations

Provides a first look at bivariate data to see clusters of points, outliers, etc.
Each pair is treated as coordinates and plotted on a plane

Supervised vs. Unsupervised

Supervised tasks have one or more target variables
- Learning with labels and data
- Goal: predict some value
- e.g. classification, regression, ranking
Unsupervised tasks are more like data analysis
- Learning with data only
- Goal: find some underlying patterns/groupings
- e.g. clustering, frequent patterns, dimension reduction, …

Classification vs. Regression

Classification and Regression are both supervised tasks
The target variables in classification are discrete
The target variables in regression are continuous

Classification vs. Clustering

Classes have their order, while Clusters are exchangeable

Cross Validation - k folds

Partition data into 5 subsets (randomly) and evaluate based on each mini-validation set

Train/Dev/Test

Can sometimes be too expensive to have k fold cross validation (i.e. LLMS)

Time Series Splitting

Time series split vs. Blocking time series split

TP, TN, FP, FN

True Positive
True Negative
False Positive
False Negative
Confusion Matrix: Actual values vs. Predicted values

Accuracy, Precision, Recall, F1, …

Accuracy: (TP+TN)/(TP+FP+FN+TN)
Precision: TP/(TP+FP)
Recall: TP/(TP+FN)
F1: Harmonic Mean between Prec and Rec
- 2 * Prec * Rec / (Prec + Rec)

Minkowski distance (L norm)

A popular distance measure, where we have l-dimensional objects and p order
Properties
- d(i, j) > 0 if i j, and d(i, i) = 0 (positivity)
- Symmetry
- Triangle Inequality
A distance that satisfies these properties is a metric

MAE, MSE, RMSE

Ranking Measurements

Precision-Recall Curve
AUC
MAP
nDCG
MRR

Carter's Digital Garden

Explorer

DSC 148 Lecture 1