Data Science Versus Statistics

Statisticians see the data science movement as something that overshadows what statistics has been doing for a long time.
Some people see data science as not needing statistics at all, even making statistics irrelevant.

The "Big Data" Meme

  • History: Big data has been around for over 200 years (US Census)
  • Science: statisticians discovered sampling and sufficiency because they care about large datasets

The "Skills" Meme

Big data skills are not for solving the real problems, rather, they are coping skills for dealing with large-scale clustering of data.
The range of easily constructible algorithms has shrunk compared to the single-processor model.

The "Jobs" Meme

Data scientists need years of hands-on experience post-masters degree before adding significant value to their employers.
The databases, software, and workflow are already set in stone.

The Future of Data Science, 1962

Data science is a complex scientific field.
Data science is larger than statistics.

Code and Data Analysis

The quantitative programming language S was developed in the mid-1970s.
R, created in the 1990s, is the current dominant programming language in academic statistics.
Scripts codify computations into workflows at a much higher, abstracted level than traditional programming languages. These workflows can then be shared with others much more easily.

Breiman's Two Cultures, 2001

Statistics either has the goal to predict or infer from the data (the two cultures).
Inference is (was) about 98% of academic statisticians, assuming that there is a true model that is generating the data.
Predictions are in line with the machine learning culture.

The Predictive Culture’s Secret Sauce

The Common Task Framework (CTF)

  1. A public available training set with observations with a list of features and a class label
  2. A set of competitors who are trying to infer a class prediction rule from the data
  3. A referee that runs the prediction rule against a testing dataset and automatically reports the prediction score of the rule

Experience with CTF

  1. Error rates decline each year, asymptoting based on task and data quality
  2. Progress comes from many small improvements
  3. Shared data plays a crucial role

Data Scientist

Person who is better at statistics than any software engineer and better at software engineering than any statistician.

Six Divisions of Greater Data Science

  1. Data Gathering, Preparation, and Exploration
  2. Data Representation and Transformation
  3. Computing with Data
  4. Data Modeling
  5. DataVisualization and Presentation
  6. Science about Data Science