Programming

Why Learn to Program? (as a data scientist)

  • Manually editing data sucks
    • Increased error rate
    • Time consuming
  • Allows the computer to do the work for you
    • Routine tasks
    • Simulations, permutations, etc.
  • Makes analyses reproducible
    • Particularly important as things increase in complexity
  • It teaches persistence
  • It can improve your life
    • Data aren’t going anywhere

Programming Workflow

Think-Try-Read-Refactor

  1. Think about what it is you want to do
  2. Try it out
  3. If you succeed, awesome
  4. If not, continue on
  5. Read to try to figure out the answer
  6. Refactor your code to improve and learn

Why are R and Python the languages data scientists use?

Easy to learn: clear, simple syntax and structure
Rapid development & prototyping: get more done with less, and in less time
Huge support: cross-platform compatible (OS X, Windows, Linux/Unix). Tons of libraries for analytics, visualization, machine learning, data wrangling…

Python vs. R

R

  • Statistical programming language
    • 1st functional programming
    • 2nd object oriented
  • Most commonly used
    • Data wrangling
      • tidyverse
    • Data visualization
      • ggplot2
    • Statistical analysis
    • Machine learning
      • caret/tidymodels
    • By people with a statistical background
      Python
  • General programming language
    • 1st object oriented
    • 2nd functional programming
  • Most commonly used
    • Data Wrangling
      • Pandas
      • NumPy
    • Machine Learning
      • Scikit-learn
      • TensorFlow, PyTorch
    • Deploying models in production
    • By people with CS background

Data Science is constantly changing

This makes data science fun and challenging.
Start somewhere: Either Python or R are good places to start
Technology changes: You’ll learn more than one language in your career
Good data scientists adapt quickly

CS programming vs. DS programming

Computer Science

Computer science is the study of the theory and practice of how computers work.
Typical job duties:

  • Testing, documenting, and debugging code
  • Creating or modifying software and mobile apps
  • Designing components of an application and integrating them into a larger overall product
  • Collaborating with a team of programmers to build and optimize code

Data Science

Data science is the scientific process of extracting value from data.
Typical job duties:

  • Collecting, “cleaning”, and organizing data sets
  • Building data models
  • Asking and answering questions with large scale data analysis
  • Creating data visualizations and presenting findings to stakeholders
    Data is the emphasis: programming skills should focus on working with data

Data Scientists

  • excel at getting the data they need (SQL)
  • explore & clean the data (R, Python)
  • are very comfortable working with tabular data (R, Python)
  • prioritize reproducibility
  • program to accomplish a task/answer a question

Version Control

How to work collaboratively and track your code-centric projects

  • Enables multiple people to simultaneously work on a single project
  • Each person edits their own copy of the files and chooses when to share those changes with the rest of the team
  • Thus, temporary or partial edits by one person do not interfere with another person’s work

What is version control?

A way to manage the evolution of a set of files
When using a version control system, you have one copy of each file and the version control system tracks the changes that have occurred over time

git & GitHub

git: the version control system
GitHub is the home where your git-based projects live on the Internet

Why version control with git and GitHub?

  • Collaboration
    • Each person is making changes locally (on their computer)
  • Returning to a safe state
    • You’re free and safe to try things out locally. You’ll only send changes to the repo when you’re at a stable point
  • Exposure for your work
    • Your public GitHub repos are your coding social media
  • Tracking others’ work
    • As a social platform, you can see others’ work too

GitHub

A remote host, which stores files geographically distant from any files on your computer.

GitHub repo

Contains all the files and folders for your project.

Clone

The action of getting the files and folders from a remote repository onto your laptop for the first time.

Staging

The process of telling git which files you want to keep track of using add.
git add file: stages specified file (or folder)
git add .: stages new and modified files
git add -u: stages modified and deleted files
git add -A: stages new, modified, and deleted files
git add *.csv: stages any files with .csv extension
git add *: stages everything

Commit

A snapshot or your files at a certain point, which tracks who, what, and when.
You can make commits more informative by adding a commit message: git commit -m "changed colors for animal icons"

Push

The action of uploading changes back to the remote for others to access.

Pull

The action of downloading changes from the remote (GitHub).

Definition Recap

repo

set of files and folders for a project

remote

where the repo lives

clone

get the repo from the remote for the first time

add

specify which files you want to stage (add to repo)

commit

snapshot of your files at a point in time

pull

get new commits to the repo from the remote

push

send your new commits to the remote