Source: https://drive.google.com/file/d/1uny6shIYQWmO5dLrvwjO8nuXRXnQFqKy/view

Getting Data

Obtaining Data

  • Buy it
  • Source it internally from your data
  • Collect it externally from your users
  • Freely download it from the web
  • Request it from an API (paid/open source)
  • Scrape it from a website
  • Steal it (intentionally or unintentionally)

Places to find data

  • Data is Plural
  • Kaggle
  • data.gov
  • data.world
  • census.gov
  • us-cities.survey.okfn.org
  • data.fivethirtyeight.com
  • ucsd.libguides.com/data-statistics/home
  • mkhosla-ucsd.github.io/cogs9/final-group-project/example-datasets

API

Rules for computer to computer interaction

Accessing an API

  1. Choose method
  2. Build URL
  3. Get Authorization

API Requests: HTTP Methods

  • get: read
  • post: create new resources
  • put: update/replace
  • patch: partial update/modify
  • delete: delete

JSON (JavaScript Object Notation)

The most common method in which APIs return data

How to get API access

  1. Apply on developer website for API which you want to access
  2. Create an OAuth application
  3. Generate a token

Web Scraping

Basic Idea of Web Scraping

Websites Web Scraping Data

Don't Reinvent the Wheel

  1. APIs
  2. SQL queries
  3. Download button
  4. Web scraping

Ethical Concerns

API Developer Agreement and Terms of Use

Data Use Restrictions

Open Source

  • Freely usable for commercial/private use
  • Not usable for commercial projects
  • Cannot repackage for reproduction
    Not Open Source
  • Usable for commercial/private practices
  • Cannot release/make available
    Lots of gray areas!

Mild-Mannered Web Scraping

Are you allowd to access the data/public?
Did you read the TOS? Does that exist? (Contact the website if unsure.)
Have you made yourself known? (i.e. put info about who you are in the header as a variable)
Are you limiting your requests? (Scraping in off hours, pausing between requests)

End User License Agreements (EULAs)

Considerations and Questions to Ask

  • Where does the data come from?
  • What are the usage restrictions?
  • Does this data contain potentially sensitive information?
  • If I publish results or a project using this data do I need to provide attributions/citations?

Important Industry Questions

  • What data resources do we have?
  • How is data moved between the above systems, what are the pipelines?
  • Who has access to what data is in this organization, what are the data fiefdoms, who controls the data systems and ACLs (access control lists)?
  • Who is/are the data champions? How do they communicate with everyone?
  • What is the process for collecting/storing new data from external resources?
  • What are the current data pipelines and what data is being used?