Source: https://drive.google.com/file/d/1uny6shIYQWmO5dLrvwjO8nuXRXnQFqKy/view
Getting Data
Obtaining Data
- Buy it
- Source it internally from your data
- Collect it externally from your users
- Freely download it from the web
- Request it from an API (paid/open source)
- Scrape it from a website
- Steal it (intentionally or unintentionally)
Places to find data
- Data is Plural
- Kaggle
- data.gov
- data.world
- census.gov
- us-cities.survey.okfn.org
- data.fivethirtyeight.com
- ucsd.libguides.com/data-statistics/home
- mkhosla-ucsd.github.io/cogs9/final-group-project/example-datasets
API
Rules for computer to computer interaction
Accessing an API
- Choose method
- Build URL
- Get Authorization
API Requests: HTTP Methods
- get: read
- post: create new resources
- put: update/replace
- patch: partial update/modify
- delete: delete
JSON (JavaScript Object Notation)
The most common method in which APIs return data
How to get API access
- Apply on developer website for API which you want to access
- Create an OAuth application
- Generate a token
Web Scraping
Basic Idea of Web Scraping
Websites → Web Scraping → Data
Don't Reinvent the Wheel
- APIs
- SQL queries
- Download button
- Web scraping
Ethical Concerns
API Developer Agreement and Terms of Use
Data Use Restrictions
Open Source
- Freely usable for commercial/private use
- Not usable for commercial projects
- Cannot repackage for reproduction
Not Open Source- Usable for commercial/private practices
- Cannot release/make available
Lots of gray areas!
Mild-Mannered Web Scraping
Are you allowd to access the data/public?
Did you read the TOS? Does that exist? (Contact the website if unsure.)
Have you made yourself known? (i.e. put info about who you are in the header as a variable)
Are you limiting your requests? (Scraping in off hours, pausing between requests)
End User License Agreements (EULAs)
Considerations and Questions to Ask
- Where does the data come from?
- What are the usage restrictions?
- Does this data contain potentially sensitive information?
- If I publish results or a project using this data do I need to provide attributions/citations?
Important Industry Questions
- What data resources do we have?
- How is data moved between the above systems, what are the pipelines?
- Who has access to what data is in this organization, what are the data fiefdoms, who controls the data systems and ACLs (access control lists)?
- Who is/are the data champions? How do they communicate with everyone?
- What is the process for collecting/storing new data from external resources?
- What are the current data pipelines and what data is being used?