Source: https://drive.google.com/file/d/10O8pMwSy9209mP4a9pCctCYIVXa8oGpL/view?usp=sharing
Text Analysis
Text Analysis
To determine structure, insight, and relationship within and between textual data, e.g. articles, tweets, books, music, web page content, code, etc.
Approaches To: Text Analysis
- Sentiment Analysis
- TF-IDF
- Visualizations - Word Clouds & Networks
- Much more!
Sentiment Analysis
Programatically infer emotional content of text
Sentiment Lexicon
Dataset containing words classified by their sentiment
When doing sentiment analysis…
Token
A meaningful unit of text
- what you use for analysis
- tokenization takes corpus of text and splits it into tokens (words, bigrams, etc.)
Stop words
Words not helpful for analysis
- extremely common words such as “the”, “of”, “to”
- are typically removed from analysis
TF-IDF
Term Frequency - Inverse Document Frequency (TF-IDF)
TF-IDF is a measure of originality of a word in a document obtained by comparing the number of times a word appears in a document with the number of documents the word appears in
Termwithin document
: frequency of in
: number of documents containing
: total number of documents
Word clouds display the words proportional to their frequency within the textual dataset
What if you don’t want to remove words from their context?