Source: https://drive.google.com/file/d/10O8pMwSy9209mP4a9pCctCYIVXa8oGpL/view?usp=sharing

Text Analysis

Text Analysis

To determine structure, insight, and relationship within and between textual data, e.g. articles, tweets, books, music, web page content, code, etc.

Approaches To: Text Analysis

  1. Sentiment Analysis
  2. TF-IDF
  3. Visualizations - Word Clouds & Networks
    • Much more!

Sentiment Analysis

Programatically infer emotional content of text

Sentiment Lexicon

Dataset containing words classified by their sentiment

When doing sentiment analysis…

Token

A meaningful unit of text

  • what you use for analysis
  • tokenization takes corpus of text and splits it into tokens (words, bigrams, etc.)

Stop words

Words not helpful for analysis

  • extremely common words such as “the”, “of”, “to”
  • are typically removed from analysis

TF-IDF

Term Frequency - Inverse Document Frequency (TF-IDF)

TF-IDF is a measure of originality of a word in a document obtained by comparing the number of times a word appears in a document with the number of documents the word appears in

Term within document
: frequency of in
: number of documents containing
: total number of documents

Word clouds display the words proportional to their frequency within the textual dataset

What if you don’t want to remove words from their context?