Source: https://drive.google.com/file/d/1YmR9Na-so1Y_m3yRA0jNqZbIqvleg3HA/view
Types of Text Analysis
Structured (supervised)
- Specify features/labels using linguistic theory
- Train classifiers using explicit labels to learn text features (i.e. spam classifier, sentiment classifier)
Distributional (unsupervised/semi-supervised)- Learn features in an unsupervised way
- Know a word by the company it keeps
- Learn to predict and cluster language based on these features
How does distributional learning work?
What is the last word in each sentence?
- Roses are red and violets are blue.
- I like my coffee with cream and sugar.
- Can you put cream cheese on my bagel?
What does the word dax mean in this context?
- A new dax costs less than you might expect.
- You can fit five people in a dax.
- The dax made a right at the junction and got stuck in traffic.
The distributional hypothesis
“You shall know a word by the company it keeps.” - J.R. Firth (1957)
“The meaning of a word is determined by how it is used.” - Ludwig Wittgenstein (1953)
Computational models take advantage of these patterns
From data like this:
- There is a lovely house on that street.
- My house is on that road.
- I drove my car down the road.
- Did you park on the street?
- There are some kids playing soccer on the road.
Models can learn:- “House” and “street” are more likely to appear in the same sentences.
- “House” and “road” appear in the same sentences.
- “Street” and “road” follow the words “the” or “that”.
- “Street” and “road” are similar words.
Early computational models: Latent Semantic Analysis
Bag of words
Dimensionality reduction
Bag of words drawbacks
- No representation of word order
- Man bites dog = dog bites man
- Dependent on document size
Skip-Gram Models: Word2Vec
Skip-gram models
Train neural networks to predict contexts
Vector spaces
Word Vectors
Large Language Models
Principles
- Similar to Word2Vec
- Neural network trained to predict the next word
- Contextualised representations
- That dog has a loud bark.
- That tree has a brown bark.
- Learns to pay attention to the right words.
- Much larger (billions of parameters vs. thousands for Word2Vec)
- Bigger is better
- Few-shot learning vs. training on many examples
Using LLMs to do Data Science
- Writing code
- Critiquing code
- Learning about new paradigms
- Writing documentation
- Cleaning messy data
- Generating data
- The future (automating your own job?)