N-Grams

An N-gram is a sequence of N tokens (or words).

Using these n-grams and the probabilities of the occurrences of certain words in certain sequences could improve the predictions of auto completion systems for example.

Similarly, we use can NLP and n-grams to train voice-based personal assistant bots. For example, using a 3-gram or trigram training model, a bot will be able to understand the difference between sentences such as “what’s the temperature?” and “set the temperature.”

Examples

Consider a sentence: “I love reading blogs about data science on Towards Science”

1-gram (Unigram)

“I”, “love”, “reading”, “blogs”, “about”, “data”, “science”, “on”, “Towards”, “Science”

2-gram (Bigram)

“I love”, “love reading”, or “Towards Science”

3-gram (Trigram)

“I love reading”, “about data science” or “on Towards Science”

Smoothing

Basics of NLP

Laplace Smoothing

AKA Add One Smoothing
Method
- Increment frequency of each word by 1
- Normalise by adding vocabulary to frequency counts when dividing
General Formula
$\frac{C ( W ∣ W _{i - 1} ) + 1}{C ( W _{i - 1} ) + V}$
Add K Smoothing
- Extension to Add One Smoothing
- Formula
  $\frac{C ( W ∣ W _{i - 1} ) + K}{C ( W _{i - 1} ) + V \times K}$

Linear Interpolation

Smooth according to the weights assigned to current and lower n-gram models
Weights are learned through the held-out method

Stupid Backoff

Used in Web-Scale Models
- For Retrieval Purposes
Fall back to lower n-gram model probabilities when the current one doesn’t exist

Good-Turing

Relate unseen words to the probability of the words which are the rarest in our corpus
Formula

For Unknown Words:
$P * = \frac{N _{1}}{N}$
Where $N_{1}$ is lowest combined frequency count out of all classes

For Known Words:
$C^{*} = \frac{( c + 1 ) \times N _{c + 1}}{N _{c}}$ $P^{*} = \frac{C ^{*}}{N}$
Where N = Total Class Instances, $N_{i}$ is the combined frequency count of the classes with $i$ ^th frequency rank

Cons

No long term dependency
Might limit the context when predicting next word
Generating text might not be a good idea
- n-gram models might overfit the data and start producing the same text as it sees in the training data and thus there won’t be any creativity

Explorer

Recent Notes

Visionary GenAI

Industry and Competitive Analysis