An N-gram is a sequence of N tokens (or words).
Using these n-grams and the probabilities of the occurrences of certain words in certain sequences could improve the predictions of auto completion systems for example.
Similarly, we use can NLP and n-grams to train voice-based personal assistant bots. For example, using a 3-gram or trigram training model, a bot will be able to understand the difference between sentences such as “what’s the temperature?” and “set the temperature.”
More on N-Grams and Probabilities
Examples
Consider a sentence: “I love reading blogs about data science on Towards Science”
1-gram (Unigram)
“I”, “love”, “reading”, “blogs”, “about”, “data”, “science”, “on”, “Towards”, “Science”
2-gram (Bigram)
“I love”, “love reading”, or “Towards Science”
3-gram (Trigram)
“I love reading”, “about data science” or “on Towards Science”
Smoothing
Laplace Smoothing
-
AKA Add One Smoothing
-
Method
- Increment frequency of each word by 1
- Normalise by adding vocabulary to frequency counts when dividing
-
General Formula
-
Add K Smoothing
-
Extension to Add One Smoothing
-
Formula
-
Linear Interpolation
- Smooth according to the weights assigned to current and lower n-gram models
- Weights are learned through the held-out method
Stupid Backoff
-
Used in Web-Scale Models
- For Retrieval Purposes
-
Fall back to lower n-gram model probabilities when the current one doesn’t exist
Good-Turing
-
Relate unseen words to the probability of the words which are the rarest in our corpus
-
Formula
For Unknown Words:
Where is lowest combined frequency count out of all classes
For Known Words:
Where N = Total Class Instances, is the combined frequency count of the classes with th frequency rank
Cons
-
No long term dependency
-
Might limit the context when predicting next word
-
Generating text might not be a good idea
- n-gram models might overfit the data and start producing the same text as it sees in the training data and thus there won’t be any creativity