An N-gram is a sequence of N tokens (or words).

Using these n-grams and the probabilities of the occurrences of certain words in certain sequences could improve the predictions of auto completion systems for example.

Similarly, we use can NLP and n-grams to train voice-based personal assistant bots. For example, using a 3-gram or trigram training model, a bot will be able to understand the difference between sentences such as “what’s the temperature?” and “set the temperature.”

More on N-Grams and Probabilities

Examples

Consider a sentence: “I love reading blogs about data science on Towards Science”

1-gram (Unigram)

“I”, “love”, “reading”, “blogs”, “about”, “data”, “science”, “on”, “Towards”, “Science”

2-gram (Bigram)

“I love”, “love reading”, or “Towards Science”

3-gram (Trigram)

“I love reading”, “about data science” or “on Towards Science”

Smoothing

Laplace Smoothing

  • AKA Add One Smoothing

  • Method

    • Increment frequency of each word by 1
    • Normalise by adding vocabulary to frequency counts when dividing
  • General Formula

  • Add K Smoothing

    • Extension to Add One Smoothing

    • Formula

Linear Interpolation

  • Smooth according to the weights assigned to current and lower n-gram models
  • Weights are learned through the held-out method

Stupid Backoff

  • Used in Web-Scale Models

    • For Retrieval Purposes
  • Fall back to lower n-gram model probabilities when the current one doesn’t exist

Good-Turing

  • Relate unseen words to the probability of the words which are the rarest in our corpus

  • Formula

    For Unknown Words:

    Where is lowest combined frequency count out of all classes

    For Known Words:

    Where N = Total Class Instances, is the combined frequency count of the classes with th frequency rank

Cons

  • No long term dependency

  • Might limit the context when predicting next word

  • Generating text might not be a good idea

    • n-gram models might overfit the data and start producing the same text as it sees in the training data and thus there won’t be any creativity