N-gram Language Models

In the realm of Natural Language Processing (NLP), language models play a pivotal role in understanding and generating human language. Among the various models, N-gram models stand out for their simplicity and effectiveness.

An N-gram model is a type of probabilistic language model used to predict the next item in a sequence, such as the next word in a sentence. The term “N-gram” refers to a contiguous sequence of N items from a given sample of text or speech. For example:

  • Unigram (1-gram): Predicts each word independently. Example: “the”
  • Bigram (2-gram): Considers the previous word. Example: “the cat”
  • Trigram (3-gram): Considers the two previous words. Example: “the cat sat”

These models rely on the Markov assumption, which simplifies the prediction of the next word by assuming it depends only on the previous (N-1) words. The general formula for an N-gram model is:

\[P(w_n \mid w_{1:n-1}) \approx P(w_n \mid w_{n-N+1:n-1})\]

where \(P(w_n \mid w_{1:n-1})\) is the probability of the word \(w_n\) given the sequence of preceding words \(w_{1:n-1}\).

Evaluating Language Models: Perplexity

Perplexity is a crucial metric for evaluating the performance of language models. It measures how well a language model predicts a sample of text. Perplexity is defined as the inverse probability of the test set, normalized by the number of words. A lower perplexity score indicates a better-performing model. The formula for perplexity is:

\[\text{Perplexity}(W) = P(w_1 w_2 \ldots w_N)^{-\frac{1}{N}}\]

or equivalently,

\[\text{Perplexity}(W) = \left( \prod_{i=1}^{N} \frac{1}{P(w_i \mid w_{1:i-1})} \right)^{\frac{1}{N}}\]

This metric provides an intuitive measure: if the model is unsure about the next word, it will assign a higher perplexity value, indicating more uncertainty.

The Challenge of Zero Probabilities

One of the primary challenges with N-gram models is the assignment of zero probability to unseen N-grams. For example, if a bigram model has never encountered the sequence “blue elephant” in the training data, it will assign a probability of zero to this sequence, regardless of its plausibility.

Overview of Smoothing Techniques in N-Gram Language Models

Smoothing techniques are used to adjust the maximum likelihood estimates of N-gram probabilities to account for unseen N-grams in the training data. This ensures that the model can handle novel sequences of words gracefully by assigning them non-zero probabilities.

Common smoothing techniques include:

1. Laplace (Add-One) Smoothing

Laplace smoothing, also known as add-one smoothing, is the simplest form of smoothing. It adds one to each count to avoid zero probabilities.

Formula:
For a bigram model, the probability \(P_{\text{Laplace}}\) is calculated as:

\[P_{\text{Laplace}}(w_n \mid w_{n-1}) = \frac{C(w_{n-1}, w_n) + 1}{C(w_{n-1}) + V}\]

where:

  • \(C(w_{n-1}, w_n)\) is the count of the bigram \((w_{n-1}, w_n)\),
  • \(C(w_{n-1})\) is the count of the preceding word,
  • \(V\) is the size of the vocabulary.

Example:
Consider a corpus with the following bigram counts:

  • “I love”: 1
  • “love programming”: 1
  • “programming is”: 0

With a vocabulary size \(V\) of 4 (I, love, programming, is), the Laplace smoothed probability of “programming is” becomes:

\[P_{\text{Laplace}}(\text{is} \mid \text{programming}) = \frac{0 + 1}{1 + 4} = \frac{1}{5} = 0.2\]

2. Add-K Smoothing

Add-K smoothing is a generalization of add-one smoothing where a fractional count \(k\) is added instead of one.

Formula:
\(P_{\text{Add-K}}(w_n \mid w_{n-1}) = \frac{C(w_{n-1}, w_n) + k}{C(w_{n-1}) + kV}\)

where \(k\) is a small positive constant.

Example:
Using the same corpus and setting \(k = 0.5\):

\[P_{\text{Add-K}}(\text{is} \mid \text{programming}) = \frac{0 + 0.5}{1 + 0.5 \times 4} = \frac{0.5}{3} = 0.167\]

3. Good-Turing Smoothing

Good-Turing smoothing adjusts the probability of N-grams based on the frequency of frequencies, effectively redistributing some probability mass from frequent to less frequent and unseen N-grams.

Formula:
\(P_{\text{GT}}(w_n \mid w_{n-1}) = \frac{(C(w_{n-1}, w_n) + 1) \cdot N_{C+1}}{N_C \cdot C(w_{n-1})}\)

where:

  • \(N_C\) is the number of N-grams with count \(C\).

Example:
If a bigram “I love” has a count of 2, and there are 10 bigrams with a count of 1 in the training corpus, the Good-Turing smoothed probability is calculated by redistributing the probabilities based on these counts.

4. Kneser-Ney Smoothing

Kneser-Ney smoothing is an advanced technique that not only discounts the counts but also redistributes the probability mass based on the likelihood of a word being a novel continuation.

Formula:
The bigram probability in interpolated Kneser-Ney smoothing is:

\[P_{\text{KN}}(w_i \mid w_{i-1}) = \frac{\max(C(w_{i-1} w_i) - d, 0)}{C(w_{i-1})} + \lambda(w_{i-1}) P_{\text{CONT}}(w_i)\]

where:

  • \(d\) is a discount constant,
  • \(P_{\text{CONT}}(w_i)\) is the continuation probability of \(w_i\),
  • \(\lambda(w_{i-1})\) is a normalizing constant to ensure the probabilities sum to 1.

Continuation Probability:
\(P_{\text{CONT}}(w) = \frac{| \{ v : C(vw) > 0 \} |}{\sum_{w'} | \{ v : C(vw') > 0 \} |}\)

Example:
For a bigram “reading glasses”, if “glasses” follows many different words, it will have a higher continuation probability compared to a word like “Kong”, which mainly follows “Hong”.

Practical Context and Usage

Laplace Smoothing is often used in text classification tasks where simplicity is preferred over precision.

Add-K Smoothing is a fine-tuned version of Laplace smoothing, used when a more balanced approach is needed.

Good-Turing Smoothing is valuable in speech recognition and other applications where accurate estimation of low-frequency events is crucial.

Kneser-Ney Smoothing is particularly effective in machine translation and large-scale language models where capturing the diversity of contexts is important.

Limitations of N-Gram Models

Despite their usefulness, N-gram models have several limitations:

  1. Data Sparsity: As the value of \(N\) increases, the model requires exponentially more data to reliably estimate probabilities. This leads to sparsity, where many possible N-grams are never observed in the training data.
  2. Fixed Context Size: N-gram models consider only a fixed number of previous words (defined by \(N\)). This limitation means they cannot capture long-range dependencies in the text.
  3. Curse of Dimensionality: The number of parameters grows exponentially with \(N\), making it impractical to store and compute probabilities for large N-grams in extensive vocabularies.
  4. Independence Assumption: N-gram models rely on the Markov assumption that the probability of a word depends only on the preceding \(N-1\) words. This assumption often does not hold true, especially in natural language, where context and meaning can span longer phrases and sentences.

While smoothing techniques mitigate some issues by addressing unseen N-grams, these inherent limitations necessitate more advanced models for many NLP tasks. Modern neural network-based models, such as Recurrent Neural Networks (RNNs) and Transformer models, offer more sophisticated approaches to capturing language nuances and dependencies.