N-gram Language Models

In the realm of Natural Language Processing (NLP), language models play a pivotal role in understanding and generating human language. Among the various models, N-gram models stand out for their simplicity and effectiveness.

An N-gram model is a type of probabilistic language model used to predict the next item in a sequence, such as the next word in a sentence. The term “N-gram” refers to a contiguous sequence of N items from a given sample of text or speech. For example:

Unigram (1-gram): Predicts each word independently. Example: “the”
Bigram (2-gram): Considers the previous word. Example: “the cat”
Trigram (3-gram): Considers the two previous words. Example: “the cat sat”

These models rely on the Markov assumption, which simplifies the prediction of the next word by assuming it depends only on the previous (N-1) words. The general formula for an N-gram model is:

\[P(w_n \mid w_{1:n-1}) \approx P(w_n \mid w_{n-N+1:n-1})\]

where \(P(w_n \mid w_{1:n-1})\) is the probability of the word \(w_n\) given the sequence of preceding words \(w_{1:n-1}\).

Evaluating Language Models: Perplexity

Perplexity is a crucial metric for evaluating the performance of language models. It measures how well a language model predicts a sample of text. Perplexity is defined as the inverse probability of the test set, normalized by the number of words. A lower perplexity score indicates a better-performing model. The formula for perplexity is:

\[\text{Perplexity}(W) = P(w_1 w_2 \ldots w_N)^{-\frac{1}{N}}\]

or equivalently,

\[\text{Perplexity}(W) = \left( \prod_{i=1}^{N} \frac{1}{P(w_i \mid w_{1:i-1})} \right)^{\frac{1}{N}}\]

This metric provides an intuitive measure: if the model is unsure about the next word, it will assign a higher perplexity value, indicating more uncertainty.

The Challenge of Zero Probabilities

One of the primary challenges with N-gram models is the assignment of zero probability to unseen N-grams. For example, if a bigram model has never encountered the sequence “blue elephant” in the training data, it will assign a probability of zero to this sequence, regardless of its plausibility.

Overview of Smoothing Techniques in N-Gram Language Models

Smoothing techniques are used to adjust the maximum likelihood estimates of N-gram probabilities to account for unseen N-grams in the training data. This ensures that the model can handle novel sequences of words gracefully by assigning them non-zero probabilities.

Common smoothing techniques include:

1. Laplace (Add-One) Smoothing

Laplace smoothing, also known as add-one smoothing, is the simplest form of smoothing. It adds one to each count to avoid zero probabilities.

Formula:
For a bigram model, the probability \(P_{\text{Laplace}}\) is calculated as:

\[P_{\text{Laplace}}(w_n \mid w_{n-1}) = \frac{C(w_{n-1}, w_n) + 1}{C(w_{n-1}) + V}\]

where:

\(C(w_{n-1}, w_n)\) is the count of the bigram \((w_{n-1}, w_n)\),
\(C(w_{n-1})\) is the count of the preceding word,
\(V\) is the size of the vocabulary.

Example:
Consider a corpus with the following bigram counts:

“I love”: 1
“love programming”: 1
“programming is”: 0

With a vocabulary size \(V\) of 4 (I, love, programming, is), the Laplace smoothed probability of “programming is” becomes:

\[P_{\text{Laplace}}(\text{is} \mid \text{programming}) = \frac{0 + 1}{1 + 4} = \frac{1}{5} = 0.2\]

2. Add-K Smoothing

Add-K smoothing is a generalization of add-one smoothing where a fractional count \(k\) is added instead of one.

Formula:
\(P_{\text{Add-K}}(w_n \mid w_{n-1}) = \frac{C(w_{n-1}, w_n) + k}{C(w_{n-1}) + kV}\)

where \(k\) is a small positive constant.

Example:
Using the same corpus and setting \(k = 0.5\):

\[P_{\text{Add-K}}(\text{is} \mid \text{programming}) = \frac{0 + 0.5}{1 + 0.5 \times 4} = \frac{0.5}{3} = 0.167\]

3. Good-Turing Smoothing

Good-Turing smoothing adjusts the probability of N-grams based on the frequency of frequencies, effectively redistributing some probability mass from frequent to less frequent and unseen N-grams.

Formula:
\(P_{\text{GT}}(w_n \mid w_{n-1}) = \frac{(C(w_{n-1}, w_n) + 1) \cdot N_{C+1}}{N_C \cdot C(w_{n-1})}\)

where:

\(N_C\) is the number of N-grams with count \(C\).

Example:
If a bigram “I love” has a count of 2, and there are 10 bigrams with a count of 1 in the training corpus, the Good-Turing smoothed probability is calculated by redistributing the probabilities based on these counts.

4. Kneser-Ney Smoothing

Kneser-Ney smoothing is an advanced technique that not only discounts the counts but also redistributes the probability mass based on the likelihood of a word being a novel continuation.

Formula:
The bigram probability in interpolated Kneser-Ney smoothing is:

\[P_{\text{KN}}(w_i \mid w_{i-1}) = \frac{\max(C(w_{i-1} w_i) - d, 0)}{C(w_{i-1})} + \lambda(w_{i-1}) P_{\text{CONT}}(w_i)\]

where:

\(d\) is a discount constant,
\(P_{\text{CONT}}(w_i)\) is the continuation probability of \(w_i\),
\(\lambda(w_{i-1})\) is a normalizing constant to ensure the probabilities sum to 1.

Continuation Probability:
\(P_{\text{CONT}}(w) = \frac{| \{ v : C(vw) > 0 \} |}{\sum_{w'} | \{ v : C(vw') > 0 \} |}\)

Example:
For a bigram “reading glasses”, if “glasses” follows many different words, it will have a higher continuation probability compared to a word like “Kong”, which mainly follows “Hong”.

Practical Context and Usage

Laplace Smoothing is often used in text classification tasks where simplicity is preferred over precision.

Add-K Smoothing is a fine-tuned version of Laplace smoothing, used when a more balanced approach is needed.

Good-Turing Smoothing is valuable in speech recognition and other applications where accurate estimation of low-frequency events is crucial.

Kneser-Ney Smoothing is particularly effective in machine translation and large-scale language models where capturing the diversity of contexts is important.

Limitations of N-Gram Models

Despite their usefulness, N-gram models have several limitations:

Data Sparsity: As the value of \(N\) increases, the model requires exponentially more data to reliably estimate probabilities. This leads to sparsity, where many possible N-grams are never observed in the training data.
Fixed Context Size: N-gram models consider only a fixed number of previous words (defined by \(N\)). This limitation means they cannot capture long-range dependencies in the text.
Curse of Dimensionality: The number of parameters grows exponentially with \(N\), making it impractical to store and compute probabilities for large N-grams in extensive vocabularies.
Independence Assumption: N-gram models rely on the Markov assumption that the probability of a word depends only on the preceding \(N-1\) words. This assumption often does not hold true, especially in natural language, where context and meaning can span longer phrases and sentences.

While smoothing techniques mitigate some issues by addressing unseen N-grams, these inherent limitations necessitate more advanced models for many NLP tasks. Modern neural network-based models, such as Recurrent Neural Networks (RNNs) and Transformer models, offer more sophisticated approaches to capturing language nuances and dependencies.

Share on

Twitter Facebook LinkedIn

Srushti Manjunath

Evaluating Language Models: Perplexity

The Challenge of Zero Probabilities

Overview of Smoothing Techniques in N-Gram Language Models

1. Laplace (Add-One) Smoothing

2. Add-K Smoothing

3. Good-Turing Smoothing

4. Kneser-Ney Smoothing

Practical Context and Usage

Limitations of N-Gram Models

Share on