N-gram Language Models
In the realm of Natural Language Processing (NLP), language models play a pivotal role in understanding and generating human language. Among the various models, N-gram models stand out for their simplicity and effectiveness.
An N-gram model is a type of probabilistic language model used to predict the next item in a sequence, such as the next word in a sentence. The term “N-gram” refers to a contiguous sequence of N items from a given sample of text or speech. For example:
- Unigram (1-gram): Predicts each word independently. Example: “the”
- Bigram (2-gram): Considers the previous word. Example: “the cat”
- Trigram (3-gram): Considers the two previous words. Example: “the cat sat”
These models rely on the Markov assumption, which simplifies the prediction of the next word by assuming it depends only on the previous (N-1) words. The general formula for an N-gram model is:
\[P(w_n \mid w_{1:n-1}) \approx P(w_n \mid w_{n-N+1:n-1})\]where \(P(w_n \mid w_{1:n-1})\) is the probability of the word \(w_n\) given the sequence of preceding words \(w_{1:n-1}\).
Evaluating Language Models: Perplexity
Perplexity is a crucial metric for evaluating the performance of language models. It measures how well a language model predicts a sample of text. Perplexity is defined as the inverse probability of the test set, normalized by the number of words. A lower perplexity score indicates a better-performing model. The formula for perplexity is:
\[\text{Perplexity}(W) = P(w_1 w_2 \ldots w_N)^{-\frac{1}{N}}\]or equivalently,
\[\text{Perplexity}(W) = \left( \prod_{i=1}^{N} \frac{1}{P(w_i \mid w_{1:i-1})} \right)^{\frac{1}{N}}\]This metric provides an intuitive measure: if the model is unsure about the next word, it will assign a higher perplexity value, indicating more uncertainty.
The Challenge of Zero Probabilities
One of the primary challenges with N-gram models is the assignment of zero probability to unseen N-grams. For example, if a bigram model has never encountered the sequence “blue elephant” in the training data, it will assign a probability of zero to this sequence, regardless of its plausibility.
Overview of Smoothing Techniques in N-Gram Language Models
Smoothing techniques are used to adjust the maximum likelihood estimates of N-gram probabilities to account for unseen N-grams in the training data. This ensures that the model can handle novel sequences of words gracefully by assigning them non-zero probabilities.
Common smoothing techniques include:
1. Laplace (Add-One) Smoothing
Laplace smoothing, also known as add-one smoothing, is the simplest form of smoothing. It adds one to each count to avoid zero probabilities.
Formula:
For a bigram model, the probability \(P_{\text{Laplace}}\) is calculated as:
where:
- \(C(w_{n-1}, w_n)\) is the count of the bigram \((w_{n-1}, w_n)\),
- \(C(w_{n-1})\) is the count of the preceding word,
- \(V\) is the size of the vocabulary.
Example:
Consider a corpus with the following bigram counts:
- “I love”: 1
- “love programming”: 1
- “programming is”: 0
With a vocabulary size \(V\) of 4 (I, love, programming, is), the Laplace smoothed probability of “programming is” becomes:
\[P_{\text{Laplace}}(\text{is} \mid \text{programming}) = \frac{0 + 1}{1 + 4} = \frac{1}{5} = 0.2\]2. Add-K Smoothing
Add-K smoothing is a generalization of add-one smoothing where a fractional count \(k\) is added instead of one.
Formula:
\(P_{\text{Add-K}}(w_n \mid w_{n-1}) = \frac{C(w_{n-1}, w_n) + k}{C(w_{n-1}) + kV}\)
where \(k\) is a small positive constant.
Example:
Using the same corpus and setting \(k = 0.5\):
3. Good-Turing Smoothing
Good-Turing smoothing adjusts the probability of N-grams based on the frequency of frequencies, effectively redistributing some probability mass from frequent to less frequent and unseen N-grams.
Formula:
\(P_{\text{GT}}(w_n \mid w_{n-1}) = \frac{(C(w_{n-1}, w_n) + 1) \cdot N_{C+1}}{N_C \cdot C(w_{n-1})}\)
where:
- \(N_C\) is the number of N-grams with count \(C\).
Example:
If a bigram “I love” has a count of 2, and there are 10 bigrams with a count of 1 in the training corpus, the Good-Turing smoothed probability is calculated by redistributing the probabilities based on these counts.
4. Kneser-Ney Smoothing
Kneser-Ney smoothing is an advanced technique that not only discounts the counts but also redistributes the probability mass based on the likelihood of a word being a novel continuation.
Formula:
The bigram probability in interpolated Kneser-Ney smoothing is:
where:
- \(d\) is a discount constant,
- \(P_{\text{CONT}}(w_i)\) is the continuation probability of \(w_i\),
- \(\lambda(w_{i-1})\) is a normalizing constant to ensure the probabilities sum to 1.
Continuation Probability:
\(P_{\text{CONT}}(w) = \frac{| \{ v : C(vw) > 0 \} |}{\sum_{w'} | \{ v : C(vw') > 0 \} |}\)
Example:
For a bigram “reading glasses”, if “glasses” follows many different words, it will have a higher continuation probability compared to a word like “Kong”, which mainly follows “Hong”.
Practical Context and Usage
Laplace Smoothing is often used in text classification tasks where simplicity is preferred over precision.
Add-K Smoothing is a fine-tuned version of Laplace smoothing, used when a more balanced approach is needed.
Good-Turing Smoothing is valuable in speech recognition and other applications where accurate estimation of low-frequency events is crucial.
Kneser-Ney Smoothing is particularly effective in machine translation and large-scale language models where capturing the diversity of contexts is important.
Limitations of N-Gram Models
Despite their usefulness, N-gram models have several limitations:
- Data Sparsity: As the value of \(N\) increases, the model requires exponentially more data to reliably estimate probabilities. This leads to sparsity, where many possible N-grams are never observed in the training data.
- Fixed Context Size: N-gram models consider only a fixed number of previous words (defined by \(N\)). This limitation means they cannot capture long-range dependencies in the text.
- Curse of Dimensionality: The number of parameters grows exponentially with \(N\), making it impractical to store and compute probabilities for large N-grams in extensive vocabularies.
- Independence Assumption: N-gram models rely on the Markov assumption that the probability of a word depends only on the preceding \(N-1\) words. This assumption often does not hold true, especially in natural language, where context and meaning can span longer phrases and sentences.
While smoothing techniques mitigate some issues by addressing unseen N-grams, these inherent limitations necessitate more advanced models for many NLP tasks. Modern neural network-based models, such as Recurrent Neural Networks (RNNs) and Transformer models, offer more sophisticated approaches to capturing language nuances and dependencies.
