What Are Hallucinations?
Large language models can generate responses that seem logical or coherent but contain incorrect or inconsistent information. We refer to this phenomenon as a hallucination.
For example, a model might say something like, ‘Marseille is the capital of France.’ While this statement is false, it could sound perfectly plausible without checking with an external truth source.
For example, in response to a question about the health benefits of particular foods, the model would likely consult an internet source and communicate what it has learned. However, not every piece of online information is true or relevant. Our model could quickly obtain the wrong sources and give bad advice.
Another cause of such errors is that LLMs can misrepresent the context in which a prompt is presented. This can lead to a response that is contextually inappropriate or inaccurate.
Causes of Hallucinations in Large Language Models
We’ll review the main factors contributing to this issue. These include
Overfitting happens when we train a machine learning model too much tuned to the training set. As a result, the model learns the training data too well, but it can’t generate good predictions for unseen data. An overfitted model produces low accuracy results for data points unseen in training, hence, leads to non-optimal decisions.
A model unable to produce sensible results on new data is also called “not able to generalize.” In this case, the model is too complex, and the patterns existing in the dataset are not well represented. Such a model with high variance overfits.
Overfitting models produce good predictions for data points in the training set but perform poorly on new samples.
Underfitting occurs when the machine learning model is not well-tuned to the training set. The resulting model is not capturing the relationship between input and output well enough. Therefore, it doesn’t produce accurate predictions, even for the training dataset. Resultingly, an underfitted model generates poor results that lead to high-error decisions, like an overfitted model.
An underfitted model is not complex enough to recognize the patterns in the dataset. Usually, it has a high bias towards one output value. This is because it considers the variations of the input data as noise and generates similar outputs regardless of the given input.
When training a model, we want it to fit well to the training data. Still, we want it to generalize and generate accurate predictions for unseen data, as well. As a result, we don’t want the resulting model to be on any extreme.
Let’s consider we have a dataset residing on an S-shaped curve such as a logarithmic curve. Fitting a high-order parabola passing through the known points with zero error is always possible. On the other hand, we can fit a straight line with a high error rate.
The first solution generates an overly complex model and models the implicit noise as well as the dataset. As a result, we can expect a high error for a new data point on the original S-shaped curve.
Conversely, the second model is far too simple to capture the relationship between the input and output. Hence, it will perform poorly on new data, too:
Cures for Underfitting
To prevent underfitting, we need to ensure the model complexity.
The first method that comes to mind is to obtain more training data. However, this is not an easy task for most problems. In such cases, we can bring data augmentation into service. So, we can increase the amount of data available by creating slightly modified synthetic copies of the data points at hand.
Similarly, increasing the number of passes on the training data is a viable approach for iterative algorithms. Increasing the number of epochs in a neural network is a well-known practice to ensure model fitting.
Another way to increase model complexity is to increase the size and number of model parameters. We can introduce engineered features from the dataset. For example, a product of numerical features or n parameter of n-grams generates new features.
Alternatively, we can reduce regularization. Some implementations implicitly include default regularization parameters to overfitting. Checking the default parameters is a good start point. As we’re trying to get out of a limited feature set, there’s no need to introduce limiting terms into the model.
Replacing the approach is another solution. For example, the selection of the kernel function in SVM determines the model complexity. Thus, the choice of kernel function might lead to overfitting or underfitting.
Let’s summarize what we’ve discussed so far in a comparison table: