Uncovering the Math Behind Bias and Variance in Machine Learning

Abhay Mane
8 min readMar 3, 2024

--

dall.e generated image

Understanding bias and variance is essential in machine learning, yet it’s often introduced through overly simplistic ways that make the concept seem straightforward. However, the reality is far more complex than these methods suggest. As a machine learning enthusiast, mastering this concept is key to grasping the challenges faced in practical applications. My inspiration comes from Caltech Prof. Yaser Abu-Mostafa’s “Learning from Data” series available on YouTube, which demystified bias and variance for me. This article, motivated by his clear explanations and the mathematical insights from his course, aims to simplify these concepts for everyone. Consider the content as my notes to make sense of these critical aspects of machine learning.

What is “Learning” in Machine Learning

In machine learning, “learning” is the process of refining models to capture the underlying relationships within data. We begin with a hypothesis set and refine our models as we collect more data, aiming to closely approximate the true function these data points reflect.

Key to refining models are two concepts: approximation and generalization. Approximation assesses how well the model fits the training data — similar to how well an algorithm identifies known patterns. Generalization, on the other hand, gauges the model’s ability to apply what it has learned to new, unseen data.

The in-sample error measures how accurately the model predicts the data it was trained on, like an algorithm recognizing familiar patterns. The out-of-sample error assesses performance on new data, testing the model’s predictive power beyond what it has already learned.

Both types of error are essential indicators of a model’s performance. A high in-sample error suggests a model’s learning is inadequate, while a low in-sample error without good generalization can indicate overfitting — where a model learns the training data too well but does not predict anything beyond it.

The ultimate aim is a model that not only performs well on training data (low in-sample error) but also generalizes effectively to new data (low out-of-sample error). This balance reflects a model’s true learning capability. A deeper look into the out-of-sample error can reveal insights into the model’s bias and variance, elements crucial to understanding and improving model performance. So, let’s dive deeper into the concept.

The Foundation: Hypothesis Sets and the Search for the Perfect Model

The World of Models: The Hypothesis Set

When we delve into the field of machine learning, we are essentially trying to uncover patterns and make predictions. At the heart of this are our hypotheses, or educated guesses, about how things work. Each of these guesses is represented by a mathematical model, and collectively, they form what is known as a hypothesis set. For example, when predicting house prices based on size, our hypothesis set might include a range of models:

• Straight lines: Suggesting a direct, proportional relationship between size and price.

• Curvy lines: Showing that price increases at a different rate to size, perhaps more sharply for larger homes.

Each stands for a different theory on how the size of a house might influence its price.

The Model: Our Best Educated Guess

From our assortment of models, we choose the one that aligns best with the data we’ve observed. This selected model, our educated guess, is denoted as g(X). It’s a representation, to the best of our current knowledge and data, of the underlying pattern we’re investigating.

So, if our data on house prices shows that as houses increase in size, the prices go up consistently, we might choose a straight-line model as our best guess, g(X).

The Target Function: The Ultimate Solution

While we have our model g(X), there is an ultimate solution to our puzzle, which we call the target function, denoted as f(X). It’s the true underlying pattern that perfectly describes how our inputs (like house size) relate to our outputs (like price). The complexity is that f(X) is often unknown in its entirety; it’s the ideal solution we strive to approximate.

In the context of our house pricing example f(X) would perfectly account for all factors influencing price, not just size, but also location, age, design, and more. However, we often work with limited data — in this case, just house size — so we use our model g(X) to approximate as closely as possible to f(X).

In summary, machine learning is our endeavour to navigate through our hypothesis set, using the data at hand, to find a model that closely mirrors the true pattern of the world. It’s a journey of approximation, where our chosen model g(X) is continually refined in the hopes of closely matching the elusive target function f(X).

Delving into Out-of-Sample Error

The out-of-sample error is our metric for gauging how well our model, which we’ve trained with certain data, will perform with new data it hasn’t seen before. It’s the expected difference between the model’s predictions and the actual values. The mathematical expression for this error is:

Ex​ denotes the expected value over the input space x.

g^(D) represents the model learned from a particular dataset D. It’s a way of saying “given this specific set of data we’ve trained on, here’s the prediction our model makes.” This function g changes depending on the dataset D.

Expected Value:

You can consider the expected value as an average of sorts. In the context of Eout (out-of-sample error)​, the expected value gives us an average error over all possible new data points. It’s a way to anticipate how our model will fare, on average, in the vast sea of potential real-world scenarios.

Why squared error?

The squared error has a few advantages:

Positivity: By squaring, we ensure all errors are positive. This means errors in opposite directions won’t cancel each other out.

Penalty for Larger Errors: Squaring magnifies bigger errors more than smaller ones. This means our model is penalized more for making grossly inaccurate predictions.

Differentiability: In mathematical optimization, having a smooth, differentiable function (like a squared function) is beneficial for algorithms.

Now, let’s take the average over all the different possible datasets D that could be used to train the model. It’s a way of accounting for variability in the model’s performance due to the different data it could be trained on, not just one specific dataset.

Here, the highlighted term gives an ‘average function’ and can be written as:

This average function g_bar(x) can be understood in the following way:

1. Creating Multiple Models: Imagine a machine learning scenario where we’re trying to predict house prices. We don’t just build one model; we experiment with many models, each trained on a slightly different set of historical data about house sales.

2. The Average Model: After training all these models, we calculate an average model, g_bar(x), which isn’t just any model but the central one that captures the core trend across all the models we’ve trained. It’s the essence of our predictions, distilled into a single, representative model.

3. Understanding g_bar(x): This considers that each set of data includes its quirks or randomness. It’s our way of saying, “If we look past the randomness, what is the stable prediction all these models are trying to make?”

4. The Unique Nature of g_bar(x): Interestingly, g_bar(x) might not match any individual model’s prediction exactly. It’s a theoretical model that represents the average outcome of all our actual models, smoothing out the peculiarities and noise found in individual datasets.

Let’s continue with the expected out-of-sample error in terms of g_bar(x):

Now, the yellow highlighted part of the above equation can be written as,

Let’s see how this is true. We start by considering the definition of variance which is the expected value of the squared difference between the actual predictions and the average prediction:

Since g_bar(x) is the average of g(x) over all data sets D, it is constant with respect to D.

Therefore,

Now, that we have fully decomposed the out-of-sample error, let's define the terms in the above equation.

  1. Bias

Bias measures the discrepancy between the predictions made by our machine learning model and the true outcomes it aims to predict. This discrepancy arises because the model has learned from various datasets but is inherently restricted by its design and cannot perfectly mimic the true underlying pattern it tries to learn. Essentially, bias is an indicator of a model’s limitations in replicating the complexity of real-world data, despite having access to extensive training data.

2. Variance

This measures the variation in the final hypothesis, depending on the data set. In statistics, variance measures how widely individual values differ from the mean. In the context of machine learning, the equation for variance takes this concept and applies it to the predictions made by the model. It calculates the average of the squared differences between the model’s predictions on different training datasets and the average prediction across all datasets. This helps to understand how sensitive the model is to the data it’s trained on a high variance indicates that changing the training data can lead to significant changes in predictions, while a low variance suggests the model is more stable and less sensitive to fluctuations in the training data.

Finally,

We ignored the noise term in the above equation by assuming that the data is noiseless. In reality, noise term is unavoidable no matter what we do.

While we can’t directly calculate bias and variance since they depend on the unknown target function and input distribution, understanding them is still crucial for model development. The main objectives are to reduce variance without notably increasing bias, and vice versa. This is often achieved through methods like regularization. Lowering bias typically needs prior knowledge about the target function, which varies by application. Conversely, reducing variance can be approached with more general techniques that don’t affect the bias.

--

--

Abhay Mane
Abhay Mane

Written by Abhay Mane

Innovator: Turning ideas into Products | Mechanical Engineer with 8 patents | MBA from IIM Lucknow | AI Product Manager | Passionate about growth & teamwork

No responses yet