Model Regularization

August 2019
model regularization meme

In this post, we will continue with our odyssey through practical machine learning methodology by discussing model regularization. Regularization is an immensely important principle in machine learning and one of the most powerful ones in the practitioner’s toolkit. Excited? Let’s get started!

Regularization is another way to address the ever-present problem of model generalization. It is a technique we apply to deal with model overfitting, particularly when a model is overspecified for the problem we are tackling.

We have actually already seen regularization previously. Recall that when we were studying support vector machines, we introduced slack variables to make the model less susceptible to outliers and also make it so that the model could handle non-linearly separable data.

These slack variables represented a means of regularizing the model, through what we will show later is called L1L_1 regularization.

There are many types of regularization, some that can be applied to a broad range of models, and others that we will see are a bit more model-class specific (such as dropout for neural networks).

L2L_2 Regularization

Traditionally when we are building a supervised model, we have some number of features we extract from our data. During training, we learn weights that dictate how important each feature is for our model.

These weights tune the strength of the features through interactions ranging from the simple linear ones (as in the case of linear regression) to more complex interactions like those we saw with neural networks.

For the time being, let’s assume that we are dealing with linear regression. Therefore, we have a weight vector A=(A1,A2,...,Ak)A = (A_1, A_2, ..., A_k) for our kk features. L2L_2 regularization is the first type of regularization we will formally investigate, and it involves adding the square of the weights to our cost function.

What that looks like in mathematics is as follows: Recall that for linear regression we were trying to minimize the value of the least-squares cost:

C(X)=12i=1n(F(Xi)Yi)2C(X) = \frac{1}{2} \cdot \displaystyle\sum_{i=1}^n(F(X_i) - Y_i)^2

Adding an L2L_2 penalty, modifies this cost function to the following:

C(X)=12i=1n(F(Xi)Yi)2+Li=1kAi2C(X) = \frac{1}{2} \cdot \displaystyle\sum_{i=1}^n(F(X_i) - Y_i)^2 + L\cdot \displaystyle\sum_{i=1}^kA_i^2

So, this cost function involves optimizing this more complex sum of terms. Notice that now our model must ensure that the squared magnitude of its weights don’t get too big, as that would lead to a larger overall value of our cost.

In practice, having smaller weight magnitudes serves the purpose of ensuring that any single feature is not weighted too heavily, effectively smoothing out our model fit. This is exactly what we want to do to prevent overfitting.

You may have noticed that we also have this extra term LL that we multiply through in our L2L_2 penalty. LL as you may remember is called a hyperparameter and is something that is typically tuned (i.e. a good value is chosen) during cross-validation or model training.

We can do some simple analysis to understand how LL affects our cost. If LL is really, really small (as in close to 0), it’s as if we are not at all applying an L2L_2 penalty and our cost function degenerates to the original cost function we were optimizing before.

However, if LL is really, really big, then our cost will focus solely on minimizing the value of our L2L_2 penalty. In practice, this amounts to sending all of our weights toward 0. Our model basically ends up learning nothing!

This makes sense because if we focus very hard on counteracting the effects of overfitting, we may effectively end up underfitting. In practice, there is a sweet spot for the LL parameter which depends on our data and problem.

A quick note on terminology: you may also sometimes see L2L_2 regularization referred to as ridge regression, though for our purposes we will continue to call it L2L_2 regularization. While we focused on linear regression to introduce L2L_2 regularization, practically speaking this technique can be applied to many other model classes.

L1L_1 Regularization

We can now move on to discussing L1L_1 regularization. This technique is conceptually similar to L2L_2 regularization, except instead of adding the term

Li=1kAi2L\cdot \displaystyle\sum_{i=1}^kA_i^2

to our cost, we add the term

Li=1kAiL\cdot \displaystyle\sum_{i=1}^k|A_i|

That’s it! As mentioned previously, we’ve already seen L1L_1 regularization in our slack variables in the support vector machine cost. Notice how with our L1L_1 regularization term, we can use the same logic for tuning the LL parameter as with L2L_2 regularization.

While L1L_1 regularization seems pretty similar mathematically, it has quite different implications for feature selection. It turns out that one of the consequences of using L1L_1 regularization is that many weights go to 0 or get really close to 0.

In that sense, L1L_1 regularization induces stricter sparsity in our feature set. This effectively means that many of the features aren’t counted at all in our model. This makes it more like a traditional feature selection algorithm, as compared to L2L_2 regularization that achieves a smoother continuous set of weights for our feature set.

Final Thoughts

In addition to these regularization techniques, there are many more ways to regularize a model out there, which we won’t cover. In practice, the type of regularization you use very often depends on how you want to control your feature set. But regardless, it is a hugely important technique to keep under your belt as you venture into the machine learning jungle!

Shameless Pitch Alert: If you’re interested in practicing MLOps, data science, and data engineering concepts, check out Confetti AI the premier educational machine learning platform used by students at Harvard, Stanford, Berkeley, and more!

Like what you read? I would love to hear from you! 🙂