Logistic Regression with Lasso and Ridge Regularization

Feature Selection, Model Stability, and Classification Performance on NIR Spectral Data.

Introduction

In many modern AI discussions, the spotlight is on deep neural networks, transformers, and large-scale models. However, a significant portion of real-world prediction problems — especially in engineering, healthcare, and industrial analytics — are still solved effectively using classical statistical learning techniques.

This post explores a practical workflow for high-dimensional binary classification using logistic regression with L1 and L2 regularisation. The dataset consists of Near-Infrared (NIR) spectrometry measurements of pharmaceutical tablets, where the goal is to classify tablets into high or low active-ingredient categories based on spectral features.

The key themes are:

Managing high-dimensional feature spaces
Preventing overfitting
Feature selection vs coefficient shrinkage
Model evaluation using misclassification rate and confusion matrices

Although the techniques here are classical, the underlying ideas — optimisation, regularisation, bias-variance trade-offs — are the same foundations that support modern deep learning systems.

Core Concepts

Logistic Regression

Logistic regression is a probabilistic classification model used when the output is categorical. Instead of predicting a raw value, the model predicts a probability between 0 and 1. A threshold (commonly 0.5) converts this probability into a class label. Logistic regression models the probability that an observation belongs to class 1 given a vector of input features x:

$$P(y = 1 \mid x) = \sigma(z) = \frac{1}{1 + e^{-z}}$$

where:

$$z = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \dots + \beta_p x_p$$

The decision rule is:

$$\hat{y} = \begin{cases} 1 & \text{if } P(y=1 \mid x) > \tau \\ 0 & \text{otherwise} \end{cases}$$

High-Dimensional Data

High-dimensional datasets contain many more input variables (features) than observations. This increases the risk of overfitting, where a model memorises training data instead of learning general patterns.

Regularisation

Regularisation reduces overfitting by adding a penalty term to the optimisation objective.
Instead of minimising only prediction error, the model minimises:

$$\text{Loss} + \lambda \cdot \text{Penalty}$$

Where:

λ controls the strength of regularisation
Larger λ → stronger penalty → simpler model

L1 Regularisation (Lasso):

Drives many coefficients exactly to zero
Acts as a built-in feature selector
Adds the squared magnitude of coefficients as the penalty when minimizing the loss function:

$$\mathcal{L}_{L2}(\beta) = \mathcal{L}(\beta) + \lambda \sum_{j=1}^{p} \beta_j^2$$

L2 Regularisation (Ridge):

Shrinks coefficients toward zero
Retains most features but reduces magnitude
Adds the sum of the absolute magnitude of the coefficients as the penalty when minimizing the loss function:

$$\mathcal{L}_{L1}(\beta) = \mathcal{L}(\beta) + \lambda \sum_{j=1}^{p} |\beta_j|$$

Overfitting vs Generalisation

A well-performing model must balance:

Bias – being too simple
Variance – being too sensitive to noise

Regularisation helps achieve this balance.

Visualising the Data

All Code and Data available on Github

The dataset contains data concerning near infrared spectrometry (NIR) measurements of a sample of pharmaceutical tablets. The task of the Logistic Regression model is to classify the pharmaceutical tablets into those that contain a “Low (0)” level of active ingredient and those that contain a “High (1)” level of active ingredient. The target variable Y takes on the value of “Low” or “High” and it is to be predicted using the 650 variables that make up the NIR explanatory variables. The training dataset comprises of 195 observations while the test dataset comprises of a further 460 observations.

Method

A Logistic Regression model with L1 penalty function is fitted to the standardized training data. This model will be a “Lasso” regression model that will have the affect of reducing the number of explanatory variables to be used in the model to a subset of the 650 NIR readings for each tablet. In order to ascertain the optimal value of lambda to use in the L1 Penalty function a vector of 100 values of lambda in the range from 0.005 up to 0.150 are evaluated. The standardized training dataset is split to use 70% of the data to fit the model and the remaining 30% of the data is used as validation data. The data is sampled at random to select the portion of data to use for training and validation. As such a series of 100 iterations is used to fit the model to account for this random sampling.

In each iteration a logistic regression model with L1 penalty function is fit to a random 70% sample from the training data for each of the 100 values of lambda. The model is then used to calculate the probability that each observation in the training and validation data has a high or low active ingredient content. A threshold function with a tau value of 0.5 is then used to assign the appropriate class. If the probability predicted by the model is <= 0.5 then the observation is assigned a class of “Low (0)” and if the probability is greater than the threshold value it is classified as “High (1)”.

For each value of lambda the miss-classification rate is then calculated for the training and validation data. This is calculated as the number of incorrectly classified observations divided by the total number of observations. At the end of 100 iterations of fitting the model the value of lambda that has the lowest mean miss-classification rate is selected as the optimal value of lambda.

The optimal value of lambda is then used to fit a Logistic Regression model with L1 Penalty Function to the standardized test data. Classes are then assigned to the test observations using the same threshold function with tau = 0.5 described earlier for the training and validation data. The miss-classification rate for the test data is then reported as the number of incorrectly classified observations in the test data set divided by the total number of observations in the test dataset.

Model Implementation in R

Environment Setup

Clear the workspace, set a seed for reproducibility, and load the required libraries.

rm(list = ls())
set.seed(128)

library(Rtsne)
library(glmnet)

Load Dataset

Load the dataset containing the training and test splits.

load("data_nir_tablets.RData")

Inspect Dataset Dimensions

Understand the size of the feature matrices and label vectors.

dim(x)
length(y)

dim(x_test)
length(y_test)

Class Balance Check

Check whether the classes are balanced or skewed.

table(y)
table(y_test)

Data Visualisation with t-SNE

t-Distributed Stochastic Neighbour Embedding (t-SNE) reduces dimensionality to visualise structure and separability.

x_all <- rbind(x, x_test)
y_all <- c(y, y_test)

rtsne <- Rtsne(x_all, perplexity = 30)
colours <- c("red", "blue")[y_all + 1]

plot(
  rtsne$Y,
  pch = 19,
  col = adjustcolor(colours, 0.3),
  main = "All Data – t-SNE Projection"
)

rtsne_train <- Rtsne(x, perplexity = 30)
colours_train <- c("red", "blue")[y + 1]

plot(
  rtsne_train$Y,
  pch = 19,
  col = adjustcolor(colours_train, 0.3),
  main = "Training Data – t-SNE Projection"
)

Feature Standardisation

Scaling ensures each feature contributes equally during optimisation.

x_stand <- scale(x, center = TRUE, scale = TRUE)
y_stand <- scale(y, center = TRUE, scale = TRUE)

x_test_stand <- scale(x_test, center = TRUE, scale = TRUE)
y_test_stand <- scale(y_test, center = TRUE, scale = TRUE)

Utility Functions

Define helper functions for classification error and probability thresholding.

classification_error <- function(y, yhat){
  tab <- table(y, yhat)
  1 - (sum(diag(tab)) / length(y))
}

tau <- 0.5

assign_class <- function(probability){
  ifelse(probability > tau, 1, 0)
}

Logistic Regression with Regularisation

This function performs repeated train/validation splits, selects the optimal lambda, and evaluates performance.

run_logistic_regression <- function(regularisation, num_lambda, training_iterations, validation_ratio){

  L_penalty_option <- ifelse(regularisation == 1, "L1 Regularization", "L2 Regularization")
  legend_location  <- ifelse(regularisation == 1, "bottomright", "topright")

  lambda <- seq(0.005, 0.150, length = num_lambda)

  error_train <- matrix(NA, training_iterations, num_lambda)
  error_val   <- matrix(NA, training_iterations, num_lambda)

  training_rows <- length(y)
  L <- floor(training_rows * validation_ratio)

  for (b in 1:training_iterations){

    val   <- sample(1:training_rows, L)
    train <- setdiff(1:training_rows, val)

    fit <- glmnet(x[train, ], y[train], family="binomial", alpha=regularisation, lambda=lambda)

    probabilities_train <- predict(fit, newx=x[train,], type="response")
    classes_train <- apply(probabilities_train, 2, assign_class)

    probabilities_val <- predict(fit, newx=x[val,], type="response")
    classes_val <- apply(probabilities_val, 2, assign_class)

    error_train[b,] <- sapply(1:num_lambda, function(l)
      classification_error(y[train], classes_train[,l]))

    error_val[b,] <- sapply(1:num_lambda, function(l)
      classification_error(y[val], classes_val[,l]))
  }

  lambda_best_val <- lambda[which.min(colMeans(error_val))]

  fit_test <- glmnet(x, y, family='binomial', lambda=lambda_best_val, alpha=regularisation)

  probabilities_test <- predict(fit_test, newx=x_test, type="response")
  classes_test <- apply(probabilities_test, 2, assign_class)

  print(table(y_test, classes_test))
  cat("\nMisclassification rate:",
      classification_error(y_test, classes_test), "\n")
}

Run L1 (Lasso) Regression

run_logistic_regression(
  regularisation = 1,
  num_lambda = 100,
  training_iterations = 100,
  validation_ratio = 0.3
)

Run L2 (Ridge) Regression

run_logistic_regression(
    regularisation=0, 
    num_lambda=100, 
    training_iterations=100, 
    validation_ratio=0.3
)

Results

The optimal value for lambda selected by the training iterations was 0.09287879. Applying this value of lambda to the test data resulted in a miss-classification rate of 0.2608696. A total of 120 out of observations were incorrectly classified out of 460 observations in the test dataset. 78 tablets were classified as having a “High (1)” level of active ingredient when in fact they contained a “Low (0)” level. A further 42 tablets were classified as having a “Low (0)” level of active ingredient when the actual level was “High (1)”. The remaining 340 tablets were correctly classified. Using a value of 0.09287879 in the L1 Penalty Function applied to the test data resulted in the beta coefficients of 646 explanatory variables being shrunk to exactly 0. The model identified 4 features from the infrared spectrometry observations as being useful for classifying the tablets by their active ingredient content.

	Predicted Low	Predicted High
Actual Low	152	78
Actual High	42	188

	L1 Penalty	L2 Penalty
Optimal Training Lambda	0.1500	0.1500
Optimal Validation Lambda	0.0929	0.0753
Miss-classification Rate	0.2609	0.2696

Discussion

The Logistic Regression model error classification rate of 0.26 on the test data set is still relatively high. The acceptance of the suitability of this model depends on the consequences of incorrectly classifying each tablet. If the active ingredient is crucial to the treatment of a serious condition, then false positives which falsely designate a tablet as “High” in the active ingredient would have a detrimental effect on the treatment of this condition. If the active ingredient is dangerous when combined with other medications then a false negative which falsely designates the tablet as “Low” could put patients at risk.

It is noted that the L1 Penalty Function resulted in the selection of only four of the near infrared spectrometry readings as explanatory variables. This resulted in a sparse model of much lower complexity which generalizes better to test observations. This means that the process for collecting the infrared spectrometry readings can be simplified and made more efficient by only collecting the readings identified by the model as being significant.

It is also noted that the optimal value of lambda selected for the validation data is lower than the optimal value of lambda identified for the training data. When using the training data the maximum value of lambda of 0.150 was selected as shown in illustration 3. However, this value of lambda did not generalise well when applied to the validation data.

Finally, it is noted that the Logistic Regression Model with L1 Penalty Function results in a slightly lower miss-classification rate when compared against a similar model with L2 Penalty Function as shown in Table 2. This may be related to the fact that L1 Regularization results in a much sparser and simpler model which perhaps generalizes better when applied to new test datasets

Command Palette