<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[The Confidence Interval]]></title><description><![CDATA[Artificial Intelligence, Machine Learning and Software Architecture.]]></description><link>https://theconfidenceinterval.com</link><generator>RSS for Node</generator><lastBuildDate>Wed, 08 Apr 2026 11:54:15 GMT</lastBuildDate><atom:link href="https://theconfidenceinterval.com/rss.xml" rel="self" type="application/rss+xml"/><language><![CDATA[en]]></language><ttl>60</ttl><item><title><![CDATA[High-Dimensional Logistic Regression with L1 and L2 Regularisation in R]]></title><description><![CDATA[Feature Selection, Model Stability, and Classification Performance on NIR Spectral Data.
Introduction
In many modern AI discussions, the spotlight is on deep neural networks, transformers, and large-scale models. However, a significant portion of rea...]]></description><link>https://theconfidenceinterval.com/high-dimensional-logistic-regression-with-l1-and-l2-regularisation-in-r</link><guid isPermaLink="true">https://theconfidenceinterval.com/high-dimensional-logistic-regression-with-l1-and-l2-regularisation-in-r</guid><category><![CDATA[logistic regression]]></category><category><![CDATA[regularization]]></category><category><![CDATA[Machine Learning]]></category><dc:creator><![CDATA[Alan Flood]]></dc:creator><pubDate>Fri, 30 Jan 2026 18:45:57 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/stock/unsplash/obMUS2F3MzM/upload/70a61fa6e41f78c047b4b9adc4d1b3f7.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>Feature Selection, Model Stability, and Classification Performance on NIR Spectral Data.</em></p>
<h2 id="heading-introduction">Introduction</h2>
<p>In many modern AI discussions, the spotlight is on deep neural networks, transformers, and large-scale models. However, a significant portion of real-world prediction problems — especially in engineering, healthcare, and industrial analytics — are still solved effectively using <strong>classical statistical learning techniques</strong>.</p>
<p>This post explores a practical workflow for <strong>high-dimensional binary classification</strong> using <strong>logistic regression with L1 and L2 regularisation</strong>. The dataset consists of Near-Infrared (NIR) spectrometry measurements of pharmaceutical tablets, where the goal is to classify tablets into <strong>high</strong> or <strong>low</strong> active-ingredient categories based on spectral features.</p>
<p>The key themes are:</p>
<ul>
<li><p>Managing high-dimensional feature spaces</p>
</li>
<li><p>Preventing overfitting</p>
</li>
<li><p>Feature selection vs coefficient shrinkage</p>
</li>
<li><p>Model evaluation using misclassification rate and confusion matrices</p>
</li>
</ul>
<p>Although the techniques here are classical, the underlying ideas — optimisation, regularisation, bias-variance trade-offs — are the same foundations that support modern deep learning systems.</p>
<h2 id="heading-core-concepts">Core Concepts</h2>
<h3 id="heading-logistic-regression">Logistic Regression</h3>
<p>Logistic regression is a probabilistic classification model used when the output is categorical. Instead of predicting a raw value, the model predicts a <strong>probability</strong> between 0 and 1. A threshold (commonly 0.5) converts this probability into a class label. Logistic regression models the probability that an observation belongs to class 1 given a vector of input features x:</p>
<p>$$P(y = 1 \mid x) = \sigma(z) = \frac{1}{1 + e^{-z}}$$</p><p>where:</p>
<p>$$z = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \dots + \beta_p x_p$$</p><p>The decision rule is:</p>
<p>$$\hat{y} = \begin{cases} 1 &amp; \text{if } P(y=1 \mid x) &gt; \tau \\ 0 &amp; \text{otherwise} \end{cases}$$</p><h3 id="heading-high-dimensional-data">High-Dimensional Data</h3>
<p>High-dimensional datasets contain many more input variables (features) than observations. This increases the risk of <strong>overfitting</strong>, where a model memorises training data instead of learning general patterns.</p>
<h3 id="heading-regularisation">Regularisation</h3>
<p>Regularisation reduces overfitting by adding a <strong>penalty term</strong> to the optimisation objective.<br />Instead of minimising only prediction error, the model minimises:</p>
<p>$$\text{Loss} + \lambda \cdot \text{Penalty}$$</p><p>Where:</p>
<ul>
<li><p>λ controls the strength of regularisation</p>
</li>
<li><p>Larger λ → stronger penalty → simpler model</p>
</li>
</ul>
<p><strong>L1 Regularisation (Lasso):</strong></p>
<ul>
<li><p>Drives many coefficients exactly to zero</p>
</li>
<li><p>Acts as a built-in <strong>feature selector</strong></p>
</li>
<li><p>Adds the <strong>squared magnitude</strong> of coefficients as the penalty when minimizing the loss function:</p>
</li>
</ul>
<p>$$\mathcal{L}_{L2}(\beta) = \mathcal{L}(\beta) + \lambda \sum_{j=1}^{p} \beta_j^2$$</p><p><strong>L2 Regularisation (Ridge):</strong></p>
<ul>
<li><p>Shrinks coefficients toward zero</p>
</li>
<li><p>Retains most features but reduces magnitude</p>
</li>
<li><p>Adds the sum of the absolute magnitude of the coefficients as the penalty when minimizing the loss function:</p>
</li>
</ul>
<p>$$\mathcal{L}_{L1}(\beta) = \mathcal{L}(\beta) + \lambda \sum_{j=1}^{p} |\beta_j|$$</p><h3 id="heading-overfitting-vs-generalisation">Overfitting vs Generalisation</h3>
<p>A well-performing model must balance:</p>
<ul>
<li><p><strong>Bias</strong> – being too simple</p>
</li>
<li><p><strong>Variance</strong> – being too sensitive to noise</p>
</li>
</ul>
<p>Regularisation helps achieve this balance.</p>
<h2 id="heading-visualising-the-data">Visualising the Data</h2>
<p><a target="_blank" href="https://github.com/alanflood/the-confidence-interval/tree/develop/posts/2026-01-30-high-dimensional-logistic-regression">All Code and Data available on Github</a></p>
<p>The dataset contains data concerning near infrared spectrometry (NIR) measurements of a sample of pharmaceutical tablets. The task of the Logistic Regression model is to classify the pharmaceutical tablets into those that contain a “Low (0)” level of active ingredient and those that contain a “High (1)” level of active ingredient. The target variable Y takes on the value of “Low” or “High” and it is to be predicted using the 650 variables that make up the NIR explanatory variables. The training dataset comprises of 195 observations while the test dataset comprises of a further 460 observations.</p>
<p><img src="https://the-confidence-interval.ghost.io/content/images/2026/01/image-14.png" alt class="image--center mx-auto" /></p>
<p><img src="https://the-confidence-interval.ghost.io/content/images/2026/01/image-15.png" alt class="image--center mx-auto" /></p>
<h2 id="heading-method">Method</h2>
<p>A Logistic Regression model with L1 penalty function is fitted to the standardized training data. This model will be a “Lasso” regression model that will have the affect of reducing the number of explanatory variables to be used in the model to a subset of the 650 NIR readings for each tablet. In order to ascertain the optimal value of lambda to use in the L1 Penalty function a vector of 100 values of lambda in the range from 0.005 up to 0.150 are evaluated. The standardized training dataset is split to use 70% of the data to fit the model and the remaining 30% of the data is used as validation data. The data is sampled at random to select the portion of data to use for training and validation. As such a series of 100 iterations is used to fit the model to account for this random sampling.</p>
<p>In each iteration a logistic regression model with L1 penalty function is fit to a random 70% sample from the training data for each of the 100 values of lambda. The model is then used to calculate the probability that each observation in the training and validation data has a high or low active ingredient content. A threshold function with a tau value of 0.5 is then used to assign the appropriate class. If the probability predicted by the model is &lt;= 0.5 then the observation is assigned a class of “Low (0)” and if the probability is greater than the threshold value it is classified as “High (1)”.</p>
<p>For each value of lambda the miss-classification rate is then calculated for the training and validation data. This is calculated as the number of incorrectly classified observations divided by the total number of observations. At the end of 100 iterations of fitting the model the value of lambda that has the lowest mean miss-classification rate is selected as the optimal value of lambda.</p>
<p>The optimal value of lambda is then used to fit a Logistic Regression model with L1 Penalty Function to the standardized test data. Classes are then assigned to the test observations using the same threshold function with tau = 0.5 described earlier for the training and validation data. The miss-classification rate for the test data is then reported as the number of incorrectly classified observations in the test data set divided by the total number of observations in the test dataset.</p>
<h2 id="heading-model-implementation-in-r">Model Implementation in R</h2>
<h3 id="heading-environment-setup">Environment Setup</h3>
<p>Clear the workspace, set a seed for reproducibility, and load the required libraries.</p>
<pre><code class="lang-R">rm(list = ls())
set.seed(<span class="hljs-number">128</span>)

<span class="hljs-keyword">library</span>(Rtsne)
<span class="hljs-keyword">library</span>(glmnet)
</code></pre>
<h3 id="heading-load-dataset">Load Dataset</h3>
<p>Load the dataset containing the training and test splits.</p>
<pre><code class="lang-R">load(<span class="hljs-string">"data_nir_tablets.RData"</span>)
</code></pre>
<h3 id="heading-inspect-dataset-dimensions">Inspect Dataset Dimensions</h3>
<p>Understand the size of the feature matrices and label vectors.</p>
<pre><code class="lang-R">dim(x)
length(y)

dim(x_test)
length(y_test)
</code></pre>
<h3 id="heading-class-balance-check">Class Balance Check</h3>
<p>Check whether the classes are balanced or skewed.</p>
<pre><code class="lang-R">table(y)
table(y_test)
</code></pre>
<h3 id="heading-data-visualisation-with-t-sne">Data Visualisation with t-SNE</h3>
<p>t-Distributed Stochastic Neighbour Embedding (t-SNE) reduces dimensionality to visualise structure and separability.</p>
<pre><code class="lang-R">x_all &lt;- rbind(x, x_test)
y_all &lt;- c(y, y_test)

rtsne &lt;- Rtsne(x_all, perplexity = <span class="hljs-number">30</span>)
colours &lt;- c(<span class="hljs-string">"red"</span>, <span class="hljs-string">"blue"</span>)[y_all + <span class="hljs-number">1</span>]

plot(
  rtsne$Y,
  pch = <span class="hljs-number">19</span>,
  col = adjustcolor(colours, <span class="hljs-number">0.3</span>),
  main = <span class="hljs-string">"All Data – t-SNE Projection"</span>
)

rtsne_train &lt;- Rtsne(x, perplexity = <span class="hljs-number">30</span>)
colours_train &lt;- c(<span class="hljs-string">"red"</span>, <span class="hljs-string">"blue"</span>)[y + <span class="hljs-number">1</span>]

plot(
  rtsne_train$Y,
  pch = <span class="hljs-number">19</span>,
  col = adjustcolor(colours_train, <span class="hljs-number">0.3</span>),
  main = <span class="hljs-string">"Training Data – t-SNE Projection"</span>
)
</code></pre>
<h3 id="heading-feature-standardisation">Feature Standardisation</h3>
<p>Scaling ensures each feature contributes equally during optimisation.</p>
<pre><code class="lang-R">x_stand &lt;- scale(x, center = <span class="hljs-literal">TRUE</span>, scale = <span class="hljs-literal">TRUE</span>)
y_stand &lt;- scale(y, center = <span class="hljs-literal">TRUE</span>, scale = <span class="hljs-literal">TRUE</span>)

x_test_stand &lt;- scale(x_test, center = <span class="hljs-literal">TRUE</span>, scale = <span class="hljs-literal">TRUE</span>)
y_test_stand &lt;- scale(y_test, center = <span class="hljs-literal">TRUE</span>, scale = <span class="hljs-literal">TRUE</span>)
</code></pre>
<h3 id="heading-utility-functions">Utility Functions</h3>
<p>Define helper functions for classification error and probability thresholding.</p>
<pre><code class="lang-R">classification_error &lt;- <span class="hljs-keyword">function</span>(y, yhat){
  tab &lt;- table(y, yhat)
  <span class="hljs-number">1</span> - (sum(diag(tab)) / length(y))
}

tau &lt;- <span class="hljs-number">0.5</span>

assign_class &lt;- <span class="hljs-keyword">function</span>(probability){
  ifelse(probability &gt; tau, <span class="hljs-number">1</span>, <span class="hljs-number">0</span>)
}
</code></pre>
<h3 id="heading-logistic-regression-with-regularisation">Logistic Regression with Regularisation</h3>
<p>This function performs repeated train/validation splits, selects the optimal lambda, and evaluates performance.</p>
<pre><code class="lang-R">run_logistic_regression &lt;- <span class="hljs-keyword">function</span>(regularisation, num_lambda, training_iterations, validation_ratio){

  L_penalty_option &lt;- ifelse(regularisation == <span class="hljs-number">1</span>, <span class="hljs-string">"L1 Regularization"</span>, <span class="hljs-string">"L2 Regularization"</span>)
  legend_location  &lt;- ifelse(regularisation == <span class="hljs-number">1</span>, <span class="hljs-string">"bottomright"</span>, <span class="hljs-string">"topright"</span>)

  lambda &lt;- seq(<span class="hljs-number">0.005</span>, <span class="hljs-number">0.150</span>, length = num_lambda)

  error_train &lt;- matrix(<span class="hljs-literal">NA</span>, training_iterations, num_lambda)
  error_val   &lt;- matrix(<span class="hljs-literal">NA</span>, training_iterations, num_lambda)

  training_rows &lt;- length(y)
  L &lt;- floor(training_rows * validation_ratio)

  <span class="hljs-keyword">for</span> (b <span class="hljs-keyword">in</span> <span class="hljs-number">1</span>:training_iterations){

    val   &lt;- sample(<span class="hljs-number">1</span>:training_rows, L)
    train &lt;- setdiff(<span class="hljs-number">1</span>:training_rows, val)

    fit &lt;- glmnet(x[train, ], y[train], family=<span class="hljs-string">"binomial"</span>, alpha=regularisation, lambda=lambda)

    probabilities_train &lt;- predict(fit, newx=x[train,], type=<span class="hljs-string">"response"</span>)
    classes_train &lt;- apply(probabilities_train, <span class="hljs-number">2</span>, assign_class)

    probabilities_val &lt;- predict(fit, newx=x[val,], type=<span class="hljs-string">"response"</span>)
    classes_val &lt;- apply(probabilities_val, <span class="hljs-number">2</span>, assign_class)

    error_train[b,] &lt;- sapply(<span class="hljs-number">1</span>:num_lambda, <span class="hljs-keyword">function</span>(l)
      classification_error(y[train], classes_train[,l]))

    error_val[b,] &lt;- sapply(<span class="hljs-number">1</span>:num_lambda, <span class="hljs-keyword">function</span>(l)
      classification_error(y[val], classes_val[,l]))
  }

  lambda_best_val &lt;- lambda[which.min(colMeans(error_val))]

  fit_test &lt;- glmnet(x, y, family=<span class="hljs-string">'binomial'</span>, lambda=lambda_best_val, alpha=regularisation)

  probabilities_test &lt;- predict(fit_test, newx=x_test, type=<span class="hljs-string">"response"</span>)
  classes_test &lt;- apply(probabilities_test, <span class="hljs-number">2</span>, assign_class)

  print(table(y_test, classes_test))
  cat(<span class="hljs-string">"\nMisclassification rate:"</span>,
      classification_error(y_test, classes_test), <span class="hljs-string">"\n"</span>)
}
</code></pre>
<h3 id="heading-run-l1-lasso-regression">Run L1 (Lasso) Regression</h3>
<pre><code class="lang-R">run_logistic_regression(
  regularisation = <span class="hljs-number">1</span>,
  num_lambda = <span class="hljs-number">100</span>,
  training_iterations = <span class="hljs-number">100</span>,
  validation_ratio = <span class="hljs-number">0.3</span>
)
</code></pre>
<h3 id="heading-run-l2-ridge-regression">Run L2 (Ridge) Regression</h3>
<pre><code class="lang-r">run_logistic_regression(
    regularisation=<span class="hljs-number">0</span>, 
    num_lambda=<span class="hljs-number">100</span>, 
    training_iterations=<span class="hljs-number">100</span>, 
    validation_ratio=<span class="hljs-number">0.3</span>
)
</code></pre>
<h2 id="heading-results">Results</h2>
<p>The optimal value for lambda selected by the training iterations was 0.09287879. Applying this value of lambda to the test data resulted in a miss-classification rate of 0.2608696. A total of 120 out of observations were incorrectly classified out of 460 observations in the test dataset. 78 tablets were classified as having a “High (1)” level of active ingredient when in fact they contained a “Low (0)” level. A further 42 tablets were classified as having a “Low (0)” level of active ingredient when the actual level was “High (1)”. The remaining 340 tablets were correctly classified. Using a value of 0.09287879 in the L1 Penalty Function applied to the test data resulted in the beta coefficients of 646 explanatory variables being shrunk to exactly 0. The model identified 4 features from the infrared spectrometry observations as being useful for classifying the tablets by their active ingredient content.</p>
<p><img src="https://the-confidence-interval.ghost.io/content/images/2026/01/image-16.png" alt class="image--center mx-auto" /></p>
<p><img src="https://the-confidence-interval.ghost.io/content/images/2026/01/image-17.png" alt class="image--center mx-auto" /></p>
<p><img src="https://the-confidence-interval.ghost.io/content/images/2026/01/image-18.png" alt class="image--center mx-auto" /></p>
<p><img src="https://the-confidence-interval.ghost.io/content/images/2026/01/image-19.png" alt class="image--center mx-auto" /></p>
<div class="hn-table">
<table>
<thead>
<tr>
<td></td><td><strong>Predicted Low</strong></td><td><strong>Predicted High</strong></td></tr>
</thead>
<tbody>
<tr>
<td><strong>Actual Low</strong></td><td>152</td><td>78</td></tr>
<tr>
<td><strong>Actual High</strong></td><td>42</td><td>188</td></tr>
</tbody>
</table>
</div><div class="hn-table">
<table>
<thead>
<tr>
<td></td><td><strong>L1 Penalty</strong></td><td><strong>L2 Penalty</strong></td></tr>
</thead>
<tbody>
<tr>
<td><strong>Optimal Training Lambda</strong></td><td>0.1500</td><td>0.1500</td></tr>
<tr>
<td><strong>Optimal Validation Lambda</strong></td><td>0.0929</td><td>0.0753</td></tr>
<tr>
<td><strong>Miss-classification Rate</strong></td><td>0.2609</td><td>0.2696</td></tr>
</tbody>
</table>
</div><h2 id="heading-discussion">Discussion</h2>
<p>The Logistic Regression model error classification rate of 0.26 on the test data set is still relatively high. The acceptance of the suitability of this model depends on the consequences of incorrectly classifying each tablet. If the active ingredient is crucial to the treatment of a serious condition, then false positives which falsely designate a tablet as “High” in the active ingredient would have a detrimental effect on the treatment of this condition. If the active ingredient is dangerous when combined with other medications then a false negative which falsely designates the tablet as “Low” could put patients at risk.</p>
<p>It is noted that the L1 Penalty Function resulted in the selection of only four of the near infrared spectrometry readings as explanatory variables. This resulted in a sparse model of much lower complexity which generalizes better to test observations. This means that the process for collecting the infrared spectrometry readings can be simplified and made more efficient by only collecting the readings identified by the model as being significant.</p>
<p>It is also noted that the optimal value of lambda selected for the validation data is lower than the optimal value of lambda identified for the training data. When using the training data the maximum value of lambda of 0.150 was selected as shown in illustration 3. However, this value of lambda did not generalise well when applied to the validation data.</p>
<p>Finally, it is noted that the Logistic Regression Model with L1 Penalty Function results in a slightly lower miss-classification rate when compared against a similar model with L2 Penalty Function as shown in Table 2. This may be related to the fact that L1 Regularization results in a much sparser and simpler model which perhaps generalizes better when applied to new test datasets</p>
]]></content:encoded></item></channel></rss>