Example: logistic regression

This tutorial was generated using Literate.jl. Download the source as a .jl file.

This tutorial was originally contributed by François Pacaud.

This tutorial shows how to solve a logistic regression problem with JuMP. Logistic regression is a well known method in machine learning, useful when we want to classify binary variables with the help of a given set of features. To this goal, we find the optimal combination of features maximizing the (log)-likelihood onto a training set.

Required packages

This tutorial uses the following packages:

using JuMP
import Clarabel
import MathOptInterface as MOI
import Random

Random.seed!(2713);

Formulating the logistic regression problem

Suppose we have a set of training data-point $i = 1, \cdots, n$, where for each $i$ we have a vector of features $x_i \in \mathbb{R}^p$ and a categorical observation $y_i \in \{-1, 1\}$.

The log-likelihood is given by

\[l(\theta) = \sum_{i=1}^n \log(\dfrac{1}{1 + \exp(-y_i \theta^\top x_i)})\]

and the optimal $\theta$ minimizes the logistic loss function:

\[\min_{\theta}\; \sum_{i=1}^n \log(1 + \exp(-y_i \theta^\top x_i)).\]

Most of the time, instead of solving directly the previous optimization problem, we prefer to add a regularization term:

\[\min_{\theta}\; \sum_{i=1}^n \log(1 + \exp(-y_i \theta^\top x_i)) + \lambda \| \theta \|\]

with $\lambda \in \mathbb{R}_+$ a penalty and $\|.\|$ a norm function. By adding such a regularization term, we avoid overfitting on the training set and usually achieve a greater score in cross-validation.

Reformulation as a conic optimization problem

By introducing auxiliary variables $t_1, \cdots, t_n$ and $r$, the optimization problem is equivalent to

\[\begin{aligned} \min_{t, r, \theta} \;& \sum_{i=1}^n t_i + \lambda r \\ \text{subject to } & \quad t_i \geq \log(1 + \exp(- y_i \theta^\top x_i)) \\ & \quad r \geq \|\theta\| \end{aligned}\]

Now, the trick is to reformulate the constraints $t_i \geq \log(1 + \exp(- y_i \theta^\top x_i))$ with the help of the exponential cone

\[K_{exp} = \{ (x, y, z) \in \mathbb{R}^3 : \; y \exp(x / y) \leq z \} .\]

Indeed, by passing to the exponential, we see that for all $i=1, \cdots, n$, the constraint $t_i \geq \log(1 + \exp(- y_i \theta^\top x_i))$ is equivalent to

\[\exp(-t_i) + \exp(u_i - t_i) \leq 1\]

with $u_i = -y_i \theta^\top x_i$. Then, by adding two auxiliary variables $z_{i1}$ and $z_{i2}$ such that $z_{i1} \geq \exp(u_i-t_i)$ and $z_{i2} \geq \exp(-t_i)$, we get the equivalent formulation

\[\left\{ \begin{aligned} (u_i -t_i , 1, z_{i1}) & \in K_{exp} \\ (-t_i , 1, z_{i2}) & \in K_{exp} \\ z_{i1} + z_{i2} & \leq 1 \end{aligned} \right.\]

In this setting, the conic version of the logistic regression problems writes out

\[\begin{aligned} \min_{t, z, r, \theta}& \; \sum_{i=1}^n t_i + \lambda r \\ \text{subject to } & \quad (u_i -t_i , 1, z_{i1}) \in K_{exp} \\ & \quad (-t_i , 1, z_{i2}) \in K_{exp} \\ & \quad z_{i1} + z_{i2} \leq 1 \\ & \quad u_i = -y_i x_i^\top \theta \\ & \quad r \geq \|\theta\| \end{aligned}\]

and thus encompasses $3n + p + 1$ variables and $3n + 1$ constraints ($u_i = -y_i \theta^\top x_i$ is only a virtual constraint used to clarify the notation). Thus, if $n \gg 1$, we get a large number of variables and constraints.

Fitting logistic regression with a conic solver

We start by implementing a function to generate a fake dataset, and where we could tune the correlation between the feature variables. The function is a direct transcription of the one used in this blog post.

function generate_dataset(n_samples = 100, n_features = 10; shift = 0.0)
    X = randn(n_samples, n_features)
    w = randn(n_features)
    y = sign.(X * w)
    X .+= 0.8 * randn(n_samples, n_features) # add noise
    X .+= shift # shift the points in the feature space
    X = hcat(X, ones(n_samples, 1))
    return X, y
end

generate_dataset (generic function with 3 methods)

We write a softplus function to formulate each constraint $t \geq \log(1 + \exp(u))$ with two exponential cones.

function softplus(model, t, u)
    z = @variable(model, [1:2], lower_bound = 0.0)
    @constraint(model, sum(z) <= 1.0)
    @constraint(model, [u - t, 1, z[1]] in MOI.ExponentialCone())
    @constraint(model, [-t, 1, z[2]] in MOI.ExponentialCone())
end

softplus (generic function with 1 method)

$\ell_2$ regularized logistic regression

Then, with the help of the softplus function, we could write our optimization model. In the $\ell_2$ regularization case, the constraint $r \geq \|\theta\|_2$ rewrites as a second order cone constraint.

function build_logit_model(X, y, λ)
    n, p = size(X)
    model = Model()
    @variable(model, θ[1:p])
    @variable(model, t[1:n])
    for i in 1:n
        u = -(X[i, :]' * θ) * y[i]
        softplus(model, t[i], u)
    end
    # Add ℓ2 regularization
    @variable(model, 0.0 <= reg)
    @constraint(model, [reg; θ] in SecondOrderCone())
    # Define objective
    @objective(model, Min, sum(t) + λ * reg)
    return model
end

build_logit_model (generic function with 1 method)

We generate the dataset.

Warning

Be careful here, for large n and p Clarabel could fail to converge.

n, p = 200, 10
X, y = generate_dataset(n, p; shift = 10.0);

# We could now solve the logistic regression problem
λ = 10.0
model = build_logit_model(X, y, λ)
set_optimizer(model, Clarabel.Optimizer)
set_silent(model)
optimize!(model)
assert_is_solved_and_feasible(model)

θ♯ = value.(model[:θ])

11-element Vector{Float64}:
  0.020413225546605917
  0.1613991538023868
  0.35700770991832304
 -0.30788085440030977
 -0.39391861316718
 -0.05914037674193521
  0.34717366952190337
 -0.8812123648157928
  0.2012565622145814
  0.5409400168106635
  0.0809041799818891

It appears that the speed of convergence is not that impacted by the correlation of the dataset, nor by the penalty $\lambda$.

$\ell_1$ regularized logistic regression

We now formulate the logistic problem with a $\ell_1$ regularization term. The $\ell_1$ regularization ensures sparsity in the optimal solution of the resulting optimization problem. Luckily, the $\ell_1$ norm is implemented as a set in MathOptInterface. Thus, we could formulate the sparse logistic regression problem with the help of a MOI.NormOneCone set.

function build_sparse_logit_model(X, y, λ)
    n, p = size(X)
    model = Model()
    @variable(model, θ[1:p])
    @variable(model, t[1:n])
    for i in 1:n
        u = -(X[i, :]' * θ) * y[i]
        softplus(model, t[i], u)
    end
    # Add ℓ1 regularization
    @variable(model, 0.0 <= reg)
    @constraint(model, [reg; θ] in MOI.NormOneCone(p + 1))
    # Define objective
    @objective(model, Min, sum(t) + λ * reg)
    return model
end

# Auxiliary function to count non-null components:
count_nonzero(v::Vector; tol = 1e-6) = sum(abs.(v) .>= tol)

# We solve the sparse logistic regression problem on the same dataset as before.
λ = 10.0
sparse_model = build_sparse_logit_model(X, y, λ)
set_optimizer(sparse_model, Clarabel.Optimizer)
set_silent(sparse_model)
optimize!(sparse_model)
assert_is_solved_and_feasible(sparse_model)

θ♯ = value.(sparse_model[:θ])
println(
    "Number of non-zero components: ",
    count_nonzero(θ♯),
    " (out of ",
    p,
    " features)",
)

Number of non-zero components: 8 (out of 10 features)

Extensions

A direct extension would be to consider the sparse logistic regression with hard thresholding, which, on contrary to the soft version using a $\ell_1$ regularization, adds an explicit cardinality constraint in its formulation:

\[\begin{aligned} \min_{\theta} & \; \sum_{i=1}^n \log(1 + \exp(-y_i \theta^\top x_i)) + \lambda \| \theta \|_2^2 \\ \text{subject to } & \quad \| \theta \|_0 <= k \end{aligned}\]

where $k$ is the maximum number of non-zero components in the vector $\theta$, and $\|.\|_0$ is the $\ell_0$ pseudo-norm:

\[\| x\|_0 = \#\{i : \; x_i \neq 0\}\]

The cardinality constraint $\|\theta\|_0 \leq k$ could be reformulated with binary variables. Thus the hard sparse regression problem could be solved by any solver supporting mixed integer conic problems.