# Auto-tuning Hyperparameters

This example shows how to learn a hyperparameter in Ridge Regression using a gradient descent routine. Let the regularized regression problem be formulated as:

\[\begin{equation} \min_{w} \quad \frac{1}{2nd} \sum_{i=1}^{n} (w^T x_{i} - y_i)^2 + \frac{\alpha}{2d} \| w \|_2^2 \end{equation}\]

where

`x`

,`y`

are the data points`w`

are the learned weights`α`

is the hyperparameter acting on regularization.

The main optimization model will be formulated with JuMP. Using the gradient of the optimal weights with respect to the regularization parameters computed with DiffOpt, we can perform a gradient descent on top of the inner model to minimize the test loss.

This tutorial uses the following packages

```
using JuMP # The mathematical programming modelling language
import DiffOpt # JuMP extension for differentiable optimization
import Ipopt # Optimization solver that handles quadratic programs
import Plots # Graphing tool
import LinearAlgebra: norm, dot
import Random
```

## Generating a noisy regression dataset

```
Random.seed!(42)
N = 100
D = 20
noise = 5
w_real = 10 * randn(D)
X = 10 * randn(N, D)
y = X * w_real + noise * randn(N)
l = N ÷ 2 # test train split
X_train = X[1:l, :]
X_test = X[l+1:N, :]
y_train = y[1:l]
y_test = y[l+1:N];
```

## Defining the regression problem

We implement the regularized regression problem as a function taking the problem data, building a JuMP model and solving it.

```
function fit_ridge(model, X, y, α)
JuMP.empty!(model)
set_silent(model)
N, D = size(X)
@variable(model, w[1:D])
@expression(model, err_term, X * w - y)
@objective(
model,
Min,
dot(err_term, err_term) / (2 * N * D) + α * dot(w, w) / (2 * D),
)
optimize!(model)
@assert termination_status(model) in [MOI.OPTIMAL, MOI.LOCALLY_SOLVED, MOI.ALMOST_LOCALLY_SOLVED]
return w
end
```

`fit_ridge (generic function with 1 method)`

We can solve the problem for several values of α to visualize the effect of regularization on the testing and training loss.

```
αs = 0.00:0.01:0.50
mse_test = Float64[]
mse_train = Float64[]
model = Model(() -> DiffOpt.diff_optimizer(Ipopt.Optimizer))
(Ntest, D) = size(X_test)
(Ntrain, D) = size(X_train)
for α in αs
w = fit_ridge(model, X_train, y_train, α)
ŵ = value.(w)
ŷ_test = X_test * ŵ
ŷ_train = X_train * ŵ
push!(mse_test, norm(ŷ_test - y_test)^2 / (2 * Ntest * D))
push!(mse_train, norm(ŷ_train - y_train)^2 / (2 * Ntrain * D))
end
```

Visualize the Mean Score Error metric

```
Plots.plot(
αs, mse_test ./ sum(mse_test),
label="MSE test", xaxis = "α", yaxis="MSE", legend=(0.8, 0.2),
width=3,
)
Plots.plot!(
αs, mse_train ./ sum(mse_train),
label="MSE train",
linestyle=:dash,
width=3,
)
Plots.title!("Normalized MSE on training and testing sets")
```

## Leveraging differentiable optimization: computing the derivative of the solution

Using DiffOpt, we can compute `∂w_i/∂α`

, the derivative of the learned solution `̂w`

w.r.t. the regularization parameter.

```
function compute_dw_dα(model, w)
D = length(w)
dw_dα = zeros(D)
MOI.set(
model,
DiffOpt.ForwardObjectiveFunction(),
dot(w, w) / (2 * D),
)
DiffOpt.forward_differentiate!(model)
for i in 1:D
dw_dα[i] = MOI.get(
model,
DiffOpt.ForwardVariablePrimal(),
w[i],
)
end
return dw_dα
end
```

`compute_dw_dα (generic function with 1 method)`

Using `∂w_i/∂α`

computed with `compute_dw_dα`

, we can compute the derivative of the test loss w.r.t. the parameter α by composing derivatives.

```
function d_testloss_dα(model, X_test, y_test, w, ŵ)
N, D = size(X_test)
dw_dα = compute_dw_dα(model, w)
err_term = X_test * ŵ - y_test
return sum(eachindex(err_term)) do i
dot(X_test[i,:], dw_dα) * err_term[i]
end / (N * D)
end
```

`d_testloss_dα (generic function with 1 method)`

We can define a meta-optimizer function performing gradient descent on the test loss w.r.t. the regularization parameter.

```
function descent(α0, max_iters=100; fixed_step = 0.01, grad_tol=1e-3)
α_s = Float64[]
∂α_s = Float64[]
test_loss = Float64[]
α = α0
N, D = size(X_test)
model = Model(() -> DiffOpt.diff_optimizer(Ipopt.Optimizer))
for iter in 1:max_iters
w = fit_ridge(model, X_train, y_train, α)
ŵ = value.(w)
err_term = X_test * ŵ - y_test
∂α = d_testloss_dα(model, X_test, y_test, w, ŵ)
push!(α_s, α)
push!(∂α_s, ∂α)
push!(test_loss, norm(err_term)^2 / (2 * N * D))
α -= fixed_step * ∂α
if abs(∂α) ≤ grad_tol
break
end
end
return α_s, ∂α_s, test_loss
end
ᾱ, ∂ᾱ, msē = descent(0.10, 500)
iters = 1:length(ᾱ);
```

Visualize gradient descent and convergence

```
Plots.plot(
αs, mse_test,
label="MSE test", xaxis = ("α"),
legend=:topleft,
width=2,
)
Plots.plot!(
ᾱ, msē,
label="learned α", width = 5,
style=:dot,
)
Plots.title!("Regularizer learning")
```

Visualize the convergence of α to its optimal value

```
Plots.plot(
iters, ᾱ, label = nothing, color = :blue,
xaxis = ("Iterations"), legend=:bottom,
title = "Convergence of α"
)
```

Visualize the convergence of the objective function

```
Plots.plot(
iters, msē, label = nothing, color = :red,
xaxis = ("Iterations"), legend=:bottom,
title = "Convergence of MSE"
)
```

Visualize the convergence of the derivative to zero

```
Plots.plot(
iters, ∂ᾱ, label = nothing, color = :green,
xaxis = ("Iterations"), legend=:bottom,
title = "Convergence of ∂α"
)
```

*This page was generated using Literate.jl.*