Benders Quantile Regression

We will apply Norm-1 regression to the Linear Regression problem. Linear regression is a statistical tool to obtain the relation between one dependent variable and other explanatory variables. In other words, given a set of $n$ explanatory variables $X = \{ X_1, \dots, X_n \}$ we would like to obtain the best possible estimate for $Y$. In order to accomplish such a task we make the hypothesis that $Y$ is approximately linear function of $X$:

\[Y = \sum_{j =1}^n \beta_j X_j + \varepsilon\]

where $\varepsilon$ is some random error.

The estimation of the $\beta$ values relies on observations of the variables: $\{y^i, x_1^i, \dots, x_n^i\}_i$.

In this example we will solve a problem where the explanatory variables are sinusoids of differents frequencies. First, we define the number of explanatory variables and observations

using ParametricOptInterface,MathOptInterface,JuMP,HiGHS
using TimerOutputs,LinearAlgebra,Random

const POI = ParametricOptInterface
const MOI = MathOptInterface
const OPTIMIZER = HiGHS.Optimizer;

const N_Candidates = 200
const N_Observations = 2000
const N_Nodes = 200

const Observations = 1:N_Observations
const Candidates = 1:N_Candidates
const Nodes = 1:N_Nodes;

Initialize a random number generator to keep results deterministic

rng = Random.MersenneTwister(123);

Building regressors (explanatory) sinusoids

const X = zeros(N_Candidates, N_Observations)
const time = [obs / N_Observations * 1 for obs in Observations]
for obs in Observations, cand in Candidates
    t = time[obs]
    f = cand
    X[cand, obs] = sin(2 * pi * f * t)
end

Define coefficients

β = zeros(N_Candidates)
for i in Candidates
    if rand(rng) <= (1 - i / N_Candidates)^2 && i <= 100
        β[i] = 4 * rand(rng) / i
    end
end

Create noisy observations

const y = X' * β .+ 0.1 * randn(rng, N_Observations)

Benders Decomposition

Benders decomposition is used to solve large optimization problems with some special characteristics. LP's can be solved with classical linear optimization methods such as the Simplex method or Interior point methods provided by solvers like HiGHS. However, these methods do not scale linearly with the problem size. In the Benders decomposition framework we break the problem in two pieces: A outer and a inner problem. Of course some variables will belong to both problems, this is where the cleverness of Benders kicks in: The outer problem is solved and passes the shared variables to the inner. The inner problem is solved with the shared variables FIXED to the values given by the outer problem. The solution of the inner problem can be used to generate a constraint to the outer problem to describe the linear approximation of the cost function of the shared variables. In many cases, like stochastic programming, the inner problems have a interesting structure and might be broken in smaller problem to be solved in parallel.

We will descibe the decomposition similarly to what is done in: Introduction to Linear Optimization, Bertsimas & Tsitsiklis (Chapter 6.5): Where the problem in question has the form

\[\begin{align} & \min_{x, y_k} && c^T x && + f_1^T y_1 && + \dots && + f_n^T y_n && \notag \\ & \text{subject to} && Ax && && && && = b \notag \\ & && B_1 x && + D_1 y_1 && && && = d_1 \notag \\ & && \dots && && \dots && && \notag \\ & && B_n x && && && + D_n y_n && = d_n \notag \\ & && x, && y_1, && && y_n && \geq 0 \notag \\ \end{align}\]

Inner Problem

Given a solution for the $x$ variables we can define the inner problem as

\[\begin{align} z_k(x) \ = \ & \min_{y_k} && f_k^T y_k && \notag \\ & \text{subject to} && D_k y_k && = d_k - B_k x \notag \\ & && y_k && \geq 0 \notag \\ \end{align}\]

The $z_k(x)$ function represents the cost of the subproblem given a solution for $x$. This function is a convex function because $x$ affects only the right hand side of the problem (this is a standard results in LP theory).

For the special case of the Norm-1 reggression the problem is written as:

\[\begin{align} z_k(\beta) \ = \ & \min_{\varepsilon^{up}, \varepsilon^{dw}} && \sum_{i \in ObsSet(k)} {\varepsilon^{up}}_i + {\varepsilon^{dw}}_i && \notag \\ & \text{subject to} && {\varepsilon^{up}}_i \geq + y_i - \sum_{j \in Candidates} \beta_j x_{i,j} && \forall i \in ObsSet(k) \notag \\ & && {\varepsilon^{dw}}_i \geq - y_i + \sum_{j \in Candidates} \beta_j x_{i,j} && \forall i \in ObsSet(k) \notag \\ & && {\varepsilon^{up}}_i, {\varepsilon^{dw}}_i \geq 0 && \forall i \in ObsSet(k) \notag \\ \end{align}\]

The collection $ObsSet(k)$ is a sub-set of the N_Observations. Any partition of the N_Observations collection is valid. In this example we will partition with the function:

function ObsSet(K)
    obs_per_block = div(N_Observations, N_Nodes)
    return (1+(K-1)*obs_per_block):(K*obs_per_block)
end

Which can be written in POI as follows:

function inner_model(K)

    # initialize the POI model
    inner = direct_model(POI.Optimizer(OPTIMIZER()))

    # Define local optimization variables for norm-1 error
    @variables(inner, begin
        ɛ_up[ObsSet(K)] >= 0
        ɛ_dw[ObsSet(K)] >= 0
    end)

    # create the regression coefficient representation
    # Create parameters
    β = [@variable(inner, set = MOI.Parameter(0.0)) for i in 1:N_Candidates]
    for (i, βi) in enumerate(β)
        set_name(βi, "β[$i]")
    end

    # create local constraints
    # Note that *parameter* algebra is implemented just like variables
    # algebra. We can multiply parameters by constants, add parameters,
    # sum parameters and variables and so on.
    @constraints(
        inner,
        begin
            ɛ_up_ctr[i in ObsSet(K)],
            ɛ_up[i] >= +sum(X[j, i] * β[j] for j in Candidates) - y[i]
            ɛ_dw_ctr[i in ObsSet(K)],
            ɛ_dw[i] >= -sum(X[j, i] * β[j] for j in Candidates) + y[i]
        end
    )

    # create local objective function
    @objective(inner, Min, sum(ɛ_up[i] + ɛ_dw[i] for i in ObsSet(K)))

    # return the correct group of parameters
    return (inner, β)
end

Outer Problem

Now that all pieces of the original problem can be representad by the convex $z_k(x)$ functions we can recast the problem in the the equivalent form:

\[\begin{align} & \min_{x} && c^T x + z_1(x) + \dots + z_n(x) && \notag \\ & \text{subject to} && Ax = b && \notag \\ & && x \geq 0 && \notag \\ \end{align}\]

However we cannot pass a problem in this form to a linear programming solver (it could be passed to other kinds of solvers).

Another standart result of optimization theory is that a convex function can be represented by its supporting hyper-planes:

\[\begin{align} z_k(x) \ = \ & \min_{z, x} && z && \notag \\ & \text{subject to} && z \geq \pi_k(\hat{x}) (x - \hat{x}) + z_k(\hat{x}), \ \forall \hat{x} \in dom(z_k) && \notag \\ \end{align}\]

Then we can re-write (again) the outer problem as

\[\begin{align} & \min_{x, z_k} && c^T x + z_1 + \dots + z_n \notag \\ & \text{subject to} && z_i \geq \pi_i(\hat{x}) (x - \hat{x}) + z_i(\hat{x}), \ \forall \hat{x} \in dom(z_i), i \in \{1, \dots, n\} \notag \\ & && Ax = b \notag \\ & && x \geq 0 \notag \\ \end{align}\]

Which is a linear program! However, it has infinitely many constraints !!

We can relax the infinite constraints and write:

\[\begin{align} & \min_{x, z_k} && c^T x + z_1 + \dots + z_n \notag \\ & \text{subject to} && Ax = b \notag \\ & && x \geq 0 \notag \\ \end{align}\]

But now its only an underestimated problem. In the case of our problem it can be written as:

\[\begin{align} & \min_{\varepsilon, \beta} && \sum_{i \in Nodes} \varepsilon_i \notag \\ & \text{subject to} && \varepsilon_i \geq 0 \notag \\ \end{align}\]

This model can be written in JuMP:

function outer_model()
    outer = Model(OPTIMIZER)
    @variables(outer, begin
        ɛ[Nodes] >= 0
        β[1:N_Candidates]
    end)
    @objective(outer, Min, sum(ɛ[i] for i in Nodes))
    sol = zeros(N_Candidates)
    return (outer, ɛ, β, sol)
end

The method to solve the outer problem and query its solution is given here:

function outer_solve(outer_model)
    model = outer_model[1]
    β = outer_model[3]
    optimize!(model)
    return (value.(β), objective_value(model))
end

Supporting Hyperplanes

With these building blocks in hand, we can start building the algorithm. So far we know how to:

Solve the relaxed outer problem
Obtain the solution for the $\hat{x}$ (or $\beta$ in our case)

Now we can:

Fix the values of $\hat{x}$ in the inner problems
Solve the inner problems
query the solution of the inner problems to obtain the supporting hyperplane

the value of $z_k(\hat{x})$, which is the objective value of the inner problem

and the derivative $\pi_k(\hat{x}) = \frac{d z_k(x)}{d x} \Big|_{x = \hat{x}}$ The derivative is the dual variable associated to the variable $\hat{x}$, which results by applying the chain rule on the constraints duals. These new steps are executed by the function:

function inner_solve(model, outer_solution)
    β0 = outer_solution[1]
    inner = model[1]

    # The first step is to fix the values given by the outer problem
    @timeit "fix" begin
        β = model[2]
        MOI.set.(inner, POI.ParameterValue(), β, β0)
    end

    # here the inner problem is solved
    @timeit "opt" optimize!(inner)

    # query dual variables, which are sensitivities
    # They represent the subgradient (almost a derivative)
    # of the objective function for infinitesimal variations
    # of the constants in the linear constraints
    # POI: we can query dual values of *parameters*
    π = MOI.get.(inner, POI.ParameterDual(), β)

    # π2 = shadow_price.(β_fix)
    obj = objective_value(inner)
    rhs = obj - dot(π, β0)
    return (rhs, π, obj)
end

Now that we have cutting plane in hand we can add them to the outer problem

function outer_add_cut(outer_model, cut_info, node)
    outer = outer_model[1]
    ɛ = outer_model[2]
    β = outer_model[3]

    rhs = cut_info[1]
    π = cut_info[2]

    @constraint(outer, ɛ[node] >= sum(π[j] * β[j] for j in Candidates) + rhs)
end

Algorithm wrap up

The complete algorithm is

Solve the relaxed master problem
Obtain the solution for the $\hat{x}$ (or $\beta$ in our case)
Fix the values of $\hat{x}$ in the slave problems
Solve the slave problem
query the solution of the slave problem to obtain the supporting hyperplane
add hyperplane to master problem
repeat

Now we grab all the pieces that we built and we write the benders algorithm by calling the above function in a proper order.

The macros @timeit are use to time each step of the algorithm.

function decomposed_model(;print_timer_outputs::Bool = true)
    reset_timer!() # reset timer fo comparision
    time_init = @elapsed @timeit "Init" begin
        # Create the outer problem with no cuts
        @timeit "outer" outer = outer_model()

        # initialize solution for the regression coefficients in zero
        @timeit "Sol" solution = (zeros(N_Candidates), Inf)
        best_sol = deepcopy(solution)

        # Create the inner problems
        @timeit "inners" inners =
            [inner_model(i) for i in Candidates]

        # Save initial version of the inner problems and create
        # the first set of cuts
        @timeit "Cuts" cuts =
            [inner_solve(inners[i], solution) for i in Candidates]
    end

    UB = +Inf
    LB = -Inf

    # println("Initialize Iterative step")
    time_loop = @elapsed @timeit "Loop" for k in 1:80

        # Add cuts generated from each inner problem to the outer problem
        @timeit "add cuts" for i in Candidates
            outer_add_cut(outer, cuts[i], i)
        end

        # Solve the outer problem with the new set of cuts
        # Obtain new solution candidate for the regression coefficients
        @timeit "solve outer" solution = outer_solve( outer)

        # Pass the new candidate solution to each of the inner problems
        # Solve the inner problems and obtain cutting planes
        @timeit "solve nodes" for i in Candidates
            cuts[i] = inner_solve( inners[i], solution)
        end

        LB = solution[2]
        new_UB = sum(cuts[i][3] for i in Candidates)
        if new_UB <= UB
            best_sol = deepcopy(solution)
        end
        UB = min(UB, new_UB)

        if abs(UB - LB) / (abs(UB) + abs(LB)) < 0.05
            break
        end
    end

    print_timer_outputs && print_timer()

    return best_sol[1]
end

Run benders decomposition with POI

β2 = decomposed_model(; print_timer_outputs = false);
GC.gc()
β2 = decomposed_model();