# Logistic regression

using DataFrames
using Plots
using RDatasets
using Convex
using SCS

This is an example logistic regression using RDatasets's iris data. Our goal is to predict whether the iris species is versicolor using the sepal length and width and petal length and width.

iris = dataset("datasets", "iris");
iris[1:10, :]
10×5 DataFrame
RowSepalLengthSepalWidthPetalLengthPetalWidthSpecies
Float64Float64Float64Float64Cat…
15.13.51.40.2setosa
24.93.01.40.2setosa
34.73.21.30.2setosa
44.63.11.50.2setosa
55.03.61.40.2setosa
65.43.91.70.4setosa
74.63.41.40.3setosa
85.03.41.50.2setosa
94.42.91.40.2setosa
104.93.11.50.1setosa

We'll define Y as the outcome variable: +1 for versicolor, -1 otherwise.

Y = [species == "versicolor" ? 1.0 : -1.0 for species in iris.Species]
150-element Vector{Float64}:
-1.0
-1.0
-1.0
-1.0
-1.0
-1.0
-1.0
-1.0
-1.0
-1.0
⋮
-1.0
-1.0
-1.0
-1.0
-1.0
-1.0
-1.0
-1.0
-1.0

We'll create our data matrix with one column for each feature (first column corresponds to offset).

X = hcat(
ones(size(iris, 1)),
iris.SepalLength,
iris.SepalWidth,
iris.PetalLength,
iris.PetalWidth,
);

Now to solve the logistic regression problem.

n, p = size(X)
beta = Variable(p)
problem = minimize(logisticloss(-Y .* (X * beta)))
solve!(problem, SCS.Optimizer; silent = true)
Problem statistics
problem is DCP         : true
number of variables    : 1 (5 scalar elements)
number of constraints  : 0 (0 scalar elements)
number of coefficients : 1_050
number of atoms        : 6

Solution summary
termination status : OPTIMAL
primal status      : FEASIBLE_POINT
dual status        : FEASIBLE_POINT
objective value    : 72.535

Expression graph
minimize
└─ sum (convex; real)
└─ logsumexp (convex; real)
└─ hcat (affine; real)
├─ …
└─ …


Let's see how well the model fits.

using Plots
logistic(x::Real) = inv(exp(-x) + one(x))
perm = sortperm(vec(X * evaluate(beta)))
plot(1:n, (Y[perm] .+ 1) / 2, st = :scatter)
plot!(1:n, logistic.(X * evaluate(beta))[perm])