Geographical clustering

Originally Contributed by: Matthew Helm (with help from Mathieu Tanneau on Julia Discourse)

The goal of this exercise is to cluster $n$ cities into $k$ groups, minimizing the total pairwise distance between cities and ensuring that the variance in the total populations of each group is relatively small.

This tutorial uses the following packages:

using JuMP
import DataFrames
import HiGHS
import LinearAlgebra

For this example, we'll use the 20 most populous cities in the United States.

cities = DataFrames.DataFrame(
    Union{String,Float64}[
        "New York, NY" 8.405 40.7127 -74.0059
        "Los Angeles, CA" 3.884 34.0522 -118.2436
        "Chicago, IL" 2.718 41.8781 -87.6297
        "Houston, TX" 2.195 29.7604 -95.3698
        "Philadelphia, PA" 1.553 39.9525 -75.1652
        "Phoenix, AZ" 1.513 33.4483 -112.0740
        "San Antonio, TX" 1.409 29.4241 -98.4936
        "San Diego, CA" 1.355 32.7157 -117.1610
        "Dallas, TX" 1.257 32.7766 -96.7969
        "San Jose, CA" 0.998 37.3382 -121.8863
        "Austin, TX" 0.885 30.2671 -97.7430
        "Indianapolis, IN" 0.843 39.7684 -86.1580
        "Jacksonville, FL" 0.842 30.3321 -81.6556
        "San Francisco, CA" 0.837 37.7749 -122.4194
        "Columbus, OH" 0.822 39.9611 -82.9987
        "Charlotte, NC" 0.792 35.2270 -80.8431
        "Fort Worth, TX" 0.792 32.7554 -97.3307
        "Detroit, MI" 0.688 42.3314 -83.0457
        "El Paso, TX" 0.674 31.7775 -106.4424
        "Memphis, TN" 0.653 35.1495 -90.0489
    ],
    ["city", "population", "lat", "lon"],
)
20×4 DataFrame
Rowcitypopulationlatlon
Union…Union…Union…Union…
1New York, NY8.40540.7127-74.0059
2Los Angeles, CA3.88434.0522-118.244
3Chicago, IL2.71841.8781-87.6297
4Houston, TX2.19529.7604-95.3698
5Philadelphia, PA1.55339.9525-75.1652
6Phoenix, AZ1.51333.4483-112.074
7San Antonio, TX1.40929.4241-98.4936
8San Diego, CA1.35532.7157-117.161
9Dallas, TX1.25732.7766-96.7969
10San Jose, CA0.99837.3382-121.886
11Austin, TX0.88530.2671-97.743
12Indianapolis, IN0.84339.7684-86.158
13Jacksonville, FL0.84230.3321-81.6556
14San Francisco, CA0.83737.7749-122.419
15Columbus, OH0.82239.9611-82.9987
16Charlotte, NC0.79235.227-80.8431
17Fort Worth, TX0.79232.7554-97.3307
18Detroit, MI0.68842.3314-83.0457
19El Paso, TX0.67431.7775-106.442
20Memphis, TN0.65335.1495-90.0489

Model Specifics

We will cluster these 20 cities into 3 different groups and we will assume that the ideal or target population $P$ for a group is simply the total population of the 20 cities divided by 3:

n = size(cities, 1)
k = 3
P = sum(cities.population) / k
11.038333333333334

Obtaining the distances between each city

Let's compute the pairwise Haversine distance between each of the cities in our data set and store the result in a variable we'll call dm:

"""
    haversine(lat1, long1, lat2, long2, r = 6372.8)

Compute the haversine distance between two points on a sphere of radius `r`,
where the points are given by the latitude/longitude pairs `lat1/long1` and
`lat2/long2` (in degrees).
"""
function haversine(lat1, long1, lat2, long2, r = 6372.8)
    lat1, long1 = deg2rad(lat1), deg2rad(long1)
    lat2, long2 = deg2rad(lat2), deg2rad(long2)
    hav(a, b) = sin((b - a) / 2)^2
    inner_term = hav(lat1, lat2) + cos(lat1) * cos(lat2) * hav(long1, long2)
    d = 2 * r * asin(sqrt(inner_term))
    # Round distance to nearest kilometer.
    return round(Int, d)
end
Main.haversine

Our distance matrix is symmetric so we'll convert it to a LowerTriangular matrix so that we can better interpret the objective value of our model:

dm = LinearAlgebra.LowerTriangular([
    haversine(cities.lat[i], cities.lon[i], cities.lat[j], cities.lon[j])
    for i in 1:n, j in 1:n
])
20×20 LinearAlgebra.LowerTriangular{Int64, Matrix{Int64}}:
    0     ⋅     ⋅     ⋅     ⋅     ⋅     ⋅  …     ⋅     ⋅     ⋅     ⋅     ⋅  ⋅
 3937     0     ⋅     ⋅     ⋅     ⋅     ⋅        ⋅     ⋅     ⋅     ⋅     ⋅  ⋅
 1145  2805     0     ⋅     ⋅     ⋅     ⋅        ⋅     ⋅     ⋅     ⋅     ⋅  ⋅
 2282  2207  1516     0     ⋅     ⋅     ⋅        ⋅     ⋅     ⋅     ⋅     ⋅  ⋅
  130  3845  1068  2157     0     ⋅     ⋅        ⋅     ⋅     ⋅     ⋅     ⋅  ⋅
 3445   574  2337  1633  3345     0     ⋅  …     ⋅     ⋅     ⋅     ⋅     ⋅  ⋅
 2546  1934  1695   304  2423  1363     0        ⋅     ⋅     ⋅     ⋅     ⋅  ⋅
 3908   179  2787  2094  3812   481  1813        ⋅     ⋅     ⋅     ⋅     ⋅  ⋅
 2206  1993  1295   362  2089  1424   406        ⋅     ⋅     ⋅     ⋅     ⋅  ⋅
 4103   492  2958  2588  4023   989  2336        ⋅     ⋅     ⋅     ⋅     ⋅  ⋅
 2432  1972  1577   235  2310  1398   118  …     ⋅     ⋅     ⋅     ⋅     ⋅  ⋅
 1036  2907   265  1394   938  2409  1609        ⋅     ⋅     ⋅     ⋅     ⋅  ⋅
 1345  3450  1391  1321  1221  2883  1626        ⋅     ⋅     ⋅     ⋅     ⋅  ⋅
 4130   559  2986  2644  4052  1051  2394        ⋅     ⋅     ⋅     ⋅     ⋅  ⋅
  767  3177   444  1598   668  2679  1834        0     ⋅     ⋅     ⋅     ⋅  ⋅
  855  3405   946  1490   725  2863  1777  …   560     0     ⋅     ⋅     ⋅  ⋅
 2251  1944  1327   382  2134  1375   387     1511  1543     0     ⋅     ⋅  ⋅
  774  3186   382  1780   711  2716  1994      264   813  1646     0     ⋅  ⋅
 3054  1130  2010  1081  2945   559   804     2292  2398   864  2374     0  ⋅
 1534  2576   777   780  1415  2028  1017      820   837   722  1003  1565  0

Build the model

Now that we have the basics taken care of, we can set up our model, create decision variables, add constraints, and then solve.

First, we'll set up a model that leverages the Cbc solver. Next, we'll set up a binary variable $x_{i,k}$ that takes the value $1$ if city $i$ is in group $k$ and $0$ otherwise. Each city must be in a group, so we'll add the constraint $\sum_k x_{i,k} = 1$ for every $i$.

model = Model(HiGHS.Optimizer)
set_silent(model)
@variable(model, x[1:n, 1:k], Bin)
@constraint(model, [i = 1:n], sum(x[i, :]) == 1);
# To reduce symmetry, we fix the first city to belong to the first group.
fix(x[1, 1], 1; force = true)

The total population of a group $k$ is $Q_k = \sum_ix_{i,k}q_i$ where $q_i$ is simply the $i$th value from the population column in our cities DataFrame. Let's add constraints so that $\alpha \leq (Q_k - P) \leq \beta$. We'll set $\alpha$ equal to $-3$ million and $\beta$ equal to $3$. By adjusting these thresholds you'll find that there is a tradeoff between having relatively even populations between groups and having geographically close cities within each group. In other words, the larger the absolute values of $\alpha$ and $\beta$, the closer together the cities in a group will be but the variance between the group populations will be higher.

@variable(model, -3 <= population_diff[1:k] <= 3)
@constraint(model, population_diff .== x' * cities.population .- P)
3-element Vector{ConstraintRef{Model, MathOptInterface.ConstraintIndex{MathOptInterface.ScalarAffineFunction{Float64}, MathOptInterface.EqualTo{Float64}}, ScalarShape}}:
 -8.405 x[1,1] - 3.884 x[2,1] - 2.718 x[3,1] - 2.195 x[4,1] - 1.553 x[5,1] - 1.513 x[6,1] - 1.409 x[7,1] - 1.355 x[8,1] - 1.257 x[9,1] - 0.998 x[10,1] - 0.885 x[11,1] - 0.843 x[12,1] - 0.842 x[13,1] - 0.837 x[14,1] - 0.822 x[15,1] - 0.792 x[16,1] - 0.792 x[17,1] - 0.688 x[18,1] - 0.674 x[19,1] - 0.653 x[20,1] + population_diff[1] = -11.038333333333334
 -8.405 x[1,2] - 3.884 x[2,2] - 2.718 x[3,2] - 2.195 x[4,2] - 1.553 x[5,2] - 1.513 x[6,2] - 1.409 x[7,2] - 1.355 x[8,2] - 1.257 x[9,2] - 0.998 x[10,2] - 0.885 x[11,2] - 0.843 x[12,2] - 0.842 x[13,2] - 0.837 x[14,2] - 0.822 x[15,2] - 0.792 x[16,2] - 0.792 x[17,2] - 0.688 x[18,2] - 0.674 x[19,2] - 0.653 x[20,2] + population_diff[2] = -11.038333333333334
 -8.405 x[1,3] - 3.884 x[2,3] - 2.718 x[3,3] - 2.195 x[4,3] - 1.553 x[5,3] - 1.513 x[6,3] - 1.409 x[7,3] - 1.355 x[8,3] - 1.257 x[9,3] - 0.998 x[10,3] - 0.885 x[11,3] - 0.843 x[12,3] - 0.842 x[13,3] - 0.837 x[14,3] - 0.822 x[15,3] - 0.792 x[16,3] - 0.792 x[17,3] - 0.688 x[18,3] - 0.674 x[19,3] - 0.653 x[20,3] + population_diff[3] = -11.038333333333334

Now we need to add one last binary variable $z_{i,j}$ to our model that we'll use to compute the total distance between the cities in our groups, defined as $\sum_{i,j}d_{i,j}z_{i,j}$. Variable $z_{i,j}$ will equal $1$ if cities $i$ and $j$ are in the same group, and $0$ if they are not in the same group.

To ensure that $z_{i,j} = 1$ if and only if cities $i$ and $j$ are in the same group, we add the constraints $z_{i,j} \geq x_{i,k} + x_{j,k} - 1$ for every pair $i,j$ and every $k$:

@variable(model, z[i = 1:n, j = 1:i], Bin)
for k in 1:k, i in 1:n, j in 1:i
    @constraint(model, z[i, j] >= x[i, k] + x[j, k] - 1)
end

We can now add an objective to our model which will simply be to minimize the dot product of $z$ and our distance matrix, dm.

@objective(model, Min, sum(dm[i, j] * z[i, j] for i in 1:n, j in 1:i));

We can then call optimize! and review the results.

optimize!(model)

Reviewing the Results

Now that we have results, we can add a column to our cities DataFrame for the group and then loop through our $x$ variable to assign each city to its group. Once we have that, we can look at the total population for each group and also look at the cities in each group to verify that they are grouped by geographic proximity.

cities.group = zeros(n)

for i in 1:n, j in 1:k
    if round(Int, value(x[i, j])) == 1
        cities.group[i] = j
    end
end

for group in DataFrames.groupby(cities, :group)
    @show group
    println("")
    @show sum(group.population)
    println("")
end
group = 7×5 SubDataFrame
 Row │ city              population  lat      lon       group
     │ Union…            Union…      Union…   Union…    Float64
─────┼──────────────────────────────────────────────────────────
   1 │ New York, NY      8.405       40.7127  -74.0059      1.0
   2 │ Philadelphia, PA  1.553       39.9525  -75.1652      1.0
   3 │ Indianapolis, IN  0.843       39.7684  -86.158       1.0
   4 │ Jacksonville, FL  0.842       30.3321  -81.6556      1.0
   5 │ Columbus, OH      0.822       39.9611  -82.9987      1.0
   6 │ Charlotte, NC     0.792       35.227   -80.8431      1.0
   7 │ Detroit, MI       0.688       42.3314  -83.0457      1.0

sum(group.population) = 13.944999999999999

group = 7×5 SubDataFrame
 Row │ city             population  lat      lon       group
     │ Union…           Union…      Union…   Union…    Float64
─────┼─────────────────────────────────────────────────────────
   1 │ Chicago, IL      2.718       41.8781  -87.6297      2.0
   2 │ Houston, TX      2.195       29.7604  -95.3698      2.0
   3 │ San Antonio, TX  1.409       29.4241  -98.4936      2.0
   4 │ Dallas, TX       1.257       32.7766  -96.7969      2.0
   5 │ Austin, TX       0.885       30.2671  -97.743       2.0
   6 │ Fort Worth, TX   0.792       32.7554  -97.3307      2.0
   7 │ Memphis, TN      0.653       35.1495  -90.0489      2.0

sum(group.population) = 9.909

group = 6×5 SubDataFrame
 Row │ city               population  lat      lon       group
     │ Union…             Union…      Union…   Union…    Float64
─────┼───────────────────────────────────────────────────────────
   1 │ Los Angeles, CA    3.884       34.0522  -118.244      3.0
   2 │ Phoenix, AZ        1.513       33.4483  -112.074      3.0
   3 │ San Diego, CA      1.355       32.7157  -117.161      3.0
   4 │ San Jose, CA       0.998       37.3382  -121.886      3.0
   5 │ San Francisco, CA  0.837       37.7749  -122.419      3.0
   6 │ El Paso, TX        0.674       31.7775  -106.442      3.0

sum(group.population) = 9.261000000000001

Tip

This tutorial was generated using Literate.jl. View the source .jl file on GitHub.