Geographical clustering

This tutorial was generated using Literate.jl. Download the source as a .jl file.

This tutorial was originally contributed by Matthew Helm and Mathieu Tanneau.

The goal of this exercise is to cluster $n$ cities into $k$ groups, minimizing the total pairwise distance between cities and ensuring that the variance in the total populations of each group is relatively small.

This tutorial uses the following packages:

using JuMP
import DataFrames
import HiGHS
import LinearAlgebra

For this example, we'll use the 20 most populous cities in the United States.

cities = DataFrames.DataFrame(
    Union{String,Float64}[
        "New York, NY" 8.405 40.7127 -74.0059
        "Los Angeles, CA" 3.884 34.0522 -118.2436
        "Chicago, IL" 2.718 41.8781 -87.6297
        "Houston, TX" 2.195 29.7604 -95.3698
        "Philadelphia, PA" 1.553 39.9525 -75.1652
        "Phoenix, AZ" 1.513 33.4483 -112.0740
        "San Antonio, TX" 1.409 29.4241 -98.4936
        "San Diego, CA" 1.355 32.7157 -117.1610
        "Dallas, TX" 1.257 32.7766 -96.7969
        "San Jose, CA" 0.998 37.3382 -121.8863
        "Austin, TX" 0.885 30.2671 -97.7430
        "Indianapolis, IN" 0.843 39.7684 -86.1580
        "Jacksonville, FL" 0.842 30.3321 -81.6556
        "San Francisco, CA" 0.837 37.7749 -122.4194
        "Columbus, OH" 0.822 39.9611 -82.9987
        "Charlotte, NC" 0.792 35.2270 -80.8431
        "Fort Worth, TX" 0.792 32.7554 -97.3307
        "Detroit, MI" 0.688 42.3314 -83.0457
        "El Paso, TX" 0.674 31.7775 -106.4424
        "Memphis, TN" 0.653 35.1495 -90.0489
    ],
    ["city", "population", "lat", "lon"],
)

20×4 DataFrame

Row	city	population	lat	lon
	Union…	Union…	Union…	Union…
1	New York, NY	8.405	40.7127	-74.0059
2	Los Angeles, CA	3.884	34.0522	-118.244
3	Chicago, IL	2.718	41.8781	-87.6297
4	Houston, TX	2.195	29.7604	-95.3698
5	Philadelphia, PA	1.553	39.9525	-75.1652
6	Phoenix, AZ	1.513	33.4483	-112.074
7	San Antonio, TX	1.409	29.4241	-98.4936
8	San Diego, CA	1.355	32.7157	-117.161
9	Dallas, TX	1.257	32.7766	-96.7969
10	San Jose, CA	0.998	37.3382	-121.886
11	Austin, TX	0.885	30.2671	-97.743
12	Indianapolis, IN	0.843	39.7684	-86.158
13	Jacksonville, FL	0.842	30.3321	-81.6556
14	San Francisco, CA	0.837	37.7749	-122.419
15	Columbus, OH	0.822	39.9611	-82.9987
16	Charlotte, NC	0.792	35.227	-80.8431
17	Fort Worth, TX	0.792	32.7554	-97.3307
18	Detroit, MI	0.688	42.3314	-83.0457
19	El Paso, TX	0.674	31.7775	-106.442
20	Memphis, TN	0.653	35.1495	-90.0489

Model Specifics

We will cluster these 20 cities into 3 different groups and we will assume that the ideal or target population $P$ for a group is simply the total population of the 20 cities divided by 3:

n = size(cities, 1)
k = 3
P = sum(cities.population) / k

11.038333333333334

Obtaining the distances between each city

Let's compute the pairwise Haversine distance between each of the cities in our data set and store the result in a variable we'll call dm:

"""
    haversine(lat1, long1, lat2, long2, r = 6372.8)

Compute the haversine distance between two points on a sphere of radius `r`,
where the points are given by the latitude/longitude pairs `lat1/long1` and
`lat2/long2` (in degrees).
"""
function haversine(lat1, long1, lat2, long2, r = 6372.8)
    lat1, long1 = deg2rad(lat1), deg2rad(long1)
    lat2, long2 = deg2rad(lat2), deg2rad(long2)
    hav(a, b) = sin((b - a) / 2)^2
    inner_term = hav(lat1, lat2) + cos(lat1) * cos(lat2) * hav(long1, long2)
    d = 2 * r * asin(sqrt(inner_term))
    # Round distance to nearest kilometer.
    return round(Int, d)
end

Main.haversine

Our distance matrix is symmetric so we'll convert it to a LowerTriangular matrix so that we can better interpret the objective value of our model:

dm = LinearAlgebra.LowerTriangular([
    haversine(cities.lat[i], cities.lon[i], cities.lat[j], cities.lon[j])
    for i in 1:n, j in 1:n
])

20×20 LinearAlgebra.LowerTriangular{Int64, Matrix{Int64}}:
    0     ⋅     ⋅     ⋅     ⋅     ⋅     ⋅  …     ⋅     ⋅     ⋅     ⋅     ⋅  ⋅
 3937     0     ⋅     ⋅     ⋅     ⋅     ⋅        ⋅     ⋅     ⋅     ⋅     ⋅  ⋅
 1145  2805     0     ⋅     ⋅     ⋅     ⋅        ⋅     ⋅     ⋅     ⋅     ⋅  ⋅
 2282  2207  1516     0     ⋅     ⋅     ⋅        ⋅     ⋅     ⋅     ⋅     ⋅  ⋅
  130  3845  1068  2157     0     ⋅     ⋅        ⋅     ⋅     ⋅     ⋅     ⋅  ⋅
 3445   574  2337  1633  3345     0     ⋅  …     ⋅     ⋅     ⋅     ⋅     ⋅  ⋅
 2546  1934  1695   304  2423  1363     0        ⋅     ⋅     ⋅     ⋅     ⋅  ⋅
 3908   179  2787  2094  3812   481  1813        ⋅     ⋅     ⋅     ⋅     ⋅  ⋅
 2206  1993  1295   362  2089  1424   406        ⋅     ⋅     ⋅     ⋅     ⋅  ⋅
 4103   492  2958  2588  4023   989  2336        ⋅     ⋅     ⋅     ⋅     ⋅  ⋅
 2432  1972  1577   235  2310  1398   118  …     ⋅     ⋅     ⋅     ⋅     ⋅  ⋅
 1036  2907   265  1394   938  2409  1609        ⋅     ⋅     ⋅     ⋅     ⋅  ⋅
 1345  3450  1391  1321  1221  2883  1626        ⋅     ⋅     ⋅     ⋅     ⋅  ⋅
 4130   559  2986  2644  4052  1051  2394        ⋅     ⋅     ⋅     ⋅     ⋅  ⋅
  767  3177   444  1598   668  2679  1834        0     ⋅     ⋅     ⋅     ⋅  ⋅
  855  3405   946  1490   725  2863  1777  …   560     0     ⋅     ⋅     ⋅  ⋅
 2251  1944  1327   382  2134  1375   387     1511  1543     0     ⋅     ⋅  ⋅
  774  3186   382  1780   711  2716  1994      264   813  1646     0     ⋅  ⋅
 3054  1130  2010  1081  2945   559   804     2292  2398   864  2374     0  ⋅
 1534  2576   777   780  1415  2028  1017      820   837   722  1003  1565  0

Build the model

Now that we have the basics taken care of, we can set up our model, create decision variables, add constraints, and then solve.

First, we'll set up a model that leverages the Cbc solver. Next, we'll set up a binary variable $x_{i,k}$ that takes the value $1$ if city $i$ is in group $k$ and $0$ otherwise. Each city must be in a group, so we'll add the constraint $\sum_k x_{i,k} = 1$ for every $i$.

model = Model(HiGHS.Optimizer)
set_silent(model)
@variable(model, x[1:n, 1:k], Bin)
@constraint(model, [i = 1:n], sum(x[i, :]) == 1);
# To reduce symmetry, we fix the first city to belong to the first group.
fix(x[1, 1], 1; force = true)

The total population of a group $k$ is $Q_k = \sum_ix_{i,k}q_i$ where $q_i$ is simply the $i$-th value from the population column in our cities DataFrame. Let's add constraints so that $\alpha \leq (Q_k - P) \leq \beta$. We'll set $\alpha$ equal to $-3$ million and $\beta$ equal to $3$. By adjusting these thresholds you'll find that there is a tradeoff between having relatively even populations between groups and having geographically close cities within each group. In other words, the larger the absolute values of $\alpha$ and $\beta$, the closer together the cities in a group will be but the variance between the group populations will be higher.

@variable(model, -3 <= population_diff[1:k] <= 3)
@constraint(model, population_diff .== x' * cities.population .- P)

3-element Vector{ConstraintRef{Model, MathOptInterface.ConstraintIndex{MathOptInterface.ScalarAffineFunction{Float64}, MathOptInterface.EqualTo{Float64}}, ScalarShape}}:
 -8.405 x[1,1] - 3.884 x[2,1] - 2.718 x[3,1] - 2.195 x[4,1] - 1.553 x[5,1] - 1.513 x[6,1] - 1.409 x[7,1] - 1.355 x[8,1] - 1.257 x[9,1] - 0.998 x[10,1] - 0.885 x[11,1] - 0.843 x[12,1] - 0.842 x[13,1] - 0.837 x[14,1] - 0.822 x[15,1] - 0.792 x[16,1] - 0.792 x[17,1] - 0.688 x[18,1] - 0.674 x[19,1] - 0.653 x[20,1] + population_diff[1] = -11.038333333333334
 -8.405 x[1,2] - 3.884 x[2,2] - 2.718 x[3,2] - 2.195 x[4,2] - 1.553 x[5,2] - 1.513 x[6,2] - 1.409 x[7,2] - 1.355 x[8,2] - 1.257 x[9,2] - 0.998 x[10,2] - 0.885 x[11,2] - 0.843 x[12,2] - 0.842 x[13,2] - 0.837 x[14,2] - 0.822 x[15,2] - 0.792 x[16,2] - 0.792 x[17,2] - 0.688 x[18,2] - 0.674 x[19,2] - 0.653 x[20,2] + population_diff[2] = -11.038333333333334
 -8.405 x[1,3] - 3.884 x[2,3] - 2.718 x[3,3] - 2.195 x[4,3] - 1.553 x[5,3] - 1.513 x[6,3] - 1.409 x[7,3] - 1.355 x[8,3] - 1.257 x[9,3] - 0.998 x[10,3] - 0.885 x[11,3] - 0.843 x[12,3] - 0.842 x[13,3] - 0.837 x[14,3] - 0.822 x[15,3] - 0.792 x[16,3] - 0.792 x[17,3] - 0.688 x[18,3] - 0.674 x[19,3] - 0.653 x[20,3] + population_diff[3] = -11.038333333333334

Now we need to add one last binary variable $z_{i,j}$ to our model that we'll use to compute the total distance between the cities in our groups, defined as $\sum_{i,j}d_{i,j}z_{i,j}$. Variable $z_{i,j}$ will equal $1$ if cities $i$ and $j$ are in the same group, and $0$ if they are not in the same group.

To ensure that $z_{i,j} = 1$ if and only if cities $i$ and $j$ are in the same group, we add the constraints $z_{i,j} \geq x_{i,k} + x_{j,k} - 1$ for every pair $i,j$ and every $k$:

@variable(model, z[i = 1:n, j = 1:i], Bin)
for k in 1:k, i in 1:n, j in 1:i
    @constraint(model, z[i, j] >= x[i, k] + x[j, k] - 1)
end

We can now add an objective to our model which will simply be to minimize the dot product of $z$ and our distance matrix, dm.

@objective(model, Min, sum(dm[i, j] * z[i, j] for i in 1:n, j in 1:i));

We can then call optimize! and review the results.

optimize!(model)
assert_is_solved_and_feasible(model)

Reviewing the Results

Now that we have results, we can add a column to our cities DataFrame for the group and then loop through our $x$ variable to assign each city to its group. Once we have that, we can look at the total population for each group and also look at the cities in each group to verify that they are grouped by geographic proximity.

cities.group = zeros(n)

for i in 1:n, j in 1:k
    if round(Int, value(x[i, j])) == 1
        cities.group[i] = j
    end
end

for group in DataFrames.groupby(cities, :group)
    @show group
    println("")
    @show sum(group.population)
    println("")
end

group = 7×5 SubDataFrame
 Row │ city              population  lat      lon       group
     │ Union…            Union…      Union…   Union…    Float64
─────┼──────────────────────────────────────────────────────────
   1 │ New York, NY      8.405       40.7127  -74.0059      1.0
   2 │ Philadelphia, PA  1.553       39.9525  -75.1652      1.0
   3 │ Indianapolis, IN  0.843       39.7684  -86.158       1.0
   4 │ Jacksonville, FL  0.842       30.3321  -81.6556      1.0
   5 │ Columbus, OH      0.822       39.9611  -82.9987      1.0
   6 │ Charlotte, NC     0.792       35.227   -80.8431      1.0
   7 │ Detroit, MI       0.688       42.3314  -83.0457      1.0

sum(group.population) = 13.944999999999999

group = 7×5 SubDataFrame
 Row │ city             population  lat      lon       group
     │ Union…           Union…      Union…   Union…    Float64
─────┼─────────────────────────────────────────────────────────
   1 │ Chicago, IL      2.718       41.8781  -87.6297      2.0
   2 │ Houston, TX      2.195       29.7604  -95.3698      2.0
   3 │ San Antonio, TX  1.409       29.4241  -98.4936      2.0
   4 │ Dallas, TX       1.257       32.7766  -96.7969      2.0
   5 │ Austin, TX       0.885       30.2671  -97.743       2.0
   6 │ Fort Worth, TX   0.792       32.7554  -97.3307      2.0
   7 │ Memphis, TN      0.653       35.1495  -90.0489      2.0

sum(group.population) = 9.909

group = 6×5 SubDataFrame
 Row │ city               population  lat      lon       group
     │ Union…             Union…      Union…   Union…    Float64
─────┼───────────────────────────────────────────────────────────
   1 │ Los Angeles, CA    3.884       34.0522  -118.244      3.0
   2 │ Phoenix, AZ        1.513       33.4483  -112.074      3.0
   3 │ San Diego, CA      1.355       32.7157  -117.161      3.0
   4 │ San Jose, CA       0.998       37.3382  -121.886      3.0
   5 │ San Francisco, CA  0.837       37.7749  -122.419      3.0
   6 │ El Paso, TX        0.674       31.7775  -106.442      3.0

sum(group.population) = 9.261000000000001