# Geographical clustering

Originally Contributed by: Matthew Helm (with help from Mathieu Tanneau on Julia Discourse)

The goal of this exercise is to cluster $n$ cities into $k$ groups, minimizing the total pairwise distance between cities and ensuring that the variance in the total populations of each group is relatively small.

This tutorial uses the following packages:

using JuMP
import DataFrames
import GLPK
import LinearAlgebra

For this example, we'll use the 20 most populous cities in the United States.

cities = DataFrames.DataFrame(
city = [
"New York, NY",
"Los Angeles, CA",
"Chicago, IL",
"Houston, TX",
"Phoenix, AZ",
"San Antonio, TX",
"San Diego, CA",
"Dallas, TX",
"San Jose, CA",
"Austin, TX",
"Indianapolis, IN",
"Jacksonville, FL",
"San Francisco, CA",
"Columbus, OH",
"Charlotte, NC",
"Fort Worth, TX",
"Detroit, MI",
"El Paso, TX",
"Memphis, TN",
],
population = [
8.405,
3.884,
2.718,
2.195,
1.553,
1.513,
1.409,
1.355,
1.257,
0.998,
0.885,
0.843,
0.842,
0.837,
0.822,
0.792,
0.792,
0.688,
0.674,
0.653,
],
lat = [
40.7127,
34.0522,
41.8781,
29.7604,
39.9525,
33.4483,
29.4241,
32.7157,
32.7766,
37.3382,
30.2671,
39.7684,
30.3321,
37.7749,
39.9611,
35.2270,
32.7554,
42.3314,
31.7775,
35.1495,
],
lon = [
-74.0059,
-118.2436,
-87.6297,
-95.3698,
-75.1652,
-112.0740,
-98.4936,
-117.1610,
-96.7969,
-121.8863,
-97.7430,
-86.1580,
-81.6556,
-122.4194,
-82.9987,
-80.8431,
-97.3307,
-83.0457,
-106.4424,
-90.0489,
],
)

20 rows × 4 columns

citypopulationlatlon
StringFloat64Float64Float64
1New York, NY8.40540.7127-74.0059
2Los Angeles, CA3.88434.0522-118.244
3Chicago, IL2.71841.8781-87.6297
4Houston, TX2.19529.7604-95.3698
6Phoenix, AZ1.51333.4483-112.074
7San Antonio, TX1.40929.4241-98.4936
8San Diego, CA1.35532.7157-117.161
9Dallas, TX1.25732.7766-96.7969
10San Jose, CA0.99837.3382-121.886
11Austin, TX0.88530.2671-97.743
12Indianapolis, IN0.84339.7684-86.158
13Jacksonville, FL0.84230.3321-81.6556
14San Francisco, CA0.83737.7749-122.419
15Columbus, OH0.82239.9611-82.9987
16Charlotte, NC0.79235.227-80.8431
17Fort Worth, TX0.79232.7554-97.3307
18Detroit, MI0.68842.3314-83.0457
19El Paso, TX0.67431.7775-106.442
20Memphis, TN0.65335.1495-90.0489

### Model Specifics

We will cluster these 20 cities into 3 different groups and we will assume that the ideal or target population $P$ for a group is simply the total population of the 20 cities divided by 3:

n = size(cities, 1)
k = 3
P = sum(cities.population) / k
11.038333333333334

### Obtaining the distances between each city

Let's compute the pairwise Haversine distance between each of the cities in our data set and store the result in a variable we'll call dm:

"""
haversine(lat1, long1, lat2, long2, r = 6372.8)

Compute the haversine distance between two points on a sphere of radius r,
where the points are given by the latitude/longitude pairs lat1/long1 and
lat2/long2 (in degrees).
"""
function haversine(lat1, long1, lat2, long2, r = 6372.8)
hav(a, b) = sin((b - a) / 2)^2
inner_term = hav(lat1, lat2) + cos(lat1) * cos(lat2) * hav(long1, long2)
d = 2 * r * asin(sqrt(inner_term))
# Round distance to nearest kilometer.
return round(Int, d)
end
Main.haversine

Our distance matrix is symmetric so we'll convert it to a LowerTriangular matrix so that we can better interpret the objective value of our model:

dm = LinearAlgebra.LowerTriangular([
haversine(cities.lat[i], cities.lon[i], cities.lat[j], cities.lon[j])
for i in 1:n, j in 1:n
])
20×20 LinearAlgebra.LowerTriangular{Int64, Matrix{Int64}}:
0     ⋅     ⋅     ⋅     ⋅     ⋅     ⋅  …     ⋅     ⋅     ⋅     ⋅     ⋅  ⋅
3937     0     ⋅     ⋅     ⋅     ⋅     ⋅        ⋅     ⋅     ⋅     ⋅     ⋅  ⋅
1145  2805     0     ⋅     ⋅     ⋅     ⋅        ⋅     ⋅     ⋅     ⋅     ⋅  ⋅
2282  2207  1516     0     ⋅     ⋅     ⋅        ⋅     ⋅     ⋅     ⋅     ⋅  ⋅
130  3845  1068  2157     0     ⋅     ⋅        ⋅     ⋅     ⋅     ⋅     ⋅  ⋅
3445   574  2337  1633  3345     0     ⋅  …     ⋅     ⋅     ⋅     ⋅     ⋅  ⋅
2546  1934  1695   304  2423  1363     0        ⋅     ⋅     ⋅     ⋅     ⋅  ⋅
3908   179  2787  2094  3812   481  1813        ⋅     ⋅     ⋅     ⋅     ⋅  ⋅
2206  1993  1295   362  2089  1424   406        ⋅     ⋅     ⋅     ⋅     ⋅  ⋅
4103   492  2958  2588  4023   989  2336        ⋅     ⋅     ⋅     ⋅     ⋅  ⋅
2432  1972  1577   235  2310  1398   118  …     ⋅     ⋅     ⋅     ⋅     ⋅  ⋅
1036  2907   265  1394   938  2409  1609        ⋅     ⋅     ⋅     ⋅     ⋅  ⋅
1345  3450  1391  1321  1221  2883  1626        ⋅     ⋅     ⋅     ⋅     ⋅  ⋅
4130   559  2986  2644  4052  1051  2394        ⋅     ⋅     ⋅     ⋅     ⋅  ⋅
767  3177   444  1598   668  2679  1834        0     ⋅     ⋅     ⋅     ⋅  ⋅
855  3405   946  1490   725  2863  1777  …   560     0     ⋅     ⋅     ⋅  ⋅
2251  1944  1327   382  2134  1375   387     1511  1543     0     ⋅     ⋅  ⋅
774  3186   382  1780   711  2716  1994      264   813  1646     0     ⋅  ⋅
3054  1130  2010  1081  2945   559   804     2292  2398   864  2374     0  ⋅
1534  2576   777   780  1415  2028  1017      820   837   722  1003  1565  0

### Build the model

Now that we have the basics taken care of, we can set up our model, create decision variables, add constraints, and then solve.

First, we'll set up a model that leverages the Cbc solver. Next, we'll set up a binary variable $x_{i,k}$ that takes the value $1$ if city $i$ is in group $k$ and $0$ otherwise. Each city must be in a group, so we'll add the constraint $\sum_kx_{i,k} = 1$ for every $i$.

model = Model(GLPK.Optimizer)
set_silent(model)
@variable(model, x[1:n, 1:k], Bin)
20×3 Matrix{VariableRef}:
x[1,1]   x[1,2]   x[1,3]
x[2,1]   x[2,2]   x[2,3]
x[3,1]   x[3,2]   x[3,3]
x[4,1]   x[4,2]   x[4,3]
x[5,1]   x[5,2]   x[5,3]
x[6,1]   x[6,2]   x[6,3]
x[7,1]   x[7,2]   x[7,3]
x[8,1]   x[8,2]   x[8,3]
x[9,1]   x[9,2]   x[9,3]
x[10,1]  x[10,2]  x[10,3]
x[11,1]  x[11,2]  x[11,3]
x[12,1]  x[12,2]  x[12,3]
x[13,1]  x[13,2]  x[13,3]
x[14,1]  x[14,2]  x[14,3]
x[15,1]  x[15,2]  x[15,3]
x[16,1]  x[16,2]  x[16,3]
x[17,1]  x[17,2]  x[17,3]
x[18,1]  x[18,2]  x[18,3]
x[19,1]  x[19,2]  x[19,3]
x[20,1]  x[20,2]  x[20,3]
@constraint(model, [i = 1:n], sum(x[i, :]) == 1)
20-element Vector{ConstraintRef{Model, MathOptInterface.ConstraintIndex{MathOptInterface.ScalarAffineFunction{Float64}, MathOptInterface.EqualTo{Float64}}, ScalarShape}}:
x[1,1] + x[1,2] + x[1,3] = 1.0
x[2,1] + x[2,2] + x[2,3] = 1.0
x[3,1] + x[3,2] + x[3,3] = 1.0
x[4,1] + x[4,2] + x[4,3] = 1.0
x[5,1] + x[5,2] + x[5,3] = 1.0
x[6,1] + x[6,2] + x[6,3] = 1.0
x[7,1] + x[7,2] + x[7,3] = 1.0
x[8,1] + x[8,2] + x[8,3] = 1.0
x[9,1] + x[9,2] + x[9,3] = 1.0
x[10,1] + x[10,2] + x[10,3] = 1.0
x[11,1] + x[11,2] + x[11,3] = 1.0
x[12,1] + x[12,2] + x[12,3] = 1.0
x[13,1] + x[13,2] + x[13,3] = 1.0
x[14,1] + x[14,2] + x[14,3] = 1.0
x[15,1] + x[15,2] + x[15,3] = 1.0
x[16,1] + x[16,2] + x[16,3] = 1.0
x[17,1] + x[17,2] + x[17,3] = 1.0
x[18,1] + x[18,2] + x[18,3] = 1.0
x[19,1] + x[19,2] + x[19,3] = 1.0
x[20,1] + x[20,2] + x[20,3] = 1.0

To reduce symmetry, we fix the first city to belong to the first group.

fix(x[1, 1], 1; force = true)

The total population of a group $k$ is $Q_k = \sum_ix_{i,k}q_i$ where $q_i$ is simply the $i$th value from the population column in our cities DataFrame. Let's add constraints so that $\alpha \leq (Q_k - P) \leq \beta$. We'll set $\alpha$ equal to $-3$ million and $\beta$ equal to $3$. By adjusting these thresholds you'll find that there is a tradeoff between having relatively even populations between groups and having geographically close cities within each group. In other words, the larger the absolute values of $\alpha$ and $\beta$, the closer together the cities in a group will be but the variance between the group populations will be higher.

@variable(model, -3 <= population_diff[1:k] <= 3)
@constraint(model, population_diff .== x' * cities.population .- P)
3-element Vector{ConstraintRef{Model, MathOptInterface.ConstraintIndex{MathOptInterface.ScalarAffineFunction{Float64}, MathOptInterface.EqualTo{Float64}}, ScalarShape}}:
-8.405 x[1,1] - 3.884 x[2,1] - 2.718 x[3,1] - 2.195 x[4,1] - 1.553 x[5,1] - 1.513 x[6,1] - 1.409 x[7,1] - 1.355 x[8,1] - 1.257 x[9,1] - 0.998 x[10,1] - 0.885 x[11,1] - 0.843 x[12,1] - 0.842 x[13,1] - 0.837 x[14,1] - 0.822 x[15,1] - 0.792 x[16,1] - 0.792 x[17,1] - 0.688 x[18,1] - 0.674 x[19,1] - 0.653 x[20,1] + population_diff = -11.038333333333334
-8.405 x[1,2] - 3.884 x[2,2] - 2.718 x[3,2] - 2.195 x[4,2] - 1.553 x[5,2] - 1.513 x[6,2] - 1.409 x[7,2] - 1.355 x[8,2] - 1.257 x[9,2] - 0.998 x[10,2] - 0.885 x[11,2] - 0.843 x[12,2] - 0.842 x[13,2] - 0.837 x[14,2] - 0.822 x[15,2] - 0.792 x[16,2] - 0.792 x[17,2] - 0.688 x[18,2] - 0.674 x[19,2] - 0.653 x[20,2] + population_diff = -11.038333333333334
-8.405 x[1,3] - 3.884 x[2,3] - 2.718 x[3,3] - 2.195 x[4,3] - 1.553 x[5,3] - 1.513 x[6,3] - 1.409 x[7,3] - 1.355 x[8,3] - 1.257 x[9,3] - 0.998 x[10,3] - 0.885 x[11,3] - 0.843 x[12,3] - 0.842 x[13,3] - 0.837 x[14,3] - 0.822 x[15,3] - 0.792 x[16,3] - 0.792 x[17,3] - 0.688 x[18,3] - 0.674 x[19,3] - 0.653 x[20,3] + population_diff = -11.038333333333334

Now we need to add one last binary variable $z_{i,j}$ to our model that we'll use to compute the total distance between the cities in our groups, defined as $\sum_{i,j}d_{i,j}z_{i,j}$. Variable $z_{i,j}$ will equal $1$ if cities $i$ and $j$ are in the same group, and $0$ if they are not in the same group.

To ensure that $z_{i,j} = 1$ if and only if cities $i$ and $j$ are in the same group, we add the constraints $z_{i,j} \geq x_{i,k} + x_{j,k} - 1$ for every pair $i,j$ and every $k$:

@variable(model, z[i = 1:n, j = 1:i], Bin)
JuMP.Containers.SparseAxisArray{VariableRef, 2, Tuple{Int64, Int64}} with 210 entries:
[12, 10]  =  z[12,10]
[12, 2 ]  =  z[12,2]
[12, 3 ]  =  z[12,3]
[16, 12]  =  z[16,12]
[16, 14]  =  z[16,14]
[16, 16]  =  z[16,16]
[17, 12]  =  z[17,12]
[18, 14]  =  z[18,14]
⋮
[18, 16]  =  z[18,16]
[18, 18]  =  z[18,18]
[19, 12]  =  z[19,12]
[19, 14]  =  z[19,14]
[19, 16]  =  z[19,16]
[19, 19]  =  z[19,19]
[20, 15]  =  z[20,15]
for k in 1:k, i in 1:n, j in 1:i
@constraint(model, z[i, j] >= x[i, k] + x[j, k] - 1)
end

We can now add an objective to our model which will simply be to minimize the dot product of $z$ and our distance matrix, dm.

@objective(model, Min, sum(dm[i, j] * z[i, j] for i in 1:n, j in 1:i))
$$$3937 z_{2,1} + 1145 z_{3,1} + 2805 z_{3,2} + 2282 z_{4,1} + 2207 z_{4,2} + 1516 z_{4,3} + 130 z_{5,1} + 3845 z_{5,2} + 1068 z_{5,3} + 2157 z_{5,4} + 3445 z_{6,1} + 574 z_{6,2} + 2337 z_{6,3} + 1633 z_{6,4} + 3345 z_{6,5} + 2546 z_{7,1} + 1934 z_{7,2} + 1695 z_{7,3} + 304 z_{7,4} + 2423 z_{7,5} + 1363 z_{7,6} + 3908 z_{8,1} + 179 z_{8,2} + 2787 z_{8,3} + 2094 z_{8,4} + 3812 z_{8,5} + 481 z_{8,6} + 1813 z_{8,7} + 2206 z_{9,1} + 1993 z_{9,2} + 1295 z_{9,3} + 362 z_{9,4} + 2089 z_{9,5} + 1424 z_{9,6} + 406 z_{9,7} + 1902 z_{9,8} + 4103 z_{10,1} + 492 z_{10,2} + 2958 z_{10,3} + 2588 z_{10,4} + 4023 z_{10,5} + 989 z_{10,6} + 2336 z_{10,7} + 670 z_{10,8} + 2333 z_{10,9} + 2432 z_{11,1} + 1972 z_{11,2} + 1577 z_{11,3} + 235 z_{11,4} + 2310 z_{11,5} + 1398 z_{11,6} + 118 z_{11,7} + 1859 z_{11,8} + 293 z_{11,9} + 2358 z_{11,10} + 1036 z_{12,1} + 2907 z_{12,2} + 265 z_{12,3} + 1394 z_{12,4} + 938 z_{12,5} + 2409 z_{12,6} + 1609 z_{12,7} + 2874 z_{12,8} + 1229 z_{12,9} + 3099 z_{12,10} + 1491 z_{12,11} + 1345 z_{13,1} + 3450 z_{13,2} + 1391 z_{13,3} + 1321 z_{13,4} + 1221 z_{13,5} + 2883 z_{13,6} + 1626 z_{13,7} + 3361 z_{13,8} + 1459 z_{13,9} + 3768 z_{13,10} + 1544 z_{13,11} + 1126 z_{13,12} + 4130 z_{14,1} + 559 z_{14,2} + 2986 z_{14,3} + 2644 z_{14,4} + 4052 z_{14,5} + 1051 z_{14,6} + 2394 z_{14,7} + 738 z_{14,8} + 2384 z_{14,9} + 68 z_{14,10} + 2414 z_{14,11} + 3131 z_{14,12} + 3815 z_{14,13} + 767 z_{15,1} + 3177 z_{15,2} + 444 z_{15,3} + 1598 z_{15,4} + 668 z_{15,5} + 2679 z_{15,6} + 1834 z_{15,7} + 3144 z_{15,8} + 1469 z_{15,9} + 3364 z_{15,10} + 1717 z_{15,11} + 271 z_{15,12} + 1078 z_{15,13} + 3395 z_{15,14} + 855 z_{16,1} + 3405 z_{16,2} + 946 z_{16,3} + 1490 z_{16,4} + 725 z_{16,5} + 2863 z_{16,6} + 1777 z_{16,7} + 3343 z_{16,8} + 1494 z_{16,9} + 3659 z_{16,10} + 1672 z_{16,11} + 689 z_{16,12} + 550 z_{16,13} + 3698 z_{16,14} + 560 z_{16,15} + 2251 z_{17,1} + 1944 z_{17,2} + 1327 z_{17,3} + 382 z_{17,4} + 2134 z_{17,5} + 1375 z_{17,6} + 387 z_{17,7} + 1853 z_{17,8} + 50 z_{17,9} + 2286 z_{17,10} + 280 z_{17,11} + 1268 z_{17,12} + 1509 z_{17,13} + 2338 z_{17,14} + 1511 z_{17,15} + 1543 z_{17,16} + 774 z_{18,1} + 3186 z_{18,2} + 382 z_{18,3} + 1780 z_{18,4} + 711 z_{18,5} + 2716 z_{18,6} + 1994 z_{18,7} + 3169 z_{18,8} + 1609 z_{18,9} + 3334 z_{18,10} + 1875 z_{18,11} + 386 z_{18,12} + 1340 z_{18,13} + 3360 z_{18,14} + 264 z_{18,15} + 813 z_{18,16} + 1646 z_{18,17} + 3054 z_{19,1} + 1130 z_{19,2} + 2010 z_{19,3} + 1081 z_{19,4} + 2945 z_{19,5} + 559 z_{19,6} + 804 z_{19,7} + 1013 z_{19,8} + 914 z_{19,9} + 1541 z_{19,10} + 846 z_{19,11} + 2028 z_{19,12} + 2362 z_{19,13} + 1602 z_{19,14} + 2292 z_{19,15} + 2398 z_{19,16} + 864 z_{19,17} + 2374 z_{19,18} + 1534 z_{20,1} + 2576 z_{20,2} + 777 z_{20,3} + 780 z_{20,4} + 1415 z_{20,5} + 2028 z_{20,6} + 1017 z_{20,7} + 2509 z_{20,8} + 676 z_{20,9} + 2853 z_{20,10} + 901 z_{20,11} + 618 z_{20,12} + 950 z_{20,13} + 2896 z_{20,14} + 820 z_{20,15} + 837 z_{20,16} + 722 z_{20,17} + 1003 z_{20,18} + 1565 z_{20,19}$$$

We can then call optimize! and review the results.

optimize!(model)

### Reviewing the Results

Now that we have results, we can add a column to our cities DataFrame for the group and then loop through our $x$ variable to assign each city to its group. Once we have that, we can look at the total population for each group and also look at the cities in each group to verify that they are grouped by geographic proximity.

cities.group = zeros(n)

for i in 1:n, j in 1:k
if round(Int, value(x[i, j])) == 1
cities.group[i] = j
end
end

for group in DataFrames.groupby(cities, :group)
@show group
println("")
@show sum(group.population)
println("")
end
group = 7×5 SubDataFrame
Row │ city              population  lat      lon       group
│ String            Float64     Float64  Float64   Float64
─────┼──────────────────────────────────────────────────────────
1 │ New York, NY           8.405  40.7127  -74.0059      1.0
2 │ Philadelphia, PA       1.553  39.9525  -75.1652      1.0
3 │ Indianapolis, IN       0.843  39.7684  -86.158       1.0
4 │ Jacksonville, FL       0.842  30.3321  -81.6556      1.0
5 │ Columbus, OH           0.822  39.9611  -82.9987      1.0
6 │ Charlotte, NC          0.792  35.227   -80.8431      1.0
7 │ Detroit, MI            0.688  42.3314  -83.0457      1.0

sum(group.population) = 13.944999999999999

group = 6×5 SubDataFrame
Row │ city               population  lat      lon       group
│ String             Float64     Float64  Float64   Float64
─────┼───────────────────────────────────────────────────────────
1 │ Los Angeles, CA         3.884  34.0522  -118.244      2.0
2 │ Phoenix, AZ             1.513  33.4483  -112.074      2.0
3 │ San Diego, CA           1.355  32.7157  -117.161      2.0
4 │ San Jose, CA            0.998  37.3382  -121.886      2.0
5 │ San Francisco, CA       0.837  37.7749  -122.419      2.0
6 │ El Paso, TX             0.674  31.7775  -106.442      2.0

sum(group.population) = 9.261000000000001

group = 7×5 SubDataFrame
Row │ city             population  lat      lon       group
│ String           Float64     Float64  Float64   Float64
─────┼─────────────────────────────────────────────────────────
1 │ Chicago, IL           2.718  41.8781  -87.6297      3.0
2 │ Houston, TX           2.195  29.7604  -95.3698      3.0
3 │ San Antonio, TX       1.409  29.4241  -98.4936      3.0
4 │ Dallas, TX            1.257  32.7766  -96.7969      3.0
5 │ Austin, TX            0.885  30.2671  -97.743       3.0
6 │ Fort Worth, TX        0.792  32.7554  -97.3307      3.0
7 │ Memphis, TN           0.653  35.1495  -90.0489      3.0

sum(group.population) = 9.909

Tip