Geographical clustering
Originally Contributed by: Matthew Helm (with help from Mathieu Tanneau on Julia Discourse)
The goal of this exercise is to cluster $n$ cities into $k$ groups, minimizing the total pairwise distance between cities and ensuring that the variance in the total populations of each group is relatively small.
This tutorial uses the following packages:
using JuMP
import DataFrames
import HiGHS
import LinearAlgebra
For this example, we'll use the 20 most populous cities in the United States.
cities = DataFrames.DataFrame(
city = [
"New York, NY",
"Los Angeles, CA",
"Chicago, IL",
"Houston, TX",
"Philadelphia, PA",
"Phoenix, AZ",
"San Antonio, TX",
"San Diego, CA",
"Dallas, TX",
"San Jose, CA",
"Austin, TX",
"Indianapolis, IN",
"Jacksonville, FL",
"San Francisco, CA",
"Columbus, OH",
"Charlotte, NC",
"Fort Worth, TX",
"Detroit, MI",
"El Paso, TX",
"Memphis, TN",
],
population = [
8.405,
3.884,
2.718,
2.195,
1.553,
1.513,
1.409,
1.355,
1.257,
0.998,
0.885,
0.843,
0.842,
0.837,
0.822,
0.792,
0.792,
0.688,
0.674,
0.653,
],
lat = [
40.7127,
34.0522,
41.8781,
29.7604,
39.9525,
33.4483,
29.4241,
32.7157,
32.7766,
37.3382,
30.2671,
39.7684,
30.3321,
37.7749,
39.9611,
35.2270,
32.7554,
42.3314,
31.7775,
35.1495,
],
lon = [
-74.0059,
-118.2436,
-87.6297,
-95.3698,
-75.1652,
-112.0740,
-98.4936,
-117.1610,
-96.7969,
-121.8863,
-97.7430,
-86.1580,
-81.6556,
-122.4194,
-82.9987,
-80.8431,
-97.3307,
-83.0457,
-106.4424,
-90.0489,
],
)
20 rows × 4 columns
city | population | lat | lon | |
---|---|---|---|---|
String | Float64 | Float64 | Float64 | |
1 | New York, NY | 8.405 | 40.7127 | -74.0059 |
2 | Los Angeles, CA | 3.884 | 34.0522 | -118.244 |
3 | Chicago, IL | 2.718 | 41.8781 | -87.6297 |
4 | Houston, TX | 2.195 | 29.7604 | -95.3698 |
5 | Philadelphia, PA | 1.553 | 39.9525 | -75.1652 |
6 | Phoenix, AZ | 1.513 | 33.4483 | -112.074 |
7 | San Antonio, TX | 1.409 | 29.4241 | -98.4936 |
8 | San Diego, CA | 1.355 | 32.7157 | -117.161 |
9 | Dallas, TX | 1.257 | 32.7766 | -96.7969 |
10 | San Jose, CA | 0.998 | 37.3382 | -121.886 |
11 | Austin, TX | 0.885 | 30.2671 | -97.743 |
12 | Indianapolis, IN | 0.843 | 39.7684 | -86.158 |
13 | Jacksonville, FL | 0.842 | 30.3321 | -81.6556 |
14 | San Francisco, CA | 0.837 | 37.7749 | -122.419 |
15 | Columbus, OH | 0.822 | 39.9611 | -82.9987 |
16 | Charlotte, NC | 0.792 | 35.227 | -80.8431 |
17 | Fort Worth, TX | 0.792 | 32.7554 | -97.3307 |
18 | Detroit, MI | 0.688 | 42.3314 | -83.0457 |
19 | El Paso, TX | 0.674 | 31.7775 | -106.442 |
20 | Memphis, TN | 0.653 | 35.1495 | -90.0489 |
Model Specifics
We will cluster these 20 cities into 3 different groups and we will assume that the ideal or target population $P$ for a group is simply the total population of the 20 cities divided by 3:
n = size(cities, 1)
k = 3
P = sum(cities.population) / k
11.038333333333334
Obtaining the distances between each city
Let's compute the pairwise Haversine distance between each of the cities in our data set and store the result in a variable we'll call dm
:
"""
haversine(lat1, long1, lat2, long2, r = 6372.8)
Compute the haversine distance between two points on a sphere of radius `r`,
where the points are given by the latitude/longitude pairs `lat1/long1` and
`lat2/long2` (in degrees).
"""
function haversine(lat1, long1, lat2, long2, r = 6372.8)
lat1, long1 = deg2rad(lat1), deg2rad(long1)
lat2, long2 = deg2rad(lat2), deg2rad(long2)
hav(a, b) = sin((b - a) / 2)^2
inner_term = hav(lat1, lat2) + cos(lat1) * cos(lat2) * hav(long1, long2)
d = 2 * r * asin(sqrt(inner_term))
# Round distance to nearest kilometer.
return round(Int, d)
end
Main.haversine
Our distance matrix is symmetric so we'll convert it to a LowerTriangular
matrix so that we can better interpret the objective value of our model:
dm = LinearAlgebra.LowerTriangular([
haversine(cities.lat[i], cities.lon[i], cities.lat[j], cities.lon[j])
for i in 1:n, j in 1:n
])
20×20 LinearAlgebra.LowerTriangular{Int64, Matrix{Int64}}:
0 ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ … ⋅ ⋅ ⋅ ⋅ ⋅ ⋅
3937 0 ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅
1145 2805 0 ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅
2282 2207 1516 0 ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅
130 3845 1068 2157 0 ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅
3445 574 2337 1633 3345 0 ⋅ … ⋅ ⋅ ⋅ ⋅ ⋅ ⋅
2546 1934 1695 304 2423 1363 0 ⋅ ⋅ ⋅ ⋅ ⋅ ⋅
3908 179 2787 2094 3812 481 1813 ⋅ ⋅ ⋅ ⋅ ⋅ ⋅
2206 1993 1295 362 2089 1424 406 ⋅ ⋅ ⋅ ⋅ ⋅ ⋅
4103 492 2958 2588 4023 989 2336 ⋅ ⋅ ⋅ ⋅ ⋅ ⋅
2432 1972 1577 235 2310 1398 118 … ⋅ ⋅ ⋅ ⋅ ⋅ ⋅
1036 2907 265 1394 938 2409 1609 ⋅ ⋅ ⋅ ⋅ ⋅ ⋅
1345 3450 1391 1321 1221 2883 1626 ⋅ ⋅ ⋅ ⋅ ⋅ ⋅
4130 559 2986 2644 4052 1051 2394 ⋅ ⋅ ⋅ ⋅ ⋅ ⋅
767 3177 444 1598 668 2679 1834 0 ⋅ ⋅ ⋅ ⋅ ⋅
855 3405 946 1490 725 2863 1777 … 560 0 ⋅ ⋅ ⋅ ⋅
2251 1944 1327 382 2134 1375 387 1511 1543 0 ⋅ ⋅ ⋅
774 3186 382 1780 711 2716 1994 264 813 1646 0 ⋅ ⋅
3054 1130 2010 1081 2945 559 804 2292 2398 864 2374 0 ⋅
1534 2576 777 780 1415 2028 1017 820 837 722 1003 1565 0
Build the model
Now that we have the basics taken care of, we can set up our model, create decision variables, add constraints, and then solve.
First, we'll set up a model that leverages the Cbc solver. Next, we'll set up a binary variable $x_{i,k}$ that takes the value $1$ if city $i$ is in group $k$ and $0$ otherwise. Each city must be in a group, so we'll add the constraint $\sum_kx_{i,k} = 1$ for every $i$.
model = Model(HiGHS.Optimizer)
set_silent(model)
@variable(model, x[1:n, 1:k], Bin)
20×3 Matrix{VariableRef}:
x[1,1] x[1,2] x[1,3]
x[2,1] x[2,2] x[2,3]
x[3,1] x[3,2] x[3,3]
x[4,1] x[4,2] x[4,3]
x[5,1] x[5,2] x[5,3]
x[6,1] x[6,2] x[6,3]
x[7,1] x[7,2] x[7,3]
x[8,1] x[8,2] x[8,3]
x[9,1] x[9,2] x[9,3]
x[10,1] x[10,2] x[10,3]
x[11,1] x[11,2] x[11,3]
x[12,1] x[12,2] x[12,3]
x[13,1] x[13,2] x[13,3]
x[14,1] x[14,2] x[14,3]
x[15,1] x[15,2] x[15,3]
x[16,1] x[16,2] x[16,3]
x[17,1] x[17,2] x[17,3]
x[18,1] x[18,2] x[18,3]
x[19,1] x[19,2] x[19,3]
x[20,1] x[20,2] x[20,3]
@constraint(model, [i = 1:n], sum(x[i, :]) == 1)
20-element Vector{ConstraintRef{Model, MathOptInterface.ConstraintIndex{MathOptInterface.ScalarAffineFunction{Float64}, MathOptInterface.EqualTo{Float64}}, ScalarShape}}:
x[1,1] + x[1,2] + x[1,3] = 1.0
x[2,1] + x[2,2] + x[2,3] = 1.0
x[3,1] + x[3,2] + x[3,3] = 1.0
x[4,1] + x[4,2] + x[4,3] = 1.0
x[5,1] + x[5,2] + x[5,3] = 1.0
x[6,1] + x[6,2] + x[6,3] = 1.0
x[7,1] + x[7,2] + x[7,3] = 1.0
x[8,1] + x[8,2] + x[8,3] = 1.0
x[9,1] + x[9,2] + x[9,3] = 1.0
x[10,1] + x[10,2] + x[10,3] = 1.0
x[11,1] + x[11,2] + x[11,3] = 1.0
x[12,1] + x[12,2] + x[12,3] = 1.0
x[13,1] + x[13,2] + x[13,3] = 1.0
x[14,1] + x[14,2] + x[14,3] = 1.0
x[15,1] + x[15,2] + x[15,3] = 1.0
x[16,1] + x[16,2] + x[16,3] = 1.0
x[17,1] + x[17,2] + x[17,3] = 1.0
x[18,1] + x[18,2] + x[18,3] = 1.0
x[19,1] + x[19,2] + x[19,3] = 1.0
x[20,1] + x[20,2] + x[20,3] = 1.0
To reduce symmetry, we fix the first city to belong to the first group.
fix(x[1, 1], 1; force = true)
The total population of a group $k$ is $Q_k = \sum_ix_{i,k}q_i$ where $q_i$ is simply the $i$th value from the population
column in our cities
DataFrame. Let's add constraints so that $\alpha \leq (Q_k - P) \leq \beta$. We'll set $\alpha$ equal to $-3$ million and $\beta$ equal to $3$. By adjusting these thresholds you'll find that there is a tradeoff between having relatively even populations between groups and having geographically close cities within each group. In other words, the larger the absolute values of $\alpha$ and $\beta$, the closer together the cities in a group will be but the variance between the group populations will be higher.
@variable(model, -3 <= population_diff[1:k] <= 3)
@constraint(model, population_diff .== x' * cities.population .- P)
3-element Vector{ConstraintRef{Model, MathOptInterface.ConstraintIndex{MathOptInterface.ScalarAffineFunction{Float64}, MathOptInterface.EqualTo{Float64}}, ScalarShape}}:
-8.405 x[1,1] - 3.884 x[2,1] - 2.718 x[3,1] - 2.195 x[4,1] - 1.553 x[5,1] - 1.513 x[6,1] - 1.409 x[7,1] - 1.355 x[8,1] - 1.257 x[9,1] - 0.998 x[10,1] - 0.885 x[11,1] - 0.843 x[12,1] - 0.842 x[13,1] - 0.837 x[14,1] - 0.822 x[15,1] - 0.792 x[16,1] - 0.792 x[17,1] - 0.688 x[18,1] - 0.674 x[19,1] - 0.653 x[20,1] + population_diff[1] = -11.038333333333334
-8.405 x[1,2] - 3.884 x[2,2] - 2.718 x[3,2] - 2.195 x[4,2] - 1.553 x[5,2] - 1.513 x[6,2] - 1.409 x[7,2] - 1.355 x[8,2] - 1.257 x[9,2] - 0.998 x[10,2] - 0.885 x[11,2] - 0.843 x[12,2] - 0.842 x[13,2] - 0.837 x[14,2] - 0.822 x[15,2] - 0.792 x[16,2] - 0.792 x[17,2] - 0.688 x[18,2] - 0.674 x[19,2] - 0.653 x[20,2] + population_diff[2] = -11.038333333333334
-8.405 x[1,3] - 3.884 x[2,3] - 2.718 x[3,3] - 2.195 x[4,3] - 1.553 x[5,3] - 1.513 x[6,3] - 1.409 x[7,3] - 1.355 x[8,3] - 1.257 x[9,3] - 0.998 x[10,3] - 0.885 x[11,3] - 0.843 x[12,3] - 0.842 x[13,3] - 0.837 x[14,3] - 0.822 x[15,3] - 0.792 x[16,3] - 0.792 x[17,3] - 0.688 x[18,3] - 0.674 x[19,3] - 0.653 x[20,3] + population_diff[3] = -11.038333333333334
Now we need to add one last binary variable $z_{i,j}$ to our model that we'll use to compute the total distance between the cities in our groups, defined as $\sum_{i,j}d_{i,j}z_{i,j}$. Variable $z_{i,j}$ will equal $1$ if cities $i$ and $j$ are in the same group, and $0$ if they are not in the same group.
To ensure that $z_{i,j} = 1$ if and only if cities $i$ and $j$ are in the same group, we add the constraints $z_{i,j} \geq x_{i,k} + x_{j,k} - 1$ for every pair $i,j$ and every $k$:
@variable(model, z[i = 1:n, j = 1:i], Bin)
JuMP.Containers.SparseAxisArray{VariableRef, 2, Tuple{Int64, Int64}} with 210 entries:
[12, 10] = z[12,10]
[12, 2 ] = z[12,2]
[12, 3 ] = z[12,3]
[16, 12] = z[16,12]
[16, 14] = z[16,14]
[16, 16] = z[16,16]
[17, 12] = z[17,12]
[18, 14] = z[18,14]
⋮
[18, 16] = z[18,16]
[18, 18] = z[18,18]
[19, 12] = z[19,12]
[19, 14] = z[19,14]
[19, 16] = z[19,16]
[19, 19] = z[19,19]
[20, 15] = z[20,15]
for k in 1:k, i in 1:n, j in 1:i
@constraint(model, z[i, j] >= x[i, k] + x[j, k] - 1)
end
We can now add an objective to our model which will simply be to minimize the dot product of $z$ and our distance matrix, dm
.
@objective(model, Min, sum(dm[i, j] * z[i, j] for i in 1:n, j in 1:i))
\[ 3937 z_{2,1} + 1145 z_{3,1} + 2805 z_{3,2} + 2282 z_{4,1} + 2207 z_{4,2} + 1516 z_{4,3} + 130 z_{5,1} + 3845 z_{5,2} + 1068 z_{5,3} + 2157 z_{5,4} + 3445 z_{6,1} + 574 z_{6,2} + 2337 z_{6,3} + 1633 z_{6,4} + 3345 z_{6,5} + 2546 z_{7,1} + 1934 z_{7,2} + 1695 z_{7,3} + 304 z_{7,4} + 2423 z_{7,5} + 1363 z_{7,6} + 3908 z_{8,1} + 179 z_{8,2} + 2787 z_{8,3} + 2094 z_{8,4} + 3812 z_{8,5} + 481 z_{8,6} + 1813 z_{8,7} + 2206 z_{9,1} + 1993 z_{9,2} + 1295 z_{9,3} + 362 z_{9,4} + 2089 z_{9,5} + 1424 z_{9,6} + 406 z_{9,7} + 1902 z_{9,8} + 4103 z_{10,1} + 492 z_{10,2} + 2958 z_{10,3} + 2588 z_{10,4} + 4023 z_{10,5} + 989 z_{10,6} + 2336 z_{10,7} + 670 z_{10,8} + 2333 z_{10,9} + 2432 z_{11,1} + 1972 z_{11,2} + 1577 z_{11,3} + 235 z_{11,4} + 2310 z_{11,5} + 1398 z_{11,6} + 118 z_{11,7} + 1859 z_{11,8} + 293 z_{11,9} + 2358 z_{11,10} + 1036 z_{12,1} + 2907 z_{12,2} + 265 z_{12,3} + 1394 z_{12,4} + 938 z_{12,5} + 2409 z_{12,6} + 1609 z_{12,7} + 2874 z_{12,8} + 1229 z_{12,9} + 3099 z_{12,10} + 1491 z_{12,11} + 1345 z_{13,1} + 3450 z_{13,2} + 1391 z_{13,3} + 1321 z_{13,4} + 1221 z_{13,5} + 2883 z_{13,6} + 1626 z_{13,7} + 3361 z_{13,8} + 1459 z_{13,9} + 3768 z_{13,10} + 1544 z_{13,11} + 1126 z_{13,12} + 4130 z_{14,1} + 559 z_{14,2} + 2986 z_{14,3} + 2644 z_{14,4} + 4052 z_{14,5} + 1051 z_{14,6} + 2394 z_{14,7} + 738 z_{14,8} + 2384 z_{14,9} + 68 z_{14,10} + 2414 z_{14,11} + 3131 z_{14,12} + 3815 z_{14,13} + 767 z_{15,1} + 3177 z_{15,2} + 444 z_{15,3} + 1598 z_{15,4} + 668 z_{15,5} + 2679 z_{15,6} + 1834 z_{15,7} + 3144 z_{15,8} + 1469 z_{15,9} + 3364 z_{15,10} + 1717 z_{15,11} + 271 z_{15,12} + 1078 z_{15,13} + 3395 z_{15,14} + 855 z_{16,1} + 3405 z_{16,2} + 946 z_{16,3} + 1490 z_{16,4} + 725 z_{16,5} + 2863 z_{16,6} + 1777 z_{16,7} + 3343 z_{16,8} + 1494 z_{16,9} + 3659 z_{16,10} + 1672 z_{16,11} + 689 z_{16,12} + 550 z_{16,13} + 3698 z_{16,14} + 560 z_{16,15} + 2251 z_{17,1} + 1944 z_{17,2} + 1327 z_{17,3} + 382 z_{17,4} + 2134 z_{17,5} + 1375 z_{17,6} + 387 z_{17,7} + 1853 z_{17,8} + 50 z_{17,9} + 2286 z_{17,10} + 280 z_{17,11} + 1268 z_{17,12} + 1509 z_{17,13} + 2338 z_{17,14} + 1511 z_{17,15} + 1543 z_{17,16} + 774 z_{18,1} + 3186 z_{18,2} + 382 z_{18,3} + 1780 z_{18,4} + 711 z_{18,5} + 2716 z_{18,6} + 1994 z_{18,7} + 3169 z_{18,8} + 1609 z_{18,9} + 3334 z_{18,10} + 1875 z_{18,11} + 386 z_{18,12} + 1340 z_{18,13} + 3360 z_{18,14} + 264 z_{18,15} + 813 z_{18,16} + 1646 z_{18,17} + 3054 z_{19,1} + 1130 z_{19,2} + 2010 z_{19,3} + 1081 z_{19,4} + 2945 z_{19,5} + 559 z_{19,6} + 804 z_{19,7} + 1013 z_{19,8} + 914 z_{19,9} + 1541 z_{19,10} + 846 z_{19,11} + 2028 z_{19,12} + 2362 z_{19,13} + 1602 z_{19,14} + 2292 z_{19,15} + 2398 z_{19,16} + 864 z_{19,17} + 2374 z_{19,18} + 1534 z_{20,1} + 2576 z_{20,2} + 777 z_{20,3} + 780 z_{20,4} + 1415 z_{20,5} + 2028 z_{20,6} + 1017 z_{20,7} + 2509 z_{20,8} + 676 z_{20,9} + 2853 z_{20,10} + 901 z_{20,11} + 618 z_{20,12} + 950 z_{20,13} + 2896 z_{20,14} + 820 z_{20,15} + 837 z_{20,16} + 722 z_{20,17} + 1003 z_{20,18} + 1565 z_{20,19} \]
We can then call optimize!
and review the results.
optimize!(model)
Reviewing the Results
Now that we have results, we can add a column to our cities
DataFrame for the group and then loop through our $x$ variable to assign each city to its group. Once we have that, we can look at the total population for each group and also look at the cities in each group to verify that they are grouped by geographic proximity.
cities.group = zeros(n)
for i in 1:n, j in 1:k
if round(Int, value(x[i, j])) == 1
cities.group[i] = j
end
end
for group in DataFrames.groupby(cities, :group)
@show group
println("")
@show sum(group.population)
println("")
end
group = 7×5 SubDataFrame
Row │ city population lat lon group
│ String Float64 Float64 Float64 Float64
─────┼──────────────────────────────────────────────────────────
1 │ New York, NY 8.405 40.7127 -74.0059 1.0
2 │ Philadelphia, PA 1.553 39.9525 -75.1652 1.0
3 │ Indianapolis, IN 0.843 39.7684 -86.158 1.0
4 │ Jacksonville, FL 0.842 30.3321 -81.6556 1.0
5 │ Columbus, OH 0.822 39.9611 -82.9987 1.0
6 │ Charlotte, NC 0.792 35.227 -80.8431 1.0
7 │ Detroit, MI 0.688 42.3314 -83.0457 1.0
sum(group.population) = 13.944999999999999
group = 6×5 SubDataFrame
Row │ city population lat lon group
│ String Float64 Float64 Float64 Float64
─────┼───────────────────────────────────────────────────────────
1 │ Los Angeles, CA 3.884 34.0522 -118.244 2.0
2 │ Phoenix, AZ 1.513 33.4483 -112.074 2.0
3 │ San Diego, CA 1.355 32.7157 -117.161 2.0
4 │ San Jose, CA 0.998 37.3382 -121.886 2.0
5 │ San Francisco, CA 0.837 37.7749 -122.419 2.0
6 │ El Paso, TX 0.674 31.7775 -106.442 2.0
sum(group.population) = 9.261000000000001
group = 7×5 SubDataFrame
Row │ city population lat lon group
│ String Float64 Float64 Float64 Float64
─────┼─────────────────────────────────────────────────────────
1 │ Chicago, IL 2.718 41.8781 -87.6297 3.0
2 │ Houston, TX 2.195 29.7604 -95.3698 3.0
3 │ San Antonio, TX 1.409 29.4241 -98.4936 3.0
4 │ Dallas, TX 1.257 32.7766 -96.7969 3.0
5 │ Austin, TX 0.885 30.2671 -97.743 3.0
6 │ Fort Worth, TX 0.792 32.7554 -97.3307 3.0
7 │ Memphis, TN 0.653 35.1495 -90.0489 3.0
sum(group.population) = 9.909
This tutorial was generated using Literate.jl. View the source .jl
file on GitHub.