Geographical clustering
This tutorial was generated using Literate.jl. Download the source as a .jl
file.
This tutorial was originally contributed by Matthew Helm and Mathieu Tanneau.
The goal of this exercise is to cluster $n$ cities into $k$ groups, minimizing the total pairwise distance between cities and ensuring that the variance in the total populations of each group is relatively small.
This tutorial uses the following packages:
using JuMP
import DataFrames
import HiGHS
import LinearAlgebra
For this example, we'll use the 20 most populous cities in the United States.
cities = DataFrames.DataFrame(
Union{String,Float64}[
"New York, NY" 8.405 40.7127 -74.0059
"Los Angeles, CA" 3.884 34.0522 -118.2436
"Chicago, IL" 2.718 41.8781 -87.6297
"Houston, TX" 2.195 29.7604 -95.3698
"Philadelphia, PA" 1.553 39.9525 -75.1652
"Phoenix, AZ" 1.513 33.4483 -112.0740
"San Antonio, TX" 1.409 29.4241 -98.4936
"San Diego, CA" 1.355 32.7157 -117.1610
"Dallas, TX" 1.257 32.7766 -96.7969
"San Jose, CA" 0.998 37.3382 -121.8863
"Austin, TX" 0.885 30.2671 -97.7430
"Indianapolis, IN" 0.843 39.7684 -86.1580
"Jacksonville, FL" 0.842 30.3321 -81.6556
"San Francisco, CA" 0.837 37.7749 -122.4194
"Columbus, OH" 0.822 39.9611 -82.9987
"Charlotte, NC" 0.792 35.2270 -80.8431
"Fort Worth, TX" 0.792 32.7554 -97.3307
"Detroit, MI" 0.688 42.3314 -83.0457
"El Paso, TX" 0.674 31.7775 -106.4424
"Memphis, TN" 0.653 35.1495 -90.0489
],
["city", "population", "lat", "lon"],
)
Row | city | population | lat | lon |
---|---|---|---|---|
Union… | Union… | Union… | Union… | |
1 | New York, NY | 8.405 | 40.7127 | -74.0059 |
2 | Los Angeles, CA | 3.884 | 34.0522 | -118.244 |
3 | Chicago, IL | 2.718 | 41.8781 | -87.6297 |
4 | Houston, TX | 2.195 | 29.7604 | -95.3698 |
5 | Philadelphia, PA | 1.553 | 39.9525 | -75.1652 |
6 | Phoenix, AZ | 1.513 | 33.4483 | -112.074 |
7 | San Antonio, TX | 1.409 | 29.4241 | -98.4936 |
8 | San Diego, CA | 1.355 | 32.7157 | -117.161 |
9 | Dallas, TX | 1.257 | 32.7766 | -96.7969 |
10 | San Jose, CA | 0.998 | 37.3382 | -121.886 |
11 | Austin, TX | 0.885 | 30.2671 | -97.743 |
12 | Indianapolis, IN | 0.843 | 39.7684 | -86.158 |
13 | Jacksonville, FL | 0.842 | 30.3321 | -81.6556 |
14 | San Francisco, CA | 0.837 | 37.7749 | -122.419 |
15 | Columbus, OH | 0.822 | 39.9611 | -82.9987 |
16 | Charlotte, NC | 0.792 | 35.227 | -80.8431 |
17 | Fort Worth, TX | 0.792 | 32.7554 | -97.3307 |
18 | Detroit, MI | 0.688 | 42.3314 | -83.0457 |
19 | El Paso, TX | 0.674 | 31.7775 | -106.442 |
20 | Memphis, TN | 0.653 | 35.1495 | -90.0489 |
Model Specifics
We will cluster these 20 cities into 3 different groups and we will assume that the ideal or target population $P$ for a group is simply the total population of the 20 cities divided by 3:
n = size(cities, 1)
k = 3
P = sum(cities.population) / k
11.038333333333334
Obtaining the distances between each city
Let's compute the pairwise Haversine distance between each of the cities in our data set and store the result in a variable we'll call dm
:
"""
haversine(lat1, long1, lat2, long2, r = 6372.8)
Compute the haversine distance between two points on a sphere of radius `r`,
where the points are given by the latitude/longitude pairs `lat1/long1` and
`lat2/long2` (in degrees).
"""
function haversine(lat1, long1, lat2, long2, r = 6372.8)
lat1, long1 = deg2rad(lat1), deg2rad(long1)
lat2, long2 = deg2rad(lat2), deg2rad(long2)
hav(a, b) = sin((b - a) / 2)^2
inner_term = hav(lat1, lat2) + cos(lat1) * cos(lat2) * hav(long1, long2)
d = 2 * r * asin(sqrt(inner_term))
# Round distance to nearest kilometer.
return round(Int, d)
end
Main.haversine
Our distance matrix is symmetric so we'll convert it to a LowerTriangular
matrix so that we can better interpret the objective value of our model:
dm = LinearAlgebra.LowerTriangular([
haversine(cities.lat[i], cities.lon[i], cities.lat[j], cities.lon[j])
for i in 1:n, j in 1:n
])
20×20 LinearAlgebra.LowerTriangular{Int64, Matrix{Int64}}:
0 ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ … ⋅ ⋅ ⋅ ⋅ ⋅ ⋅
3937 0 ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅
1145 2805 0 ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅
2282 2207 1516 0 ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅
130 3845 1068 2157 0 ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅
3445 574 2337 1633 3345 0 ⋅ … ⋅ ⋅ ⋅ ⋅ ⋅ ⋅
2546 1934 1695 304 2423 1363 0 ⋅ ⋅ ⋅ ⋅ ⋅ ⋅
3908 179 2787 2094 3812 481 1813 ⋅ ⋅ ⋅ ⋅ ⋅ ⋅
2206 1993 1295 362 2089 1424 406 ⋅ ⋅ ⋅ ⋅ ⋅ ⋅
4103 492 2958 2588 4023 989 2336 ⋅ ⋅ ⋅ ⋅ ⋅ ⋅
2432 1972 1577 235 2310 1398 118 … ⋅ ⋅ ⋅ ⋅ ⋅ ⋅
1036 2907 265 1394 938 2409 1609 ⋅ ⋅ ⋅ ⋅ ⋅ ⋅
1345 3450 1391 1321 1221 2883 1626 ⋅ ⋅ ⋅ ⋅ ⋅ ⋅
4130 559 2986 2644 4052 1051 2394 ⋅ ⋅ ⋅ ⋅ ⋅ ⋅
767 3177 444 1598 668 2679 1834 0 ⋅ ⋅ ⋅ ⋅ ⋅
855 3405 946 1490 725 2863 1777 … 560 0 ⋅ ⋅ ⋅ ⋅
2251 1944 1327 382 2134 1375 387 1511 1543 0 ⋅ ⋅ ⋅
774 3186 382 1780 711 2716 1994 264 813 1646 0 ⋅ ⋅
3054 1130 2010 1081 2945 559 804 2292 2398 864 2374 0 ⋅
1534 2576 777 780 1415 2028 1017 820 837 722 1003 1565 0
Build the model
Now that we have the basics taken care of, we can set up our model, create decision variables, add constraints, and then solve.
First, we'll set up a model that leverages the Cbc solver. Next, we'll set up a binary variable $x_{i,k}$ that takes the value $1$ if city $i$ is in group $k$ and $0$ otherwise. Each city must be in a group, so we'll add the constraint $\sum_k x_{i,k} = 1$ for every $i$.
model = Model(HiGHS.Optimizer)
set_silent(model)
@variable(model, x[1:n, 1:k], Bin)
@constraint(model, [i = 1:n], sum(x[i, :]) == 1);
# To reduce symmetry, we fix the first city to belong to the first group.
fix(x[1, 1], 1; force = true)
The total population of a group $k$ is $Q_k = \sum_ix_{i,k}q_i$ where $q_i$ is simply the $i$-th value from the population
column in our cities
DataFrame. Let's add constraints so that $\alpha \leq (Q_k - P) \leq \beta$. We'll set $\alpha$ equal to $-3$ million and $\beta$ equal to $3$. By adjusting these thresholds you'll find that there is a tradeoff between having relatively even populations between groups and having geographically close cities within each group. In other words, the larger the absolute values of $\alpha$ and $\beta$, the closer together the cities in a group will be but the variance between the group populations will be higher.
@variable(model, -3 <= population_diff[1:k] <= 3)
@constraint(model, population_diff .== x' * cities.population .- P)
3-element Vector{ConstraintRef{Model, MathOptInterface.ConstraintIndex{MathOptInterface.ScalarAffineFunction{Float64}, MathOptInterface.EqualTo{Float64}}, ScalarShape}}:
-8.405 x[1,1] - 3.884 x[2,1] - 2.718 x[3,1] - 2.195 x[4,1] - 1.553 x[5,1] - 1.513 x[6,1] - 1.409 x[7,1] - 1.355 x[8,1] - 1.257 x[9,1] - 0.998 x[10,1] - 0.885 x[11,1] - 0.843 x[12,1] - 0.842 x[13,1] - 0.837 x[14,1] - 0.822 x[15,1] - 0.792 x[16,1] - 0.792 x[17,1] - 0.688 x[18,1] - 0.674 x[19,1] - 0.653 x[20,1] + population_diff[1] = -11.038333333333334
-8.405 x[1,2] - 3.884 x[2,2] - 2.718 x[3,2] - 2.195 x[4,2] - 1.553 x[5,2] - 1.513 x[6,2] - 1.409 x[7,2] - 1.355 x[8,2] - 1.257 x[9,2] - 0.998 x[10,2] - 0.885 x[11,2] - 0.843 x[12,2] - 0.842 x[13,2] - 0.837 x[14,2] - 0.822 x[15,2] - 0.792 x[16,2] - 0.792 x[17,2] - 0.688 x[18,2] - 0.674 x[19,2] - 0.653 x[20,2] + population_diff[2] = -11.038333333333334
-8.405 x[1,3] - 3.884 x[2,3] - 2.718 x[3,3] - 2.195 x[4,3] - 1.553 x[5,3] - 1.513 x[6,3] - 1.409 x[7,3] - 1.355 x[8,3] - 1.257 x[9,3] - 0.998 x[10,3] - 0.885 x[11,3] - 0.843 x[12,3] - 0.842 x[13,3] - 0.837 x[14,3] - 0.822 x[15,3] - 0.792 x[16,3] - 0.792 x[17,3] - 0.688 x[18,3] - 0.674 x[19,3] - 0.653 x[20,3] + population_diff[3] = -11.038333333333334
Now we need to add one last binary variable $z_{i,j}$ to our model that we'll use to compute the total distance between the cities in our groups, defined as $\sum_{i,j}d_{i,j}z_{i,j}$. Variable $z_{i,j}$ will equal $1$ if cities $i$ and $j$ are in the same group, and $0$ if they are not in the same group.
To ensure that $z_{i,j} = 1$ if and only if cities $i$ and $j$ are in the same group, we add the constraints $z_{i,j} \geq x_{i,k} + x_{j,k} - 1$ for every pair $i,j$ and every $k$:
@variable(model, z[i = 1:n, j = 1:i], Bin)
for k in 1:k, i in 1:n, j in 1:i
@constraint(model, z[i, j] >= x[i, k] + x[j, k] - 1)
end
We can now add an objective to our model which will simply be to minimize the dot product of $z$ and our distance matrix, dm
.
@objective(model, Min, sum(dm[i, j] * z[i, j] for i in 1:n, j in 1:i));
We can then call optimize!
and review the results.
optimize!(model)
@assert is_solved_and_feasible(model)
Reviewing the Results
Now that we have results, we can add a column to our cities
DataFrame for the group and then loop through our $x$ variable to assign each city to its group. Once we have that, we can look at the total population for each group and also look at the cities in each group to verify that they are grouped by geographic proximity.
cities.group = zeros(n)
for i in 1:n, j in 1:k
if round(Int, value(x[i, j])) == 1
cities.group[i] = j
end
end
for group in DataFrames.groupby(cities, :group)
@show group
println("")
@show sum(group.population)
println("")
end
group = 7×5 SubDataFrame
Row │ city population lat lon group
│ Union… Union… Union… Union… Float64
─────┼──────────────────────────────────────────────────────────
1 │ New York, NY 8.405 40.7127 -74.0059 1.0
2 │ Philadelphia, PA 1.553 39.9525 -75.1652 1.0
3 │ Indianapolis, IN 0.843 39.7684 -86.158 1.0
4 │ Jacksonville, FL 0.842 30.3321 -81.6556 1.0
5 │ Columbus, OH 0.822 39.9611 -82.9987 1.0
6 │ Charlotte, NC 0.792 35.227 -80.8431 1.0
7 │ Detroit, MI 0.688 42.3314 -83.0457 1.0
sum(group.population) = 13.944999999999999
group = 7×5 SubDataFrame
Row │ city population lat lon group
│ Union… Union… Union… Union… Float64
─────┼─────────────────────────────────────────────────────────
1 │ Chicago, IL 2.718 41.8781 -87.6297 2.0
2 │ Houston, TX 2.195 29.7604 -95.3698 2.0
3 │ San Antonio, TX 1.409 29.4241 -98.4936 2.0
4 │ Dallas, TX 1.257 32.7766 -96.7969 2.0
5 │ Austin, TX 0.885 30.2671 -97.743 2.0
6 │ Fort Worth, TX 0.792 32.7554 -97.3307 2.0
7 │ Memphis, TN 0.653 35.1495 -90.0489 2.0
sum(group.population) = 9.909
group = 6×5 SubDataFrame
Row │ city population lat lon group
│ Union… Union… Union… Union… Float64
─────┼───────────────────────────────────────────────────────────
1 │ Los Angeles, CA 3.884 34.0522 -118.244 3.0
2 │ Phoenix, AZ 1.513 33.4483 -112.074 3.0
3 │ San Diego, CA 1.355 32.7157 -117.161 3.0
4 │ San Jose, CA 0.998 37.3382 -121.886 3.0
5 │ San Francisco, CA 0.837 37.7749 -122.419 3.0
6 │ El Paso, TX 0.674 31.7775 -106.442 3.0
sum(group.population) = 9.261000000000001