Getting started with data and plotting

This tutorial was generated using Literate.jl. Download the source as a .jl file.

In this tutorial we will learn how to read tabular data into Julia, and some of the basics of plotting.

If you're new to Julia, start by reading Getting started with Julia and Getting started with JuMP first.

Note

There are multiple ways to read the same kind of data into Julia. This tutorial focuses on DataFrames.jl because it provides the ecosystem to work with most of the required file types in a straightforward manner.

Before we get started, we need this constant to point to where the data files are.

import JuMP
const DATA_DIR = joinpath(
    dirname(pathof(JuMP)),
    joinpath("..", "docs", "src", "tutorials", "getting_started", "data"),
);

Where to get help

Read the documentation

Preliminaries

To get started, we need to install some packages.

DataFrames.jl

The DataFrames package provides a set of tools for working with tabular data. It is available through the Julia package manager.

using Pkg
Pkg.add("DataFrames")
import DataFrames
What is a DataFrame?

A DataFrame is a data structure like a table or spreadsheet. You can use it for storing and exploring a set of related data values. Think of it as a smarter array for holding tabular data.

Plots.jl

The Plots package provides a set of tools for plotting. It is available through the Julia package manager.

using Pkg
Pkg.add("Plots")
import Plots

CSV .jl

CSV and other delimited text files can be read by the CSV.jl package.

Pkg.add("CSV")
import CSV

DataFrame basics

To read a CSV file into a DataFrame, we use the CSV.read function.

csv_df = CSV.read(joinpath(DATA_DIR, "StarWars.csv"), DataFrames.DataFrame)
20×13 DataFrame
RowNameGenderHeightWeightEyecolorHaircolorSkincolorHomelandBornDiedJediSpeciesWeapon
String31String7Float64String7String15String7String15String15String15String15String7String15String15
1Anakin Skywalkermale1.8884blueblondfairTatooine41.9BBY4ABYjedihumanlightsaber
2Padme Amidalafemale1.6545brownbrownlightNaboo46BBY19BBYno_jedihumanunarmed
3Luke Skywalkermale1.7277blueblondfairTatooine19BBYunk_diedjedihumanlightsaber
4Leia Skywalkerfemale1.549brownbrownlightAlderaan19BBYunk_diedno_jedihumanblaster
5Qui-Gon Jinnmale1.9388.5bluebrownlightunk_planet92BBY32BBYjedihumanlightsaber
6Obi-Wan Kenobimale1.8277bluegrayauburnfairStewjon57BBY0BBYjedihumanlightsaber
7Han Solomale1.880brownbrownlightCorellia29BBYunk_diedno_jedihumanblaster
8Sheev Palpatinemale1.7375blueredpaleNaboo82BBY10ABYno_jedihumanforce-lightning
9R2-D2male0.9632NANANANaboo33BBYunk_diedno_jedidroidunarmed
10C-3POmale1.6775NANANATatooine112BBY3ABYno_jedidroidunarmed
11Yodamale0.6617brownbrowngreenunk_planet896BBY4ABYjediyodalightsaber
12Darth Maulmale1.7580yellownoneredDathomir54BBYunk_diedno_jedidathomirianlightsaber
13Dookumale1.9386brownbrownlightSerenno102BBY19BBYjedihumanlightsaber
14Chewbaccamale2.28112bluebrownNAKashyyyk200BBY25ABYno_jediwookieebowcaster
15Jabbamale3.9NAyellownonetan-greenTatooineunk_born4ABYno_jedihuttunarmed
16Lando Calrissianmale1.7879brownblankdarkSocorro31BBYunk_diedno_jedihumanblaster
17Boba Fettmale1.8378brownblackbrownKamino31.5BBYunk_diedno_jedihumanblaster
18Jango Fettmale1.8379brownblackbrownConcordDawn66BBY22BBYno_jedihumanblaster
19Grievousmale2.16159goldblackorangeKaleeunk_born19BBYno_jedikaleeshslugthrower
20Chief Chirpamale1.050blackgraybrownEndorunk_born4ABYno_jediewokspear

Let's try plotting some of this data

Plots.scatter(
    csv_df.Weight,
    csv_df.Height;
    xlabel = "Weight",
    ylabel = "Height",
)
Example block output

That doesn't look right. What happened? If you look at the dataframe above, it read Weight in as a String column because there are "NA" fields. Let's correct that, by telling CSV to consider "NA" as missing.

csv_df = CSV.read(
    joinpath(DATA_DIR, "StarWars.csv"),
    DataFrames.DataFrame;
    missingstring = "NA",
)
20×13 DataFrame
RowNameGenderHeightWeightEyecolorHaircolorSkincolorHomelandBornDiedJediSpeciesWeapon
String31String7Float64Float64?String15?String7?String15?String15String15String15String7String15String15
1Anakin Skywalkermale1.8884.0blueblondfairTatooine41.9BBY4ABYjedihumanlightsaber
2Padme Amidalafemale1.6545.0brownbrownlightNaboo46BBY19BBYno_jedihumanunarmed
3Luke Skywalkermale1.7277.0blueblondfairTatooine19BBYunk_diedjedihumanlightsaber
4Leia Skywalkerfemale1.549.0brownbrownlightAlderaan19BBYunk_diedno_jedihumanblaster
5Qui-Gon Jinnmale1.9388.5bluebrownlightunk_planet92BBY32BBYjedihumanlightsaber
6Obi-Wan Kenobimale1.8277.0bluegrayauburnfairStewjon57BBY0BBYjedihumanlightsaber
7Han Solomale1.880.0brownbrownlightCorellia29BBYunk_diedno_jedihumanblaster
8Sheev Palpatinemale1.7375.0blueredpaleNaboo82BBY10ABYno_jedihumanforce-lightning
9R2-D2male0.9632.0missingmissingmissingNaboo33BBYunk_diedno_jedidroidunarmed
10C-3POmale1.6775.0missingmissingmissingTatooine112BBY3ABYno_jedidroidunarmed
11Yodamale0.6617.0brownbrowngreenunk_planet896BBY4ABYjediyodalightsaber
12Darth Maulmale1.7580.0yellownoneredDathomir54BBYunk_diedno_jedidathomirianlightsaber
13Dookumale1.9386.0brownbrownlightSerenno102BBY19BBYjedihumanlightsaber
14Chewbaccamale2.28112.0bluebrownmissingKashyyyk200BBY25ABYno_jediwookieebowcaster
15Jabbamale3.9missingyellownonetan-greenTatooineunk_born4ABYno_jedihuttunarmed
16Lando Calrissianmale1.7879.0brownblankdarkSocorro31BBYunk_diedno_jedihumanblaster
17Boba Fettmale1.8378.0brownblackbrownKamino31.5BBYunk_diedno_jedihumanblaster
18Jango Fettmale1.8379.0brownblackbrownConcordDawn66BBY22BBYno_jedihumanblaster
19Grievousmale2.16159.0goldblackorangeKaleeunk_born19BBYno_jedikaleeshslugthrower
20Chief Chirpamale1.050.0blackgraybrownEndorunk_born4ABYno_jediewokspear

Then let's re-plot our data

Plots.scatter(
    csv_df.Weight,
    csv_df.Height;
    title = "Height vs Weight of StarWars characters",
    xlabel = "Weight",
    ylabel = "Height",
    label = false,
    ylims = (0, 3),
)
Example block output

That looks better.

Tip

Read the CSV documentation for other parsing options.

DataFrames.jl supports manipulation using functions similar to pandas. For example, split the dataframe into groups based on eye-color:

by_eyecolor = DataFrames.groupby(csv_df, :Eyecolor)

GroupedDataFrame with 7 groups based on key: Eyecolor

First Group (5 rows): Eyecolor = "blue"
RowNameGenderHeightWeightEyecolorHaircolorSkincolorHomelandBornDiedJediSpeciesWeapon
String31String7Float64Float64?String15?String7?String15?String15String15String15String7String15String15
1Anakin Skywalkermale1.8884.0blueblondfairTatooine41.9BBY4ABYjedihumanlightsaber
2Luke Skywalkermale1.7277.0blueblondfairTatooine19BBYunk_diedjedihumanlightsaber
3Qui-Gon Jinnmale1.9388.5bluebrownlightunk_planet92BBY32BBYjedihumanlightsaber
4Sheev Palpatinemale1.7375.0blueredpaleNaboo82BBY10ABYno_jedihumanforce-lightning
5Chewbaccamale2.28112.0bluebrownmissingKashyyyk200BBY25ABYno_jediwookieebowcaster

Last Group (1 row): Eyecolor = "black"
RowNameGenderHeightWeightEyecolorHaircolorSkincolorHomelandBornDiedJediSpeciesWeapon
String31String7Float64Float64?String15?String7?String15?String15String15String15String7String15String15
1Chief Chirpamale1.050.0blackgraybrownEndorunk_born4ABYno_jediewokspear

Then recombine into a single dataframe based on a function operating over the split dataframes:

eyecolor_count = DataFrames.combine(by_eyecolor) do df
    return DataFrames.nrow(df)
end
7×2 DataFrame
RowEyecolorx1
String15?Int64
1blue5
2brown8
3bluegray1
4missing2
5yellow2
6gold1
7black1

We can rename columns:

DataFrames.rename!(eyecolor_count, :x1 => :count)
7×2 DataFrame
RowEyecolorcount
String15?Int64
1blue5
2brown8
3bluegray1
4missing2
5yellow2
6gold1
7black1

Drop some missing rows:

DataFrames.dropmissing!(eyecolor_count, :Eyecolor)
6×2 DataFrame
RowEyecolorcount
String15Int64
1blue5
2brown8
3bluegray1
4yellow2
5gold1
6black1

Then we can visualize the data:

sort!(eyecolor_count, :count; rev = true)
Plots.bar(
    eyecolor_count.Eyecolor,
    eyecolor_count.count;
    xlabel = "Eye color",
    ylabel = "Number of characters",
    label = false,
)
Example block output

Other Delimited Files

We can also use the CSV.jl package to read any other delimited text file format.

By default, CSV.File will try to detect a file's delimiter from the first 10 lines of the file.

Candidate delimiters include ',', '\t', ' ', '|', ';', and ':'. If it can't auto-detect the delimiter, it will assume ','.

Let's take the example of space separated data.

ss_df = CSV.read(joinpath(DATA_DIR, "Cereal.txt"), DataFrames.DataFrame)
23×10 DataFrame
RowNameCupsCaloriesCarbsFatFiberPotassiumProteinSodiumSugars
String31Float64Int64Float64Int64Float64Int64Int64Int64Int64
1CapnCrunch0.7512012.020.035122012
2CocoaPuffs1.011012.010.055118013
3Trix1.011013.010.025114012
4AppleJacks1.011011.001.030212514
5CornChex1.011022.000.02522803
6CornFlakes1.010021.001.03522902
7Nut&Honey0.6712015.010.04021909
8Smacks0.751109.011.04027015
9MultiGrain1.010015.012.09022206
10CracklinOat0.511010.034.016031407
11GrapeNuts0.2511017.003.09031793
12HoneyNutCheerios0.7511011.511.590325010
13NutriGrain0.6714021.023.013032207
14Product191.010020.001.04533203
15TotalRaisinBran1.014015.014.0230319014
16WheatChex0.6710017.013.011532303
17Oatmeal0.513013.521.5120317010
18Life0.6710012.022.09541506
19Maypo1.010016.010.095403
20QuakerOats0.510014.012.011041356
21Muesli1.015016.033.0170415011
22Cheerios1.2511017.022.010562901
23SpecialK1.011016.001.05562303

We can also specify the delimiter as follows:

delim_df = CSV.read(
    joinpath(DATA_DIR, "Soccer.txt"),
    DataFrames.DataFrame;
    delim = "::",
)
20×7 DataFrame
RowTeamPlayedWinsDrawsLossesGoals_forGoals_against
String31Int64Int64Int64Int64String15String15
1Barcelona383044110 goals21 goals
2Real Madrid383026118 goals38 goals
3Atletico Madrid38239667 goals29 goals
4Valencia382211570 goals32 goals
5Seville38237871 goals45 goals
6Villarreal3816121048 goals37 goals
7Athletic Bilbao3815101342 goals41 goals
8Celta Vigo3813121347 goals44 goals
9Malaga381481642 goals48 goals
10Espanyol3813101547 goals51 goals
11Rayo Vallecano381541946 goals68 goals
12Real Sociedad3811131444 goals51 goals
13Elche381181935 goals62 goals
14Levante389101934 goals67 goals
15Getafe381072133 goals64 goals
16Deportivo La Coruna387141735 goals60 goals
17Granada387141729 goals64 goals
18Eibar38982134 goals55 goals
19Almeria38882235 goals64 goals
20Cordoba383112422 goals68 goals

Working with DataFrames

Now that we have read the required data into a DataFrame, let us look at some basic operations we can perform on it.

Querying Basic Information

The size function gets us the dimensions of the DataFrame:

DataFrames.size(ss_df)
(23, 10)

We can also use the nrow and ncol functions to get the number of rows and columns respectively:

DataFrames.nrow(ss_df), DataFrames.ncol(ss_df)
(23, 10)

The describe function gives basic summary statistics of data in a DataFrame:

DataFrames.describe(ss_df)
10×7 DataFrame
Rowvariablemeanminmedianmaxnmissingeltype
SymbolUnion…AnyUnion…AnyInt64DataType
1NameAppleJacksWheatChex0String31
2Cups0.8230430.251.01.250Float64
3Calories113.043100110.01500Int64
4Carbs15.04359.015.022.00Float64
5Fat1.1304301.030Int64
6Fiber1.565220.01.54.00Float64
7Potassium86.30432590.02300Int64
8Protein2.9130413.060Int64
9Sodium189.9570190.03200Int64
10Sugars7.5217417.0150Int64

Names of every column can be obtained by the names function:

DataFrames.names(ss_df)
10-element Vector{String}:
 "Name"
 "Cups"
 "Calories"
 "Carbs"
 "Fat"
 "Fiber"
 "Potassium"
 "Protein"
 "Sodium"
 "Sugars"

Corresponding data types are obtained using the broadcasted eltype function:

eltype.(ss_df)
23×10 DataFrame
RowNameCupsCaloriesCarbsFatFiberPotassiumProteinSodiumSugars
DataTypeDataTypeDataTypeDataTypeDataTypeDataTypeDataTypeDataTypeDataTypeDataType
1CharFloat64Int64Float64Int64Float64Int64Int64Int64Int64
2CharFloat64Int64Float64Int64Float64Int64Int64Int64Int64
3CharFloat64Int64Float64Int64Float64Int64Int64Int64Int64
4CharFloat64Int64Float64Int64Float64Int64Int64Int64Int64
5CharFloat64Int64Float64Int64Float64Int64Int64Int64Int64
6CharFloat64Int64Float64Int64Float64Int64Int64Int64Int64
7CharFloat64Int64Float64Int64Float64Int64Int64Int64Int64
8CharFloat64Int64Float64Int64Float64Int64Int64Int64Int64
9CharFloat64Int64Float64Int64Float64Int64Int64Int64Int64
10CharFloat64Int64Float64Int64Float64Int64Int64Int64Int64
11CharFloat64Int64Float64Int64Float64Int64Int64Int64Int64
12CharFloat64Int64Float64Int64Float64Int64Int64Int64Int64
13CharFloat64Int64Float64Int64Float64Int64Int64Int64Int64
14CharFloat64Int64Float64Int64Float64Int64Int64Int64Int64
15CharFloat64Int64Float64Int64Float64Int64Int64Int64Int64
16CharFloat64Int64Float64Int64Float64Int64Int64Int64Int64
17CharFloat64Int64Float64Int64Float64Int64Int64Int64Int64
18CharFloat64Int64Float64Int64Float64Int64Int64Int64Int64
19CharFloat64Int64Float64Int64Float64Int64Int64Int64Int64
20CharFloat64Int64Float64Int64Float64Int64Int64Int64Int64
21CharFloat64Int64Float64Int64Float64Int64Int64Int64Int64
22CharFloat64Int64Float64Int64Float64Int64Int64Int64Int64
23CharFloat64Int64Float64Int64Float64Int64Int64Int64Int64

Accessing the Data

Similar to regular arrays, we use numerical indexing to access elements of a DataFrame:

csv_df[1, 1]
"Anakin Skywalker"

The following are different ways to access a column:

csv_df[!, 1]
20-element Vector{InlineStrings.String31}:
 "Anakin Skywalker"
 "Padme Amidala"
 "Luke Skywalker"
 "Leia Skywalker"
 "Qui-Gon Jinn"
 "Obi-Wan Kenobi"
 "Han Solo"
 "Sheev Palpatine"
 "R2-D2"
 "C-3PO"
 "Yoda"
 "Darth Maul"
 "Dooku"
 "Chewbacca"
 "Jabba"
 "Lando Calrissian"
 "Boba Fett"
 "Jango Fett"
 "Grievous"
 "Chief Chirpa"
csv_df[!, :Name]
20-element Vector{InlineStrings.String31}:
 "Anakin Skywalker"
 "Padme Amidala"
 "Luke Skywalker"
 "Leia Skywalker"
 "Qui-Gon Jinn"
 "Obi-Wan Kenobi"
 "Han Solo"
 "Sheev Palpatine"
 "R2-D2"
 "C-3PO"
 "Yoda"
 "Darth Maul"
 "Dooku"
 "Chewbacca"
 "Jabba"
 "Lando Calrissian"
 "Boba Fett"
 "Jango Fett"
 "Grievous"
 "Chief Chirpa"
csv_df.Name
20-element Vector{InlineStrings.String31}:
 "Anakin Skywalker"
 "Padme Amidala"
 "Luke Skywalker"
 "Leia Skywalker"
 "Qui-Gon Jinn"
 "Obi-Wan Kenobi"
 "Han Solo"
 "Sheev Palpatine"
 "R2-D2"
 "C-3PO"
 "Yoda"
 "Darth Maul"
 "Dooku"
 "Chewbacca"
 "Jabba"
 "Lando Calrissian"
 "Boba Fett"
 "Jango Fett"
 "Grievous"
 "Chief Chirpa"
csv_df[:, 1] # Note that this creates a copy.
20-element Vector{InlineStrings.String31}:
 "Anakin Skywalker"
 "Padme Amidala"
 "Luke Skywalker"
 "Leia Skywalker"
 "Qui-Gon Jinn"
 "Obi-Wan Kenobi"
 "Han Solo"
 "Sheev Palpatine"
 "R2-D2"
 "C-3PO"
 "Yoda"
 "Darth Maul"
 "Dooku"
 "Chewbacca"
 "Jabba"
 "Lando Calrissian"
 "Boba Fett"
 "Jango Fett"
 "Grievous"
 "Chief Chirpa"

The following are different ways to access a row:

csv_df[1:1, :]
1×13 DataFrame
RowNameGenderHeightWeightEyecolorHaircolorSkincolorHomelandBornDiedJediSpeciesWeapon
String31String7Float64Float64?String15?String7?String15?String15String15String15String7String15String15
1Anakin Skywalkermale1.8884.0blueblondfairTatooine41.9BBY4ABYjedihumanlightsaber
csv_df[1, :] # This produces a DataFrameRow.
DataFrameRow (13 columns)
RowNameGenderHeightWeightEyecolorHaircolorSkincolorHomelandBornDiedJediSpeciesWeapon
String31String7Float64Float64?String15?String7?String15?String15String15String15String7String15String15
1Anakin Skywalkermale1.8884.0blueblondfairTatooine41.9BBY4ABYjedihumanlightsaber

We can change the values just as we normally assign values.

Assign a range to scalar:

csv_df[1:3, :Height] .= 1.83
3-element view(::Vector{Float64}, 1:3) with eltype Float64:
 1.83
 1.83
 1.83

Assign a vector:

csv_df[4:6, :Height] = [1.8, 1.6, 1.8]
3-element Vector{Float64}:
 1.8
 1.6
 1.8
csv_df
20×13 DataFrame
RowNameGenderHeightWeightEyecolorHaircolorSkincolorHomelandBornDiedJediSpeciesWeapon
String31String7Float64Float64?String15?String7?String15?String15String15String15String7String15String15
1Anakin Skywalkermale1.8384.0blueblondfairTatooine41.9BBY4ABYjedihumanlightsaber
2Padme Amidalafemale1.8345.0brownbrownlightNaboo46BBY19BBYno_jedihumanunarmed
3Luke Skywalkermale1.8377.0blueblondfairTatooine19BBYunk_diedjedihumanlightsaber
4Leia Skywalkerfemale1.849.0brownbrownlightAlderaan19BBYunk_diedno_jedihumanblaster
5Qui-Gon Jinnmale1.688.5bluebrownlightunk_planet92BBY32BBYjedihumanlightsaber
6Obi-Wan Kenobimale1.877.0bluegrayauburnfairStewjon57BBY0BBYjedihumanlightsaber
7Han Solomale1.880.0brownbrownlightCorellia29BBYunk_diedno_jedihumanblaster
8Sheev Palpatinemale1.7375.0blueredpaleNaboo82BBY10ABYno_jedihumanforce-lightning
9R2-D2male0.9632.0missingmissingmissingNaboo33BBYunk_diedno_jedidroidunarmed
10C-3POmale1.6775.0missingmissingmissingTatooine112BBY3ABYno_jedidroidunarmed
11Yodamale0.6617.0brownbrowngreenunk_planet896BBY4ABYjediyodalightsaber
12Darth Maulmale1.7580.0yellownoneredDathomir54BBYunk_diedno_jedidathomirianlightsaber
13Dookumale1.9386.0brownbrownlightSerenno102BBY19BBYjedihumanlightsaber
14Chewbaccamale2.28112.0bluebrownmissingKashyyyk200BBY25ABYno_jediwookieebowcaster
15Jabbamale3.9missingyellownonetan-greenTatooineunk_born4ABYno_jedihuttunarmed
16Lando Calrissianmale1.7879.0brownblankdarkSocorro31BBYunk_diedno_jedihumanblaster
17Boba Fettmale1.8378.0brownblackbrownKamino31.5BBYunk_diedno_jedihumanblaster
18Jango Fettmale1.8379.0brownblackbrownConcordDawn66BBY22BBYno_jedihumanblaster
19Grievousmale2.16159.0goldblackorangeKaleeunk_born19BBYno_jedikaleeshslugthrower
20Chief Chirpamale1.050.0blackgraybrownEndorunk_born4ABYno_jediewokspear
Tip

There are a lot more things which can be done with a DataFrame. Read the docs for more information.

For information on dplyr-type syntax:

Example: the passport problem

Let's now apply what we have learned to solve a real problem.

Data manipulation

The Passport Index Dataset lists travel visa requirements for 199 countries, in .csv format. Our task is to find the minimum number of passports required to visit all countries.

passport_data = CSV.read(
    joinpath(DATA_DIR, "passport-index-matrix.csv"),
    DataFrames.DataFrame,
);

In this dataset, the first column represents a passport (=from) and each remaining column represents a foreign country (=to).

The values in each cell are as follows:

  • 3 = visa-free travel
  • 2 = eTA is required
  • 1 = visa can be obtained on arrival
  • 0 = visa is required
  • -1 is for all instances where passport and destination are the same

Our task is to find out the minimum number of passports needed to visit every country without requiring a visa.

The values we are interested in are -1 and 3. Let's modify the dataframe so that the -1 and 3 are 1 (true), and all others are 0 (false):

function modifier(x)
    if x == -1 || x == 3
        return 1
    else
        return 0
    end
end

for country in passport_data.Passport
    passport_data[!, country] = modifier.(passport_data[!, country])
end

The values in the cells now represent:

  • 1 = no visa required for travel
  • 0 = visa required for travel

JuMP Modeling

To model the problem as a mixed-integer linear program, we need a binary decision variable $x_c$ for each country $c$. $x_c$ is $1$ if we select passport $c$ and $0$ otherwise. Our objective is to minimize the sum $\sum x_c$ over all countries.

Since we wish to visit all the countries, for every country, we must own at least one passport that lets us travel to that country visa free. For one destination, this can be mathematically represented as $\sum_{c \in C} a_{c,d} \cdot x_{d} \geq 1$, where $a$ is the passport_data dataframe.

Thus, we can represent this problem using the following model:

\[\begin{aligned} \min && \sum_{c \in C} x_c \\ \text{s.t.} && \sum_{c \in C} a_{c,d} x_c \geq 1 && \forall d \in C \\ && x_c \in \{0,1\} && \forall c \in C. \end{aligned}\]

We'll now solve the problem using JuMP:

using JuMP
import HiGHS

First, create the set of countries:

C = passport_data.Passport
199-element Vector{String}:
 "Afghanistan"
 "Albania"
 "Algeria"
 "Andorra"
 "Angola"
 "Antigua and Barbuda"
 "Argentina"
 "Armenia"
 "Australia"
 "Austria"
 ⋮
 "Uruguay"
 "Uzbekistan"
 "Vanuatu"
 "Vatican"
 "Venezuela"
 "Viet Nam"
 "Yemen"
 "Zambia"
 "Zimbabwe"

Then, create the model and initialize the decision variables:

model = Model(HiGHS.Optimizer)
set_silent(model)
@variable(model, x[C], Bin)
@objective(model, Min, sum(x))
@constraint(model, [d in C], passport_data[!, d]' * x >= 1)
model
A JuMP Model
├ solver: HiGHS
├ objective_sense: MIN_SENSE
│ └ objective_function_type: AffExpr
├ num_variables: 199
├ num_constraints: 398
│ ├ AffExpr in MOI.GreaterThan{Float64}: 199
│ └ VariableRef in MOI.ZeroOne: 199
└ Names registered in the model
  └ :x

Now optimize:

optimize!(model)

We can use the solution_summary function to get an overview of the solution:

solution_summary(model)
* Solver : HiGHS

* Status
  Result count       : 1
  Termination status : OPTIMAL
  Message from the solver:
  "kHighsModelStatusOptimal"

* Candidate solution (result #1)
  Primal status      : FEASIBLE_POINT
  Dual status        : NO_SOLUTION
  Objective value    : 2.30000e+01
  Objective bound    : 2.30000e+01
  Relative gap       : 0.00000e+00
  Dual objective value : NaN

* Work counters
  Solve time (sec)   : 9.13119e-03
  Simplex iterations : 26
  Barrier iterations : -1
  Node count         : 1

Just to be sure, check that the solver found an optimal solution:

@assert is_solved_and_feasible(model)

Solution

Let's have a look at the solution in more detail:

println("Minimum number of passports needed: ", objective_value(model))
Minimum number of passports needed: 23.0
println("Optimal passports:")
for c in C
    if value(x[c]) > 0.5
        println(" * ", c)
    end
end
Optimal passports:
 * Afghanistan
 * Chad
 * Comoros
 * Djibouti
 * Georgia
 * Hong Kong
 * India
 * Luxembourg
 * Madagascar
 * Maldives
 * Mali
 * New Zealand
 * North Korea
 * Papua New Guinea
 * Singapore
 * Somalia
 * Sri Lanka
 * Tunisia
 * Turkey
 * Uganda
 * United Arab Emirates
 * United States
 * Zimbabwe

We need some passports, like New Zealand and the United States, which have widespread access to a large number of countries. However, we also need passports like North Korea which only have visa-free access to a very limited number of countries.

Note

We use value(x[c]) > 0.5 rather than value(x[c]) == 1 to avoid excluding solutions like x[c] = 0.99999 that are "1" to some tolerance.