Getting started with data and plotting

This tutorial was generated using Literate.jl. Download the source as a .jl file.

In this tutorial we will learn how to read tabular data into Julia, and some of the basics of plotting.

If you're new to Julia, start by reading Getting started with Julia and Getting started with JuMP first.

Note

There are multiple ways to read the same kind of data into Julia. This tutorial focuses on DataFrames.jl because it provides the ecosystem to work with most of the required file types in a straightforward manner.

Before we get started, we need this constant to point to where the data files are.

import JuMP
const DATA_DIR = joinpath(
    dirname(pathof(JuMP)),
    joinpath("..", "docs", "src", "tutorials", "getting_started", "data"),
);

Where to get help

Read the documentation

Plots.jl: http://docs.juliaplots.org/latest/
CSV.jl: http://csv.juliadata.org/stable
DataFrames.jl: https://dataframes.juliadata.org/stable/

Preliminaries

To get started, we need to install some packages.

DataFrames.jl

The DataFrames package provides a set of tools for working with tabular data. It is available through the Julia package manager.

using Pkg
Pkg.add("DataFrames")

import DataFrames

What is a DataFrame?

A DataFrame is a data structure like a table or spreadsheet. You can use it for storing and exploring a set of related data values. Think of it as a smarter array for holding tabular data.

Plots.jl

The Plots package provides a set of tools for plotting. It is available through the Julia package manager.

using Pkg
Pkg.add("Plots")

import Plots

CSV .jl

CSV and other delimited text files can be read by the CSV.jl package.

Pkg.add("CSV")

import CSV

DataFrame basics

To read a CSV file into a DataFrame, we use the CSV.read function.

csv_df = CSV.read(joinpath(DATA_DIR, "StarWars.csv"), DataFrames.DataFrame)

20×13 DataFrame

Row	Name	Gender	Height	Weight	Eyecolor	Haircolor	Skincolor	Homeland	Born	Died	Jedi	Species	Weapon
	String31	String7	Float64	String7	String15	String7	String15	String15	String15	String15	String7	String15	String15
1	Anakin Skywalker	male	1.88	84	blue	blond	fair	Tatooine	41.9BBY	4ABY	jedi	human	lightsaber
2	Padme Amidala	female	1.65	45	brown	brown	light	Naboo	46BBY	19BBY	no_jedi	human	unarmed
3	Luke Skywalker	male	1.72	77	blue	blond	fair	Tatooine	19BBY	unk_died	jedi	human	lightsaber
4	Leia Skywalker	female	1.5	49	brown	brown	light	Alderaan	19BBY	unk_died	no_jedi	human	blaster
5	Qui-Gon Jinn	male	1.93	88.5	blue	brown	light	unk_planet	92BBY	32BBY	jedi	human	lightsaber
6	Obi-Wan Kenobi	male	1.82	77	bluegray	auburn	fair	Stewjon	57BBY	0BBY	jedi	human	lightsaber
7	Han Solo	male	1.8	80	brown	brown	light	Corellia	29BBY	unk_died	no_jedi	human	blaster
8	Sheev Palpatine	male	1.73	75	blue	red	pale	Naboo	82BBY	10ABY	no_jedi	human	force-lightning
9	R2-D2	male	0.96	32	NA	NA	NA	Naboo	33BBY	unk_died	no_jedi	droid	unarmed
10	C-3PO	male	1.67	75	NA	NA	NA	Tatooine	112BBY	3ABY	no_jedi	droid	unarmed
11	Yoda	male	0.66	17	brown	brown	green	unk_planet	896BBY	4ABY	jedi	yoda	lightsaber
12	Darth Maul	male	1.75	80	yellow	none	red	Dathomir	54BBY	unk_died	no_jedi	dathomirian	lightsaber
13	Dooku	male	1.93	86	brown	brown	light	Serenno	102BBY	19BBY	jedi	human	lightsaber
14	Chewbacca	male	2.28	112	blue	brown	NA	Kashyyyk	200BBY	25ABY	no_jedi	wookiee	bowcaster
15	Jabba	male	3.9	NA	yellow	none	tan-green	Tatooine	unk_born	4ABY	no_jedi	hutt	unarmed
16	Lando Calrissian	male	1.78	79	brown	blank	dark	Socorro	31BBY	unk_died	no_jedi	human	blaster
17	Boba Fett	male	1.83	78	brown	black	brown	Kamino	31.5BBY	unk_died	no_jedi	human	blaster
18	Jango Fett	male	1.83	79	brown	black	brown	ConcordDawn	66BBY	22BBY	no_jedi	human	blaster
19	Grievous	male	2.16	159	gold	black	orange	Kalee	unk_born	19BBY	no_jedi	kaleesh	slugthrower
20	Chief Chirpa	male	1.0	50	black	gray	brown	Endor	unk_born	4ABY	no_jedi	ewok	spear

Let's try plotting some of this data

Plots.scatter(
    csv_df.Weight,
    csv_df.Height;
    xlabel = "Weight",
    ylabel = "Height",
)

That doesn't look right. What happened? If you look at the dataframe above, it read Weight in as a String column because there are "NA" fields. Let's correct that, by telling CSV to consider "NA" as missing.

csv_df = CSV.read(
    joinpath(DATA_DIR, "StarWars.csv"),
    DataFrames.DataFrame;
    missingstring = "NA",
)

20×13 DataFrame

Row	Name	Gender	Height	Weight	Eyecolor	Haircolor	Skincolor	Homeland	Born	Died	Jedi	Species	Weapon
	String31	String7	Float64	Float64?	String15?	String7?	String15?	String15	String15	String15	String7	String15	String15
1	Anakin Skywalker	male	1.88	84.0	blue	blond	fair	Tatooine	41.9BBY	4ABY	jedi	human	lightsaber
2	Padme Amidala	female	1.65	45.0	brown	brown	light	Naboo	46BBY	19BBY	no_jedi	human	unarmed
3	Luke Skywalker	male	1.72	77.0	blue	blond	fair	Tatooine	19BBY	unk_died	jedi	human	lightsaber
4	Leia Skywalker	female	1.5	49.0	brown	brown	light	Alderaan	19BBY	unk_died	no_jedi	human	blaster
5	Qui-Gon Jinn	male	1.93	88.5	blue	brown	light	unk_planet	92BBY	32BBY	jedi	human	lightsaber
6	Obi-Wan Kenobi	male	1.82	77.0	bluegray	auburn	fair	Stewjon	57BBY	0BBY	jedi	human	lightsaber
7	Han Solo	male	1.8	80.0	brown	brown	light	Corellia	29BBY	unk_died	no_jedi	human	blaster
8	Sheev Palpatine	male	1.73	75.0	blue	red	pale	Naboo	82BBY	10ABY	no_jedi	human	force-lightning
9	R2-D2	male	0.96	32.0	missing	missing	missing	Naboo	33BBY	unk_died	no_jedi	droid	unarmed
10	C-3PO	male	1.67	75.0	missing	missing	missing	Tatooine	112BBY	3ABY	no_jedi	droid	unarmed
11	Yoda	male	0.66	17.0	brown	brown	green	unk_planet	896BBY	4ABY	jedi	yoda	lightsaber
12	Darth Maul	male	1.75	80.0	yellow	none	red	Dathomir	54BBY	unk_died	no_jedi	dathomirian	lightsaber
13	Dooku	male	1.93	86.0	brown	brown	light	Serenno	102BBY	19BBY	jedi	human	lightsaber
14	Chewbacca	male	2.28	112.0	blue	brown	missing	Kashyyyk	200BBY	25ABY	no_jedi	wookiee	bowcaster
15	Jabba	male	3.9	missing	yellow	none	tan-green	Tatooine	unk_born	4ABY	no_jedi	hutt	unarmed
16	Lando Calrissian	male	1.78	79.0	brown	blank	dark	Socorro	31BBY	unk_died	no_jedi	human	blaster
17	Boba Fett	male	1.83	78.0	brown	black	brown	Kamino	31.5BBY	unk_died	no_jedi	human	blaster
18	Jango Fett	male	1.83	79.0	brown	black	brown	ConcordDawn	66BBY	22BBY	no_jedi	human	blaster
19	Grievous	male	2.16	159.0	gold	black	orange	Kalee	unk_born	19BBY	no_jedi	kaleesh	slugthrower
20	Chief Chirpa	male	1.0	50.0	black	gray	brown	Endor	unk_born	4ABY	no_jedi	ewok	spear

Then let's re-plot our data

Plots.scatter(
    csv_df.Weight,
    csv_df.Height;
    title = "Height vs Weight of StarWars characters",
    xlabel = "Weight",
    ylabel = "Height",
    label = false,
    ylims = (0, 3),
)

That looks better.

Tip

Read the CSV documentation for other parsing options.

DataFrames.jl supports manipulation using functions similar to pandas. For example, split the dataframe into groups based on eye-color:

by_eyecolor = DataFrames.groupby(csv_df, :Eyecolor)

GroupedDataFrame with 7 groups based on key: Eyecolor

First Group (5 rows): Eyecolor = "blue"

Row	Name	Gender	Height	Weight	Eyecolor	Haircolor	Skincolor	Homeland	Born	Died	Jedi	Species	Weapon
	String31	String7	Float64	Float64?	String15?	String7?	String15?	String15	String15	String15	String7	String15	String15
1	Anakin Skywalker	male	1.88	84.0	blue	blond	fair	Tatooine	41.9BBY	4ABY	jedi	human	lightsaber
2	Luke Skywalker	male	1.72	77.0	blue	blond	fair	Tatooine	19BBY	unk_died	jedi	human	lightsaber
3	Qui-Gon Jinn	male	1.93	88.5	blue	brown	light	unk_planet	92BBY	32BBY	jedi	human	lightsaber
4	Sheev Palpatine	male	1.73	75.0	blue	red	pale	Naboo	82BBY	10ABY	no_jedi	human	force-lightning
5	Chewbacca	male	2.28	112.0	blue	brown	missing	Kashyyyk	200BBY	25ABY	no_jedi	wookiee	bowcaster

⋮

Last Group (1 row): Eyecolor = "black"

Row	Name	Gender	Height	Weight	Eyecolor	Haircolor	Skincolor	Homeland	Born	Died	Jedi	Species	Weapon
	String31	String7	Float64	Float64?	String15?	String7?	String15?	String15	String15	String15	String7	String15	String15
1	Chief Chirpa	male	1.0	50.0	black	gray	brown	Endor	unk_born	4ABY	no_jedi	ewok	spear

Then recombine into a single dataframe based on a function operating over the split dataframes:

eyecolor_count = DataFrames.combine(by_eyecolor) do df
    return DataFrames.nrow(df)
end

7×2 DataFrame

Row	Eyecolor	x1
	String15?	Int64
1	blue	5
2	brown	8
3	bluegray	1
4	missing	2
5	yellow	2
6	gold	1
7	black	1

We can rename columns:

DataFrames.rename!(eyecolor_count, :x1 => :count)

7×2 DataFrame

Row	Eyecolor	count
	String15?	Int64
1	blue	5
2	brown	8
3	bluegray	1
4	missing	2
5	yellow	2
6	gold	1
7	black	1

Drop some missing rows:

DataFrames.dropmissing!(eyecolor_count, :Eyecolor)

6×2 DataFrame

Row	Eyecolor	count
	String15	Int64
1	blue	5
2	brown	8
3	bluegray	1
4	yellow	2
5	gold	1
6	black	1

Then we can visualize the data:

sort!(eyecolor_count, :count; rev = true)
Plots.bar(
    eyecolor_count.Eyecolor,
    eyecolor_count.count;
    xlabel = "Eye color",
    ylabel = "Number of characters",
    label = false,
)

Other Delimited Files

We can also use the CSV.jl package to read any other delimited text file format.

By default, CSV.File will try to detect a file's delimiter from the first 10 lines of the file.

Candidate delimiters include ',', '\t', ' ', '|', ';', and ':'. If it can't auto-detect the delimiter, it will assume ','.

Let's take the example of space separated data.

ss_df = CSV.read(joinpath(DATA_DIR, "Cereal.txt"), DataFrames.DataFrame)

23×10 DataFrame

Row	Name	Cups	Calories	Carbs	Fat	Fiber	Potassium	Protein	Sodium	Sugars
	String31	Float64	Int64	Float64	Int64	Float64	Int64	Int64	Int64	Int64
1	CapnCrunch	0.75	120	12.0	2	0.0	35	1	220	12
2	CocoaPuffs	1.0	110	12.0	1	0.0	55	1	180	13
3	Trix	1.0	110	13.0	1	0.0	25	1	140	12
4	AppleJacks	1.0	110	11.0	0	1.0	30	2	125	14
5	CornChex	1.0	110	22.0	0	0.0	25	2	280	3
6	CornFlakes	1.0	100	21.0	0	1.0	35	2	290	2
7	Nut&Honey	0.67	120	15.0	1	0.0	40	2	190	9
8	Smacks	0.75	110	9.0	1	1.0	40	2	70	15
9	MultiGrain	1.0	100	15.0	1	2.0	90	2	220	6
10	CracklinOat	0.5	110	10.0	3	4.0	160	3	140	7
11	GrapeNuts	0.25	110	17.0	0	3.0	90	3	179	3
12	HoneyNutCheerios	0.75	110	11.5	1	1.5	90	3	250	10
13	NutriGrain	0.67	140	21.0	2	3.0	130	3	220	7
14	Product19	1.0	100	20.0	0	1.0	45	3	320	3
15	TotalRaisinBran	1.0	140	15.0	1	4.0	230	3	190	14
16	WheatChex	0.67	100	17.0	1	3.0	115	3	230	3
17	Oatmeal	0.5	130	13.5	2	1.5	120	3	170	10
18	Life	0.67	100	12.0	2	2.0	95	4	150	6
19	Maypo	1.0	100	16.0	1	0.0	95	4	0	3
20	QuakerOats	0.5	100	14.0	1	2.0	110	4	135	6
21	Muesli	1.0	150	16.0	3	3.0	170	4	150	11
22	Cheerios	1.25	110	17.0	2	2.0	105	6	290	1
23	SpecialK	1.0	110	16.0	0	1.0	55	6	230	3

We can also specify the delimiter as follows:

delim_df = CSV.read(
    joinpath(DATA_DIR, "Soccer.txt"),
    DataFrames.DataFrame;
    delim = "::",
)

20×7 DataFrame

Row	Team	Played	Wins	Draws	Losses	Goals_for	Goals_against
	String31	Int64	Int64	Int64	Int64	String15	String15
1	Barcelona	38	30	4	4	110 goals	21 goals
2	Real Madrid	38	30	2	6	118 goals	38 goals
3	Atletico Madrid	38	23	9	6	67 goals	29 goals
4	Valencia	38	22	11	5	70 goals	32 goals
5	Seville	38	23	7	8	71 goals	45 goals
6	Villarreal	38	16	12	10	48 goals	37 goals
7	Athletic Bilbao	38	15	10	13	42 goals	41 goals
8	Celta Vigo	38	13	12	13	47 goals	44 goals
9	Malaga	38	14	8	16	42 goals	48 goals
10	Espanyol	38	13	10	15	47 goals	51 goals
11	Rayo Vallecano	38	15	4	19	46 goals	68 goals
12	Real Sociedad	38	11	13	14	44 goals	51 goals
13	Elche	38	11	8	19	35 goals	62 goals
14	Levante	38	9	10	19	34 goals	67 goals
15	Getafe	38	10	7	21	33 goals	64 goals
16	Deportivo La Coruna	38	7	14	17	35 goals	60 goals
17	Granada	38	7	14	17	29 goals	64 goals
18	Eibar	38	9	8	21	34 goals	55 goals
19	Almeria	38	8	8	22	35 goals	64 goals
20	Cordoba	38	3	11	24	22 goals	68 goals

Working with DataFrames

Now that we have read the required data into a DataFrame, let us look at some basic operations we can perform on it.

Querying Basic Information

The size function gets us the dimensions of the DataFrame:

DataFrames.size(ss_df)

(23, 10)

We can also use the nrow and ncol functions to get the number of rows and columns respectively:

DataFrames.nrow(ss_df), DataFrames.ncol(ss_df)

(23, 10)

The describe function gives basic summary statistics of data in a DataFrame:

DataFrames.describe(ss_df)

10×7 DataFrame

Row	variable	mean	min	median	max	nmissing	eltype
	Symbol	Union…	Any	Union…	Any	Int64	DataType
1	Name		AppleJacks		WheatChex	0	String31
2	Cups	0.823043	0.25	1.0	1.25	0	Float64
3	Calories	113.043	100	110.0	150	0	Int64
4	Carbs	15.0435	9.0	15.0	22.0	0	Float64
5	Fat	1.13043	0	1.0	3	0	Int64
6	Fiber	1.56522	0.0	1.5	4.0	0	Float64
7	Potassium	86.3043	25	90.0	230	0	Int64
8	Protein	2.91304	1	3.0	6	0	Int64
9	Sodium	189.957	0	190.0	320	0	Int64
10	Sugars	7.52174	1	7.0	15	0	Int64

Names of every column can be obtained by the names function:

DataFrames.names(ss_df)

10-element Vector{String}:
 "Name"
 "Cups"
 "Calories"
 "Carbs"
 "Fat"
 "Fiber"
 "Potassium"
 "Protein"
 "Sodium"
 "Sugars"

Corresponding data types are obtained using the broadcasted eltype function:

eltype.(ss_df)

23×10 DataFrame

Row	Name	Cups	Calories	Carbs	Fat	Fiber	Potassium	Protein	Sodium	Sugars
	DataType	DataType	DataType	DataType	DataType	DataType	DataType	DataType	DataType	DataType
1	Char	Float64	Int64	Float64	Int64	Float64	Int64	Int64	Int64	Int64
2	Char	Float64	Int64	Float64	Int64	Float64	Int64	Int64	Int64	Int64
3	Char	Float64	Int64	Float64	Int64	Float64	Int64	Int64	Int64	Int64
4	Char	Float64	Int64	Float64	Int64	Float64	Int64	Int64	Int64	Int64
5	Char	Float64	Int64	Float64	Int64	Float64	Int64	Int64	Int64	Int64
6	Char	Float64	Int64	Float64	Int64	Float64	Int64	Int64	Int64	Int64
7	Char	Float64	Int64	Float64	Int64	Float64	Int64	Int64	Int64	Int64
8	Char	Float64	Int64	Float64	Int64	Float64	Int64	Int64	Int64	Int64
9	Char	Float64	Int64	Float64	Int64	Float64	Int64	Int64	Int64	Int64
10	Char	Float64	Int64	Float64	Int64	Float64	Int64	Int64	Int64	Int64
11	Char	Float64	Int64	Float64	Int64	Float64	Int64	Int64	Int64	Int64
12	Char	Float64	Int64	Float64	Int64	Float64	Int64	Int64	Int64	Int64
13	Char	Float64	Int64	Float64	Int64	Float64	Int64	Int64	Int64	Int64
14	Char	Float64	Int64	Float64	Int64	Float64	Int64	Int64	Int64	Int64
15	Char	Float64	Int64	Float64	Int64	Float64	Int64	Int64	Int64	Int64
16	Char	Float64	Int64	Float64	Int64	Float64	Int64	Int64	Int64	Int64
17	Char	Float64	Int64	Float64	Int64	Float64	Int64	Int64	Int64	Int64
18	Char	Float64	Int64	Float64	Int64	Float64	Int64	Int64	Int64	Int64
19	Char	Float64	Int64	Float64	Int64	Float64	Int64	Int64	Int64	Int64
20	Char	Float64	Int64	Float64	Int64	Float64	Int64	Int64	Int64	Int64
21	Char	Float64	Int64	Float64	Int64	Float64	Int64	Int64	Int64	Int64
22	Char	Float64	Int64	Float64	Int64	Float64	Int64	Int64	Int64	Int64
23	Char	Float64	Int64	Float64	Int64	Float64	Int64	Int64	Int64	Int64

Accessing the Data

Similar to regular arrays, we use numerical indexing to access elements of a DataFrame:

csv_df[1, 1]

"Anakin Skywalker"

The following are different ways to access a column:

csv_df[!, 1]

20-element Vector{InlineStrings.String31}:
 "Anakin Skywalker"
 "Padme Amidala"
 "Luke Skywalker"
 "Leia Skywalker"
 "Qui-Gon Jinn"
 "Obi-Wan Kenobi"
 "Han Solo"
 "Sheev Palpatine"
 "R2-D2"
 "C-3PO"
 "Yoda"
 "Darth Maul"
 "Dooku"
 "Chewbacca"
 "Jabba"
 "Lando Calrissian"
 "Boba Fett"
 "Jango Fett"
 "Grievous"
 "Chief Chirpa"

csv_df[!, :Name]

20-element Vector{InlineStrings.String31}:
 "Anakin Skywalker"
 "Padme Amidala"
 "Luke Skywalker"
 "Leia Skywalker"
 "Qui-Gon Jinn"
 "Obi-Wan Kenobi"
 "Han Solo"
 "Sheev Palpatine"
 "R2-D2"
 "C-3PO"
 "Yoda"
 "Darth Maul"
 "Dooku"
 "Chewbacca"
 "Jabba"
 "Lando Calrissian"
 "Boba Fett"
 "Jango Fett"
 "Grievous"
 "Chief Chirpa"

csv_df.Name

20-element Vector{InlineStrings.String31}:
 "Anakin Skywalker"
 "Padme Amidala"
 "Luke Skywalker"
 "Leia Skywalker"
 "Qui-Gon Jinn"
 "Obi-Wan Kenobi"
 "Han Solo"
 "Sheev Palpatine"
 "R2-D2"
 "C-3PO"
 "Yoda"
 "Darth Maul"
 "Dooku"
 "Chewbacca"
 "Jabba"
 "Lando Calrissian"
 "Boba Fett"
 "Jango Fett"
 "Grievous"
 "Chief Chirpa"

csv_df[:, 1] # Note that this creates a copy.

20-element Vector{InlineStrings.String31}:
 "Anakin Skywalker"
 "Padme Amidala"
 "Luke Skywalker"
 "Leia Skywalker"
 "Qui-Gon Jinn"
 "Obi-Wan Kenobi"
 "Han Solo"
 "Sheev Palpatine"
 "R2-D2"
 "C-3PO"
 "Yoda"
 "Darth Maul"
 "Dooku"
 "Chewbacca"
 "Jabba"
 "Lando Calrissian"
 "Boba Fett"
 "Jango Fett"
 "Grievous"
 "Chief Chirpa"

The following are different ways to access a row:

csv_df[1:1, :]

1×13 DataFrame

Row	Name	Gender	Height	Weight	Eyecolor	Haircolor	Skincolor	Homeland	Born	Died	Jedi	Species	Weapon
	String31	String7	Float64	Float64?	String15?	String7?	String15?	String15	String15	String15	String7	String15	String15
1	Anakin Skywalker	male	1.88	84.0	blue	blond	fair	Tatooine	41.9BBY	4ABY	jedi	human	lightsaber

csv_df[1, :] # This produces a DataFrameRow.

DataFrameRow (13 columns)

Row	Name	Gender	Height	Weight	Eyecolor	Haircolor	Skincolor	Homeland	Born	Died	Jedi	Species	Weapon
	String31	String7	Float64	Float64?	String15?	String7?	String15?	String15	String15	String15	String7	String15	String15
1	Anakin Skywalker	male	1.88	84.0	blue	blond	fair	Tatooine	41.9BBY	4ABY	jedi	human	lightsaber

We can change the values just as we normally assign values.

Assign a range to scalar:

csv_df[1:3, :Height] .= 1.83

3-element view(::Vector{Float64}, 1:3) with eltype Float64:
 1.83
 1.83
 1.83

Assign a vector:

csv_df[4:6, :Height] = [1.8, 1.6, 1.8]

3-element Vector{Float64}:
 1.8
 1.6
 1.8

csv_df

20×13 DataFrame

Row	Name	Gender	Height	Weight	Eyecolor	Haircolor	Skincolor	Homeland	Born	Died	Jedi	Species	Weapon
	String31	String7	Float64	Float64?	String15?	String7?	String15?	String15	String15	String15	String7	String15	String15
1	Anakin Skywalker	male	1.83	84.0	blue	blond	fair	Tatooine	41.9BBY	4ABY	jedi	human	lightsaber
2	Padme Amidala	female	1.83	45.0	brown	brown	light	Naboo	46BBY	19BBY	no_jedi	human	unarmed
3	Luke Skywalker	male	1.83	77.0	blue	blond	fair	Tatooine	19BBY	unk_died	jedi	human	lightsaber
4	Leia Skywalker	female	1.8	49.0	brown	brown	light	Alderaan	19BBY	unk_died	no_jedi	human	blaster
5	Qui-Gon Jinn	male	1.6	88.5	blue	brown	light	unk_planet	92BBY	32BBY	jedi	human	lightsaber
6	Obi-Wan Kenobi	male	1.8	77.0	bluegray	auburn	fair	Stewjon	57BBY	0BBY	jedi	human	lightsaber
7	Han Solo	male	1.8	80.0	brown	brown	light	Corellia	29BBY	unk_died	no_jedi	human	blaster
8	Sheev Palpatine	male	1.73	75.0	blue	red	pale	Naboo	82BBY	10ABY	no_jedi	human	force-lightning
9	R2-D2	male	0.96	32.0	missing	missing	missing	Naboo	33BBY	unk_died	no_jedi	droid	unarmed
10	C-3PO	male	1.67	75.0	missing	missing	missing	Tatooine	112BBY	3ABY	no_jedi	droid	unarmed
11	Yoda	male	0.66	17.0	brown	brown	green	unk_planet	896BBY	4ABY	jedi	yoda	lightsaber
12	Darth Maul	male	1.75	80.0	yellow	none	red	Dathomir	54BBY	unk_died	no_jedi	dathomirian	lightsaber
13	Dooku	male	1.93	86.0	brown	brown	light	Serenno	102BBY	19BBY	jedi	human	lightsaber
14	Chewbacca	male	2.28	112.0	blue	brown	missing	Kashyyyk	200BBY	25ABY	no_jedi	wookiee	bowcaster
15	Jabba	male	3.9	missing	yellow	none	tan-green	Tatooine	unk_born	4ABY	no_jedi	hutt	unarmed
16	Lando Calrissian	male	1.78	79.0	brown	blank	dark	Socorro	31BBY	unk_died	no_jedi	human	blaster
17	Boba Fett	male	1.83	78.0	brown	black	brown	Kamino	31.5BBY	unk_died	no_jedi	human	blaster
18	Jango Fett	male	1.83	79.0	brown	black	brown	ConcordDawn	66BBY	22BBY	no_jedi	human	blaster
19	Grievous	male	2.16	159.0	gold	black	orange	Kalee	unk_born	19BBY	no_jedi	kaleesh	slugthrower
20	Chief Chirpa	male	1.0	50.0	black	gray	brown	Endor	unk_born	4ABY	no_jedi	ewok	spear

Tip

There are a lot more things which can be done with a DataFrame. Read the docs for more information.

For information on dplyr-type syntax:

Read the DataFrames.jl documentation
Check out DataFramesMeta.jl

Example: the passport problem

Let's now apply what we have learned to solve a real problem.

Data manipulation

The Passport Index Dataset lists travel visa requirements for 199 countries, in .csv format. Our task is to find the minimum number of passports required to visit all countries.

passport_data = CSV.read(
    joinpath(DATA_DIR, "passport-index-matrix.csv"),
    DataFrames.DataFrame,
);

In this dataset, the first column represents a passport (=from) and each remaining column represents a foreign country (=to).

The values in each cell are as follows:

3 = visa-free travel
2 = eTA is required
1 = visa can be obtained on arrival
0 = visa is required
-1 is for all instances where passport and destination are the same

Our task is to find out the minimum number of passports needed to visit every country without requiring a visa.

The values we are interested in are -1 and 3. Let's modify the dataframe so that the -1 and 3 are 1 (true), and all others are 0 (false):

function modifier(x)
    if x == -1 || x == 3
        return 1
    else
        return 0
    end
end

for country in passport_data.Passport
    passport_data[!, country] = modifier.(passport_data[!, country])
end

The values in the cells now represent:

1 = no visa required for travel
0 = visa required for travel

JuMP Modeling

To model the problem as a mixed-integer linear program, we need a binary decision variable $x_c$ for each country $c$. $x_c$ is $1$ if we select passport $c$ and $0$ otherwise. Our objective is to minimize the sum $\sum x_c$ over all countries.

Since we wish to visit all the countries, for every country, we must own at least one passport that lets us travel to that country visa free. For one destination, this can be mathematically represented as $\sum_{c \in C} a_{c,d} \cdot x_{d} \geq 1$, where $a$ is the passport_data dataframe.

Thus, we can represent this problem using the following model:

\[\begin{aligned} \min && \sum_{c \in C} x_c \\ \text{s.t.} && \sum_{c \in C} a_{c,d} x_c \geq 1 && \forall d \in C \\ && x_c \in \{0,1\} && \forall c \in C. \end{aligned}\]

We'll now solve the problem using JuMP:

using JuMP
import HiGHS

First, create the set of countries:

C = passport_data.Passport

199-element Vector{String}:
 "Afghanistan"
 "Albania"
 "Algeria"
 "Andorra"
 "Angola"
 "Antigua and Barbuda"
 "Argentina"
 "Armenia"
 "Australia"
 "Austria"
 ⋮
 "Uruguay"
 "Uzbekistan"
 "Vanuatu"
 "Vatican"
 "Venezuela"
 "Viet Nam"
 "Yemen"
 "Zambia"
 "Zimbabwe"

Then, create the model and initialize the decision variables:

model = Model(HiGHS.Optimizer)
set_silent(model)
@variable(model, x[C], Bin)
@objective(model, Min, sum(x))
@constraint(model, [d in C], passport_data[!, d]' * x >= 1)
model

A JuMP Model
├ solver: HiGHS
├ objective_sense: MIN_SENSE
│ └ objective_function_type: AffExpr
├ num_variables: 199
├ num_constraints: 398
│ ├ AffExpr in MOI.GreaterThan{Float64}: 199
│ └ VariableRef in MOI.ZeroOne: 199
└ Names registered in the model
  └ :x

Now optimize:

optimize!(model)

We can use the solution_summary function to get an overview of the solution:

solution_summary(model)

solution_summary(; result = 1, verbose = false)
├ solver_name          : HiGHS
├ Termination
│ ├ termination_status : OPTIMAL
│ ├ result_count       : 1
│ ├ raw_status         : kHighsModelStatusOptimal
│ └ objective_bound    : 2.30000e+01
├ Solution (result = 1)
│ ├ primal_status        : FEASIBLE_POINT
│ ├ dual_status          : NO_SOLUTION
│ ├ objective_value      : 2.30000e+01
│ ├ dual_objective_value : NaN
│ └ relative_gap         : 0.00000e+00
└ Work counters
  ├ solve_time (sec)   : 7.19881e-03
  ├ simplex_iterations : 26
  ├ barrier_iterations : -1
  └ node_count         : 1

Just to be sure, check that the solver found an optimal solution:

assert_is_solved_and_feasible(model)

Solution

Let's have a look at the solution in more detail:

println("Minimum number of passports needed: ", objective_value(model))

Minimum number of passports needed: 23.0

println("Optimal passports:")
for c in C
    if value(x[c]) > 0.5
        println(" * ", c)
    end
end

Optimal passports:
 * Afghanistan
 * Chad
 * Comoros
 * Djibouti
 * Georgia
 * Hong Kong
 * India
 * Luxembourg
 * Madagascar
 * Maldives
 * Mali
 * New Zealand
 * North Korea
 * Papua New Guinea
 * Singapore
 * Somalia
 * Sri Lanka
 * Tunisia
 * Turkey
 * Uganda
 * United Arab Emirates
 * United States
 * Zimbabwe

We need some passports, like New Zealand and the United States, which have widespread access to a large number of countries. However, we also need passports like North Korea which only have visa-free access to a very limited number of countries.

Note

We use value(x[c]) > 0.5 rather than value(x[c]) == 1 to avoid excluding solutions like x[c] = 0.99999 that are "1" to some tolerance.