Getting started with data and plotting
This tutorial was generated using Literate.jl. Download the source as a .jl
file.
In this tutorial we will learn how to read tabular data into Julia, and some of the basics of plotting.
If you're new to Julia, start by reading Getting started with Julia and Getting started with JuMP first.
There are multiple ways to read the same kind of data into Julia. This tutorial focuses on DataFrames.jl because it provides the ecosystem to work with most of the required file types in a straightforward manner.
Before we get started, we need this constant to point to where the data files are.
import JuMP
const DATA_DIR = joinpath(
dirname(pathof(JuMP)),
joinpath("..", "docs", "src", "tutorials", "getting_started", "data"),
);
Where to get help
Read the documentation
- Plots.jl: http://docs.juliaplots.org/latest/
- CSV.jl: http://csv.juliadata.org/stable
- DataFrames.jl: https://dataframes.juliadata.org/stable/
Preliminaries
To get started, we need to install some packages.
DataFrames.jl
The DataFrames
package provides a set of tools for working with tabular data. It is available through the Julia package manager.
using Pkg
Pkg.add("DataFrames")
import DataFrames
A DataFrame is a data structure like a table or spreadsheet. You can use it for storing and exploring a set of related data values. Think of it as a smarter array for holding tabular data.
Plots.jl
The Plots
package provides a set of tools for plotting. It is available through the Julia package manager.
using Pkg
Pkg.add("Plots")
import Plots
CSV .jl
CSV and other delimited text files can be read by the CSV.jl package.
Pkg.add("CSV")
import CSV
DataFrame basics
To read a CSV file into a DataFrame, we use the CSV.read
function.
csv_df = CSV.read(joinpath(DATA_DIR, "StarWars.csv"), DataFrames.DataFrame)
Row | Name | Gender | Height | Weight | Eyecolor | Haircolor | Skincolor | Homeland | Born | Died | Jedi | Species | Weapon |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
String31 | String7 | Float64 | String7 | String15 | String7 | String15 | String15 | String15 | String15 | String7 | String15 | String15 | |
1 | Anakin Skywalker | male | 1.88 | 84 | blue | blond | fair | Tatooine | 41.9BBY | 4ABY | jedi | human | lightsaber |
2 | Padme Amidala | female | 1.65 | 45 | brown | brown | light | Naboo | 46BBY | 19BBY | no_jedi | human | unarmed |
3 | Luke Skywalker | male | 1.72 | 77 | blue | blond | fair | Tatooine | 19BBY | unk_died | jedi | human | lightsaber |
4 | Leia Skywalker | female | 1.5 | 49 | brown | brown | light | Alderaan | 19BBY | unk_died | no_jedi | human | blaster |
5 | Qui-Gon Jinn | male | 1.93 | 88.5 | blue | brown | light | unk_planet | 92BBY | 32BBY | jedi | human | lightsaber |
6 | Obi-Wan Kenobi | male | 1.82 | 77 | bluegray | auburn | fair | Stewjon | 57BBY | 0BBY | jedi | human | lightsaber |
7 | Han Solo | male | 1.8 | 80 | brown | brown | light | Corellia | 29BBY | unk_died | no_jedi | human | blaster |
8 | Sheev Palpatine | male | 1.73 | 75 | blue | red | pale | Naboo | 82BBY | 10ABY | no_jedi | human | force-lightning |
9 | R2-D2 | male | 0.96 | 32 | NA | NA | NA | Naboo | 33BBY | unk_died | no_jedi | droid | unarmed |
10 | C-3PO | male | 1.67 | 75 | NA | NA | NA | Tatooine | 112BBY | 3ABY | no_jedi | droid | unarmed |
11 | Yoda | male | 0.66 | 17 | brown | brown | green | unk_planet | 896BBY | 4ABY | jedi | yoda | lightsaber |
12 | Darth Maul | male | 1.75 | 80 | yellow | none | red | Dathomir | 54BBY | unk_died | no_jedi | dathomirian | lightsaber |
13 | Dooku | male | 1.93 | 86 | brown | brown | light | Serenno | 102BBY | 19BBY | jedi | human | lightsaber |
14 | Chewbacca | male | 2.28 | 112 | blue | brown | NA | Kashyyyk | 200BBY | 25ABY | no_jedi | wookiee | bowcaster |
15 | Jabba | male | 3.9 | NA | yellow | none | tan-green | Tatooine | unk_born | 4ABY | no_jedi | hutt | unarmed |
16 | Lando Calrissian | male | 1.78 | 79 | brown | blank | dark | Socorro | 31BBY | unk_died | no_jedi | human | blaster |
17 | Boba Fett | male | 1.83 | 78 | brown | black | brown | Kamino | 31.5BBY | unk_died | no_jedi | human | blaster |
18 | Jango Fett | male | 1.83 | 79 | brown | black | brown | ConcordDawn | 66BBY | 22BBY | no_jedi | human | blaster |
19 | Grievous | male | 2.16 | 159 | gold | black | orange | Kalee | unk_born | 19BBY | no_jedi | kaleesh | slugthrower |
20 | Chief Chirpa | male | 1.0 | 50 | black | gray | brown | Endor | unk_born | 4ABY | no_jedi | ewok | spear |
Let's try plotting some of this data
Plots.scatter(
csv_df.Weight,
csv_df.Height;
xlabel = "Weight",
ylabel = "Height",
)
That doesn't look right. What happened? If you look at the dataframe above, it read Weight
in as a String
column because there are "NA" fields. Let's correct that, by telling CSV to consider "NA" as missing
.
csv_df = CSV.read(
joinpath(DATA_DIR, "StarWars.csv"),
DataFrames.DataFrame;
missingstring = "NA",
)
Row | Name | Gender | Height | Weight | Eyecolor | Haircolor | Skincolor | Homeland | Born | Died | Jedi | Species | Weapon |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
String31 | String7 | Float64 | Float64? | String15 | String7 | String15 | String15 | String15 | String15 | String7 | String15 | String15 | |
1 | Anakin Skywalker | male | 1.88 | 84.0 | blue | blond | fair | Tatooine | 41.9BBY | 4ABY | jedi | human | lightsaber |
2 | Padme Amidala | female | 1.65 | 45.0 | brown | brown | light | Naboo | 46BBY | 19BBY | no_jedi | human | unarmed |
3 | Luke Skywalker | male | 1.72 | 77.0 | blue | blond | fair | Tatooine | 19BBY | unk_died | jedi | human | lightsaber |
4 | Leia Skywalker | female | 1.5 | 49.0 | brown | brown | light | Alderaan | 19BBY | unk_died | no_jedi | human | blaster |
5 | Qui-Gon Jinn | male | 1.93 | 88.5 | blue | brown | light | unk_planet | 92BBY | 32BBY | jedi | human | lightsaber |
6 | Obi-Wan Kenobi | male | 1.82 | 77.0 | bluegray | auburn | fair | Stewjon | 57BBY | 0BBY | jedi | human | lightsaber |
7 | Han Solo | male | 1.8 | 80.0 | brown | brown | light | Corellia | 29BBY | unk_died | no_jedi | human | blaster |
8 | Sheev Palpatine | male | 1.73 | 75.0 | blue | red | pale | Naboo | 82BBY | 10ABY | no_jedi | human | force-lightning |
9 | R2-D2 | male | 0.96 | 32.0 | missing | missing | missing | Naboo | 33BBY | unk_died | no_jedi | droid | unarmed |
10 | C-3PO | male | 1.67 | 75.0 | missing | missing | missing | Tatooine | 112BBY | 3ABY | no_jedi | droid | unarmed |
11 | Yoda | male | 0.66 | 17.0 | brown | brown | green | unk_planet | 896BBY | 4ABY | jedi | yoda | lightsaber |
12 | Darth Maul | male | 1.75 | 80.0 | yellow | none | red | Dathomir | 54BBY | unk_died | no_jedi | dathomirian | lightsaber |
13 | Dooku | male | 1.93 | 86.0 | brown | brown | light | Serenno | 102BBY | 19BBY | jedi | human | lightsaber |
14 | Chewbacca | male | 2.28 | 112.0 | blue | brown | missing | Kashyyyk | 200BBY | 25ABY | no_jedi | wookiee | bowcaster |
15 | Jabba | male | 3.9 | missing | yellow | none | tan-green | Tatooine | unk_born | 4ABY | no_jedi | hutt | unarmed |
16 | Lando Calrissian | male | 1.78 | 79.0 | brown | blank | dark | Socorro | 31BBY | unk_died | no_jedi | human | blaster |
17 | Boba Fett | male | 1.83 | 78.0 | brown | black | brown | Kamino | 31.5BBY | unk_died | no_jedi | human | blaster |
18 | Jango Fett | male | 1.83 | 79.0 | brown | black | brown | ConcordDawn | 66BBY | 22BBY | no_jedi | human | blaster |
19 | Grievous | male | 2.16 | 159.0 | gold | black | orange | Kalee | unk_born | 19BBY | no_jedi | kaleesh | slugthrower |
20 | Chief Chirpa | male | 1.0 | 50.0 | black | gray | brown | Endor | unk_born | 4ABY | no_jedi | ewok | spear |
Then let's re-plot our data
Plots.scatter(
csv_df.Weight,
csv_df.Height;
title = "Height vs Weight of StarWars characters",
xlabel = "Weight",
ylabel = "Height",
label = false,
ylims = (0, 3),
)
That looks better.
Read the CSV documentation for other parsing options.
DataFrames.jl supports manipulation using functions similar to pandas. For example, split the dataframe into groups based on eye-color:
by_eyecolor = DataFrames.groupby(csv_df, :Eyecolor)
GroupedDataFrame with 7 groups based on key: Eyecolor
Row | Name | Gender | Height | Weight | Eyecolor | Haircolor | Skincolor | Homeland | Born | Died | Jedi | Species | Weapon |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
String31 | String7 | Float64 | Float64? | String15 | String7 | String15 | String15 | String15 | String15 | String7 | String15 | String15 | |
1 | Anakin Skywalker | male | 1.88 | 84.0 | blue | blond | fair | Tatooine | 41.9BBY | 4ABY | jedi | human | lightsaber |
2 | Luke Skywalker | male | 1.72 | 77.0 | blue | blond | fair | Tatooine | 19BBY | unk_died | jedi | human | lightsaber |
3 | Qui-Gon Jinn | male | 1.93 | 88.5 | blue | brown | light | unk_planet | 92BBY | 32BBY | jedi | human | lightsaber |
4 | Sheev Palpatine | male | 1.73 | 75.0 | blue | red | pale | Naboo | 82BBY | 10ABY | no_jedi | human | force-lightning |
5 | Chewbacca | male | 2.28 | 112.0 | blue | brown | missing | Kashyyyk | 200BBY | 25ABY | no_jedi | wookiee | bowcaster |
⋮
Row | Name | Gender | Height | Weight | Eyecolor | Haircolor | Skincolor | Homeland | Born | Died | Jedi | Species | Weapon |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
String31 | String7 | Float64 | Float64? | String15 | String7 | String15 | String15 | String15 | String15 | String7 | String15 | String15 | |
1 | Chief Chirpa | male | 1.0 | 50.0 | black | gray | brown | Endor | unk_born | 4ABY | no_jedi | ewok | spear |
Then recombine into a single dataframe based on a function operating over the split dataframes:
eyecolor_count = DataFrames.combine(by_eyecolor) do df
return DataFrames.nrow(df)
end
Row | Eyecolor | x1 |
---|---|---|
String15 | Int64 | |
1 | blue | 5 |
2 | brown | 8 |
3 | bluegray | 1 |
4 | missing | 2 |
5 | yellow | 2 |
6 | gold | 1 |
7 | black | 1 |
We can rename columns:
DataFrames.rename!(eyecolor_count, :x1 => :count)
Row | Eyecolor | count |
---|---|---|
String15 | Int64 | |
1 | blue | 5 |
2 | brown | 8 |
3 | bluegray | 1 |
4 | missing | 2 |
5 | yellow | 2 |
6 | gold | 1 |
7 | black | 1 |
Drop some missing rows:
DataFrames.dropmissing!(eyecolor_count, :Eyecolor)
Row | Eyecolor | count |
---|---|---|
String15 | Int64 | |
1 | blue | 5 |
2 | brown | 8 |
3 | bluegray | 1 |
4 | yellow | 2 |
5 | gold | 1 |
6 | black | 1 |
Then we can visualize the data:
sort!(eyecolor_count, :count; rev = true)
Plots.bar(
eyecolor_count.Eyecolor,
eyecolor_count.count;
xlabel = "Eye color",
ylabel = "Number of characters",
label = false,
)
Other Delimited Files
We can also use the CSV.jl
package to read any other delimited text file format.
By default, CSV.File will try to detect a file's delimiter from the first 10 lines of the file.
Candidate delimiters include ','
, '\t'
, ' '
, '|'
, ';'
, and ':'
. If it can't auto-detect the delimiter, it will assume ','
.
Let's take the example of space separated data.
ss_df = CSV.read(joinpath(DATA_DIR, "Cereal.txt"), DataFrames.DataFrame)
Row | Name | Cups | Calories | Carbs | Fat | Fiber | Potassium | Protein | Sodium | Sugars |
---|---|---|---|---|---|---|---|---|---|---|
String31 | Float64 | Int64 | Float64 | Int64 | Float64 | Int64 | Int64 | Int64 | Int64 | |
1 | CapnCrunch | 0.75 | 120 | 12.0 | 2 | 0.0 | 35 | 1 | 220 | 12 |
2 | CocoaPuffs | 1.0 | 110 | 12.0 | 1 | 0.0 | 55 | 1 | 180 | 13 |
3 | Trix | 1.0 | 110 | 13.0 | 1 | 0.0 | 25 | 1 | 140 | 12 |
4 | AppleJacks | 1.0 | 110 | 11.0 | 0 | 1.0 | 30 | 2 | 125 | 14 |
5 | CornChex | 1.0 | 110 | 22.0 | 0 | 0.0 | 25 | 2 | 280 | 3 |
6 | CornFlakes | 1.0 | 100 | 21.0 | 0 | 1.0 | 35 | 2 | 290 | 2 |
7 | Nut&Honey | 0.67 | 120 | 15.0 | 1 | 0.0 | 40 | 2 | 190 | 9 |
8 | Smacks | 0.75 | 110 | 9.0 | 1 | 1.0 | 40 | 2 | 70 | 15 |
9 | MultiGrain | 1.0 | 100 | 15.0 | 1 | 2.0 | 90 | 2 | 220 | 6 |
10 | CracklinOat | 0.5 | 110 | 10.0 | 3 | 4.0 | 160 | 3 | 140 | 7 |
11 | GrapeNuts | 0.25 | 110 | 17.0 | 0 | 3.0 | 90 | 3 | 179 | 3 |
12 | HoneyNutCheerios | 0.75 | 110 | 11.5 | 1 | 1.5 | 90 | 3 | 250 | 10 |
13 | NutriGrain | 0.67 | 140 | 21.0 | 2 | 3.0 | 130 | 3 | 220 | 7 |
14 | Product19 | 1.0 | 100 | 20.0 | 0 | 1.0 | 45 | 3 | 320 | 3 |
15 | TotalRaisinBran | 1.0 | 140 | 15.0 | 1 | 4.0 | 230 | 3 | 190 | 14 |
16 | WheatChex | 0.67 | 100 | 17.0 | 1 | 3.0 | 115 | 3 | 230 | 3 |
17 | Oatmeal | 0.5 | 130 | 13.5 | 2 | 1.5 | 120 | 3 | 170 | 10 |
18 | Life | 0.67 | 100 | 12.0 | 2 | 2.0 | 95 | 4 | 150 | 6 |
19 | Maypo | 1.0 | 100 | 16.0 | 1 | 0.0 | 95 | 4 | 0 | 3 |
20 | QuakerOats | 0.5 | 100 | 14.0 | 1 | 2.0 | 110 | 4 | 135 | 6 |
21 | Muesli | 1.0 | 150 | 16.0 | 3 | 3.0 | 170 | 4 | 150 | 11 |
22 | Cheerios | 1.25 | 110 | 17.0 | 2 | 2.0 | 105 | 6 | 290 | 1 |
23 | SpecialK | 1.0 | 110 | 16.0 | 0 | 1.0 | 55 | 6 | 230 | 3 |
We can also specify the delimiter as follows:
delim_df = CSV.read(
joinpath(DATA_DIR, "Soccer.txt"),
DataFrames.DataFrame;
delim = "::",
)
Row | Team | Played | Wins | Draws | Losses | Goals_for | Goals_against |
---|---|---|---|---|---|---|---|
String31 | Int64 | Int64 | Int64 | Int64 | String15 | String15 | |
1 | Barcelona | 38 | 30 | 4 | 4 | 110 goals | 21 goals |
2 | Real Madrid | 38 | 30 | 2 | 6 | 118 goals | 38 goals |
3 | Atletico Madrid | 38 | 23 | 9 | 6 | 67 goals | 29 goals |
4 | Valencia | 38 | 22 | 11 | 5 | 70 goals | 32 goals |
5 | Seville | 38 | 23 | 7 | 8 | 71 goals | 45 goals |
6 | Villarreal | 38 | 16 | 12 | 10 | 48 goals | 37 goals |
7 | Athletic Bilbao | 38 | 15 | 10 | 13 | 42 goals | 41 goals |
8 | Celta Vigo | 38 | 13 | 12 | 13 | 47 goals | 44 goals |
9 | Malaga | 38 | 14 | 8 | 16 | 42 goals | 48 goals |
10 | Espanyol | 38 | 13 | 10 | 15 | 47 goals | 51 goals |
11 | Rayo Vallecano | 38 | 15 | 4 | 19 | 46 goals | 68 goals |
12 | Real Sociedad | 38 | 11 | 13 | 14 | 44 goals | 51 goals |
13 | Elche | 38 | 11 | 8 | 19 | 35 goals | 62 goals |
14 | Levante | 38 | 9 | 10 | 19 | 34 goals | 67 goals |
15 | Getafe | 38 | 10 | 7 | 21 | 33 goals | 64 goals |
16 | Deportivo La Coruna | 38 | 7 | 14 | 17 | 35 goals | 60 goals |
17 | Granada | 38 | 7 | 14 | 17 | 29 goals | 64 goals |
18 | Eibar | 38 | 9 | 8 | 21 | 34 goals | 55 goals |
19 | Almeria | 38 | 8 | 8 | 22 | 35 goals | 64 goals |
20 | Cordoba | 38 | 3 | 11 | 24 | 22 goals | 68 goals |
Working with DataFrames
Now that we have read the required data into a DataFrame, let us look at some basic operations we can perform on it.
Querying Basic Information
The size
function gets us the dimensions of the DataFrame:
DataFrames.size(ss_df)
(23, 10)
We can also use the nrow
and ncol
functions to get the number of rows and columns respectively:
DataFrames.nrow(ss_df), DataFrames.ncol(ss_df)
(23, 10)
The describe
function gives basic summary statistics of data in a DataFrame:
DataFrames.describe(ss_df)
Row | variable | mean | min | median | max | nmissing | eltype |
---|---|---|---|---|---|---|---|
Symbol | Union… | Any | Union… | Any | Int64 | DataType | |
1 | Name | AppleJacks | WheatChex | 0 | String31 | ||
2 | Cups | 0.823043 | 0.25 | 1.0 | 1.25 | 0 | Float64 |
3 | Calories | 113.043 | 100 | 110.0 | 150 | 0 | Int64 |
4 | Carbs | 15.0435 | 9.0 | 15.0 | 22.0 | 0 | Float64 |
5 | Fat | 1.13043 | 0 | 1.0 | 3 | 0 | Int64 |
6 | Fiber | 1.56522 | 0.0 | 1.5 | 4.0 | 0 | Float64 |
7 | Potassium | 86.3043 | 25 | 90.0 | 230 | 0 | Int64 |
8 | Protein | 2.91304 | 1 | 3.0 | 6 | 0 | Int64 |
9 | Sodium | 189.957 | 0 | 190.0 | 320 | 0 | Int64 |
10 | Sugars | 7.52174 | 1 | 7.0 | 15 | 0 | Int64 |
Names of every column can be obtained by the names
function:
DataFrames.names(ss_df)
10-element Vector{String}:
"Name"
"Cups"
"Calories"
"Carbs"
"Fat"
"Fiber"
"Potassium"
"Protein"
"Sodium"
"Sugars"
Corresponding data types are obtained using the broadcasted eltype
function:
eltype.(ss_df)
Row | Name | Cups | Calories | Carbs | Fat | Fiber | Potassium | Protein | Sodium | Sugars |
---|---|---|---|---|---|---|---|---|---|---|
DataType | DataType | DataType | DataType | DataType | DataType | DataType | DataType | DataType | DataType | |
1 | Char | Float64 | Int64 | Float64 | Int64 | Float64 | Int64 | Int64 | Int64 | Int64 |
2 | Char | Float64 | Int64 | Float64 | Int64 | Float64 | Int64 | Int64 | Int64 | Int64 |
3 | Char | Float64 | Int64 | Float64 | Int64 | Float64 | Int64 | Int64 | Int64 | Int64 |
4 | Char | Float64 | Int64 | Float64 | Int64 | Float64 | Int64 | Int64 | Int64 | Int64 |
5 | Char | Float64 | Int64 | Float64 | Int64 | Float64 | Int64 | Int64 | Int64 | Int64 |
6 | Char | Float64 | Int64 | Float64 | Int64 | Float64 | Int64 | Int64 | Int64 | Int64 |
7 | Char | Float64 | Int64 | Float64 | Int64 | Float64 | Int64 | Int64 | Int64 | Int64 |
8 | Char | Float64 | Int64 | Float64 | Int64 | Float64 | Int64 | Int64 | Int64 | Int64 |
9 | Char | Float64 | Int64 | Float64 | Int64 | Float64 | Int64 | Int64 | Int64 | Int64 |
10 | Char | Float64 | Int64 | Float64 | Int64 | Float64 | Int64 | Int64 | Int64 | Int64 |
11 | Char | Float64 | Int64 | Float64 | Int64 | Float64 | Int64 | Int64 | Int64 | Int64 |
12 | Char | Float64 | Int64 | Float64 | Int64 | Float64 | Int64 | Int64 | Int64 | Int64 |
13 | Char | Float64 | Int64 | Float64 | Int64 | Float64 | Int64 | Int64 | Int64 | Int64 |
14 | Char | Float64 | Int64 | Float64 | Int64 | Float64 | Int64 | Int64 | Int64 | Int64 |
15 | Char | Float64 | Int64 | Float64 | Int64 | Float64 | Int64 | Int64 | Int64 | Int64 |
16 | Char | Float64 | Int64 | Float64 | Int64 | Float64 | Int64 | Int64 | Int64 | Int64 |
17 | Char | Float64 | Int64 | Float64 | Int64 | Float64 | Int64 | Int64 | Int64 | Int64 |
18 | Char | Float64 | Int64 | Float64 | Int64 | Float64 | Int64 | Int64 | Int64 | Int64 |
19 | Char | Float64 | Int64 | Float64 | Int64 | Float64 | Int64 | Int64 | Int64 | Int64 |
20 | Char | Float64 | Int64 | Float64 | Int64 | Float64 | Int64 | Int64 | Int64 | Int64 |
21 | Char | Float64 | Int64 | Float64 | Int64 | Float64 | Int64 | Int64 | Int64 | Int64 |
22 | Char | Float64 | Int64 | Float64 | Int64 | Float64 | Int64 | Int64 | Int64 | Int64 |
23 | Char | Float64 | Int64 | Float64 | Int64 | Float64 | Int64 | Int64 | Int64 | Int64 |
Accessing the Data
Similar to regular arrays, we use numerical indexing to access elements of a DataFrame:
csv_df[1, 1]
"Anakin Skywalker"
The following are different ways to access a column:
csv_df[!, 1]
20-element Vector{InlineStrings.String31}:
"Anakin Skywalker"
"Padme Amidala"
"Luke Skywalker"
"Leia Skywalker"
"Qui-Gon Jinn"
"Obi-Wan Kenobi"
"Han Solo"
"Sheev Palpatine"
"R2-D2"
"C-3PO"
"Yoda"
"Darth Maul"
"Dooku"
"Chewbacca"
"Jabba"
"Lando Calrissian"
"Boba Fett"
"Jango Fett"
"Grievous"
"Chief Chirpa"
csv_df[!, :Name]
20-element Vector{InlineStrings.String31}:
"Anakin Skywalker"
"Padme Amidala"
"Luke Skywalker"
"Leia Skywalker"
"Qui-Gon Jinn"
"Obi-Wan Kenobi"
"Han Solo"
"Sheev Palpatine"
"R2-D2"
"C-3PO"
"Yoda"
"Darth Maul"
"Dooku"
"Chewbacca"
"Jabba"
"Lando Calrissian"
"Boba Fett"
"Jango Fett"
"Grievous"
"Chief Chirpa"
csv_df.Name
20-element Vector{InlineStrings.String31}:
"Anakin Skywalker"
"Padme Amidala"
"Luke Skywalker"
"Leia Skywalker"
"Qui-Gon Jinn"
"Obi-Wan Kenobi"
"Han Solo"
"Sheev Palpatine"
"R2-D2"
"C-3PO"
"Yoda"
"Darth Maul"
"Dooku"
"Chewbacca"
"Jabba"
"Lando Calrissian"
"Boba Fett"
"Jango Fett"
"Grievous"
"Chief Chirpa"
csv_df[:, 1] # Note that this creates a copy.
20-element Vector{InlineStrings.String31}:
"Anakin Skywalker"
"Padme Amidala"
"Luke Skywalker"
"Leia Skywalker"
"Qui-Gon Jinn"
"Obi-Wan Kenobi"
"Han Solo"
"Sheev Palpatine"
"R2-D2"
"C-3PO"
"Yoda"
"Darth Maul"
"Dooku"
"Chewbacca"
"Jabba"
"Lando Calrissian"
"Boba Fett"
"Jango Fett"
"Grievous"
"Chief Chirpa"
The following are different ways to access a row:
csv_df[1:1, :]
Row | Name | Gender | Height | Weight | Eyecolor | Haircolor | Skincolor | Homeland | Born | Died | Jedi | Species | Weapon |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
String31 | String7 | Float64 | Float64? | String15 | String7 | String15 | String15 | String15 | String15 | String7 | String15 | String15 | |
1 | Anakin Skywalker | male | 1.88 | 84.0 | blue | blond | fair | Tatooine | 41.9BBY | 4ABY | jedi | human | lightsaber |
csv_df[1, :] # This produces a DataFrameRow.
Row | Name | Gender | Height | Weight | Eyecolor | Haircolor | Skincolor | Homeland | Born | Died | Jedi | Species | Weapon |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
String31 | String7 | Float64 | Float64? | String15 | String7 | String15 | String15 | String15 | String15 | String7 | String15 | String15 | |
1 | Anakin Skywalker | male | 1.88 | 84.0 | blue | blond | fair | Tatooine | 41.9BBY | 4ABY | jedi | human | lightsaber |
We can change the values just as we normally assign values.
Assign a range to scalar:
csv_df[1:3, :Height] .= 1.83
3-element view(::Vector{Float64}, 1:3) with eltype Float64:
1.83
1.83
1.83
Assign a vector:
csv_df[4:6, :Height] = [1.8, 1.6, 1.8]
3-element Vector{Float64}:
1.8
1.6
1.8
csv_df
Row | Name | Gender | Height | Weight | Eyecolor | Haircolor | Skincolor | Homeland | Born | Died | Jedi | Species | Weapon |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
String31 | String7 | Float64 | Float64? | String15 | String7 | String15 | String15 | String15 | String15 | String7 | String15 | String15 | |
1 | Anakin Skywalker | male | 1.83 | 84.0 | blue | blond | fair | Tatooine | 41.9BBY | 4ABY | jedi | human | lightsaber |
2 | Padme Amidala | female | 1.83 | 45.0 | brown | brown | light | Naboo | 46BBY | 19BBY | no_jedi | human | unarmed |
3 | Luke Skywalker | male | 1.83 | 77.0 | blue | blond | fair | Tatooine | 19BBY | unk_died | jedi | human | lightsaber |
4 | Leia Skywalker | female | 1.8 | 49.0 | brown | brown | light | Alderaan | 19BBY | unk_died | no_jedi | human | blaster |
5 | Qui-Gon Jinn | male | 1.6 | 88.5 | blue | brown | light | unk_planet | 92BBY | 32BBY | jedi | human | lightsaber |
6 | Obi-Wan Kenobi | male | 1.8 | 77.0 | bluegray | auburn | fair | Stewjon | 57BBY | 0BBY | jedi | human | lightsaber |
7 | Han Solo | male | 1.8 | 80.0 | brown | brown | light | Corellia | 29BBY | unk_died | no_jedi | human | blaster |
8 | Sheev Palpatine | male | 1.73 | 75.0 | blue | red | pale | Naboo | 82BBY | 10ABY | no_jedi | human | force-lightning |
9 | R2-D2 | male | 0.96 | 32.0 | missing | missing | missing | Naboo | 33BBY | unk_died | no_jedi | droid | unarmed |
10 | C-3PO | male | 1.67 | 75.0 | missing | missing | missing | Tatooine | 112BBY | 3ABY | no_jedi | droid | unarmed |
11 | Yoda | male | 0.66 | 17.0 | brown | brown | green | unk_planet | 896BBY | 4ABY | jedi | yoda | lightsaber |
12 | Darth Maul | male | 1.75 | 80.0 | yellow | none | red | Dathomir | 54BBY | unk_died | no_jedi | dathomirian | lightsaber |
13 | Dooku | male | 1.93 | 86.0 | brown | brown | light | Serenno | 102BBY | 19BBY | jedi | human | lightsaber |
14 | Chewbacca | male | 2.28 | 112.0 | blue | brown | missing | Kashyyyk | 200BBY | 25ABY | no_jedi | wookiee | bowcaster |
15 | Jabba | male | 3.9 | missing | yellow | none | tan-green | Tatooine | unk_born | 4ABY | no_jedi | hutt | unarmed |
16 | Lando Calrissian | male | 1.78 | 79.0 | brown | blank | dark | Socorro | 31BBY | unk_died | no_jedi | human | blaster |
17 | Boba Fett | male | 1.83 | 78.0 | brown | black | brown | Kamino | 31.5BBY | unk_died | no_jedi | human | blaster |
18 | Jango Fett | male | 1.83 | 79.0 | brown | black | brown | ConcordDawn | 66BBY | 22BBY | no_jedi | human | blaster |
19 | Grievous | male | 2.16 | 159.0 | gold | black | orange | Kalee | unk_born | 19BBY | no_jedi | kaleesh | slugthrower |
20 | Chief Chirpa | male | 1.0 | 50.0 | black | gray | brown | Endor | unk_born | 4ABY | no_jedi | ewok | spear |
There are a lot more things which can be done with a DataFrame. Read the docs for more information.
For information on dplyr-type syntax:
- Read the DataFrames.jl documentation
- Check out DataFramesMeta.jl
Example: the passport problem
Let's now apply what we have learned to solve a real problem.
Data manipulation
The Passport Index Dataset lists travel visa requirements for 199 countries, in .csv
format. Our task is to find the minimum number of passports required to visit all countries.
passport_data = CSV.read(
joinpath(DATA_DIR, "passport-index-matrix.csv"),
DataFrames.DataFrame,
);
In this dataset, the first column represents a passport (=from) and each remaining column represents a foreign country (=to).
The values in each cell are as follows:
- 3 = visa-free travel
- 2 = eTA is required
- 1 = visa can be obtained on arrival
- 0 = visa is required
- -1 is for all instances where passport and destination are the same
Our task is to find out the minimum number of passports needed to visit every country without requiring a visa.
The values we are interested in are -1 and 3. Let's modify the dataframe so that the -1 and 3 are 1
(true), and all others are 0
(false):
function modifier(x)
if x == -1 || x == 3
return 1
else
return 0
end
end
for country in passport_data.Passport
passport_data[!, country] = modifier.(passport_data[!, country])
end
The values in the cells now represent:
- 1 = no visa required for travel
- 0 = visa required for travel
JuMP Modeling
To model the problem as a mixed-integer linear program, we need a binary decision variable $x_c$ for each country $c$. $x_c$ is $1$ if we select passport $c$ and $0$ otherwise. Our objective is to minimize the sum $\sum x_c$ over all countries.
Since we wish to visit all the countries, for every country, we must own at least one passport that lets us travel to that country visa free. For one destination, this can be mathematically represented as $\sum_{c \in C} a_{c,d} \cdot x_{d} \geq 1$, where $a$ is the passport_data
dataframe.
Thus, we can represent this problem using the following model:
\[\begin{aligned} \min && \sum_{c \in C} x_c \\ \text{s.t.} && \sum_{c \in C} a_{c,d} x_c \geq 1 && \forall d \in C \\ && x_c \in \{0,1\} && \forall c \in C. \end{aligned}\]
We'll now solve the problem using JuMP:
using JuMP
import HiGHS
First, create the set of countries:
C = passport_data.Passport
199-element Vector{String}:
"Afghanistan"
"Albania"
"Algeria"
"Andorra"
"Angola"
"Antigua and Barbuda"
"Argentina"
"Armenia"
"Australia"
"Austria"
⋮
"Uruguay"
"Uzbekistan"
"Vanuatu"
"Vatican"
"Venezuela"
"Viet Nam"
"Yemen"
"Zambia"
"Zimbabwe"
Then, create the model and initialize the decision variables:
model = Model(HiGHS.Optimizer)
set_silent(model)
@variable(model, x[C], Bin)
@objective(model, Min, sum(x))
@constraint(model, [d in C], passport_data[!, d]' * x >= 1)
model
A JuMP Model
├ solver: HiGHS
├ objective_sense: MIN_SENSE
│ └ objective_function_type: AffExpr
├ num_variables: 199
├ num_constraints: 398
│ ├ AffExpr in MOI.GreaterThan{Float64}: 199
│ └ VariableRef in MOI.ZeroOne: 199
└ Names registered in the model
└ :x
Now optimize:
optimize!(model)
We can use the solution_summary
function to get an overview of the solution:
solution_summary(model)
* Solver : HiGHS
* Status
Result count : 1
Termination status : OPTIMAL
Message from the solver:
"kHighsModelStatusOptimal"
* Candidate solution (result #1)
Primal status : FEASIBLE_POINT
Dual status : NO_SOLUTION
Objective value : 2.30000e+01
Objective bound : 2.30000e+01
Relative gap : 0.00000e+00
* Work counters
Solve time (sec) : 6.19411e-03
Simplex iterations : 26
Barrier iterations : -1
Node count : 1
Just to be sure, check that the solver found an optimal solution:
@assert is_solved_and_feasible(model)
Solution
Let's have a look at the solution in more detail:
println("Minimum number of passports needed: ", objective_value(model))
Minimum number of passports needed: 23.0
println("Optimal passports:")
for c in C
if value(x[c]) > 0.5
println(" * ", c)
end
end
Optimal passports:
* Afghanistan
* Chad
* Comoros
* Djibouti
* Georgia
* Hong Kong
* India
* Luxembourg
* Madagascar
* Maldives
* Mali
* New Zealand
* North Korea
* Papua New Guinea
* Singapore
* Somalia
* Sri Lanka
* Tunisia
* Turkey
* Uganda
* United Arab Emirates
* United States
* Zimbabwe
We need some passports, like New Zealand and the United States, which have widespread access to a large number of countries. However, we also need passports like North Korea which only have visa-free access to a very limited number of countries.
We use value(x[c]) > 0.5
rather than value(x[c]) == 1
to avoid excluding solutions like x[c] = 0.99999
that are "1" to some tolerance.