The World Cup starts today! The tournament which runs from June 14 through July 15 is probably the most popular sporting event in the world. if you are a soccer fan, you know that learning about the players and their teams and talking about it all with your friends greatly enhances the experience. In this post, I will show you how to gather and explore data for the 736 players from the 32 teams at the 2018 FIFA World Cup. Have fun and enjoy the games. I will be watching with you.
Download Player Data
Official PDF
FIFA has made several official player lists available, conveniently changing the format each time. For this exercise, I use the one from early June. The tabulizer package makes extracting information from tables included in a PDF document relatively easy. (The other (later) version of the official PDF is here. Strangely, the weight variable has been dropped.)
suppressMessages(library(tidyverse))
library(stringr)
suppressMessages(library(lubridate))
suppressMessages(library(cowplot))
# Note that I set warnings to FALSE because of some annoying (and intermittent)
# issues with RJavaTools.
library(tabulizer)
url <- "https://github.com/davidkane9/wc18/raw/master/fifa_player_list_1.pdf"
out <- extract_tables(url, output = "data.frame")
We now have a 32 element list, each item a data frame of information about the 23 players on each team. Let’s combine this information into a single tidy tibble.
# Note how bind_rows() makes it very easy to combine a list of compatible
# dataframes.
pdf_data <- bind_rows(out) %>%
as_tibble() %>%
# Make the variable names more tidy-like.
rename(team = Team,
number = X.,
position = Pos.,
name = FIFA.Popular.Name,
birth_date = Birth.Date,
shirt_name = Shirt.Name,
club = Club,
height = Height,
weight = Weight) %>%
# Country names are contentious issues. I modify two names because I will
# later need to merge this tibble with data from Wikipedia, which uses
# different names.
mutate(team = case_when(
team == "Korea Republic" ~ "South Korea",
team == "IR Iran" ~ "Iran",
TRUE ~ team)) %>%
# league and club should be separate variables. We also want birth_date to be
# a date and to have an age variable already calculated.
mutate(birth_date = dmy(birth_date),
league = str_sub(club, -4, -2),
club = str_sub(club, end = -7),
age = interval(birth_date, "2018-06-14") / years(1))
Here is a sample of the data:
## # A tibble: 10 x 10
## team number position birth_date shirt_name club height weight league age
## <chr> <int> <chr> <date> <chr> <chr> <int> <int> <chr> <dbl>
## 1 Denmark 3 DF 1992-08-03 VESTERGAARD VfL Bor… 200 98 GER 25.9
## 2 Argentina 18 DF 1990-07-13 SALVIO SL Benf… 167 69 POR 27.9
## 3 Croatia 15 DF 1996-09-17 ĆALETA-CAR FC Red … 192 89 AUT 21.7
## 4 Croatia 21 DF 1989-04-29 VIDA Besikta… 184 76 TUR 29.1
## 5 Japan 3 DF 1992-12-11 SHOJI Kashima… 182 74 JPN 25.5
## 6 Colombia 7 FW 1986-09-08 BACCA Villarr… 181 77 ESP 31.8
## 7 Iceland 10 MF 1989-09-08 G. SIGURDSSON Everton… 186 82 ENG 28.8
## 8 Germany 17 DF 1988-09-03 BOATENG FC Baye… 192 90 GER 29.8
## 9 Poland 4 DF 1986-04-21 CIONEK SPAL Fe… 184 81 ITA 32.1
## 10 Uruguay 9 FW 1987-01-24 L. SUAREZ FC Barc… 182 85 ESP 31.4
Perform some error checking.
stopifnot(length(unique(pdf_data$team)) == 32) # There are 32 teams.
stopifnot(all(range(table(pdf_data$team)) == 23)) # Each team has 23 players.
stopifnot(pdf_data %>%
filter(position == "GK") %>%
group_by(team) %>%
tally() %>%
filter(n != 3) %>%
nrow() == 0) # All teams have 3 goal keepers.
stopifnot(all(pdf_data$position %in%
c("GK", "DF", "MF", "FW"))) # All players assigned to 1 of 4 positions.
Wikipedia Data
Wikipedia includes other player information which might be interesting, especially the number of caps for each player. A “cap” is an appearance in a game for the national team. The rvest package makes scraping data from Wikipedia fairly easy.
suppressMessages(library(rvest))
html <- read_html("https://en.wikipedia.org/wiki/2018_FIFA_World_Cup_squads")
# Once we have read in all the html, we need to identify the location of the
# data we want. The rvest vignette provides guidance, but the key trick is the
# use of SelectorGadget to find the correct CSS node.
# First, we need the country and the shirt number of each player so that we can
# merge this data with that from the PDF.
country <- html_nodes(html, ".mw-headline") %>%
html_text() %>%
as_tibble() %>%
filter(! str_detect(value, "Group")) %>%
slice(1:32)
number <- html_nodes(html, ".plainrowheaders td:nth-child(1)") %>%
html_text()
# We don't need the name of each player but I like to grab it, both because I
# prefer the Wikipedia formatting and to use this as a cross-check on the
# accuracy of our country/number merge.
name <- html_nodes(html, "th a") %>%
html_text() %>%
as_tibble() %>%
filter(! str_detect(value, "^captain$")) %>%
slice(1:736)
# cap is the variable we care about, but Wikipedia page also includes the number
# of goals that each player has scored for the national team. Try adding that
# information on your own.
caps <- html_nodes(html, ".plainrowheaders td:nth-child(5)") %>%
html_text()
# Create a tibble. Note that we are relying on all the vectors being in the
# correct order.
wiki_data <- tibble(
number = as.numeric(number),
name = name$value,
team = rep(country$value, each = 23),
caps = as.numeric(caps))
# I prefer the name from Wikipedia. Exercise for the reader: How might we use
# name (from Wikipedia) and shirt_name (from the PDF) to confirm that we have
# lined up the data correctly?
x <- left_join(select(pdf_data, -name), wiki_data, by = c("team", "number"))
Data Exploration
With this information, there are a variety of topics to explore.
Birth Month
For the entire sample of 736 players, there is a clear birth month effect, visible both when looking at calendar months and when aggregating to calendar quarters. Players are much more likely to have birthdays earlier in the year. The most common explanation is that players born in January have an advantage over players born in December (when both are born in the same calendar year) because the former will be older than the later whenever they are competing for spots on the same age-group team, given that the cut-offs are always (?) December 31. This advantage in youth soccer bleeds into adult soccer because of the extra opportunities it provides for expert coaching. (See “A Star Is Made,” by Stephen J. Dubner and Steven D. Levitt, May 7, 2006, New York Times Magazine.)
Strangely, the effect is only true for players who will be 25 and over at the start of the World Cup, about 75% of the sample.
Why would that be true? Note that there are many fewer players starting the tournament at age 24 than one might expect:
Are the “missing” score or so 24 year-olds a sign of something meaningful or random noise?
Team Quality
We don’t have good measures of player (or team) quality in this data. But we do know if an individual plays for a team in one of the countries which host the five highest quality leagues: England (ENG), Spain (ESP), Germany (GER), Italy (58) and France (49). (It is no coincidence that these leagues account for the largest share of the players.)
x %>%
group_by(league) %>%
tally() %>%
arrange(desc(n))
## # A tibble: 55 x 2
## league n
## <chr> <int>
## 1 ENG 124
## 2 ESP 81
## 3 GER 67
## 4 ITA 58
## 5 FRA 49
## 6 RUS 36
## 7 KSA 30
## 8 MEX 22
## 9 TUR 22
## 10 POR 19
## # ... with 45 more rows
Any World Cup team with very few players who play in these 5 leagues is unlikely to be a good team. The best leagues have teams with so much money that they (almost) always are able to hire the best players. The vast majority of players in, for example, the Saudi Arabian or Turkish leagues are not wanted by any team in the best leagues. So, one measure of team quality is the percentage of players who play for teams in those 5 elite leagues. Here are the top 8 and bottom 4:
x %>%
group_by(team) %>%
summarise(elite = mean(league %in%
c("ENG", "ESP", "GER", "ITA", "FRA"))) %>%
arrange(desc(elite)) %>%
slice(c(1:8, 29:32))
## # A tibble: 12 x 2
## team elite
## <chr> <dbl>
## 1 England 1
## 2 France 1
## 3 Germany 1
## 4 Spain 1
## 5 Belgium 0.826
## 6 Switzerland 0.826
## 7 Senegal 0.783
## 8 Brazil 0.739
## 9 Iran 0.0435
## 10 Panama 0.0435
## 11 Peru 0.0435
## 12 Russia 0.0435
This measure captures the fact that teams like England, France, Spain and Germany are likely to do well while teams like Iran, Panama and Peru are not. Russia, as the host country, is a more difficult case. There are many problems with this analysis. Feel free to point them out in the comments. A better approach would look at the quality of the clubs that individuals play for or, even better, at measures of individual player quality.
What can you do with this data?
David Kane teaches at Harvard University and co-organizes the Boston R User Group.
You may leave a comment below or discuss the post in the forum community.rstudio.com.