Player Data for the 2018 FIFA World Cup

2018-06-14

by David Kane

The World Cup starts today! The tournament which runs from June 14 through July 15 is probably the most popular sporting event in the world. if you are a soccer fan, you know that learning about the players and their teams and talking about it all with your friends greatly enhances the experience. In this post, I will show you how to gather and explore data for the 736 players from the 32 teams at the 2018 FIFA World Cup. Have fun and enjoy the games. I will be watching with you.

Download Player Data

Official PDF

FIFA has made several official player lists available, conveniently changing the format each time. For this exercise, I use the one from early June. The tabulizer package makes extracting information from tables included in a PDF document relatively easy. (The other (later) version of the official PDF is here. Strangely, the weight variable has been dropped.)

suppressMessages(library(tidyverse))
library(stringr)
suppressMessages(library(lubridate))
suppressMessages(library(cowplot))

# Note that I set warnings to FALSE because of some annoying (and intermittent)
# issues with RJavaTools.

library(tabulizer)
url <- "https://github.com/davidkane9/wc18/raw/master/fifa_player_list_1.pdf"
out <- extract_tables(url, output = "data.frame")

We now have a 32 element list, each item a data frame of information about the 23 players on each team. Let’s combine this information into a single tidy tibble.

# Note how bind_rows() makes it very easy to combine a list of compatible
# dataframes.

pdf_data <- bind_rows(out) %>% 
  as_tibble() %>% 
  
  # Make the variable names more tidy-like.
  
  rename(team = Team,
         number = X.,
         position = Pos.,
         name = FIFA.Popular.Name,
         birth_date = Birth.Date,
         shirt_name = Shirt.Name,
         club = Club,
         height = Height,
         weight = Weight) %>% 
  
  # Country names are contentious issues. I modify two names because I will
  # later need to merge this tibble with data from Wikipedia, which uses
  # different names.
  
  mutate(team = case_when(
         team == "Korea Republic" ~ "South Korea",
         team == "IR Iran" ~ "Iran",
         TRUE ~ team)) %>% 
  
  # league and club should be separate variables. We also want birth_date to be
  # a date and to have an age variable already calculated.
  
  mutate(birth_date = dmy(birth_date),
         league = str_sub(club, -4, -2),
         club = str_sub(club, end = -7),
         age = interval(birth_date, "2018-06-14") / years(1))

Here is a sample of the data:

## # A tibble: 10 x 10
##    team      number position birth_date shirt_name    club     height weight league   age
##    <chr>      <int> <chr>    <date>     <chr>         <chr>     <int>  <int> <chr>  <dbl>
##  1 Denmark        3 DF       1992-08-03 VESTERGAARD   VfL Bor…    200     98 GER     25.9
##  2 Argentina     18 DF       1990-07-13 SALVIO        SL Benf…    167     69 POR     27.9
##  3 Croatia       15 DF       1996-09-17 ĆALETA-CAR    FC Red …    192     89 AUT     21.7
##  4 Croatia       21 DF       1989-04-29 VIDA          Besikta…    184     76 TUR     29.1
##  5 Japan          3 DF       1992-12-11 SHOJI         Kashima…    182     74 JPN     25.5
##  6 Colombia       7 FW       1986-09-08 BACCA         Villarr…    181     77 ESP     31.8
##  7 Iceland       10 MF       1989-09-08 G. SIGURDSSON Everton…    186     82 ENG     28.8
##  8 Germany       17 DF       1988-09-03 BOATENG       FC Baye…    192     90 GER     29.8
##  9 Poland         4 DF       1986-04-21 CIONEK        SPAL Fe…    184     81 ITA     32.1
## 10 Uruguay        9 FW       1987-01-24 L. SUAREZ     FC Barc…    182     85 ESP     31.4

Perform some error checking.

stopifnot(length(unique(pdf_data$team)) == 32)      # There are 32 teams.
stopifnot(all(range(table(pdf_data$team)) == 23))   # Each team has 23 players.
stopifnot(pdf_data %>% 
            filter(position == "GK") %>% 
            group_by(team) %>% 
            tally() %>% 
            filter(n != 3) %>% 
            nrow() == 0)                     # All teams have 3 goal keepers.
stopifnot(all(pdf_data$position %in% 
                c("GK", "DF", "MF", "FW")))  # All players assigned to 1 of 4 positions.

Wikipedia Data

Wikipedia includes other player information which might be interesting, especially the number of caps for each player. A “cap” is an appearance in a game for the national team. The rvest package makes scraping data from Wikipedia fairly easy.

suppressMessages(library(rvest))
html <- read_html("https://en.wikipedia.org/wiki/2018_FIFA_World_Cup_squads")

# Once we have read in all the html, we need to identify the location of the
# data we want. The rvest vignette provides guidance, but the key trick is the
# use of SelectorGadget to find the correct CSS node.

# First, we need the country and the shirt number of each player so that we can
# merge this data with that from the PDF.

country <- html_nodes(html, ".mw-headline") %>% 
  html_text() %>%
  as_tibble() %>% 
  filter(! str_detect(value, "Group")) %>% 
  slice(1:32)

number <- html_nodes(html, ".plainrowheaders td:nth-child(1)") %>% 
  html_text()

# We don't need the name of each player but I like to grab it, both because I
# prefer the Wikipedia formatting and to use this as a cross-check on the
# accuracy of our country/number merge.

name <- html_nodes(html, "th a") %>% 
  html_text() %>% 
  as_tibble() %>% 
  filter(! str_detect(value, "^captain$")) %>% 
  slice(1:736)

# cap is the variable we care about, but Wikipedia page also includes the number
# of goals that each player has scored for the national team. Try adding that
# information on your own.

caps <- html_nodes(html, ".plainrowheaders td:nth-child(5)") %>% 
  html_text()

# Create a tibble. Note that we are relying on all the vectors being in the
# correct order.

wiki_data <- tibble(
  number = as.numeric(number),
  name = name$value,
  team = rep(country$value, each = 23),
  caps = as.numeric(caps))

# I prefer the name from Wikipedia. Exercise for the reader: How might we use
# name (from Wikipedia) and shirt_name (from the PDF) to confirm that we have
# lined up the data correctly?
  
x <- left_join(select(pdf_data, -name), wiki_data, by = c("team", "number"))

Data Exploration

With this information, there are a variety of topics to explore.

Birth Month

For the entire sample of 736 players, there is a clear birth month effect, visible both when looking at calendar months and when aggregating to calendar quarters. Players are much more likely to have birthdays earlier in the year. The most common explanation is that players born in January have an advantage over players born in December (when both are born in the same calendar year) because the former will be older than the later whenever they are competing for spots on the same age-group team, given that the cut-offs are always (?) December 31. This advantage in youth soccer bleeds into adult soccer because of the extra opportunities it provides for expert coaching. (See “A Star Is Made,” by Stephen J. Dubner and Steven D. Levitt, May 7, 2006, New York Times Magazine.)

Strangely, the effect is only true for players who will be 25 and over at the start of the World Cup, about 75% of the sample.

Why would that be true? Note that there are many fewer players starting the tournament at age 24 than one might expect:

Are the “missing” score or so 24 year-olds a sign of something meaningful or random noise?

Team Quality

We don’t have good measures of player (or team) quality in this data. But we do know if an individual plays for a team in one of the countries which host the five highest quality leagues: England (ENG), Spain (ESP), Germany (GER), Italy (58) and France (49). (It is no coincidence that these leagues account for the largest share of the players.)

x %>% 
  group_by(league) %>% 
  tally() %>% 
  arrange(desc(n))

## # A tibble: 55 x 2
##    league     n
##    <chr>  <int>
##  1 ENG      124
##  2 ESP       81
##  3 GER       67
##  4 ITA       58
##  5 FRA       49
##  6 RUS       36
##  7 KSA       30
##  8 MEX       22
##  9 TUR       22
## 10 POR       19
## # ... with 45 more rows

Any World Cup team with very few players who play in these 5 leagues is unlikely to be a good team. The best leagues have teams with so much money that they (almost) always are able to hire the best players. The vast majority of players in, for example, the Saudi Arabian or Turkish leagues are not wanted by any team in the best leagues. So, one measure of team quality is the percentage of players who play for teams in those 5 elite leagues. Here are the top 8 and bottom 4:

x %>% 
  group_by(team) %>% 
  summarise(elite = mean(league %in% 
                           c("ENG", "ESP", "GER", "ITA", "FRA"))) %>%
  arrange(desc(elite)) %>% 
  slice(c(1:8, 29:32))

## # A tibble: 12 x 2
##    team         elite
##    <chr>        <dbl>
##  1 England     1     
##  2 France      1     
##  3 Germany     1     
##  4 Spain       1     
##  5 Belgium     0.826 
##  6 Switzerland 0.826 
##  7 Senegal     0.783 
##  8 Brazil      0.739 
##  9 Iran        0.0435
## 10 Panama      0.0435
## 11 Peru        0.0435
## 12 Russia      0.0435

This measure captures the fact that teams like England, France, Spain and Germany are likely to do well while teams like Iran, Panama and Peru are not. Russia, as the host country, is a more difficult case. There are many problems with this analysis. Feel free to point them out in the comments. A better approach would look at the quality of the clubs that individuals play for or, even better, at measures of individual player quality.

What can you do with this data?

David Kane teaches at Harvard University and co-organizes the Boston R User Group.