Load R packages

library(cricketdata)
library(tidyverse)

Data

In this workshop we will use cricketdata R package by Rob Hyndman and his team, to get a better understanding of the concepts of exploration, visualization, and potential analyses.

This post was inspire by Rob Hyndman’s recent post on “The cricketdata package”.

There are four key functions in the cricketdata package:

Sri Lanka men’s ODI data by innings

This example shows Sri Lankan men’s ODI batting results by innings.

menODI <- fetch_cricinfo("ODI", "Men", "Batting", type = "innings", country = "Sri Lanka")

colnames(menODI)
##  [1] "Date"          "Player"        "Runs"          "NotOut"       
##  [5] "Minutes"       "BallsFaced"    "Fours"         "Sixes"        
##  [9] "StrikeRate"    "Innings"       "Participation" "Opposition"   
## [13] "Ground"

Data Import and Export

# Export Data
write_csv(menODI, "SLmenODI.csv")
# Import Data
data <- read_csv("SLmenODI.csv")
head(data)
## # A tibble: 6 × 13
##   Date       Player        Runs NotOut Minutes BallsFaced Fours Sixes StrikeRate
##   <date>     <chr>        <dbl> <lgl>    <dbl>      <dbl> <dbl> <dbl>      <dbl>
## 1 2000-10-29 ST Jayasuri…   189 FALSE       NA        161    21     4      117. 
## 2 2013-07-02 WU Tharanga    174 TRUE       228        159    19     3      109. 
## 3 2013-07-20 KC Sangakka…   169 FALSE      200        137    18     6      123. 
## 4 2015-02-26 TM Dilshan     161 TRUE       221        146    22     0      110. 
## 5 2012-02-28 TM Dilshan     160 TRUE       202        165    11     3       97.0
## 6 2009-12-15 TM Dilshan     160 FALSE      186        124    20     3      129. 
## # … with 4 more variables: Innings <dbl>, Participation <chr>,
## #   Opposition <chr>, Ground <chr>

Data wrangling

Data wrangling is the process of cleaning and unifying messy and complex data sets for easy access and analysis.

There are five dplyr functions that you will use to do the vast majority of data manipulations:

Pipe (%>%) Operator

  • The principal function provided by the magrittr package is %>%, or what’s called the “pipe” operator.
  • This operator will forward a value, or the result of an expression, into the next function call/expression.
menODI %>%
  filter(Date == "2022-06-19")
## # A tibble: 11 × 13
##    Date       Player       Runs NotOut Minutes BallsFaced Fours Sixes StrikeRate
##    <date>     <chr>       <int> <lgl>    <dbl>      <int> <int> <int>      <dbl>
##  1 2022-06-19 P Nissanka    137 FALSE      203        147    11     2       93.2
##  2 2022-06-19 BKG Mendis     87 TRUE       128         85     8     0      102. 
##  3 2022-06-19 N Dickwella    25 FALSE       33         26     5     0       96.2
##  4 2022-06-19 DM de Silva    25 FALSE       25         17     4     0      147. 
##  5 2022-06-19 KIC Asalan…    13 TRUE        26         12     0     1      108. 
##  6 2022-06-19 C Karunara…     0 TRUE         5          2     0     0        0  
##  7 2022-06-19 MD Shanaka      0 FALSE        3          2     0     0        0  
##  8 2022-06-19 DN Wellala…    NA FALSE       NA         NA    NA    NA       NA  
##  9 2022-06-19 PVD Chamee…    NA FALSE       NA         NA    NA    NA       NA  
## 10 2022-06-19 M Theeksha…    NA FALSE       NA         NA    NA    NA       NA  
## 11 2022-06-19 JDF Vander…    NA FALSE       NA         NA    NA    NA       NA  
## # … with 4 more variables: Innings <int>, Participation <chr>,
## #   Opposition <chr>, Ground <chr>
menODI %>%
  select(Date, Player, Runs, StrikeRate, NotOut)
## # A tibble: 9,637 × 5
##    Date       Player         Runs StrikeRate NotOut
##    <date>     <chr>         <int>      <dbl> <lgl> 
##  1 2000-10-29 ST Jayasuriya   189      117.  FALSE 
##  2 2013-07-02 WU Tharanga     174      109.  TRUE  
##  3 2013-07-20 KC Sangakkara   169      123.  FALSE 
##  4 2015-02-26 TM Dilshan      161      110.  TRUE  
##  5 2012-02-28 TM Dilshan      160       97.0 TRUE  
##  6 2009-12-15 TM Dilshan      160      129.  FALSE 
##  7 2006-07-04 ST Jayasuriya   157      151.  FALSE 
##  8 2006-07-01 ST Jayasuriya   152      154.  FALSE 
##  9 1997-05-17 ST Jayasuriya   151      126.  TRUE  
## 10 1996-03-06 PA de Silva     145      126.  FALSE 
## # … with 9,627 more rows
menODI %>%
  group_by(Player) %>%
  summarise(Runs = mean(Runs), matches = n()) %>%
  arrange(desc(Runs))
## # A tibble: 204 × 3
##    Player          Runs matches
##    <chr>          <dbl>   <int>
##  1 MG Vandort      48         1
##  2 SRD Wettimuny   45.3       3
##  3 KIC Asalanka    43        15
##  4 WIA Fernando    37.1      26
##  5 APB Tennekoon   34.2       4
##  6 ML Udawatte     28.6       9
##  7 P Nissanka      28.2      16
##  8 KNA Bandara     28.2       5
##  9 DA Gunawardene  28        61
## 10 MH Tissera      26         3
## # … with 194 more rows
menODI %>% 
  filter(Player == "BKG Mendis") %>% 
  arrange(desc(Date))
## # A tibble: 86 × 13
##    Date       Player      Runs NotOut Minutes BallsFaced Fours Sixes StrikeRate
##    <date>     <chr>      <int> <lgl>    <dbl>      <int> <int> <int>      <dbl>
##  1 2022-06-21 BKG Mendis    14 FALSE       22         21     1     1       66.7
##  2 2022-06-19 BKG Mendis    87 TRUE       128         85     8     0      102. 
##  3 2022-06-16 BKG Mendis    36 FALSE       75         41     2     1       87.8
##  4 2022-06-14 BKG Mendis    86 TRUE       139         87     8     1       98.9
##  5 2022-01-21 BKG Mendis    36 FALSE       69         51     4     0       70.6
##  6 2022-01-18 BKG Mendis     7 FALSE       21          9     1     0       77.8
##  7 2022-01-16 BKG Mendis    26 FALSE       29         24     6     0      108. 
##  8 2021-05-28 BKG Mendis    22 FALSE       56         36     0     1       61.1
##  9 2021-05-25 BKG Mendis    15 FALSE       31         22     0     1       68.2
## 10 2021-05-23 BKG Mendis    24 FALSE       43         36     2     0       66.7
## # … with 76 more rows, and 4 more variables: Innings <int>,
## #   Participation <chr>, Opposition <chr>, Ground <chr>

Data Visualization

p <- menODI %>%
  filter(Opposition %in% c("Australia", "Bangladesh", "Pakistan") )%>%
  ggplot(aes(y = Runs, x = Date, col = Opposition)) +
  geom_point(alpha = 0.7) +
  geom_smooth()+
  ggtitle("Sri Lanka Men ODI: Runs per Innings")
  
print(p) 

The average number of runs per innings for Bangladesh is higher than that for Australia and Pakistan, even though the performance has gradually declined over time.

Sri Lanka test fielding data

Next, we demonstrate some of the fielding data available, using Test match fielding from Sri Lankan men’s players.

SLfielding <- fetch_cricinfo("Test", "Men", "Fielding", 
                              country = "Sri Lanka")
head(SLfielding)
## # A tibble: 6 × 11
##   Player           Start   End Matches Innings Dismissals Caught CaughtFielder
##   <chr>            <int> <int>   <int>   <int>      <int>  <int>         <int>
## 1 DPMD Jayawardene  1997  2014     149     270        205    205           205
## 2 KC Sangakkara     2000  2015     134     248        202    182            51
## 3 HAPW Jayawardene  2000  2015      58     102        156    124             0
## 4 N Dickwella       2014  2022      49      88        150    126             1
## 5 HP Tillakaratne   1989  2004      83     141        124    122            89
## 6 RS Kaluwitharana  1992  2004      49      87        119     93             0
## # … with 3 more variables: CaughtBehind <int>, Stumped <int>,
## #   MaxDismissalsInnings <dbl>
colnames(SLfielding)
##  [1] "Player"               "Start"                "End"                 
##  [4] "Matches"              "Innings"              "Dismissals"          
##  [7] "Caught"               "CaughtFielder"        "CaughtBehind"        
## [10] "Stumped"              "MaxDismissalsInnings"

We can plot the number of dismissals by number of matches for all male test players.

p1 <- SLfielding %>%
  ggplot(aes(x = Matches, y = Dismissals)) +
  geom_point() +
  ggtitle("Sri Lanka Men Test Fielding")

print(p1)

Because wicket keepers typically have a lot more dismissals than other players, let’s show them in a different colour.

p2 <- SLfielding %>%
  mutate(wktkeeper = (CaughtBehind > 0) | (Stumped > 0)) %>%
  ggplot(aes(x = Matches, y = Dismissals, col = wktkeeper)) +
  geom_point() +
  ggtitle("Sri Lanka Men Test Fielding")

print(p2)

We can see two outlying points. I would like to do further investigation into them.

Interactive data visualization

Interactive data visualization is the use of tools and processes to create a visual representation of data that can be explored and analyzed directly within the visualization itself. This interaction can help to uncover insights that lead to better, data-driven decisions.

plotly R package allows us to create interactive and publication-quality charts/graphs in R.

p3 <- SLfielding %>%
  mutate(wktkeeper = (CaughtBehind > 0) | (Stumped > 0)) %>%
  ggplot(aes(x = Matches, y = Dismissals, col = wktkeeper,
             text=Player)) +
  geom_point() +
  ggtitle("Sri Lanka Men Test Fielding")

library(plotly)
ggplotly(p3)

The high number of dismissals, just above 200, is due to Kumar Sangakkara. Another interesting statistic is the non-wicketkeeper with over 200 dismissals.This is Mahela Jayawardene who took 205 catches during his career.

Kusal Mendis’s ODI batting

Finally, consider the data for individual players. The Cricinfo player ID is required for the fetch_player_data() function, which you can find on their website or by using the find_player_id() function. We’ll look at Kusal Medis’s ODI results.

KMendis_id <- find_player_id("BKG Mendis")$ID
KMendis <- fetch_player_data(KMendis_id, "ODI") %>%
  mutate(NotOut = (Dismissal == "not out"))

head(KMendis)
## # A tibble: 6 × 14
##   Start_Date Innings Opposition Ground  Runs Mins  BF    X4s   X6s   SR    Pos  
##   <date>       <int> <chr>      <chr>  <dbl> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 2016-06-16       1 Ireland    Dubli…    51 <NA>  59    8     0     86.44 3    
## 2 2016-06-18       1 Ireland    Dubli…     8 <NA>  7     1     0     114.… 8    
## 3 2016-06-21       1 England    Notti…    17 25    14    2     1     121.… 3    
## 4 2016-06-24       1 England    Birmi…     0 12    9     0     0     0.00  3    
## 5 2016-06-26       1 England    Brist…    53 83    66    5     1     80.30 3    
## 6 2016-06-29       1 England    The O…    77 76    64    13    0     120.… 3    
## # … with 3 more variables: Dismissal <chr>, Inns <chr>, NotOut <lgl>

Let’s can plot his runs per innings on the vertical axis over time on the horizontal axis.

# Compute batting average
KMave <- KMendis %>%
  filter(!is.na(Runs)) %>%
  summarise(Average = sum(Runs) / (n() - sum(NotOut))) %>%
  pull(Average)
names(KMave) <- paste("Average =", round(KMave, 2))
KMave
## Average = 31.5 
##           31.5
# Plot ODI scores
ggplot(KMendis) +
  geom_hline(aes(yintercept = KMave), col="gray") +
  geom_point(aes(x = Start_Date, y = Runs, col = NotOut)) +
  ggtitle("Kusal Mendis ODI Scores") 

Around 2021, a significant blank space is visible. In July 2021, Mendis was suspended from playing in international cricket for one year. Sri Lanka Cricket agreed to lift the ban early, removing the punishment in January 2022. Now Kusal Mendis is back in the squad.

Keep Exploring! Happy Learning with R!!

Additional Resources

R for Data Science by by Hadley Wickham and Garrett Grolemund

This is a great data science book for beginners interested in learning data science with R.

References

Rob Hyndman, Timothy Hyndman, Charles Gray, Sayani Gupta and Jacquie Tran (2022). cricketdata: International Cricket Data. R package version 0.1.1. https://CRAN.R-project.org/package=cricketdata