library(cricketdata)
library(tidyverse)
In this workshop we will use cricketdata
R package by Rob Hyndman and his team, to get a better understanding of the concepts of exploration, visualization, and potential analyses.
This post was inspire by Rob Hyndman’s recent post on “The cricketdata package”.
There are four key functions in the cricketdata
package:
fetch_cricinfo()
: Fetch team data on international cricket matches provided by ESPNCricinfo.fetch_player_data()
: Fetch individual player data on international cricket matches provided by ESPNCricinfo.find_player_id()
: Search for the player ID on ESPNCricinfo.fetch_cricsheet()
: Fetch ball-by-ball, match and player data from Cricsheet.This example shows Sri Lankan men’s ODI batting results by innings.
menODI <- fetch_cricinfo("ODI", "Men", "Batting", type = "innings", country = "Sri Lanka")
colnames(menODI)
## [1] "Date" "Player" "Runs" "NotOut"
## [5] "Minutes" "BallsFaced" "Fours" "Sixes"
## [9] "StrikeRate" "Innings" "Participation" "Opposition"
## [13] "Ground"
# Export Data
write_csv(menODI, "SLmenODI.csv")
# Import Data
data <- read_csv("SLmenODI.csv")
head(data)
## # A tibble: 6 × 13
## Date Player Runs NotOut Minutes BallsFaced Fours Sixes StrikeRate
## <date> <chr> <dbl> <lgl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2000-10-29 ST Jayasuri… 189 FALSE NA 161 21 4 117.
## 2 2013-07-02 WU Tharanga 174 TRUE 228 159 19 3 109.
## 3 2013-07-20 KC Sangakka… 169 FALSE 200 137 18 6 123.
## 4 2015-02-26 TM Dilshan 161 TRUE 221 146 22 0 110.
## 5 2012-02-28 TM Dilshan 160 TRUE 202 165 11 3 97.0
## 6 2009-12-15 TM Dilshan 160 FALSE 186 124 20 3 129.
## # … with 4 more variables: Innings <dbl>, Participation <chr>,
## # Opposition <chr>, Ground <chr>
Data wrangling is the process of cleaning and unifying messy and complex data sets for easy access and analysis.
There are five dplyr functions that you will use to do the vast majority of data manipulations:
filter()
: pick observations by their valuesselect()
: pick variables by their namesmutate()
: create new variables with functions of existing variablessummarise()
: collapse many values down to a single summaryarrange()
: reorder the rowsmagrittr
package is %>%
, or what’s called the “pipe” operator.menODI %>%
filter(Date == "2022-06-19")
## # A tibble: 11 × 13
## Date Player Runs NotOut Minutes BallsFaced Fours Sixes StrikeRate
## <date> <chr> <int> <lgl> <dbl> <int> <int> <int> <dbl>
## 1 2022-06-19 P Nissanka 137 FALSE 203 147 11 2 93.2
## 2 2022-06-19 BKG Mendis 87 TRUE 128 85 8 0 102.
## 3 2022-06-19 N Dickwella 25 FALSE 33 26 5 0 96.2
## 4 2022-06-19 DM de Silva 25 FALSE 25 17 4 0 147.
## 5 2022-06-19 KIC Asalan… 13 TRUE 26 12 0 1 108.
## 6 2022-06-19 C Karunara… 0 TRUE 5 2 0 0 0
## 7 2022-06-19 MD Shanaka 0 FALSE 3 2 0 0 0
## 8 2022-06-19 DN Wellala… NA FALSE NA NA NA NA NA
## 9 2022-06-19 PVD Chamee… NA FALSE NA NA NA NA NA
## 10 2022-06-19 M Theeksha… NA FALSE NA NA NA NA NA
## 11 2022-06-19 JDF Vander… NA FALSE NA NA NA NA NA
## # … with 4 more variables: Innings <int>, Participation <chr>,
## # Opposition <chr>, Ground <chr>
menODI %>%
select(Date, Player, Runs, StrikeRate, NotOut)
## # A tibble: 9,637 × 5
## Date Player Runs StrikeRate NotOut
## <date> <chr> <int> <dbl> <lgl>
## 1 2000-10-29 ST Jayasuriya 189 117. FALSE
## 2 2013-07-02 WU Tharanga 174 109. TRUE
## 3 2013-07-20 KC Sangakkara 169 123. FALSE
## 4 2015-02-26 TM Dilshan 161 110. TRUE
## 5 2012-02-28 TM Dilshan 160 97.0 TRUE
## 6 2009-12-15 TM Dilshan 160 129. FALSE
## 7 2006-07-04 ST Jayasuriya 157 151. FALSE
## 8 2006-07-01 ST Jayasuriya 152 154. FALSE
## 9 1997-05-17 ST Jayasuriya 151 126. TRUE
## 10 1996-03-06 PA de Silva 145 126. FALSE
## # … with 9,627 more rows
menODI %>%
group_by(Player) %>%
summarise(Runs = mean(Runs), matches = n()) %>%
arrange(desc(Runs))
## # A tibble: 204 × 3
## Player Runs matches
## <chr> <dbl> <int>
## 1 MG Vandort 48 1
## 2 SRD Wettimuny 45.3 3
## 3 KIC Asalanka 43 15
## 4 WIA Fernando 37.1 26
## 5 APB Tennekoon 34.2 4
## 6 ML Udawatte 28.6 9
## 7 P Nissanka 28.2 16
## 8 KNA Bandara 28.2 5
## 9 DA Gunawardene 28 61
## 10 MH Tissera 26 3
## # … with 194 more rows
menODI %>%
filter(Player == "BKG Mendis") %>%
arrange(desc(Date))
## # A tibble: 86 × 13
## Date Player Runs NotOut Minutes BallsFaced Fours Sixes StrikeRate
## <date> <chr> <int> <lgl> <dbl> <int> <int> <int> <dbl>
## 1 2022-06-21 BKG Mendis 14 FALSE 22 21 1 1 66.7
## 2 2022-06-19 BKG Mendis 87 TRUE 128 85 8 0 102.
## 3 2022-06-16 BKG Mendis 36 FALSE 75 41 2 1 87.8
## 4 2022-06-14 BKG Mendis 86 TRUE 139 87 8 1 98.9
## 5 2022-01-21 BKG Mendis 36 FALSE 69 51 4 0 70.6
## 6 2022-01-18 BKG Mendis 7 FALSE 21 9 1 0 77.8
## 7 2022-01-16 BKG Mendis 26 FALSE 29 24 6 0 108.
## 8 2021-05-28 BKG Mendis 22 FALSE 56 36 0 1 61.1
## 9 2021-05-25 BKG Mendis 15 FALSE 31 22 0 1 68.2
## 10 2021-05-23 BKG Mendis 24 FALSE 43 36 2 0 66.7
## # … with 76 more rows, and 4 more variables: Innings <int>,
## # Participation <chr>, Opposition <chr>, Ground <chr>
p <- menODI %>%
filter(Opposition %in% c("Australia", "Bangladesh", "Pakistan") )%>%
ggplot(aes(y = Runs, x = Date, col = Opposition)) +
geom_point(alpha = 0.7) +
geom_smooth()+
ggtitle("Sri Lanka Men ODI: Runs per Innings")
print(p)
The average number of runs per innings for Bangladesh is higher than that for Australia and Pakistan, even though the performance has gradually declined over time.
Next, we demonstrate some of the fielding data available, using Test match fielding from Sri Lankan men’s players.
SLfielding <- fetch_cricinfo("Test", "Men", "Fielding",
country = "Sri Lanka")
head(SLfielding)
## # A tibble: 6 × 11
## Player Start End Matches Innings Dismissals Caught CaughtFielder
## <chr> <int> <int> <int> <int> <int> <int> <int>
## 1 DPMD Jayawardene 1997 2014 149 270 205 205 205
## 2 KC Sangakkara 2000 2015 134 248 202 182 51
## 3 HAPW Jayawardene 2000 2015 58 102 156 124 0
## 4 N Dickwella 2014 2022 49 88 150 126 1
## 5 HP Tillakaratne 1989 2004 83 141 124 122 89
## 6 RS Kaluwitharana 1992 2004 49 87 119 93 0
## # … with 3 more variables: CaughtBehind <int>, Stumped <int>,
## # MaxDismissalsInnings <dbl>
colnames(SLfielding)
## [1] "Player" "Start" "End"
## [4] "Matches" "Innings" "Dismissals"
## [7] "Caught" "CaughtFielder" "CaughtBehind"
## [10] "Stumped" "MaxDismissalsInnings"
We can plot the number of dismissals by number of matches for all male test players.
p1 <- SLfielding %>%
ggplot(aes(x = Matches, y = Dismissals)) +
geom_point() +
ggtitle("Sri Lanka Men Test Fielding")
print(p1)
Because wicket keepers typically have a lot more dismissals than other players, let’s show them in a different colour.
p2 <- SLfielding %>%
mutate(wktkeeper = (CaughtBehind > 0) | (Stumped > 0)) %>%
ggplot(aes(x = Matches, y = Dismissals, col = wktkeeper)) +
geom_point() +
ggtitle("Sri Lanka Men Test Fielding")
print(p2)
We can see two outlying points. I would like to do further investigation into them.
Interactive data visualization is the use of tools and processes to create a visual representation of data that can be explored and analyzed directly within the visualization itself. This interaction can help to uncover insights that lead to better, data-driven decisions.
plotly
R package allows us to create interactive and publication-quality charts/graphs in R.
p3 <- SLfielding %>%
mutate(wktkeeper = (CaughtBehind > 0) | (Stumped > 0)) %>%
ggplot(aes(x = Matches, y = Dismissals, col = wktkeeper,
text=Player)) +
geom_point() +
ggtitle("Sri Lanka Men Test Fielding")
library(plotly)
ggplotly(p3)
The high number of dismissals, just above 200, is due to Kumar Sangakkara. Another interesting statistic is the non-wicketkeeper with over 200 dismissals.This is Mahela Jayawardene who took 205 catches during his career.
Finally, consider the data for individual players. The Cricinfo player ID is required for the fetch_player_data()
function, which you can find on their website or by using the find_player_id()
function. We’ll look at Kusal Medis’s ODI results.
KMendis_id <- find_player_id("BKG Mendis")$ID
KMendis <- fetch_player_data(KMendis_id, "ODI") %>%
mutate(NotOut = (Dismissal == "not out"))
head(KMendis)
## # A tibble: 6 × 14
## Start_Date Innings Opposition Ground Runs Mins BF X4s X6s SR Pos
## <date> <int> <chr> <chr> <dbl> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 2016-06-16 1 Ireland Dubli… 51 <NA> 59 8 0 86.44 3
## 2 2016-06-18 1 Ireland Dubli… 8 <NA> 7 1 0 114.… 8
## 3 2016-06-21 1 England Notti… 17 25 14 2 1 121.… 3
## 4 2016-06-24 1 England Birmi… 0 12 9 0 0 0.00 3
## 5 2016-06-26 1 England Brist… 53 83 66 5 1 80.30 3
## 6 2016-06-29 1 England The O… 77 76 64 13 0 120.… 3
## # … with 3 more variables: Dismissal <chr>, Inns <chr>, NotOut <lgl>
Let’s can plot his runs per innings on the vertical axis over time on the horizontal axis.
# Compute batting average
KMave <- KMendis %>%
filter(!is.na(Runs)) %>%
summarise(Average = sum(Runs) / (n() - sum(NotOut))) %>%
pull(Average)
names(KMave) <- paste("Average =", round(KMave, 2))
KMave
## Average = 31.5
## 31.5
# Plot ODI scores
ggplot(KMendis) +
geom_hline(aes(yintercept = KMave), col="gray") +
geom_point(aes(x = Start_Date, y = Runs, col = NotOut)) +
ggtitle("Kusal Mendis ODI Scores")
Around 2021, a significant blank space is visible. In July 2021, Mendis was suspended from playing in international cricket for one year. Sri Lanka Cricket agreed to lift the ban early, removing the punishment in January 2022. Now Kusal Mendis is back in the squad.
Keep Exploring! Happy Learning with R!!
R for Data Science by by Hadley Wickham and Garrett Grolemund
This is a great data science book for beginners interested in learning data science with R.
Rob Hyndman, Timothy Hyndman, Charles Gray, Sayani Gupta and Jacquie Tran (2022). cricketdata: International Cricket Data. R package version 0.1.1. https://CRAN.R-project.org/package=cricketdata