Penalty Kicks in Professional Soccer

Daniel Moser

View the Project on GitHub djmwa/penaltykicks

Document Type: Code & Analysis
Purpose: Data acquisition and initial exploration


Data Acquisition Strategy

As with most data-driven projects, the first step—after framing the research question (see Introduction)—is acquiring appropriate and sufficient data. I began by targeting data for the English Premier League (EPL), under the assumption that it’s the most followed and well-documented league, increasing the likelihood of accessible, structured historical data.

My hope was that a reliable EPL data source would also contain data for other leagues of interest. The [official Premier League]( website does offer detailed match-level data, but unfortunately, penalty kick (PK) information is embedded only on individual match pages—nearly 400 per season—making scraping this data impractical as a first approach.

After surveying multiple sources, I identified three primary sites that collectively offer:

While the match-level PK dataset is limited, it provides a good foundation for initial analysis. I will continue seeking more comprehensive PK data, but this gives us a sufficient base to begin quantifying game-level PK impact and determine how well predictions from season-level PK statistics match with those from game-level PK stats.

Data Sources and Scripts

1. FBRef

Site: FBRef EPL Stats (2024–25)
Data: Season-level team and player statistics for all 12 leagues, including shooting, scoring, goalkeeping, passing, fouls, cards, and more.

Key Challenges:

Solutions:

FBRef Python Script (EPL)

Note: The biggest lift was handling ambiguous or non-unique table identifiers due to the tabbed layout and commented HTML. Once parsed correctly, looping through seasons and exporting structured data became straightforward.

2. Football-Data.co.uk

Site: Football-Data.co.uk
Data: Match-level results for nearly 30 years, including goals, half-time scores, shots, fouls, and cards. Available for all 12 leagues.

Key Challenges:

Solutions:

Football-Data Python Script

Note: Adding fault tolerance to the download function was crucial—initial runs failed silently or halted mid-script. Logging errors to a list allowed me to reattempt or manually complete downloads efficiently.

3. EPL Review

Site: EPL Review
Data: Game-level penalty kick data (attempts and conversions) for 8 EPL seasons

Key Challenges:

Solutions:

EPL Review Python Script

Note: This site presented the most brittle structure—lacking predictable HTML tags or consistent formats. A row-count heuristic worked for now but will need re-evaluation if the site structure changes.

Final Thoughts

This first pass at data collection prioritized completeness and reproducibility over elegance. Where site structure allowed, I leaned on reusable scripts and common patterns; where it didn’t, I opted for pragmatic one-off solutions.

Next steps will include:

I expect the next update will focus on initial trends identified during the EDA process.


Home
Previous: Background
Next: Data Cleaning