Daniel Moser
Document Type: Code & Analysis
Purpose: Data dictionary and cleaning
Despite having a clear research question about penalty kick impact on game outcomes, I made the strategic error of attempting to clean and process the entire available dataset first. This meant working with 79 separate files with stat leaders (each with 1 stat, and limited to 10 players), plus 474 features (many duplicated or irrelevant) across 24 files for team-level stats.
What I Did Wrong:
Problems This Identified:
The Real Issue: I was solving interesting technical problems that had no bearing on my actual research question.
Even with the failure of this phase to produce data relevant to my question, I identified several challenges (and ways to tackle them) that will be helpful if there are further questions of interest to explore. I also gained valuable experience writing and debugging code for dataframe import and manipulation.
After recognizing this was a dead end, I stepped back and properly scoped the project around my original research question.
Key Realization: I had clear research objectives but got allowed myself to get lost in the complexity and volume of available data and attempting to wrangle it into a usable form. This is a classic case of letting the data drive the analysis rather than letting the research question drive the data collection.
What I Learned: Always start with your research question and work backwards to identify the minimum data needed. Data exploration should be targeted and purposeful, not exhaustive.
Refined Approach:
Data Requirements Identified:
EPL Penalty Data Issues:
football-data Issues:
Problem: Neither dataset had explicit match IDs, requiring creation of composite keys.
Solution: Used combination of season + home team + away team as unique identifier, leveraging the round-robin league structure where each team plays every other team exactly twice (home and away).
Validation: Cross-referenced team name formats between datasets to ensure proper matching.
Issue: Penalty kick data CSVs had no headers, causing pandas to treat first row as column names.
Solution: Used column indexes instead of names, then manually assigned appropriate column names after data loading.
Issue: One CSV file had incorrect format due to web scraping error.
Solution: Manual verification of each file, followed by copy-paste correction of the problematic file while maintaining structural consistency.
* Manual verification was viable because only 8 files required verification
Issue: UnicodeDecodeError
when processing files with international player names.
Solution: Implemented fallback encoding strategy using Latin-1 encoding with Windows-1252 as backup.
Issue: TypeError
when processing match strings - missing values were being read as floats instead of strings.
Root Cause: Extra null rows in 2012-13 season data.
Solution: Added robust null value handling to prevent processing errors, rather than just fixing the single problematic file. This improved code resilience for future use.
Issue: Simple joins resulted in duplicate rows for games with multiple penalty kicks.
Solution: Aggregated penalty data before joining, creating separate columns for:
Created 8 season-specific files with standardized columns:
Season
: EPL Season identifierHomeTeam
: Home team nameAwayTeam
: Away team nameHomeGoals
: Total goals scored by home teamAwayGoals
: Total goals scored by away teamResult
: Game outcome (H/A/D)home_pk_scored
: Penalties scored by home teamhome_pk_awarded
: Penalties awarded to home teamaway_pk_scored
: Penalties scored by away teamaway_pk_awarded
: Penalties awarded to away teamThe Python script and 2019-2020 data used to create these files can be found here, along with the 2019-2020 output file.
Start with the Research Question: The biggest lesson was that I should have worked backwards from my research question to identify the minimum viable dataset. Getting distracted by comprehensive data cleaning was a significant time sink that didn’t advance my actual goals.
Data Exploration Should Be Targeted: While understanding your data is important, exploration should be purposeful and bounded by your analytical objectives, not driven by curiosity about everything that’s available.
Minimum Viable Data First: For future projects, I would identify the smallest dataset that can answer my core research question, prove the analysis works, then expand scope if needed.
Data Quality Assessment: Always verify data integrity before processing. The time spent on manual verification prevented downstream analytical errors.
Robust Error Handling: Building resilient code that handles edge cases (like encoding errors and null values) is more valuable than quick fixes.
Beware of Interesting Side Problems: Complex data structure problems can be intellectually engaging but may not be relevant to your research objectives. Stay focused on what matters for your analysis.