A Python-based data pipeline for football recruitment analytics using FBref data. This project processes player statistics, computes role-based scores, generates player shortlists, and identifies comparable players for specific tactical roles.
This project extracts, transforms, and analyzes player performance data from FBref to support football recruitment decisions. It evaluates players against specific tactical roles (Ball-Playing CB, Deep-Lying Playmaker, Winger Creator) using weighted scoring algorithms and provides actionable insights through:
- Role Scoring: Evaluates players against defined tactical roles with weighted metrics
- Player Shortlists: Generates ranked lists of top candidates per role
- Comparables Analysis: Identifies similar players using machine learning similarity metrics
- Tableau-Ready Exports: Produces dimension and fact tables optimized for visualization
- Multi-League Data Extraction: Supports multiple leagues and seasons via FBref
- Comprehensive Metrics: Processes standard, passing, defense, possession, shooting, and advanced stats
- Percentile Ranking: Contextualizes player performance within league/season/position groups
- Role-Based Analytics: Configurable role definitions with must-have criteria and weighted scoring
- Risk Flagging: Identifies potential concerns (age, playing time, error rates)
- Data Mart Architecture: Organized dimensional modeling for analytics consumption
- Python >= 3.12, < 3.13
- pip or compatible package manager
- Clone the repository:
git clone <repository-url>
cd recruitment-support-v1- Install the package and dependencies:
pip install -e .- Install development dependencies (optional):
pip install -e ".[dev]"soccerdata>=1.8.0- FBref data extractionpandas>=2.2- Data manipulationnumpy>=2.0- Numerical operationspyarrow>=16.0- Parquet file supportpyyaml>=6.0- Configuration file parsingscikit-learn>=1.5- Machine learning utilities for similarity calculationstyper>=0.12- CLI frameworkrich>=13.0- Enhanced terminal output
The pipeline is configured via YAML files in the configs/ directory.
project:
name: recruitment-support-fbref
version: v1
fbref:
leagues: ["ENG-Premier League"]
seasons: ["2425"]
data_dir: "data/soccerdata_cache/FBref"
no_cache: false
no_store: false
filters:
min_minutes: 900
league_scope: "ENG-Premier League"
same_league_only: true
roles:
role_defs_path: "configs/roles_v1.yaml"
position_map_path: "configs/position_map.yaml"
exports:
out_dir: "data/exports/tableau"Defines tactical roles with:
- Position buckets: Eligibility criteria (CB, DMCM, WIDE)
- Must-have filters: Minimum thresholds (minutes, key metrics)
- Weighted scoring: Relative importance of metrics per role
- Negative metrics: Metrics where lower values are better
Currently defined roles:
- BPCB (Ball-Playing CB): Center-back with strong distribution and progression
- DLP (Deep-Lying Playmaker): Midfielder focused on deep-lying creative passing
- WCR (Winger Creator): Wide attacker with dribbling and chance creation
The pipeline consists of three main scripts that should be run in sequence:
Extracts data from FBref, processes player statistics, and computes role scores.
python scripts/run_pipeline.pyWhat it does:
- Downloads/caches FBref data for specified leagues and seasons
- Flattens and merges multiple stat tables (standard, passing, defense, etc.)
- Applies data cleaning and filtering (minimum minutes, position buckets)
- Calculates canonical metrics (per-90 rates, percentages)
- Computes percentile rankings within league/season/position groups
- Applies role scoring algorithms based on
configs/roles_v1.yaml
Output:
data/intermediate/player_season_base.parquet- Raw merged datadata/intermediate/player_season_clean.parquet- Cleaned and filtered datadata/intermediate/player_season_scored.parquet- Data with percentiles and role scores
Creates dimensional data marts and exports to Tableau format.
python scripts/build_marts.pyWhat it does:
- Generates unique player and team identifiers
- Builds dimension tables (
dim_player,dim_team) - Creates fact tables (
fact_player_season,fact_role_profile_card) - Exports all tables as both Parquet and CSV formats
Output:
data/marts/dim_player.parquet&.csvdata/marts/dim_team.parquet&.csvdata/marts/fact_player_season.parquet&.csvdata/marts/fact_role_profile_card.parquet&.csv- CSV files also written to
data/exports/tableau/
Identifies similar players for each role using cosine similarity on percentile features.
python scripts/build_comparables.py [--top-n 10]Options:
--top-n: Number of comparables per player (default: 10)
What it does:
- For each eligible player in each role, finds most similar players
- Uses RobustScaler normalization and cosine distance
- Generates reason codes explaining why players are similar
- Filters to same league/season scope (v1)
Output:
data/marts/fact_comparables.parquetdata/exports/tableau/fact_comparables.csv
Generates ranked shortlists of top candidates per role.
python scripts/build_shortlist.py [--top-n 50]Options:
--top-n: Number of players per role shortlist (default: 50)
What it does:
- Ranks eligible players by role score
- Computes sub-scores (progression, creation, defending, security, finishing)
- Adds risk flags (low minutes, high errors, age concerns)
- Includes evidence strings (key metrics with percentiles)
- Filters to top N per role
Output:
data/marts/fact_shortlist.parquetdata/exports/tableau/fact_shortlist.csv
recruitment-support-v1/
├── configs/ # Configuration files
│ ├── v1.yaml # Main pipeline configuration
│ ├── roles_v1.yaml # Role definitions and scoring weights
│ └── position_map.yaml # Position mapping rules
├── scripts/ # Executable pipeline scripts
│ ├── run_pipeline.py # Data extraction and scoring
│ ├── build_marts.py # Dimension and fact table creation
│ ├── build_comparables.py # Similarity analysis
│ └── build_shortlist.py # Shortlist generation
├── src/rsfbref/ # Main package code
│ ├── config.py # Configuration loading
│ ├── io/ # Data I/O
│ │ └── fbref_reader.py # FBref data extraction wrapper
│ ├── transform/ # Data transformation
│ │ ├── flatten.py # Column name flattening
│ │ ├── player_season.py # Stat table merging
│ │ └── clean_player_season.py # Data cleaning and feature engineering
│ ├── features/ # Feature engineering
│ │ └── percentiles.py # Percentile calculations
│ ├── analytics/ # Analytics algorithms
│ │ ├── roles.py # Role scoring logic
│ │ ├── comparables.py # Player similarity analysis
│ │ └── shortlist.py # Shortlist generation
│ ├── marts/ # Data mart builders
│ │ ├── build_dims.py # Dimension table creation
│ │ └── build_facts.py # Fact table creation
│ └── export/ # Export utilities
│ └── tableau.py # Tableau CSV export
├── data/ # Data directory (gitignored)
│ ├── intermediate/ # Intermediate processing files
│ ├── marts/ # Final data marts (Parquet)
│ ├── exports/ # Export files (CSV)
│ └── soccerdata_cache/ # FBref data cache
└── pyproject.toml # Package configuration
1. FBref Data Extraction
└─> Multiple stat tables (standard, passing, defense, etc.)
2. Data Transformation
├─> Flatten column names (MultiIndex → snake_case)
├─> Merge tables on (league, season, team, player)
└─> Clean and filter (position buckets, minimum minutes)
3. Feature Engineering
├─> Calculate per-90 metrics
├─> Compute percentages and rates
└─> Generate percentile rankings (by league/season/position)
4. Role Scoring
├─> Apply must-have filters
├─> Calculate weighted scores from percentiles
└─> Generate role eligibility flags
5. Analytics
├─> Build comparables (cosine similarity)
├─> Generate shortlists (ranked by score)
└─> Compute risk flags and sub-scores
6. Data Marts
├─> Create dimension tables (players, teams)
├─> Create fact tables (player seasons, role profiles, etc.)
└─> Export to Parquet and CSV formats
pass_cmp_pct: Pass completion percentagepasses_att_p90: Passes attempted per 90 minutesprog_passes_p90: Progressive passes per 90passes_final_third_p90: Passes into final third per 90long_pass_cmp_pct: Long pass completion percentage
key_passes_p90: Key passes per 90xa_p90: Expected assists per 90sca_p90: Shot-creating actions per 90crosses_pa_p90: Crosses into penalty area per 90
tkl_int_p90: Tackles + interceptions per 90clr_p90: Clearances per 90errors_p90: Errors leading to shots per 90aerial_win_pct: Aerial duel win percentage
prog_carries_p90: Progressive carries per 90carries_pa_p90: Carries into penalty area per 90succ_takeons_p90: Successful take-ons per 90takeon_succ_pct: Take-on success percentage
mis_dis_p90: Miscontrols + dispossessions per 90fouls_p90: Fouls committed per 90
Per_90_Minutes_npxG: Non-penalty expected goals per 90
All metrics are also available as percentile ranks (pct_*) within league/season/position buckets.
Players are categorized into position buckets for role eligibility:
- GK: Goalkeepers (excluded from v1 roles)
- CB: Center-backs (for BPCB role)
- DMCM: Defensive/Central midfielders (for DLP role)
- WIDE: Wide forwards/wingers (for WCR role)
- OTHER: Players not fitting above categories
dim_player
player_id: Unique player identifier (SHA1 hash)player_name: Player namenation: Nationalityage: Ageborn: Birth dateposition_raw: Raw FBref positionposition_bucket: Categorized position
dim_team
team_id: Unique team identifier (SHA1 hash)team_name: Team nameleague: League nameseason: Season identifier
fact_player_season
- All dimension keys (player_id, team_id, league, season)
- Playing time (minutes, nineties)
- All canonical metrics
- All percentile columns (
pct_*) - Role scores (
score_BPCB,score_DLP,score_WCR)
fact_role_profile_card
- Long-format table for visualization
- One row per (player, team, season, role, KPI)
- Contains
kpi_name,kpi_value,kpi_pct,role_score
fact_comparables
anchor_player_id,comparable_player_idrole_id,distance,rankreason_1,reason_2,reason_3: Similarity explanations
fact_shortlist
- Top N players per role
total_score: Overall role scoresub_progression,sub_creation,sub_defending, etc.: Sub-scoresrisk_flags: Comma-separated risk indicatorsrisk_count: Number of risk flagsevidence_1throughevidence_5: Key metric strings
-
Edit
configs/roles_v1.yamlto add role definition:- Define
role_id,role_name,position_bucket - Set
must_havecriteria - Configure
weightsfor scoring - Specify
negative_metricsif applicable
- Define
-
Update
scripts/build_comparables.pyandscripts/build_shortlist.pyto include new role_id in loops -
Add role-specific features in
src/rsfbref/analytics/comparables.pyif needed
- Add metric calculation in
src/rsfbref/transform/clean_player_season.py - Include metric in
metric_colslist inscripts/run_pipeline.py - Add to appropriate role definitions in
configs/roles_v1.yaml
Modify configs/v1.yaml to:
- Change
leagueslist - Adjust
seasonslist - Update
data_dirfor cache location
FBref data is cached locally in data/soccerdata_cache/FBref/ to avoid re-downloading. To refresh data:
- Set
no_cache: truein config (will re-download but not save) - Delete cache directory and re-run
- Set
no_store: trueto disable caching entirely
Run tests with pytest:
pytest tests/