A reproducible data pipeline and CLI tool that predicts the probability of finding a seat on NYC subway trains at specific stations, lines, directions, and times.
Given a query (station, line, direction, date, hour), this tool returns:
P(seat_available_on_boarding) = probability that a boarding rider can immediately find a seat.
The model combines:
- Hourly ridership data (MTA Socrata API)
- Origin-destination flows (OD matrix)
- GTFS schedules (trains per hour, route patterns)
- Rolling stock specifications (cars per train, seats per car)
- Logistic seat availability model (queueing-inspired, calibrated for peak crowding)
- Data Ingestion: Automated pulls from NYC Open Data (Socrata API) and GTFS
- Feature Engineering: TPH calculation, directional demand splitting, seat supply modeling
- Probabilistic Model: Logistic curve with peak crowding adjustments
- CLI Interface: Simple commands for predictions and visualizations
- EDA Notebook: Comprehensive analysis and validation
- Configurable: YAML-based configuration for all parameters
- Python 3.8+
- pip
git clone <your-repo-url>
cd TrainSeat
pip install -r requirements.txtFor higher API rate limits, register for a free Socrata app token:
- Visit https://data.ny.gov/
- Create an account and request an app token
- Set environment variable:
export SOCRATA_APP_TOKEN="your_token_here"Run the interactive program with ASCII art and step-by-step prompts:
Windows (PowerShell):
.\run_train_seat.batWindows (CMD):
run_train_seat.batMac/Linux:
./run_train_seat.shOr directly:
python train_seat.pyYou'll be guided through:
- 🎨 NYC train station ASCII art welcome
- 🚇 Select your subway line (1-7, A-Z, etc.)
- 🧭 Choose direction (Northbound/Southbound/Manhattan-bound/etc.)
- 🚉 Pick your station from a numbered list
- 📅 Enter date and time
- 🎯 Get detailed seat probability with interpretation!
Sample interaction:
╔═══════════════════════════════════════════════════════════════════════════╗
║ ███╗ ██╗██╗ ██╗ ██████╗ ███████╗██╗ ██╗██████╗ ██╗ ██╗ ║
║ ████╗ ██║╚██╗ ██╔╝██╔════╝ ██╔════╝██║ ██║██╔══██╗██║ ██║ ║
║ ██╔██╗ ██║ ╚████╔╝ ██║ ███████╗██║ ██║██████╔╝██║ █╗ ██║ ║
║ 🚇 SEAT AVAILABILITY PREDICTOR 🚇 ║
╚═══════════════════════════════════════════════════════════════════════════╝
🚇 SELECT YOUR SUBWAY LINE
==============================================================================
1. [ 1] 2. [ 2] 3. [ 3]
4. [ 4] 5. [ 5] 6. [ 6]
7. [ 7] 8. [ A] 9. [ B]
...
Select line number (1-24): 8
🧭 SELECT DIRECTION
1. Northbound / Uptown
2. Southbound / Downtown
...
Select direction (1-7): 1
🚉 SELECT YOUR STATION
==============================================================================
1. Inwood - 207 St
2. Dyckman St
3. 190 St
4. 181 St
...
20. Chambers St
Select station number (1-20): 10
📅 SELECT DATE AND TIME
Year (e.g., 2025): 2025
Month (1-12): 10
Day (1-31): 15
Hour (0-23): 8
🔄 Calculating seat availability...
🎯 SEAT AVAILABILITY PREDICTION
==============================================================================
🔴 PROBABILITY: 18.5% (LOW)
📊 DETAILED METRICS
──────────────────────────────────────────────────────────────────────────────
🚆 TRAIN INFORMATION:
• Station: 59 St - Columbus Circle
• Line: A
• Direction: N
• Date/Time: 2025-10-15T08:00:00
📈 CAPACITY METRICS:
• Trains per hour: 15.0
• Cars per train: 10
• Seats per car: 30
• Total seats per train: 300
👥 DEMAND METRICS:
• Expected boardings per train: 425.3
• Train load ratio: 1.42
• Hourly load ratio: 2.13
• Peak period: Yes ⚡
──────────────────────────────────────────────────────────────────────────────
💡 INTERPRETATION
──────────────────────────────────────────────────────────────────────────────
⚠ Low chance of finding a seat
⚠ Consider alternate route or time
⚠️ NOTICE: Demand exceeds seat capacity
💡 Most passengers will be standing
==============================================================================
🔄 Run another prediction? (y/n):
For automation and scripting, use the CLI tools:
Download GTFS and ridership data:
python cli.py fetch-gtfs
python cli.py fetch-ridership --start-date 2024-09-01 --end-date 2024-09-30Predict seat availability for a specific query:
python cli.py prob \
--station "14 St-Union Sq" \
--line "4" \
--dir "N" \
--datetime "2025-10-01T08:00:00-04:00"Visualize seat availability across all hours of a month. The CLI will show you all stops on the line and let you pick:
python cli.py heatmap \
--line "L" \
--dir "Manhattan" \
--month "2025-10" \
--output "bedford_oct.png"Interactive prompt:
Found 24 stops on line L (Manhattan):
============================================================
1. 8 Av
2. 6 Av
3. Union Sq - 14 St
4. 3 Av
5. 1 Av
6. Bedford Av
...
Select stop number: 6
Get structured output for programmatic use:
python cli.py prob \
--station "Times Sq-42 St" \
--line "1" \
--dir "S" \
--datetime "2025-10-01T18:00:00-04:00" \
--format json| Command | Description |
|---|---|
prob |
Predict seat probability for a single query |
heatmap |
Generate hourly heatmap for a month |
fetch-ridership |
Download ridership data from Socrata |
fetch-gtfs |
Download GTFS static data |
| Option | Required | Description | Example |
|---|---|---|---|
--station |
Yes | Station name or complex | "14 St-Union Sq" |
--line |
Yes | Subway line | "4", "L", "A" |
--dir |
Yes | Direction | N, S, E, W, Manhattan, Brooklyn |
--datetime |
Yes | ISO datetime | "2025-10-01T08:00:00-04:00" |
--format |
No | Output format (text or json) |
json |
--config |
No | Path to config file | ./config/default.yaml |
| Option | Required | Description | Example |
|---|---|---|---|
--line |
Yes | Subway line | "L", "4", "A" |
--dir |
Yes | Direction | N, S, Manhattan, Brooklyn |
--month |
Yes | Month in YYYY-MM format | "2025-10" |
--output |
Yes | Output PNG file path | "output.png" |
--config |
No | Path to config file | ./config/default.yaml |
Note: The heatmap command now interactively prompts you to select a station from all stops on the specified line, preventing typos and ensuring valid station names.
All data sources are public and non-personal.
2020-2024 Dataset:
- Dataset ID:
wujg-7c2s - Fields:
station_complex_id,date,hour,entries,exits,payment_type - URL: https://data.ny.gov/Transportation/MTA-Subway-Hourly-Ridership-2020-2024/wujg-7c2s
2025+ Dataset:
- Check catalog.data.gov for "MTA Subway Hourly Ridership: Beginning 2025"
- URL: https://catalog.data.gov/dataset/mta-subway-hourly-ridership-beginning-2025
2024 Dataset:
- Dataset ID:
jsu2-fbtj - Fields:
year,month,day_of_week,hour,origin_station,destination_station,est_trips - URL: https://data.ny.gov/Transportation/MTA-Subway-Origin-Destination-Ridership-Estimate-2/jsu2-fbtj
- Documentation: https://www.mta.info/article/introducing-subway-origin-destination-ridership-dataset
- Source: MTA Developers
- Files:
routes.txt,trips.txt,stop_times.txt,stops.txt,calendar.txt,shapes.txt - URL: https://data.ny.gov/Transportation/MTA-General-Transit-Feed-Specification-GTFS-Static/fgm6-ccue
- Documentation: https://www.mta.info/developers
MTA Subway Guideline Revisions (July 14, 2025):
- R143/R160/R179 cars: ~42 seats/car (guideline 53 pax/car off-peak)
- R211 cars: ~30 seats/car (open gangway design)
- URL: https://www.mta.info/document/179601
Rolling Stock References:
- NYC Subway Rolling Stock: https://en.wikipedia.org/wiki/New_York_City_Subway_rolling_stock
- R160 Specifications: https://en.wikipedia.org/wiki/R160_(New_York_City_Subway_car)
1. INGEST
├─ Socrata: Hourly ridership + OD matrix
└─ GTFS: Routes, trips, stop_times, stops
2. FEATURES
├─ TPH (trains per hour) from GTFS schedules
├─ Seat supply = TPH × cars_per_train × seats_per_car
├─ Boarding demand (directional split via OD matrix)
└─ Train load ratio = boardings_per_train / seats_per_train
3. MODEL
├─ P(seat) = sigmoid(a × (b - train_load_ratio_adj))
├─ Peak crowding adjustment: train_load_ratio × (1 + k × headway_cv)
└─ Output: probability + intermediate metrics
4. OUTPUT
└─ CLI, JSON, or visualizations
Logistic function:
P(seat) = 1 / (1 + exp(-a × (b - load_ratio_adj)))
Where:
a= logistic slope (default: 6.0)b= threshold load ratio at P=0.5 (default: 1.0)load_ratio_adj= train load ratio with peak crowding adjustment
Peak crowding adjustment:
load_ratio_adj = (boardings_per_train / seats_per_train) × (1 + k × headway_cv)
Where:
k= peak crowding factor (default: 0.15)headway_cv= coefficient of variation for headways (default: 0.25 in peaks)
All parameters are configurable in config/default.yaml:
rolling_stock:
default_seats_per_car: 42
line_overrides:
cars_per_train:
"7": 11
"G": 5
seats_per_car:
"A": 30 # R211
"C": 30 # R211
model:
logistic_slope: 6.0
logistic_threshold: 1.0
peak_crowding_factor: 0.15
default_headway_cv: 0.25TrainSeat v1.0 is a DIRECTIONAL ESTIMATOR, not a ground-truth predictor.
- ✅ Compare relative crowding (8 AM vs 9 AM, Line A vs Line C)
- ✅ Identify high/low seat availability periods
- ✅ Provide guidance for commute planning
- ❌ Predict specific train arrivals (hourly aggregation only)
- ❌ Account for upstream alighting (main source of seats)
- ❌ Adjust for real-time delays/cancellations
- ❌ Provide calibrated probabilities (no ground truth validation)
Read TrainSeat_Explain_v2.txt for full technical details, honest assessment, and v2.0 roadmap.
TrainSeat/
├── src/
│ ├── __init__.py
│ ├── ingest.py # Socrata + GTFS data ingestion
│ ├── features.py # TPH, seat supply, demand splitting
│ ├── model.py # Logistic seat probability model
│ ├── api.py # Main prediction API
│ └── utils.py # Helpers (logging, mapping, formatting)
├── config/
│ └── default.yaml # Configuration (rolling stock, model params)
├── data_cache/
│ ├── raw/ # Cached Socrata data (Parquet)
│ ├── gtfs/ # GTFS static files
│ └── mappings/ # station_complex_map.csv
├── notebooks/
│ └── eda.ipynb # Exploratory data analysis & validation
├── cli.py # CLI interface
├── requirements.txt # Python dependencies
└── README.md # This file
- Historical demand is representative: Uses recent monthly data as proxy for future demand
- OD matrix directional split: When OD data is unavailable, falls back to 50/50 or terminal heuristics
- GTFS schedules are accurate: Actual service may vary due to delays, construction, etc.
- Uniform loading: Assumes riders board evenly across train cars (not considering end-car effects)
- Seated vs standing: Model only predicts seat availability, not overall crowding/capacity
- No mid-route alighting: Load ratio based on boardings at query station, not cumulative load
- Real-time data: Does not use GTFS-RT for actual train locations or delays
- Special events: Does not account for games, concerts, protests, etc.
- Weather: No weather-based demand adjustments
- Ridership trends: Uses historical averages; does not forecast long-term trends
- Station-specific issues: Cannot detect station closures, service changes, etc.
All assumptions can be adjusted via config/default.yaml or by editing source files:
- Rolling stock specs (cars/train, seats/car)
- Model hyperparameters (logistic slope, threshold, peak factor)
- Demand smoothing (EWMA alpha)
- Fallback directional splits
Run the Jupyter notebook for detailed analysis:
cd notebooks
jupyter notebook eda.ipynbNotebook contents:
- Load and explore ridership data
- Analyze GTFS schedules (TPH by line/hour)
- Visualize seat probability patterns
- Sanity checks (overnight high, AM peak low)
- Sensitivity analysis on model parameters
- Load ratio vs probability curves
pytest tests/- Modularity: Each component (ingest, features, model) is independent
- Caching: All expensive API calls cache to
data_cache/ - Logging: Configurable via
--log-level(DEBUG, INFO, WARNING, ERROR) - Error handling: Graceful fallbacks for missing data
- Fork the repository
- Create a feature branch (
git checkout -b feature/new-model) - Commit your changes (
git commit -am 'Add new model') - Push to the branch (
git push origin feature/new-model) - Create a Pull Request
1. Socrata rate limits
Error: 429 Too Many Requests
Solution: Register for a free app token and set SOCRATA_APP_TOKEN env var.
2. Missing station mapping
Warning: No mapping found for station: XYZ
Solution: Add manual override to data_cache/mappings/station_complex_map.csv.
3. GTFS download fails
Error: HTTPSConnectionPool
Solution: Check network connection; GTFS endpoint may be temporarily down. Retry or use cached data.
4. No ridership data for date
Warning: No ridership data for station X at hour Y
Solution: Fetch broader date range using fetch-ridership or check if station name matches Socrata dataset.
MIT License - see LICENSE file for details.
Data Sources:
- MTA Open Data. (2024). MTA Subway Hourly Ridership: 2020-2024. Retrieved from https://data.ny.gov/Transportation/MTA-Subway-Hourly-Ridership-2020-2024/wujg-7c2s
- MTA Open Data. (2024). MTA Subway Origin-Destination Ridership Estimate: 2024. Retrieved from https://data.ny.gov/Transportation/MTA-Subway-Origin-Destination-Ridership-Estimate-2/jsu2-fbtj
- MTA Developers. (2025). GTFS Static Data. Retrieved from https://www.mta.info/developers
- MTA. (2025). Subway Guideline Revisions. Retrieved from https://www.mta.info/document/179601
Model References:
- Queueing theory for transit capacity analysis
- MTA published load guidelines (53 pax/car off-peak, 200+ peak)
For questions, issues, or contributions, please open an issue on GitHub.
Built with: Python, pandas, numpy, sodapy, matplotlib, seaborn, pyyaml, click
Last updated: 2025-09-26