MTA Subway Seat Availability Predictor

A reproducible data pipeline and CLI tool that predicts the probability of finding a seat on NYC subway trains at specific stations, lines, directions, and times.

Overview

Given a query (station, line, direction, date, hour), this tool returns:

P(seat_available_on_boarding) = probability that a boarding rider can immediately find a seat.

The model combines:

Hourly ridership data (MTA Socrata API)
Origin-destination flows (OD matrix)
GTFS schedules (trains per hour, route patterns)
Rolling stock specifications (cars per train, seats per car)
Logistic seat availability model (queueing-inspired, calibrated for peak crowding)

Features

Data Ingestion: Automated pulls from NYC Open Data (Socrata API) and GTFS
Feature Engineering: TPH calculation, directional demand splitting, seat supply modeling
Probabilistic Model: Logistic curve with peak crowding adjustments
CLI Interface: Simple commands for predictions and visualizations
EDA Notebook: Comprehensive analysis and validation
Configurable: YAML-based configuration for all parameters

Installation

Prerequisites

Python 3.8+
pip

Setup

git clone <your-repo-url>
cd TrainSeat

pip install -r requirements.txt

Optional: Socrata App Token

For higher API rate limits, register for a free Socrata app token:

Visit https://data.ny.gov/
Create an account and request an app token
Set environment variable:

export SOCRATA_APP_TOKEN="your_token_here"

Quick Start

🚇 Interactive Mode (Recommended)

Run the interactive program with ASCII art and step-by-step prompts:

Windows (PowerShell):

.\run_train_seat.bat

Windows (CMD):

run_train_seat.bat

Mac/Linux:

./run_train_seat.sh

Or directly:

python train_seat.py

You'll be guided through:

🎨 NYC train station ASCII art welcome
🚇 Select your subway line (1-7, A-Z, etc.)
🧭 Choose direction (Northbound/Southbound/Manhattan-bound/etc.)
🚉 Pick your station from a numbered list
📅 Enter date and time
🎯 Get detailed seat probability with interpretation!

Sample interaction:

╔═══════════════════════════════════════════════════════════════════════════╗
║   ███╗   ██╗██╗   ██╗ ██████╗    ███████╗██╗   ██╗██████╗ ██╗    ██╗   ║
║   ████╗  ██║╚██╗ ██╔╝██╔════╝    ██╔════╝██║   ██║██╔══██╗██║    ██║   ║
║   ██╔██╗ ██║ ╚████╔╝ ██║         ███████╗██║   ██║██████╔╝██║ █╗ ██║   ║
║              🚇  SEAT AVAILABILITY PREDICTOR  🚇                          ║
╚═══════════════════════════════════════════════════════════════════════════╝

🚇  SELECT YOUR SUBWAY LINE
==============================================================================
 1. [  1]     2. [  2]     3. [  3]
 4. [  4]     5. [  5]     6. [  6]
 7. [  7]     8. [  A]     9. [  B]
...

Select line number (1-24): 8

🧭  SELECT DIRECTION
1. Northbound / Uptown
2. Southbound / Downtown
...

Select direction (1-7): 1

🚉  SELECT YOUR STATION
==============================================================================
 1. Inwood - 207 St
 2. Dyckman St
 3. 190 St
 4. 181 St
...
20. Chambers St

Select station number (1-20): 10

📅  SELECT DATE AND TIME
Year (e.g., 2025): 2025
Month (1-12): 10
Day (1-31): 15
Hour (0-23): 8

🔄 Calculating seat availability...

🎯  SEAT AVAILABILITY PREDICTION
==============================================================================

🔴  PROBABILITY: 18.5% (LOW)

📊  DETAILED METRICS
──────────────────────────────────────────────────────────────────────────────
🚆  TRAIN INFORMATION:
    • Station: 59 St - Columbus Circle
    • Line: A
    • Direction: N
    • Date/Time: 2025-10-15T08:00:00

📈  CAPACITY METRICS:
    • Trains per hour: 15.0
    • Cars per train: 10
    • Seats per car: 30
    • Total seats per train: 300

👥  DEMAND METRICS:
    • Expected boardings per train: 425.3
    • Train load ratio: 1.42
    • Hourly load ratio: 2.13
    • Peak period: Yes ⚡

──────────────────────────────────────────────────────────────────────────────
💡  INTERPRETATION
──────────────────────────────────────────────────────────────────────────────
    ⚠ Low chance of finding a seat
    ⚠ Consider alternate route or time

    ⚠️  NOTICE: Demand exceeds seat capacity
    💡 Most passengers will be standing

==============================================================================

🔄 Run another prediction? (y/n):

🔧 Advanced: CLI Mode

For automation and scripting, use the CLI tools:

1. Fetch Data

Download GTFS and ridership data:

python cli.py fetch-gtfs

python cli.py fetch-ridership --start-date 2024-09-01 --end-date 2024-09-30

2. Single Prediction

Predict seat availability for a specific query:

python cli.py prob \
  --station "14 St-Union Sq" \
  --line "4" \
  --dir "N" \
  --datetime "2025-10-01T08:00:00-04:00"

3. Generate Heatmap

Visualize seat availability across all hours of a month. The CLI will show you all stops on the line and let you pick:

python cli.py heatmap \
  --line "L" \
  --dir "Manhattan" \
  --month "2025-10" \
  --output "bedford_oct.png"

Interactive prompt:

Found 24 stops on line L (Manhattan):
============================================================
 1. 8 Av
 2. 6 Av
 3. Union Sq - 14 St
 4. 3 Av
 5. 1 Av
 6. Bedford Av
...

Select stop number: 6

4. JSON Output

Get structured output for programmatic use:

python cli.py prob \
  --station "Times Sq-42 St" \
  --line "1" \
  --dir "S" \
  --datetime "2025-10-01T18:00:00-04:00" \
  --format json

CLI Reference

Commands

Command	Description
`prob`	Predict seat probability for a single query
`heatmap`	Generate hourly heatmap for a month
`fetch-ridership`	Download ridership data from Socrata
`fetch-gtfs`	Download GTFS static data

Options for `prob`

Option	Required	Description	Example
`--station`	Yes	Station name or complex	`"14 St-Union Sq"`
`--line`	Yes	Subway line	`"4"`, `"L"`, `"A"`
`--dir`	Yes	Direction	`N`, `S`, `E`, `W`, `Manhattan`, `Brooklyn`
`--datetime`	Yes	ISO datetime	`"2025-10-01T08:00:00-04:00"`
`--format`	No	Output format (`text` or `json`)	`json`
`--config`	No	Path to config file	`./config/default.yaml`

Options for `heatmap`

Option	Required	Description	Example
`--line`	Yes	Subway line	`"L"`, `"4"`, `"A"`
`--dir`	Yes	Direction	`N`, `S`, `Manhattan`, `Brooklyn`
`--month`	Yes	Month in YYYY-MM format	`"2025-10"`
`--output`	Yes	Output PNG file path	`"output.png"`
`--config`	No	Path to config file	`./config/default.yaml`

Note: The heatmap command now interactively prompts you to select a station from all stops on the specified line, preventing typos and ensuring valid station names.

Data Sources

All data sources are public and non-personal.

1. MTA Subway Hourly Ridership

2020-2024 Dataset:

Dataset ID: wujg-7c2s
Fields: station_complex_id, date, hour, entries, exits, payment_type
URL: https://data.ny.gov/Transportation/MTA-Subway-Hourly-Ridership-2020-2024/wujg-7c2s

2025+ Dataset:

Check catalog.data.gov for "MTA Subway Hourly Ridership: Beginning 2025"
URL: https://catalog.data.gov/dataset/mta-subway-hourly-ridership-beginning-2025

2. MTA Subway Origin-Destination Ridership

2024 Dataset:

Dataset ID: jsu2-fbtj
Fields: year, month, day_of_week, hour, origin_station, destination_station, est_trips
URL: https://data.ny.gov/Transportation/MTA-Subway-Origin-Destination-Ridership-Estimate-2/jsu2-fbtj
Documentation: https://www.mta.info/article/introducing-subway-origin-destination-ridership-dataset

3. GTFS Static (Schedules & Routes)

Source: MTA Developers
Files: routes.txt, trips.txt, stop_times.txt, stops.txt, calendar.txt, shapes.txt
URL: https://data.ny.gov/Transportation/MTA-General-Transit-Feed-Specification-GTFS-Static/fgm6-ccue
Documentation: https://www.mta.info/developers

4. Seating Capacity References

MTA Subway Guideline Revisions (July 14, 2025):

R143/R160/R179 cars: ~42 seats/car (guideline 53 pax/car off-peak)
R211 cars: ~30 seats/car (open gangway design)
URL: https://www.mta.info/document/179601

Rolling Stock References:

NYC Subway Rolling Stock: https://en.wikipedia.org/wiki/New_York_City_Subway_rolling_stock
R160 Specifications: https://en.wikipedia.org/wiki/R160_(New_York_City_Subway_car)

Model Architecture

Pipeline Flow

1. INGEST
   ├─ Socrata: Hourly ridership + OD matrix
   └─ GTFS: Routes, trips, stop_times, stops

2. FEATURES
   ├─ TPH (trains per hour) from GTFS schedules
   ├─ Seat supply = TPH × cars_per_train × seats_per_car
   ├─ Boarding demand (directional split via OD matrix)
   └─ Train load ratio = boardings_per_train / seats_per_train

3. MODEL
   ├─ P(seat) = sigmoid(a × (b - train_load_ratio_adj))
   ├─ Peak crowding adjustment: train_load_ratio × (1 + k × headway_cv)
   └─ Output: probability + intermediate metrics

4. OUTPUT
   └─ CLI, JSON, or visualizations

Seat Probability Model

Logistic function:

P(seat) = 1 / (1 + exp(-a × (b - load_ratio_adj)))

Where:

a = logistic slope (default: 6.0)
b = threshold load ratio at P=0.5 (default: 1.0)
load_ratio_adj = train load ratio with peak crowding adjustment

Peak crowding adjustment:

load_ratio_adj = (boardings_per_train / seats_per_train) × (1 + k × headway_cv)

Where:

k = peak crowding factor (default: 0.15)
headway_cv = coefficient of variation for headways (default: 0.25 in peaks)

Configuration

All parameters are configurable in config/default.yaml:

rolling_stock:
  default_seats_per_car: 42
  line_overrides:
    cars_per_train:
      "7": 11
      "G": 5
    seats_per_car:
      "A": 30  # R211
      "C": 30  # R211

model:
  logistic_slope: 6.0
  logistic_threshold: 1.0
  peak_crowding_factor: 0.15
  default_headway_cv: 0.25

⚠️ Important: Model Limitations

TrainSeat v1.0 is a DIRECTIONAL ESTIMATOR, not a ground-truth predictor.

What it CAN do:

✅ Compare relative crowding (8 AM vs 9 AM, Line A vs Line C)
✅ Identify high/low seat availability periods
✅ Provide guidance for commute planning

What it CANNOT do:

❌ Predict specific train arrivals (hourly aggregation only)
❌ Account for upstream alighting (main source of seats)
❌ Adjust for real-time delays/cancellations
❌ Provide calibrated probabilities (no ground truth validation)

Read TrainSeat_Explain_v2.txt for full technical details, honest assessment, and v2.0 roadmap.

Project Structure

TrainSeat/
├── src/
│   ├── __init__.py
│   ├── ingest.py          # Socrata + GTFS data ingestion
│   ├── features.py         # TPH, seat supply, demand splitting
│   ├── model.py            # Logistic seat probability model
│   ├── api.py              # Main prediction API
│   └── utils.py            # Helpers (logging, mapping, formatting)
├── config/
│   └── default.yaml        # Configuration (rolling stock, model params)
├── data_cache/
│   ├── raw/                # Cached Socrata data (Parquet)
│   ├── gtfs/               # GTFS static files
│   └── mappings/           # station_complex_map.csv
├── notebooks/
│   └── eda.ipynb           # Exploratory data analysis & validation
├── cli.py                  # CLI interface
├── requirements.txt        # Python dependencies
└── README.md               # This file

Assumptions & Limitations

Assumptions

Historical demand is representative: Uses recent monthly data as proxy for future demand
OD matrix directional split: When OD data is unavailable, falls back to 50/50 or terminal heuristics
GTFS schedules are accurate: Actual service may vary due to delays, construction, etc.
Uniform loading: Assumes riders board evenly across train cars (not considering end-car effects)
Seated vs standing: Model only predicts seat availability, not overall crowding/capacity
No mid-route alighting: Load ratio based on boardings at query station, not cumulative load

Limitations

Real-time data: Does not use GTFS-RT for actual train locations or delays
Special events: Does not account for games, concerts, protests, etc.
Weather: No weather-based demand adjustments
Ridership trends: Uses historical averages; does not forecast long-term trends
Station-specific issues: Cannot detect station closures, service changes, etc.

Configurable Parameters

All assumptions can be adjusted via config/default.yaml or by editing source files:

Rolling stock specs (cars/train, seats/car)
Model hyperparameters (logistic slope, threshold, peak factor)
Demand smoothing (EWMA alpha)
Fallback directional splits

EDA & Validation

Run the Jupyter notebook for detailed analysis:

cd notebooks
jupyter notebook eda.ipynb

Notebook contents:

Load and explore ridership data
Analyze GTFS schedules (TPH by line/hour)
Visualize seat probability patterns
Sanity checks (overnight high, AM peak low)
Sensitivity analysis on model parameters
Load ratio vs probability curves

Development

Running Tests (Future)

pytest tests/

Code Structure

Modularity: Each component (ingest, features, model) is independent
Caching: All expensive API calls cache to data_cache/
Logging: Configurable via --log-level (DEBUG, INFO, WARNING, ERROR)
Error handling: Graceful fallbacks for missing data

Contributing

Fork the repository
Create a feature branch (git checkout -b feature/new-model)
Commit your changes (git commit -am 'Add new model')
Push to the branch (git push origin feature/new-model)
Create a Pull Request

Troubleshooting

Common Issues

1. Socrata rate limits

Error: 429 Too Many Requests

Solution: Register for a free app token and set SOCRATA_APP_TOKEN env var.

2. Missing station mapping

Warning: No mapping found for station: XYZ

Solution: Add manual override to data_cache/mappings/station_complex_map.csv.

3. GTFS download fails

Error: HTTPSConnectionPool

Solution: Check network connection; GTFS endpoint may be temporarily down. Retry or use cached data.

4. No ridership data for date

Warning: No ridership data for station X at hour Y

Solution: Fetch broader date range using fetch-ridership or check if station name matches Socrata dataset.

License

MIT License - see LICENSE file for details.

Citations

Data Sources:

MTA Open Data. (2024). MTA Subway Hourly Ridership: 2020-2024. Retrieved from https://data.ny.gov/Transportation/MTA-Subway-Hourly-Ridership-2020-2024/wujg-7c2s
MTA Open Data. (2024). MTA Subway Origin-Destination Ridership Estimate: 2024. Retrieved from https://data.ny.gov/Transportation/MTA-Subway-Origin-Destination-Ridership-Estimate-2/jsu2-fbtj
MTA Developers. (2025). GTFS Static Data. Retrieved from https://www.mta.info/developers
MTA. (2025). Subway Guideline Revisions. Retrieved from https://www.mta.info/document/179601

Model References:

Queueing theory for transit capacity analysis
MTA published load guidelines (53 pax/car off-peak, 200+ peak)

Contact

For questions, issues, or contributions, please open an issue on GitHub.

Built with: Python, pandas, numpy, sodapy, matplotlib, seaborn, pyyaml, click

Last updated: 2025-09-26

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.claude		.claude
.devcontainer		.devcontainer
config		config
data_cache/models		data_cache/models
notebooks		notebooks
src		src
.gitignore		.gitignore
IMPROVEMENTS_SUMMARY.md		IMPROVEMENTS_SUMMARY.md
PROJECT_SUMMARY.txt		PROJECT_SUMMARY.txt
QUICKSTART.md		QUICKSTART.md
README.md		README.md
To Start - python train_seat.py.txt		To Start - python train_seat.py.txt
TrainSeat_Explain.txt		TrainSeat_Explain.txt
TrainSeat_Explain_v2.txt		TrainSeat_Explain_v2.txt
cli.py		cli.py
debug_prediction.py		debug_prediction.py
investigate_ridership.py		investigate_ridership.py
quick_debug.py		quick_debug.py
requirements.txt		requirements.txt
run_train_seat.bat		run_train_seat.bat
run_train_seat.sh		run_train_seat.sh
streamlit_app.py		streamlit_app.py
test_86st.py		test_86st.py
test_fixed_system.py		test_fixed_system.py
train_seat.py		train_seat.py

Folders and files

Latest commit

History

Repository files navigation

MTA Subway Seat Availability Predictor

Overview

Features

Installation

Prerequisites

Setup

Optional: Socrata App Token

Quick Start

🚇 Interactive Mode (Recommended)

🔧 Advanced: CLI Mode

1. Fetch Data

2. Single Prediction

3. Generate Heatmap

4. JSON Output

CLI Reference

Commands

Options for prob

Options for heatmap

Data Sources

1. MTA Subway Hourly Ridership

2. MTA Subway Origin-Destination Ridership

3. GTFS Static (Schedules & Routes)

4. Seating Capacity References

Model Architecture

Pipeline Flow

Seat Probability Model

Configuration

⚠️ Important: Model Limitations

What it CAN do:

What it CANNOT do:

Project Structure

Assumptions & Limitations

Assumptions

Limitations

Configurable Parameters

EDA & Validation

Development

Running Tests (Future)

Code Structure

Contributing

Troubleshooting

Common Issues

License

Citations

Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Options for `prob`

Options for `heatmap`

Packages