Skip to content

MichalKononenko2/TrainSeat

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MTA Subway Seat Availability Predictor

A reproducible data pipeline and CLI tool that predicts the probability of finding a seat on NYC subway trains at specific stations, lines, directions, and times.

Overview

Given a query (station, line, direction, date, hour), this tool returns:

P(seat_available_on_boarding) = probability that a boarding rider can immediately find a seat.

The model combines:

  • Hourly ridership data (MTA Socrata API)
  • Origin-destination flows (OD matrix)
  • GTFS schedules (trains per hour, route patterns)
  • Rolling stock specifications (cars per train, seats per car)
  • Logistic seat availability model (queueing-inspired, calibrated for peak crowding)

Features

  • Data Ingestion: Automated pulls from NYC Open Data (Socrata API) and GTFS
  • Feature Engineering: TPH calculation, directional demand splitting, seat supply modeling
  • Probabilistic Model: Logistic curve with peak crowding adjustments
  • CLI Interface: Simple commands for predictions and visualizations
  • EDA Notebook: Comprehensive analysis and validation
  • Configurable: YAML-based configuration for all parameters

Installation

Prerequisites

  • Python 3.8+
  • pip

Setup

git clone <your-repo-url>
cd TrainSeat

pip install -r requirements.txt

Optional: Socrata App Token

For higher API rate limits, register for a free Socrata app token:

  1. Visit https://data.ny.gov/
  2. Create an account and request an app token
  3. Set environment variable:
export SOCRATA_APP_TOKEN="your_token_here"

Quick Start

🚇 Interactive Mode (Recommended)

Run the interactive program with ASCII art and step-by-step prompts:

Windows (PowerShell):

.\run_train_seat.bat

Windows (CMD):

run_train_seat.bat

Mac/Linux:

./run_train_seat.sh

Or directly:

python train_seat.py

You'll be guided through:

  1. 🎨 NYC train station ASCII art welcome
  2. 🚇 Select your subway line (1-7, A-Z, etc.)
  3. 🧭 Choose direction (Northbound/Southbound/Manhattan-bound/etc.)
  4. 🚉 Pick your station from a numbered list
  5. 📅 Enter date and time
  6. 🎯 Get detailed seat probability with interpretation!

Sample interaction:

╔═══════════════════════════════════════════════════════════════════════════╗
║   ███╗   ██╗██╗   ██╗ ██████╗    ███████╗██╗   ██╗██████╗ ██╗    ██╗   ║
║   ████╗  ██║╚██╗ ██╔╝██╔════╝    ██╔════╝██║   ██║██╔══██╗██║    ██║   ║
║   ██╔██╗ ██║ ╚████╔╝ ██║         ███████╗██║   ██║██████╔╝██║ █╗ ██║   ║
║              🚇  SEAT AVAILABILITY PREDICTOR  🚇                          ║
╚═══════════════════════════════════════════════════════════════════════════╝

🚇  SELECT YOUR SUBWAY LINE
==============================================================================
 1. [  1]     2. [  2]     3. [  3]
 4. [  4]     5. [  5]     6. [  6]
 7. [  7]     8. [  A]     9. [  B]
...

Select line number (1-24): 8

🧭  SELECT DIRECTION
1. Northbound / Uptown
2. Southbound / Downtown
...

Select direction (1-7): 1

🚉  SELECT YOUR STATION
==============================================================================
 1. Inwood - 207 St
 2. Dyckman St
 3. 190 St
 4. 181 St
...
20. Chambers St

Select station number (1-20): 10

📅  SELECT DATE AND TIME
Year (e.g., 2025): 2025
Month (1-12): 10
Day (1-31): 15
Hour (0-23): 8

🔄 Calculating seat availability...

🎯  SEAT AVAILABILITY PREDICTION
==============================================================================

🔴  PROBABILITY: 18.5% (LOW)

📊  DETAILED METRICS
──────────────────────────────────────────────────────────────────────────────
🚆  TRAIN INFORMATION:
    • Station: 59 St - Columbus Circle
    • Line: A
    • Direction: N
    • Date/Time: 2025-10-15T08:00:00

📈  CAPACITY METRICS:
    • Trains per hour: 15.0
    • Cars per train: 10
    • Seats per car: 30
    • Total seats per train: 300

👥  DEMAND METRICS:
    • Expected boardings per train: 425.3
    • Train load ratio: 1.42
    • Hourly load ratio: 2.13
    • Peak period: Yes ⚡

──────────────────────────────────────────────────────────────────────────────
💡  INTERPRETATION
──────────────────────────────────────────────────────────────────────────────
    ⚠ Low chance of finding a seat
    ⚠ Consider alternate route or time

    ⚠️  NOTICE: Demand exceeds seat capacity
    💡 Most passengers will be standing

==============================================================================

🔄 Run another prediction? (y/n):

🔧 Advanced: CLI Mode

For automation and scripting, use the CLI tools:

1. Fetch Data

Download GTFS and ridership data:

python cli.py fetch-gtfs

python cli.py fetch-ridership --start-date 2024-09-01 --end-date 2024-09-30

2. Single Prediction

Predict seat availability for a specific query:

python cli.py prob \
  --station "14 St-Union Sq" \
  --line "4" \
  --dir "N" \
  --datetime "2025-10-01T08:00:00-04:00"

3. Generate Heatmap

Visualize seat availability across all hours of a month. The CLI will show you all stops on the line and let you pick:

python cli.py heatmap \
  --line "L" \
  --dir "Manhattan" \
  --month "2025-10" \
  --output "bedford_oct.png"

Interactive prompt:

Found 24 stops on line L (Manhattan):
============================================================
 1. 8 Av
 2. 6 Av
 3. Union Sq - 14 St
 4. 3 Av
 5. 1 Av
 6. Bedford Av
...

Select stop number: 6

4. JSON Output

Get structured output for programmatic use:

python cli.py prob \
  --station "Times Sq-42 St" \
  --line "1" \
  --dir "S" \
  --datetime "2025-10-01T18:00:00-04:00" \
  --format json

CLI Reference

Commands

Command Description
prob Predict seat probability for a single query
heatmap Generate hourly heatmap for a month
fetch-ridership Download ridership data from Socrata
fetch-gtfs Download GTFS static data

Options for prob

Option Required Description Example
--station Yes Station name or complex "14 St-Union Sq"
--line Yes Subway line "4", "L", "A"
--dir Yes Direction N, S, E, W, Manhattan, Brooklyn
--datetime Yes ISO datetime "2025-10-01T08:00:00-04:00"
--format No Output format (text or json) json
--config No Path to config file ./config/default.yaml

Options for heatmap

Option Required Description Example
--line Yes Subway line "L", "4", "A"
--dir Yes Direction N, S, Manhattan, Brooklyn
--month Yes Month in YYYY-MM format "2025-10"
--output Yes Output PNG file path "output.png"
--config No Path to config file ./config/default.yaml

Note: The heatmap command now interactively prompts you to select a station from all stops on the specified line, preventing typos and ensuring valid station names.

Data Sources

All data sources are public and non-personal.

1. MTA Subway Hourly Ridership

2020-2024 Dataset:

2025+ Dataset:

2. MTA Subway Origin-Destination Ridership

2024 Dataset:

3. GTFS Static (Schedules & Routes)

4. Seating Capacity References

MTA Subway Guideline Revisions (July 14, 2025):

Rolling Stock References:

Model Architecture

Pipeline Flow

1. INGEST
   ├─ Socrata: Hourly ridership + OD matrix
   └─ GTFS: Routes, trips, stop_times, stops

2. FEATURES
   ├─ TPH (trains per hour) from GTFS schedules
   ├─ Seat supply = TPH × cars_per_train × seats_per_car
   ├─ Boarding demand (directional split via OD matrix)
   └─ Train load ratio = boardings_per_train / seats_per_train

3. MODEL
   ├─ P(seat) = sigmoid(a × (b - train_load_ratio_adj))
   ├─ Peak crowding adjustment: train_load_ratio × (1 + k × headway_cv)
   └─ Output: probability + intermediate metrics

4. OUTPUT
   └─ CLI, JSON, or visualizations

Seat Probability Model

Logistic function:

P(seat) = 1 / (1 + exp(-a × (b - load_ratio_adj)))

Where:

  • a = logistic slope (default: 6.0)
  • b = threshold load ratio at P=0.5 (default: 1.0)
  • load_ratio_adj = train load ratio with peak crowding adjustment

Peak crowding adjustment:

load_ratio_adj = (boardings_per_train / seats_per_train) × (1 + k × headway_cv)

Where:

  • k = peak crowding factor (default: 0.15)
  • headway_cv = coefficient of variation for headways (default: 0.25 in peaks)

Configuration

All parameters are configurable in config/default.yaml:

rolling_stock:
  default_seats_per_car: 42
  line_overrides:
    cars_per_train:
      "7": 11
      "G": 5
    seats_per_car:
      "A": 30  # R211
      "C": 30  # R211

model:
  logistic_slope: 6.0
  logistic_threshold: 1.0
  peak_crowding_factor: 0.15
  default_headway_cv: 0.25

⚠️ Important: Model Limitations

TrainSeat v1.0 is a DIRECTIONAL ESTIMATOR, not a ground-truth predictor.

What it CAN do:

  • ✅ Compare relative crowding (8 AM vs 9 AM, Line A vs Line C)
  • ✅ Identify high/low seat availability periods
  • ✅ Provide guidance for commute planning

What it CANNOT do:

  • ❌ Predict specific train arrivals (hourly aggregation only)
  • ❌ Account for upstream alighting (main source of seats)
  • ❌ Adjust for real-time delays/cancellations
  • ❌ Provide calibrated probabilities (no ground truth validation)

Read TrainSeat_Explain_v2.txt for full technical details, honest assessment, and v2.0 roadmap.


Project Structure

TrainSeat/
├── src/
│   ├── __init__.py
│   ├── ingest.py          # Socrata + GTFS data ingestion
│   ├── features.py         # TPH, seat supply, demand splitting
│   ├── model.py            # Logistic seat probability model
│   ├── api.py              # Main prediction API
│   └── utils.py            # Helpers (logging, mapping, formatting)
├── config/
│   └── default.yaml        # Configuration (rolling stock, model params)
├── data_cache/
│   ├── raw/                # Cached Socrata data (Parquet)
│   ├── gtfs/               # GTFS static files
│   └── mappings/           # station_complex_map.csv
├── notebooks/
│   └── eda.ipynb           # Exploratory data analysis & validation
├── cli.py                  # CLI interface
├── requirements.txt        # Python dependencies
└── README.md               # This file

Assumptions & Limitations

Assumptions

  1. Historical demand is representative: Uses recent monthly data as proxy for future demand
  2. OD matrix directional split: When OD data is unavailable, falls back to 50/50 or terminal heuristics
  3. GTFS schedules are accurate: Actual service may vary due to delays, construction, etc.
  4. Uniform loading: Assumes riders board evenly across train cars (not considering end-car effects)
  5. Seated vs standing: Model only predicts seat availability, not overall crowding/capacity
  6. No mid-route alighting: Load ratio based on boardings at query station, not cumulative load

Limitations

  • Real-time data: Does not use GTFS-RT for actual train locations or delays
  • Special events: Does not account for games, concerts, protests, etc.
  • Weather: No weather-based demand adjustments
  • Ridership trends: Uses historical averages; does not forecast long-term trends
  • Station-specific issues: Cannot detect station closures, service changes, etc.

Configurable Parameters

All assumptions can be adjusted via config/default.yaml or by editing source files:

  • Rolling stock specs (cars/train, seats/car)
  • Model hyperparameters (logistic slope, threshold, peak factor)
  • Demand smoothing (EWMA alpha)
  • Fallback directional splits

EDA & Validation

Run the Jupyter notebook for detailed analysis:

cd notebooks
jupyter notebook eda.ipynb

Notebook contents:

  1. Load and explore ridership data
  2. Analyze GTFS schedules (TPH by line/hour)
  3. Visualize seat probability patterns
  4. Sanity checks (overnight high, AM peak low)
  5. Sensitivity analysis on model parameters
  6. Load ratio vs probability curves

Development

Running Tests (Future)

pytest tests/

Code Structure

  • Modularity: Each component (ingest, features, model) is independent
  • Caching: All expensive API calls cache to data_cache/
  • Logging: Configurable via --log-level (DEBUG, INFO, WARNING, ERROR)
  • Error handling: Graceful fallbacks for missing data

Contributing

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/new-model)
  3. Commit your changes (git commit -am 'Add new model')
  4. Push to the branch (git push origin feature/new-model)
  5. Create a Pull Request

Troubleshooting

Common Issues

1. Socrata rate limits

Error: 429 Too Many Requests

Solution: Register for a free app token and set SOCRATA_APP_TOKEN env var.

2. Missing station mapping

Warning: No mapping found for station: XYZ

Solution: Add manual override to data_cache/mappings/station_complex_map.csv.

3. GTFS download fails

Error: HTTPSConnectionPool

Solution: Check network connection; GTFS endpoint may be temporarily down. Retry or use cached data.

4. No ridership data for date

Warning: No ridership data for station X at hour Y

Solution: Fetch broader date range using fetch-ridership or check if station name matches Socrata dataset.

License

MIT License - see LICENSE file for details.

Citations

Data Sources:

Model References:

  • Queueing theory for transit capacity analysis
  • MTA published load guidelines (53 pax/car off-peak, 200+ peak)

Contact

For questions, issues, or contributions, please open an issue on GitHub.


Built with: Python, pandas, numpy, sodapy, matplotlib, seaborn, pyyaml, click

Last updated: 2025-09-26

About

Predicting the probability you can get a seat on a NYC Subway

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors