Skip to content

Latest commit

 

History

History
97 lines (67 loc) · 4.13 KB

File metadata and controls

97 lines (67 loc) · 4.13 KB

HeapFile.java — Multi-Page Table Storage

File: src/main/java/com/minipostgres/storage/HeapFile.java


What Problem Does This Solve?

A single 4KB page holds maybe 50-100 small tuples. Real tables have millions of rows spanning thousands of pages. HeapFile manages this multi-page structure — it abstracts away page boundaries so the upper layers just say "insert this tuple" without worrying about which page has space.

The name "heap" comes from the fact that tuples are stored in unordered fashion — they go wherever there's free space. This is in contrast to a clustered index where tuples are physically sorted.


Thought Process

Why "Heap" and Not Sorted?

If tuples were sorted by primary key, every insert might require shifting data across multiple pages — catastrophically expensive. A heap file simply finds the first page with enough space and drops the tuple there. This makes inserts O(1) amortized.

The tradeoff: reads require scanning all pages to find a specific tuple (unless you have an index). That's why we build B+ Trees on top.

Insert Strategy: First-Fit

public RecordId insertTuple(Tuple tuple) {
    byte[] tupleData = tuple.serialize();
    for (int i = 0; i < pageCount; i++) {
        Page page = bufferPool.getPage(getFileName(), i);
        int slotId = page.insertTuple(tupleData);
        if (slotId != -1) {
            bufferPool.markDirty(getFileName(), i);
            return new RecordId(i, slotId);
        }
    }
    // Allocate new page...
}

We scan existing pages for one with enough free space ("first-fit"). If none found, we allocate a new page at the end. This is simple and works well for append-heavy workloads. A production DB might maintain a Free Space Map to avoid scanning all pages — but first-fit is correct and illustrative.

Why Return a RecordId?

Every insert returns a RecordId(pageId, slotId) — the physical address of the tuple. This is what gets stored in B+ Tree index entries. When a query finds a matching key in the index, it gets the RecordId and does a direct lookup via getTuple(RecordId) — O(1) page access instead of scanning.

The TupleWithId Wrapper

public static class TupleWithId {
    private final RecordId recordId;
    private final Tuple tuple;
}

Sequential scans need both the tuple data (for predicate evaluation) and the RecordId (to build index entries, or for the caller to reference). TupleWithId bundles them together.

Why an Iterator Pattern?

public Iterator<TupleWithId> iterator() { ... }

An iterator is memory-efficient — it processes one tuple at a time instead of loading all tuples into a list. For a table with millions of rows, materializing everything would blow up memory. The HeapFileIterator lazily fetches pages as needed through the BufferPool.

Routing Through BufferPool

HeapFile never touches disk directly. Every page access goes through bufferPool.getPage(), which either returns a cached page or reads from disk. This means:

  • Repeated scans of the same table reuse cached pages
  • The BufferPool controls memory usage across all tables
  • Dirty pages are automatically tracked

How It Connects to Other Components

HeapFile ◀──schema──── Schema     (knows tuple format)
HeapFile ──page ops──▶ BufferPool (all I/O goes through cache)
HeapFile ──RecordId──▶ BTree      (index entries point here)
HeapFile ──iterator──▶ SeqScan    (full table scan)
HeapFile ──getTuple──▶ IndexScan  (point lookup by RecordId)
HeapFile ──registered──▶ Catalog  (metadata)

File Layout on Disk

students.dat
┌────────┬────────┬────────┬────────┬───
│ Page 0 │ Page 1 │ Page 2 │ Page 3 │...
│ 4096 B │ 4096 B │ 4096 B │ 4096 B │
└────────┴────────┴────────┴────────┴───

Each page is exactly 4096 bytes. Page N starts at byte offset N × 4096. This makes random access trivial — RandomAccessFile.seek(pageId * PAGE_SIZE).