Skip to content

Add read cache for zip archive entries in XLSX reader#4825

Draft
kemo wants to merge 5 commits into
PHPOffice:masterfrom
kemo:perf/xlsx-zip-read-cache
Draft

Add read cache for zip archive entries in XLSX reader#4825
kemo wants to merge 5 commits into
PHPOffice:masterfrom
kemo:perf/xlsx-zip-read-cache

Conversation

@kemo
Copy link
Copy Markdown
Contributor

@kemo kemo commented Mar 10, 2026

Summary

  • Adds an in-memory cache to getFromZipArchive() in the XLSX reader to avoid re-reading the same zip entries
  • The method is called 17+ times during parsing with up to 3 fallback getFromName() attempts each — files like styles.xml and theme.xml may be decompressed multiple times
  • Cache is cleared at the start of each entry point (canRead, listWorksheetNames, listWorksheetInfo, loadSpreadsheetFromFile)

Changes

  • src/PhpSpreadsheet/Reader/Xlsx.php: Added $zipCache array, cache check in getFromZipArchive(), clearZipCache() calls at entry points
  • tests/PhpSpreadsheetTests/Reader/Xlsx/ZipReadCacheTest.php: 5 tests covering correctness, repeated loads, different files, list operations
  • tests/PhpSpreadsheetTests/Benchmark/XlsxZipReadCacheBenchmark.php: Benchmark with complex multi-sheet XLSX

Test plan

  • Verify XLSX files load correctly with the cache
  • Verify cache is properly cleared between separate load operations
  • Verify listWorksheetNames() and listWorksheetInfo() work correctly
  • Run benchmark: vendor/bin/phpunit --group benchmark --filter XlsxZipReadCacheBenchmark --stderr

@kemo
Copy link
Copy Markdown
Contributor Author

kemo commented Mar 11, 2026

Benchmark Results (xlsx-zip-read-cache)

Targeted benchmark: read XLSX files with varying zip archive complexity (single and multi-sheet).

Benchmark Master Branch Delta
issue.3654c (1.7MB) 353ms 360ms +2%
issue.4248 (696K) 53ms 52ms ~same
issue.3654 (558K) 38ms 38ms ~same
issue.2362 (253K) 1,035ms 1,061ms +3%
27template (243K) 4.7ms 4.5ms ~same
Generated 10 sheets (0.3MB) 2,162ms 2,124ms -2%
Generated 25 sheets (0.7MB) 5,525ms 5,533ms ~same

No meaningful difference. The zip read cache doesn't help when each archive entry is read only once during a load. The benefit would show in scenarios where the same zip entries are accessed multiple times (e.g. repeated partial reads or multi-pass processing).

@oleibman
Copy link
Copy Markdown
Collaborator

Thank you for the summaries. For purposes of prioritization, I am going to put the PRs with less demonstrable improvement, including this one, in draft status for now. This does not mean that I will not return to them.

@oleibman oleibman marked this pull request as draft March 11, 2026 14:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants