Streaming GTFS stop times and shapes#6754
Streaming GTFS stop times and shapes#6754leonardehrenfried wants to merge 9 commits intoopentripplanner:dev-2.xfrom
Conversation
5aacb43 to
67a963b
Compare
optionsome
left a comment
There was a problem hiding this comment.
I guess one option would also be to add some sort of a streaming reader mode to the OBA library and read the rows through it, but we probably would need to do it a slightly more generic way which might lead to more memory consumption-
| dao.setPackShapePoints(true); | ||
| dao.setPackStopTimes(true); |
There was a problem hiding this comment.
They instruct OBA to use a more compact way of representing these entities. But if the do the streaming approach it is no longer necessary.
I had the same idea. The problem is that streaming the entities will give up referential integrity checks in the library and for example the StopTime.trip is no longer a full Trip but a trip id, which the consumer has to resolve. This means that we need a new data model. So with a new way of reading data and a new data model there isn't much left of OBA. Also, now that I've maintained OBA for a while, I see that there is a huge amount of complicated indirection in there which to me doesn't make a lot of sense. My favourite solution is this: we create a new module in this repo where we develop a new streaming library. Once we are satisfied with it we can consider moving it to another repo either in the OBA or the OTP orgs. |
67a963b to
4dc8e10
Compare
Summary
This is a proof of concept for more efficient processing of stop times and shapes: rather than reading all of them into a huge list/array they are streamed off the CSV source line by line.
This has huge memory savings - in a typical graph build you can save 30-40%!
Combined with #6752 this saves about 60% of memory.
The downside is that we now have two ways of reading GTFS data: one streaming and one from the OBA library.
We need to discuss the various trade offs to make and therefore this is a draft. (It also depends on a PR that isn't merged yet.)
cc @tkalvas @abyrd @jessicaKoehnke