-
Notifications
You must be signed in to change notification settings - Fork 208
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Increased memory usage with dlt > 1.3.0 and filesystem module during extract phase #2221
Comments
@trin94 thanks for this report! we didn't touch the could you tell me if there are any differences for fsspec and adlfs dependencies in your 1.3.0 and 1.4.1 environments? |
Another thing you could try is to set parallelism to 1 and the queueing strategy to
is this reducing memory consumption? (only one resources will be extracted at the same time) |
Hi @rudolfix, thank you a lot for your response!
No, there are no differences. We upgraded all dependencies to their most recent versions:
With
Unfortunately, this did not help. |
@trin94 thanks this is very helpful (and worrying!). we'll investigate that with high prio |
@trin94 when doing code diff, I can see only one thing that got added to 1.4.1:
automatically to defaults. you could try to disable this change with a config ie. https://dlthub.com/docs/dlt-ecosystem/destinations/filesystem#adding-additional-configuration |
@trin94 I will try to reproduce this here with a smaller generated dataset. I'd like you to confirm that you are sure that the pipeline fails during the extract phase and not during any of the other phases such as normalization or loading? You could just run pipeline.extract() in the container to make sure. Also could you tell me how large the dataset is you are loading? I will try to see the increased mem usage on a smaller one I create here. |
@trin94 Can you run your pipeline locally (only the extract phase) and have all the same settings but have psutil installed in the environment? You will get a memory printout with your logs. Can you confirm that you see this problem there too? I have now extracted a 1.2gb jsonl dataset I generated locally and see no memory increase. I will test doing the same from an azure bucket next. |
Hey @sh-rp, Thank you for looking into this!
Yes, it's the extract phase. We've been using
It's around
Locally, I don't see any RAM being eaten up as well. I tested it on a subset (
I will check it later (maybe even tomorrow) and report back if it fixes the issue. Thank you a lot! Best regards |
I also did a test with the same files on azure and also cannot see an increased or increasing memory footprint. For reference the script to create the files:
And the simplified pipeline script:
|
Hey,
Unfortunately, this did not reduce memory consumption.
I just checked one of our test files. A single line in our jsonl file (picked at random) amounts to roughly I've attached full logs of all dependencies that were installed into our docker containers: I also ran with |
Alright, some more questions and asks to help me reproduce this:
|
Hey @sh-rp,
There are common fields but overall they are wildly different.
Yes. I've attached one. If you need more (and different) files, please tell me 🙂 |
OK I see that https://dlthub.com/docs/reference/performance#use-built-in-json-parser - you can try switching to simplejson |
Hi @trin94 , I'm experiencing a similar issue. I will try to provide more details at the end of the week. Do you also experience a significant processing speed difference? In my case, I'm using filesystem on Azure blob storage as source and also as destination. I experience also memory growth but I also experience a significant speed difference (version |
Hi,
I've tried setting
I can confirm that processing is noticeably slower on newer versions but I cannot tell whether it's connected to the increased memory consumption we're observing. |
@trin94 as I mentioned we could not identify any obvious problems when comparing diffs of 1.3.0 and 1.4.1. If you still are willing to investigate - you could remove the incremental in here:
and run a full load as a test. we changed how the incremental is instantiated and when boundary deduplication happens but that should rather decrease the memory consumption. if you could run it it would be very helpful |
@trin94 also please double check if the problem happens in |
Hi @rudolfix,
That had no effect, unfortunately.
That's exactly what we do 😃 Thank you so much for looking into this! Upgrading isn’t a priority right now, so we’re fine with v1.3.0. |
dlt version
Describe the problem
We're observing memory issues when upgrading from dlt 1.3.0 to dlt 1.5.0.
We use the filesystem module to load data (less than
50.000k
files, each file< 10 MB
, most of them< 1 MB
) into a postgres instance.1.3.0
, we observed< 1 GB
of memory usage during the extract phase (extraction phase takes 16 minutes).1.4.1
and1.5.0
, our container job gets killed because it exceeded8 GB
of memory usage after 30 minutes.The only difference between these runs is the
dlt
version.Expected behavior
Memory usage stays roughly the same
~ 1GB
but at least< 2GB
during the extraction phase.Steps to reproduce
Probably, a similar file set needs to be used. Unfortunately, I cannot share the files as it's production data.
Operating system
Linux
Runtime environment
Other
Python version
3.12
dlt data source
filesystem
dlt destination
Postgres
Other deployment details
We're using a Container App Job on Azure.
Additional information
Python:
Config:
Kinda offtopic:
Python 3.12
that we're usingThe text was updated successfully, but these errors were encountered: