[help] up-to-date checking is slow for large file targets: is this normal/unavoidable? #1416

monicathieu · 2025-01-11T02:03:39Z

monicathieu
Jan 11, 2025

Help

I understand and agree to https://books.ropensci.org/targets/help.html.

Description

Hi! First of all this package is fantastic and powerful and has made my projects safer/more reproducible as intended, so thank you 😊

I have a question: In my case, I'm running a functional MRI research study. Let's say right now I have 15 participants' data, and each participant has 5 raw MRI data files that are about 2 GB each. Each of these raw MRI files gets processed into another MRI file that's a similar size. I'm tracking each of these MRI files as its own target, and they're all set as format = "file". The data in these files is not getting read into R when these targets are run, but they are getting read into/written out of a Matlab instance that's being called from R.

Ultimately, my main analysis targets currently depend on 15 x 5 x 2 = 150 upstream data targets that are each 2 GB (as well as many other much smaller upstream targets). (This study is in progress, so the number of targets is slowly increasing as we collect more data!)

When I change the code for a downstream analysis target, and all of the upstream MRI data targets are up-to-date, tar_make(downstream_target) takes nearly an hour to check that each of the upstream targets is up-to-date before the downstream target runs. This is starting to become quite a drag! I also notice that it's specifically these big MRI targets that take a long time to get reported as up-to-date. The smaller upstream targets get checked and reported as up-to-date much faster. So I assume it's something with the file size and possibly the hashing process?

Is there anything I can do to speed up the updatedness checking, or is that a fact of life with target files this large? Thank you for your help!

Potentially relevant settings and other details:

Running on a Linux server, R 4.4.1, targets 1.7.1
I usually run the pipeline with use_crew=FALSE, and it's still quite slow as I described, but it's even slower per each target when I use crew_controller_local. This makes me think that the problem has to do with reading target information into memory to check it?
When an upstream target DOES need to run, there's a noticeable amount of time between "dispatched" and "completed" that isn't accounted for by the run duration reported by the completed target output. That also makes me think it's a memory thing?
trust_object_timestamps option is TRUE (the default I think, I have never changed it)
memory option is "persistent" (also the default?). I'm not 100% clear whether "persistent" or "transient" would be faster for large "file"-format targets

Answered by wlandau

Jan 13, 2025

The overhead is definitely avoidable if you do one of the following:

In targets version 1.7.1, set format = "file_fast" where you currently have format = "file".
Upgrade to the latest release of targets (1.10.0). You may have to run a slow tar_make() at first, but afterward the pipeline should trust timestamps and run much faster if everything is up to date.

View full answer

wlandau · 2025-01-13T21:06:15Z

wlandau
Jan 13, 2025
Maintainer

The overhead is definitely avoidable if you do one of the following:

In targets version 1.7.1, set format = "file_fast" where you currently have format = "file".
Upgrade to the latest release of targets (1.10.0). You may have to run a slow tar_make() at first, but afterward the pipeline should trust timestamps and run much faster if everything is up to date.

1 reply

monicathieu Jan 13, 2025
Author

I read up on the changelog and decided it was worth it to try upgrading to 1.10.0. Totally worth it!!! Automatic timestamp trust appears to be working. After the first slow tar_make() (a tiny bit faster than pre-update speed), the next time I called it was like blazingly fast. Five stars for the update!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[help] up-to-date checking is slow for large file targets: is this normal/unavoidable? #1416

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

[help] up-to-date checking is slow for large file targets: is this normal/unavoidable? #1416

monicathieu Jan 11, 2025

Help

Description

Replies: 1 comment · 1 reply

wlandau Jan 13, 2025 Maintainer

monicathieu Jan 13, 2025 Author

monicathieu
Jan 11, 2025

Replies: 1 comment 1 reply

wlandau
Jan 13, 2025
Maintainer

monicathieu Jan 13, 2025
Author