Skip to content

Large input taking a days to process #1285

@ammaraziz

Description

@ammaraziz

Hi again,

I am running MiCall version v7.17.1 through docker, the command I used:

micall folder --project_code HIVB --skip trim.censor -s

The sequencing experiment was hybridisation using the TWIST HIV library run on a P1 NextSeq paired end 150bp. The data generated for most of the samples was heaps (something like 5 million paired end reads on average, unprocessed). These samples with reasonable read numbers finished within hours.

But two samples were nearly triple the number of reads (15 million) are still being processed ~ 4 days later.

The machine I am running MiCall on has plenty of ram (2TB) and cpu (256).

The output (just one sample):

2025-04-24 07:13:42,554[INFO]micall.link_samples(): Pairing files /data/20_XXXXXX_S20_R1_001.fastq.gz and /data/20_XXXXXX_S20_R2_001.fastq.gz.
2025-04-24 07:13:43,460[INFO]micall.drivers.sample.process(): Processing Sample 20_XXXXXX (12 of 24) ('/data/20_XXXXXX_S20_R1_001.fastq.gz').
2025-04-24 08:35:44,843[INFO]micall.drivers.sample.process(): Running fastq_g2p on Sample 20_XXXXXX (12 of 24).
2025-04-24 11:41:52,503[INFO]micall.drivers.sample.run_mapping(): Running prelim_map on Sample 20_XXXXXX (12 of 24).
2025-04-24 19:49:56,932[INFO]micall.drivers.sample.run_mapping(): Running remap on Sample 20_XXXXXX (12 of 24).
2025-04-25 15:48:24,382[INFO]micall.drivers.sample.process_post_assembly(): Running sam2aln on Sample 20_XXXXXX (12 of 24).
2025-04-25 16:47:30,930[INFO]micall.drivers.sample.process_post_assembly(): Running aln2counts on Sample 20_XXXXXX (12 of 24).

To be clear, the pipeline hasn't crashed, it is still processing the data. In the output directory, there is scratch folder containing tmp processing folders. For the above sample number 20 the folder contains 2674 (and still growing) nuc_read_counts*.csv.

The issue is the large number of input reads but what's interesting is that the input is 'only' 3 times larger than files which took ~4 hours to complete.

I could subsample the reads to a reasonable amount but I think the issue still remains that something is triggering MiCall is go super slow. Possibly related to #214 ?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions