Large input taking a days to process

Hi again,

I am running MiCall version `v7.17.1` through docker, the command I used:

```
micall folder --project_code HIVB --skip trim.censor -s
```

The sequencing experiment was hybridisation using the TWIST HIV library run on a P1 NextSeq paired end 150bp. The data generated for most of the samples was heaps (something like 5 million paired end reads on average, unprocessed). These samples with reasonable read numbers finished within hours.

But two samples were nearly triple the number of reads (15 million) are still being processed ~ 4 days later.

The machine I am running MiCall on has plenty of ram (2TB) and cpu (256).


The output (just one sample):

```
2025-04-24 07:13:42,554[INFO]micall.link_samples(): Pairing files /data/20_XXXXXX_S20_R1_001.fastq.gz and /data/20_XXXXXX_S20_R2_001.fastq.gz.
2025-04-24 07:13:43,460[INFO]micall.drivers.sample.process(): Processing Sample 20_XXXXXX (12 of 24) ('/data/20_XXXXXX_S20_R1_001.fastq.gz').
2025-04-24 08:35:44,843[INFO]micall.drivers.sample.process(): Running fastq_g2p on Sample 20_XXXXXX (12 of 24).
2025-04-24 11:41:52,503[INFO]micall.drivers.sample.run_mapping(): Running prelim_map on Sample 20_XXXXXX (12 of 24).
2025-04-24 19:49:56,932[INFO]micall.drivers.sample.run_mapping(): Running remap on Sample 20_XXXXXX (12 of 24).
2025-04-25 15:48:24,382[INFO]micall.drivers.sample.process_post_assembly(): Running sam2aln on Sample 20_XXXXXX (12 of 24).
2025-04-25 16:47:30,930[INFO]micall.drivers.sample.process_post_assembly(): Running aln2counts on Sample 20_XXXXXX (12 of 24).
```

To be clear, the pipeline hasn't crashed, it is still processing  the data. In the output directory, there is `scratch` folder containing tmp processing folders. For the above sample number 20 the folder contains 2674 (and still growing) `nuc_read_counts*.csv`.

The issue is the large number of input reads but what's interesting is that the input is 'only' 3 times larger than files which took ~4 hours to complete.

I could subsample the reads to a reasonable amount but I think the issue still remains that something is triggering MiCall is go super slow. Possibly related to https://github.com/cfe-lab/MiCall/issues/214 ?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Large input taking a days to process #1285

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Large input taking a days to process #1285

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions