Skip to content

Merge telescope-wise data from the same OB#2916

Open
TjarkMiener wants to merge 13 commits into
mainfrom
combine_tel_events
Open

Merge telescope-wise data from the same OB#2916
TjarkMiener wants to merge 13 commits into
mainfrom
combine_tel_events

Conversation

@TjarkMiener
Copy link
Copy Markdown
Member

This PR adds a new merge strategy for combining telescope-wise data from the same OB. It also added a static method to the SubarrayDescription to merge subarrays into one.


closes #1625

Base automatically changed from mondata_merger to main January 13, 2026 17:02
@maxnoe
Copy link
Copy Markdown
Member

maxnoe commented Jan 14, 2026

I'm not sure that we need the merger of the subarray. The individual telescope data stream has the subarray_id, which points to the full subarray. It's not yet fully clear how we then obtain the instrument description from that, but in the production case, likely from files on CVMFS exported from data in SOSS.

For now, ctapipe_io_zfits has resource files bundled to create it.

In any case, I don't really foresee the need to combine "per-telescope" subarray descriptions, at least not for observed CTAO data.

Do you have a usecase that requires merging subarrays?

@TjarkMiener
Copy link
Copy Markdown
Member Author

I'm not sure that we need the merger of the subarray. The individual telescope data stream has the subarray_id, which points to the full subarray. It's not yet fully clear how we then obtain the instrument description from that, but in the production case, likely from files on CVMFS exported from data in SOSS.

For now, ctapipe_io_zfits has resource files bundled to create it.

In any case, I don't really foresee the need to combine "per-telescope" subarray descriptions, at least not for observed CTAO data.

Do you have a usecase that requires merging subarrays?

Thanks for the feedback, that makes sense for the standard CTAO production workflow. The motivation for including subarray merging here is mainly to support a few non-standard but actively used workflows rather than the core CTAO analysis chain. In calibpipe, we work with telescope-wise monitoring outputs that we want to merge at a later stage; this could be handled by a dedicated calibpipe tool, but since ctapipe-merge already provides a generic merging interface and now supports different merge strategies, it feels convenient to support this pattern there as well. The required changes are relatively minor, with the main added complexity being the handling of the trigger tables in _flush(). A second motivation comes from CTLearn, where some stereo deep learning models operate on lower-level data and combine information from multiple telescopes early in the analysis chain. While it is possible to redesign the readers to explicitly handle multiple telescope-wise files, this quickly becomes more complex, whereas allowing merged telescope-wise inputs simplifies the workflow. In addition, the static helper SubarrayDescription.merge_subarrays() is needed in CTLearn for one of our best-performing models, where predictions are made on multiple telescope-pair subarrays and then combined. I agree that this functionality is likely not needed for standard observed CTAO data processing, but with relatively small, opt-in changes we can support these additional use cases without impacting the default workflow, while improving convenience for downstream tools.

@TjarkMiener TjarkMiener marked this pull request as ready for review January 15, 2026 10:30
@Voutsi
Copy link
Copy Markdown

Voutsi commented Jan 15, 2026

Do you have a usecase that requires merging subarrays?

@maxnoe , I would like to add here the use case of telescope cross calibration. There we want to process all the data from all subarrays of a given night, and depending on the case, we might also want to process data from more than one nights. Reading @TjarkMiener reply, if it is straightforward to add such functionality in ctapipe, that can merge DL2 data from different OBs and merge the subarrays, that would be great.

Comment thread src/ctapipe/instrument/tests/test_subarray.py Outdated
@maxnoe
Copy link
Copy Markdown
Member

maxnoe commented Jan 15, 2026

Ok, thanks, these are fine usecases and merging the subarray data should be straight forward.

However, we should make sure that the default for merging telescopes is merging of the same ob, which should require the same subarray.

Comment on lines +719 to +722
if self.subarray != subarray:
raise CannotMerge(
f"Subarrays do not match for file: {other.filename}"
)
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Voutsi the default merge strategy will check if the subarrays match between different observation blocks. For your use case you would need to disable that check? Should we add another merge strategy for this then?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TjarkMiener , yes for my usecase I need to merge different subarrays. But I am not sure I need a new strategy. Wouldn't this strategy "events-multiple-obs" do the job?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs much more work.

Thinking about it, you just cannot simply combine subarrays, there are multiple arrays in the data that are arrays of length n_telescopes. You need to change these values to be consistent over all events, e.g.the tels_with_trigger in the trigger subarray trigger table.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to change these values to be consistent over all events, e.g.the tels_with_trigger in the trigger subarray trigger table.

Yes, you are missing a test that checks that sub.tel_ids_to_mask() is consistent for the original and merged subarrays, which of course will fail. It's unfortunately a pretty strong underlying assumption that the telescope list doesn't change.

Maybe another reason we need a format validation tool, since checking the consistency of event fields like tels_with_trigger or similar is quite high-level.

Do you really need to merge subarrays though? I would just refactor the problem into two steps/tools:

  1. tool that collects statistics you need for each telescope and writes them to a file, inside of which are data indexed by the telescope type name or tel_id, or whatever you need. That tool should be able to run on run over multiple DL2/Event input files, perhaps just appending to the initial file so you end up with one "merged" stats file.
  2. tool that reads this intermediate stats file to compute inter/intra-telescope calibration.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about it, you just cannot simply combine subarrays, there are multiple arrays in the data that are arrays of length n_telescopes. You need to change these values to be consistent over all events, e.g.the tels_with_trigger in the trigger subarray trigger table.

Yes, we need to run the _flush() also for this usecase. I added a new merge strategy 'events-multiple-obs-different-subarrays' for it. I also exclude DL2 subarray from the merging since it is not really needed and it will contain arrays of length n_telescopes. It is also not a simple merge since it will involve some logic of how to combine those predictions.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we before we continue here with implementation we need a better formulation of the different use-cases and what we expect as input and output data for each of them.

This looks now like a footgun for people to end up with files that violate assumptions of e.g. the HDF5EventSource and the TableLoader, i.e. of our data model.

@maxnoe
Copy link
Copy Markdown
Member

maxnoe commented Jan 19, 2026

@maxnoe , I would like to add here the use case of telescope cross calibration. There we want to process all the data from all subarrays of a given night, and depending on the case, we might also want to process data from more than one nights.

I don't really see why this would require you to merge input files for that. Why not work on multiple files and load into memory from them as needed?

@TjarkMiener
Copy link
Copy Markdown
Member Author

@maxnoe Sorry for the long silence, I misinterpreted your comment about the multiple files slightly. First, I thought with multiple files you mean per tel_id and then the cross-calibration tool should be able to digest those telescope-wise files (but then the crucial stereo impact reco is missing). Now, I think you mean that the cross calibration should just be able to digest per obs_id stereo DL2 files and keep track of the SubarrayDescription per obs_id since they can differ. Therefore indeed, we do not need the merge strategy events-multiple-obs-different-subarrays, so the last two commits should be removed. The overall pseudo-workflow would look like this:

  1. DL0 per tel_id will be processed via ctapipe-process to DL1b for each OB.
  2. DL1b per tel_id will be merged in each OB via ctapipe-merge with combine-telescope-events.
  3. Merged DL1b files for each OB will be processed via ctapipe-process enabling --write-showers (or this recompute flag of DL2).
  4. ctapipe-apply-models tool to get the DL2 telescope and subarray predictions from ML reco. This would lead to a DL2 file for each obs_id following the same structure as @Voutsi's test simulation data.
  5. cross-calibration tool reads multiple files per obs_id with (maybe) different subarrays and do the merging in memory

Besides excluding DL2 subarray from merging with strategy combine-telescope-events as done in 2a3f0b8, we also need to exclude DL2 telescope infos such as the impact. Then, combine-telescope-events only produced files that are not violating the assumptions of the data model.

Can you please confirm so we can go ahead with the implementation, @maxnoe @kosack?

@maxnoe
Copy link
Copy Markdown
Member

maxnoe commented Jan 21, 2026

Now, I think you mean that the cross calibration should just be able to digest per obs_id stereo DL2 files and keep track of the SubarrayDescription per obs_id since they can differ.

Yes, indeed.

@maxnoe
Copy link
Copy Markdown
Member

maxnoe commented Jan 21, 2026

DL1b per tel_id will be merged in each OB via ctapipe-merge with combine-telescope-events.
Merged DL1b files for each OB will be processed via ctapipe-process enabling --write-showers (or this recompute flag of DL2).

An alternative approach here would be not to merge, but read multiple per-telescope DL1 files during the DL1 to DL2 step. This might be the more flexible approach and it doesn't require an extra step and copying around data.

@TjarkMiener TjarkMiener force-pushed the combine_tel_events branch 2 times, most recently from 8cc1b2a to 5b8e7d0 Compare January 21, 2026 12:44
@TjarkMiener TjarkMiener requested a review from kosack January 21, 2026 12:57
@TjarkMiener
Copy link
Copy Markdown
Member Author

I have implemented the suggestions and discussion points above. This PR would be now ready for review! @Voutsi
@maxnoe @mexanick @kosack

@maxnoe
Copy link
Copy Markdown
Member

maxnoe commented Jan 26, 2026

The docs build is fixed in main, please update

@ctao-sonarqube
Copy link
Copy Markdown

Quality Gate failed Quality Gate failed

Failed conditions
3 New issues

See analysis details on SonarQube

Catch issues before they fail your Quality Gate with our IDE extension SonarQube for IDE SonarQube for IDE

@TjarkMiener
Copy link
Copy Markdown
Member Author

The docs build is fixed in main, please update

Thanks @maxnoe it is passing now. Regarding the SonarQube analysis, can you please have a look and let me know if the issues are acceptable or if I should resolve them.

@maxnoe
Copy link
Copy Markdown
Member

maxnoe commented Jan 26, 2026

A second motivation comes from CTLearn, where some stereo deep learning models operate on lower-level data and combine information from multiple telescopes early in the analysis chain. While it is possible to redesign the readers to explicitly handle multiple telescope-wise files, this quickly becomes more complex, whereas allowing merged telescope-wise inputs simplifies the workflow. I

I don't think adding a "foot-gun" level feature to ctapipe, where we might create broken subarray descriptions and non-matching trigger and dl2 tables is warranted for that.

Where do these telescope-wise files come from? Why do they need to be merged? Why merge at all?

I think we need to solve the case of per-telescope DL1 input files, and not by just trying to merge them together. This only results in unnecessary data copies and possibly inconsistent files.

@TjarkMiener
Copy link
Copy Markdown
Member Author

where we might create broken subarray descriptions and non-matching trigger and dl2 tables

I don’t think this is possible with the current implementation. Can you think of a specific misuse of the component that would violate the assumptions of the data model? DL2 is explicitly excluded when using --combine-telescope-data for this reason. The feature is also restricted to the single-ob case, and we keep track of which telescope IDs are merged. The process will fail as soon as duplicated tel_ids are detected. Both the subarray and telescope trigger tables are handled consistently, and the shower simulation table is updated accordingly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Horizontal merging of DL1 files

5 participants