Refactor DL2EventLoader to use FeatureGenerator by kosack · Pull Request #2919 · cta-observatory/ctapipe

kosack · 2026-01-19T13:00:13Z

This fixes #2918 , and was triggered by it. Configuration variables must be traitlets, not complex classes without text serialization. This removes it from the config and makes it an instance variable that can be set in the constructor. More explicitly, in the code the option output_table_schema was a List[astropy.table.Column], which is not allowed (List is a traitlet, but Column is not).

However, when removing this configuration option it was clear that the API for this class could be significantly improved in such a way as to remove the need for the output_table_schema at all.

Refactoring:

Refactor DL2EventPreprocessor to use FeatureGenerator and do both event-selection and final column selection. This is a large API change, but simplifies the code significantly. The new version will have a feature_set option that pre-configures the class for different common use cases, simplifing the need for complex configuration files (which are still however allowed).
- Implement simulation use case (the original use case for this class)
- implement observation use case (processing DL2 subarray events)
- implement inter/intra telescope calibration use cases, at least allowing for per-telescope DL2 data. Since I'm not very familiar with this UC, and it can be achieved by using feature_set=custom and specifying everything in a config file, it may not be necessary.
Add class irf.EventWeighter to do event weighting, which should not be something done in io.
Refactor DL2EventLoader to use the new DL2EventPreprocessor
Refactor IRFTool and OptimizeTool to use the new classes.

In the end, this refactoring will create much simpler workflow, but I think perhaps it doesn't all belong in IO, since only the reading and stacking of the chunks is really an "I/O" operation. THe workflow in this refactored version will be for example for an IRF production tool :

DL2EventPreProcessor:

(raw dl2 table) → DL2FeatureGenerator → QualityQuery → [select output columns] → (processed dl2 table)

Which means DL2EventLoader is just:

(input file) → TableLoader → [loop over chunks] → DL2EventPreprocessor → [merge chunks] → (processed dl2 table)

For IRF processing, a final step of

(processed dl2 table) → RadialEventWeighter → (processd dl2 table with weights)

Is used to split events into FOV bins and do spectral re-weighting.

configuration variables must be traitlets, not complex classes without text serialization. This removes it from the config and makes it a package variable. Fixes #2918

The fixture `test_config` was too generic a name, so renamed to `dl2_event_loader_config`

kosack · 2026-01-19T14:53:36Z

Looking deeper at this code, it's somewhat fragile, and confusing. I think it may need some restructuring to support the use case of adding new columns. Having to specify an output schema is a bit heavy - it would be better to just have a list of columns to take from the input files and carry over to the reduced table, right? I can refactor this if that sounds ok. In that case, I would remove the whole table schema object and interface, and just replace it with two config options:

columns_to_write: List[Unicode]
columns_to_rename (as before)

So you would configure it like:

DL2EventPreprocessor:
    columns_to_rename:
        ExtraTreesRegressor_energy: reco_energy
        ExtraTreesClassifier_prediction: gh_score
        HillasReconstructor_alt: reco_alt
        HillasReconstructor_az: reco_az
    # These must be columns that exist in the input, or the name of renamed columns.
    columns_to_write:
        - obs_id
        - event_id
        - true_energy
        - true_az
        - true_alt
        - reco_energy
        - reco_az
        - reco_alt
        - pointing_az
        - pointing_alt
        - gh_score

I can think of even more simplifications, but that would be the basic one. In that case, it's super easy to configure for different cases. If you want to include some olther column like ExtraTreesRegressor_tel_energy in the final output, you would just add it to columns_to_write.

kosack · 2026-01-19T15:02:13Z

I'm also thinking a bit forward to a missing interface: event-wise algorithm selection, so I might also add an abstract interface for that and a default implementation that selects a single algorithm. That would replace the whole "column renaming" interface, with somthing more flexiable, since that is what the renaming is doing: selection which column to use for the "final" DL3 output columns. But that could be in a future PR.

maxnoe · 2026-01-19T15:51:54Z

Yes, this should never have been a traitlet of type List(Column), at least not without defining a traitlet wrapper for Column.

kosack · 2026-01-20T14:03:46Z

Actually I have an even better API than my suggestion above: what the DL2EventPreprocessor does is nearly exactly what we have in the ML training tools, so it would be best to use exactly the same API!:

renamed columns are just new features, so we can just use a FeatureGenerator
output columns are the final list of features.

DL2EventPreprocessor:
    FeatureGenerator: 
        features:
            # the first few are just renamings
            - ["reco_energy", "ExtraTreesRegressor_energy"]
            - ["gh_score", "ExtraTreesClassifier_prediction"]
            - ["reco_alt", "HillasReconstructor_alt"]
            - ["reco_az", "HillasReconstructor_az"]
            # can even get rid of the hard-coded computed columns, since we can do math
            - ["theta", "angular_separation(reco_az, reco_alt, pointing_az, pointing_alt)"]
            
    features:
        - obs_id
        - event_id
        - true_energy
        - true_az
        - true_alt
        - reco_energy
        - reco_az
        - reco_alt
        - pointing_az
        - pointing_alt
        - gh_score
        
     QualityQuery:  
          quality_criteria:
               - ...

This config then looks almost identical to e.g. train_energy_regressor.yaml. The code can be refactored to be also very similar to the code in the training tool, making it much simpler.

LukasBeiske · 2026-01-20T16:04:21Z

Actually I have an even better API than my suggestion above: what the DL2EventPreprocessor does is nearly exactly what we have in the ML training tools, so it would be best to use exactly the same API!:

* renamed columns are just _new features_, so we can just use a `FeatureGenerator`

* output columns are the final list of features.

DL2EventPreprocessor:
    FeatureGenerator: 
        features:
            # the first few are just renamings
            - ["reco_energy", "ExtraTreesRegressor_energy"]
            - ["gh_score", "ExtraTreesClassifier_prediction"]
            - ["reco_alt", "HillasReconstructor_alt"]
            - ["reco_az", "HillasReconstructor_az"]
            # can even get rid of the hard-coded computed columns, since we can do math
            - ["theta", "angular_separation(reco_az, reco_alt, pointing_az, pointing_alt)"]
            
    features:
        - obs_id
        - event_id
        - true_energy
        - true_az
        - true_alt
        - reco_energy
        - reco_az
        - reco_alt
        - pointing_az
        - pointing_alt
        - gh_score
        
     QualityQuery:  
          quality_criteria:
               - ...

This config then looks almost identical to e.g. train_energy_regressor.yaml. The code can be refactored to be also very similar to the code in the training tool, making it much simpler.

I think that's a very nice idea. Using a FeatureGenerator would also help with the somewhat ugly calculation of the event multiplicity in #2789.

kosack · 2026-01-22T17:16:19Z

I've currently replaced all the genearted columns in the old implementation with :

FeatureGenerator:
  features:
    - [reco_energy    , "RandomForestRegressor_energy"
    - [reco_alt       , "HillasReconstructor_alt"
    - [reco_az        , "HillasReconstructor_az"
    - [gh_score       , "RandomForestClassifier_prediction"
    - [theta          , "angular_separation(reco_az, reco_alt, true_az, true_alt)"
    - [reco_fov_coord , "altaz_to_fov(reco_az, reco_alt, subarray_pointing_lon, subarray_pointing_lat)"
    - [reco_fov_lon   , "reco_fov_coord[:,0]"
    - [reco_fov_lat   , "reco_fov_coord[:,1]"
    - [true_fov_coord , "altaz_to_fov(true_az, true_alt, subarray_pointing_lon, subarray_pointing_lat)"
    - [true_fov_lon   , "true_fov_coord[:,0]"
    - [true_fov_lat   , "true_fov_coord[:,1]"
    - [true_fov_offset, "angular_separation(reco_fov_lon, reco_fov_lat, 0*u.deg, 0*u.deg)"
    - [reco_fov_offset, "angular_separation(true_fov_lon, reco_fov_lat, 0*u.deg, 0*u.deg)"

- gammaness_classifier -> gammaness_reconstructor (to be consistent) - added columns for obs_id,event_id in output f simulation - fixed typo in multiplicity calc

The metadata is now copied to the new table, not shallow copied, since new fields may be added to the new table

maxnoe · 2026-01-24T14:48:10Z

+
+    target_spectrum_name = traits.UseEnum(
+        Spectra,
+        default_value=Spectra.CRAB_HEGRA,


By allowing passing in a callable and restricting to UseEnum with a default here, we end up in the situation where we store the target spectrum CRAB_HEGRA in the config, but actually the passed in callable was used.

Good point. I could just restrict this to pre-defined spectra and not allow an arbitrary one to be passed in. The ability to use abitrary spectra was mostly for testing, but perhaps would be useful for calibration or non-gamma-ray studies, but those could be supported later by adding more spectra to the enum. For testing I could just add a Spectra.FLAT option to test the no-op case.

Instead of the enum, we could make traitlets wrappers for the spectra themselves.

spectrum_cls: PowerLaw PowerLaw: index: 2.0 normalization: value: 5e-10 unit: "m^-2 s^-1 TeV^-1"

to still allow setting pre-configured ones, you could add a spectrum_name with default None that updates the values from pre-defined ones.

That would be a nice separate PR I think. I'll open an issue. For now I'll just restrict to the pre-defined names, since that is what was there before.

maxnoe · 2026-01-24T14:49:11Z

+        source_spectrum = self.source_spectrum
+        if self.is_diffuse:
+            source_spectrum = source_spectrum.integrate_cone(
+                0 * u.deg, self.fov_offset_max


We need a fov_offset_min here, e.g. to compute sensitivity in fov offset bands.

That was what is in the RadialEventWeighter implementation already,but I suppose this Simple implementation with no offset-binning is really just the same with only one bin, so maybe this class is not necessary. The old code supported a case with no offset binning, which is where this comes from, but it's not clear if it was ever used - have to check why it was there.

maxnoe · 2026-01-24T14:50:58Z

- [true_fov_offset, "angular_separation(reco_fov_lon, reco_fov_lat, 0*u.deg, 0*u.deg)"
- [reco_fov_offset, "angular_separation(true_fov_lon, reco_fov_lat, 0*u.deg, 0*u.deg)"

Names and used variables here are not consistent!

Weighting should not be in ``ctapipe.io``, as it's really IRF specific. So I moved it to ctapipe.irf and created a EventWeighter class hierarchy for the different methods.

kosack · 2026-01-26T16:48:23Z

In refactoring this, I think we no longer really need DL2EventLoader at all. The DL2EventPreprocessor class is sufficient. Most of the complexity was to support several different use cases: IRF production, observed DL2 production, and inter/intra telescope calibration. With just DL2EventPreprocessor, I can reproduce what was done with DL2EventLoader using for example the following few lines:

from ctapipe.io import TableLoader, EventPreprocessor
from astropy.table import vstack

loader = TableLoader(DL2FILE, dl2=True, simulated=True, observation_info=True)
preprocess = EventPreprocessor(feature_set="dl2_simulation")
events = vstack(
    [
        preprocess(QTable(c.data)) 
        for c in loader.read_subarray_events_chunked(chunk_size=100_000)
    ]
)

So I think it's more clear to just have these lines explicitly inside any Tool that needs them, rather than wrapping them with yet another Component. For the IRF code, for example, I would call this but add a call to RadialEventWeighter. For or other use cases, you in any case need different options to TableLoader and DL2EventPreprocessor, so those should just be done explicitly for each use case.

Even better for tools like inter-telescope calibration, where you probably don't even need to vstack all chunks and can opreate at the chunk level, it's even more efficient.

SInce that is the real use case

kosack · 2026-01-27T08:35:25Z

Since this refactoring is getting large, and I don't want to break all the working code, I will split this into a few PRs, one of which will just do the minimal fix here to remove the bad config trait, and others to add the new implementation (EventWeigter, EventPreprocessor). Then I can slowly replace code that uses the DL2EventLoader with this new code.

That also makes it easier for @Voutsi and @mdebony to use the new code in calibration and DL3 production, without breaking the IRF code, and the new classes are much more flexible for those use cases.

I would suggest that we deprecate the current DL2EventLoader, which mixes too many things together - it was fine as an internal part of the IRF module, but now that it is being used in a more general way, it was trying to do to much and mixed science algorithms and IO.

In the new classes I have created, they should make the calibration code very easy to implement without tight coupling to the IRF code. Later, I'll also refactor the IRF module to use them as well, but that will be a second step. I started that here, but found it was too large a change for one PR, and while the refactoring will make the code more maintainable, it will also likely introduce bugs temporarily that I want to avoid in the short term.

kosack added 2 commits January 19, 2026 13:57

Remove DL2EventLoader.output_table_schema config

6a990fc

configuration variables must be traitlets, not complex classes without text serialization. This removes it from the config and makes it a package variable. Fixes #2918

add changelog

3940d93

kosack force-pushed the remove_columns_config branch from 73d1e87 to 3940d93 Compare January 19, 2026 13:06

kosack added 5 commits January 19, 2026 14:59

make output_table_schema a constructor option

d29f64c

add typehint

94dd6df

removed column schema from test_config and renamed

bad000e

The fixture `test_config` was too generic a name, so renamed to `dl2_event_loader_config`

removed redundant code

eb52694

added required columns for default renaming scheme

40b968e

This was referenced Jan 20, 2026

Provenance crashes because it cannot serialise Astropy columns #2918

Open

Fix FeatureGenerator gives wrong units #2921

Merged

kosack mentioned this pull request Jan 20, 2026

Allow specifying column metadata for features in FeatureGenerator #2922

Open

kosack added 6 commits January 21, 2026 17:34

added altaz_to_fov helper function

e4cf0f9

added test for altaz_to_fov

6bef7a0

started refactoring DL2EventPreprocessor

2072b7b

updated docstring

f023614

add test

b93bab8

improved computation of fov offsets

29d0a46

kosack changed the title ~~Remove DL2EventLoader.output_table_schema config~~ Refactor DL2EventLoader to use FeatureGenerator Jan 22, 2026

kosack marked this pull request as draft January 22, 2026 16:59

kosack added 5 commits January 22, 2026 19:04

cleaned up feature_set and removed old code

57d3a52

improved some tests and sanity checks

d6e8b29

minor cleanup of tests

a16609c

renamed parameter, update simulation feature_set

bcf428b

- gammaness_classifier -> gammaness_reconstructor (to be consistent) - added columns for obs_id,event_id in output f simulation - fixed typo in multiplicity calc

added test for using different reconstructors

238d619

fix bug in _shallow_copy_table that lost metadata

949cd82

The metadata is now copied to the new table, not shallow copied, since new fields may be added to the new table

maxnoe reviewed Jan 24, 2026

View reviewed changes

maxnoe reviewed Jan 26, 2026

View reviewed changes

Comment thread src/ctapipe/irf/event_weighter.py Outdated

Refactor event weighting into irf.EventWeighter

6af3bc3

Weighting should not be in ``ctapipe.io``, as it's really IRF specific. So I moved it to ctapipe.irf and created a EventWeighter class hierarchy for the different methods.

kosack force-pushed the remove_columns_config branch from bd51492 to 6af3bc3 Compare January 26, 2026 13:14

added Phi binning (for future impl)

4684eb6

kosack added 4 commits January 26, 2026 18:00

renamed DL2EventPreprocessor to EventPreprocessor

9072a3c

rename test

ac38fc2

renamed to event_preprocessor.py

63a5d89

rename dl2_simulation feature set to dl2_irf

441fe50

SInce that is the real use case

This was referenced Jan 27, 2026

Feature: EventWeighter #2927

Open

Feature: EventPreprocessor #2928

Merged

Conversation

kosack commented Jan 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kosack commented Jan 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kosack commented Jan 19, 2026

Uh oh!

maxnoe commented Jan 19, 2026

Uh oh!

kosack commented Jan 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LukasBeiske commented Jan 20, 2026

Uh oh!

kosack commented Jan 22, 2026

Uh oh!

maxnoe Jan 24, 2026

Choose a reason for hiding this comment

Uh oh!

kosack Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

maxnoe Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

maxnoe Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

kosack Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

maxnoe Jan 24, 2026

Choose a reason for hiding this comment

Uh oh!

kosack Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

maxnoe commented Jan 24, 2026

Uh oh!

Uh oh!

kosack commented Jan 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kosack commented Jan 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kosack commented Jan 19, 2026 •

edited

Loading

kosack commented Jan 19, 2026 •

edited

Loading

kosack commented Jan 20, 2026 •

edited

Loading

kosack commented Jan 26, 2026 •

edited

Loading