Skip to content

Refactor DL2EventLoader to use FeatureGenerator#2919

Draft
kosack wants to merge 25 commits into
mainfrom
remove_columns_config
Draft

Refactor DL2EventLoader to use FeatureGenerator#2919
kosack wants to merge 25 commits into
mainfrom
remove_columns_config

Conversation

@kosack
Copy link
Copy Markdown
Member

@kosack kosack commented Jan 19, 2026

This fixes #2918 , and was triggered by it. Configuration variables must be traitlets, not complex classes without text serialization. This removes it from the config and makes it an instance variable that can be set in the constructor. More explicitly, in the code the option output_table_schema was a List[astropy.table.Column], which is not allowed (List is a traitlet, but Column is not).

However, when removing this configuration option it was clear that the API for this class could be significantly improved in such a way as to remove the need for the output_table_schema at all.

Refactoring:

  • Refactor DL2EventPreprocessor to use FeatureGenerator and do both event-selection and final column selection. This is a large API change, but simplifies the code significantly. The new version will have a feature_set option that pre-configures the class for different common use cases, simplifing the need for complex configuration files (which are still however allowed).
    • Implement simulation use case (the original use case for this class)
    • implement observation use case (processing DL2 subarray events)
    • implement inter/intra telescope calibration use cases, at least allowing for per-telescope DL2 data. Since I'm not very familiar with this UC, and it can be achieved by using feature_set=custom and specifying everything in a config file, it may not be necessary.
  • Add class irf.EventWeighter to do event weighting, which should not be something done in io.
  • Refactor DL2EventLoader to use the new DL2EventPreprocessor
  • Refactor IRFTool and OptimizeTool to use the new classes.

In the end, this refactoring will create much simpler workflow, but I think perhaps it doesn't all belong in IO, since only the reading and stacking of the chunks is really an "I/O" operation. THe workflow in this refactored version will be for example for an IRF production tool :

DL2EventPreProcessor:

(raw dl2 table) → DL2FeatureGenerator → QualityQuery → [select output columns] → (processed dl2 table)

Which means DL2EventLoader is just:

(input file) → TableLoader → [loop over chunks] → DL2EventPreprocessor → [merge chunks] → (processed dl2 table)

For IRF processing, a final step of

(processed dl2 table) → RadialEventWeighter → (processd dl2 table with weights)

Is used to split events into FOV bins and do spectral re-weighting.

configuration variables must be traitlets, not complex classes without text
serialization. This removes it from the config and makes it a package
variable.  Fixes #2918
@kosack kosack force-pushed the remove_columns_config branch from 73d1e87 to 3940d93 Compare January 19, 2026 13:06
@kosack
Copy link
Copy Markdown
Member Author

kosack commented Jan 19, 2026

Looking deeper at this code, it's somewhat fragile, and confusing. I think it may need some restructuring to support the use case of adding new columns. Having to specify an output schema is a bit heavy - it would be better to just have a list of columns to take from the input files and carry over to the reduced table, right? I can refactor this if that sounds ok. In that case, I would remove the whole table schema object and interface, and just replace it with two config options:

  • columns_to_write: List[Unicode]
  • columns_to_rename (as before)

So you would configure it like:

DL2EventPreprocessor:
    columns_to_rename:
        ExtraTreesRegressor_energy: reco_energy
        ExtraTreesClassifier_prediction: gh_score
        HillasReconstructor_alt: reco_alt
        HillasReconstructor_az: reco_az
    # These must be columns that exist in the input, or the name of renamed columns.
    columns_to_write:
        - obs_id
        - event_id
        - true_energy
        - true_az
        - true_alt
        - reco_energy
        - reco_az
        - reco_alt
        - pointing_az
        - pointing_alt
        - gh_score

I can think of even more simplifications, but that would be the basic one. In that case, it's super easy to configure for different cases. If you want to include some olther column like ExtraTreesRegressor_tel_energy in the final output, you would just add it to columns_to_write.

@kosack
Copy link
Copy Markdown
Member Author

kosack commented Jan 19, 2026

I'm also thinking a bit forward to a missing interface: event-wise algorithm selection, so I might also add an abstract interface for that and a default implementation that selects a single algorithm. That would replace the whole "column renaming" interface, with somthing more flexiable, since that is what the renaming is doing: selection which column to use for the "final" DL3 output columns. But that could be in a future PR.

@maxnoe
Copy link
Copy Markdown
Member

maxnoe commented Jan 19, 2026

Yes, this should never have been a traitlet of type List(Column), at least not without defining a traitlet wrapper for Column.

@kosack
Copy link
Copy Markdown
Member Author

kosack commented Jan 20, 2026

Actually I have an even better API than my suggestion above: what the DL2EventPreprocessor does is nearly exactly what we have in the ML training tools, so it would be best to use exactly the same API!:

  • renamed columns are just new features, so we can just use a FeatureGenerator
  • output columns are the final list of features.
DL2EventPreprocessor:
    FeatureGenerator: 
        features:
            # the first few are just renamings
            - ["reco_energy", "ExtraTreesRegressor_energy"]
            - ["gh_score", "ExtraTreesClassifier_prediction"]
            - ["reco_alt", "HillasReconstructor_alt"]
            - ["reco_az", "HillasReconstructor_az"]
            # can even get rid of the hard-coded computed columns, since we can do math
            - ["theta", "angular_separation(reco_az, reco_alt, pointing_az, pointing_alt)"]
            
    features:
        - obs_id
        - event_id
        - true_energy
        - true_az
        - true_alt
        - reco_energy
        - reco_az
        - reco_alt
        - pointing_az
        - pointing_alt
        - gh_score
        
     QualityQuery:  
          quality_criteria:
               - ...

This config then looks almost identical to e.g. train_energy_regressor.yaml. The code can be refactored to be also very similar to the code in the training tool, making it much simpler.

@LukasBeiske
Copy link
Copy Markdown
Contributor

Actually I have an even better API than my suggestion above: what the DL2EventPreprocessor does is nearly exactly what we have in the ML training tools, so it would be best to use exactly the same API!:

* renamed columns are just _new features_, so we can just use a `FeatureGenerator`

* output columns are the final list of features.
DL2EventPreprocessor:
    FeatureGenerator: 
        features:
            # the first few are just renamings
            - ["reco_energy", "ExtraTreesRegressor_energy"]
            - ["gh_score", "ExtraTreesClassifier_prediction"]
            - ["reco_alt", "HillasReconstructor_alt"]
            - ["reco_az", "HillasReconstructor_az"]
            # can even get rid of the hard-coded computed columns, since we can do math
            - ["theta", "angular_separation(reco_az, reco_alt, pointing_az, pointing_alt)"]
            
    features:
        - obs_id
        - event_id
        - true_energy
        - true_az
        - true_alt
        - reco_energy
        - reco_az
        - reco_alt
        - pointing_az
        - pointing_alt
        - gh_score
        
     QualityQuery:  
          quality_criteria:
               - ...

This config then looks almost identical to e.g. train_energy_regressor.yaml. The code can be refactored to be also very similar to the code in the training tool, making it much simpler.

I think that's a very nice idea. Using a FeatureGenerator would also help with the somewhat ugly calculation of the event multiplicity in #2789.

@kosack kosack changed the title Remove DL2EventLoader.output_table_schema config Refactor DL2EventLoader to use FeatureGenerator Jan 22, 2026
@kosack kosack marked this pull request as draft January 22, 2026 16:59
@kosack
Copy link
Copy Markdown
Member Author

kosack commented Jan 22, 2026

I've currently replaced all the genearted columns in the old implementation with :

FeatureGenerator:
  features:
    - [reco_energy    , "RandomForestRegressor_energy"
    - [reco_alt       , "HillasReconstructor_alt"
    - [reco_az        , "HillasReconstructor_az"
    - [gh_score       , "RandomForestClassifier_prediction"
    - [theta          , "angular_separation(reco_az, reco_alt, true_az, true_alt)"
    - [reco_fov_coord , "altaz_to_fov(reco_az, reco_alt, subarray_pointing_lon, subarray_pointing_lat)"
    - [reco_fov_lon   , "reco_fov_coord[:,0]"
    - [reco_fov_lat   , "reco_fov_coord[:,1]"
    - [true_fov_coord , "altaz_to_fov(true_az, true_alt, subarray_pointing_lon, subarray_pointing_lat)"
    - [true_fov_lon   , "true_fov_coord[:,0]"
    - [true_fov_lat   , "true_fov_coord[:,1]"
    - [true_fov_offset, "angular_separation(reco_fov_lon, reco_fov_lat, 0*u.deg, 0*u.deg)"
    - [reco_fov_offset, "angular_separation(true_fov_lon, reco_fov_lat, 0*u.deg, 0*u.deg)"

- gammaness_classifier -> gammaness_reconstructor (to be consistent)
- added columns for obs_id,event_id in output f simulation
- fixed typo in multiplicity calc
The metadata is now copied to the new table, not shallow copied, since
new fields may be added to the new table

target_spectrum_name = traits.UseEnum(
Spectra,
default_value=Spectra.CRAB_HEGRA,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By allowing passing in a callable and restricting to UseEnum with a default here, we end up in the situation where we store the target spectrum CRAB_HEGRA in the config, but actually the passed in callable was used.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I could just restrict this to pre-defined spectra and not allow an arbitrary one to be passed in. The ability to use abitrary spectra was mostly for testing, but perhaps would be useful for calibration or non-gamma-ray studies, but those could be supported later by adding more spectra to the enum. For testing I could just add a Spectra.FLAT option to test the no-op case.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of the enum, we could make traitlets wrappers for the spectra themselves.

spectrum_cls: PowerLaw
PowerLaw:
   index: 2.0
   normalization:
      value: 5e-10
      unit: "m^-2 s^-1 TeV^-1"

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to still allow setting pre-configured ones, you could add a spectrum_name with default None that updates the values from pre-defined ones.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be a nice separate PR I think. I'll open an issue. For now I'll just restrict to the pre-defined names, since that is what was there before.

source_spectrum = self.source_spectrum
if self.is_diffuse:
source_spectrum = source_spectrum.integrate_cone(
0 * u.deg, self.fov_offset_max
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need a fov_offset_min here, e.g. to compute sensitivity in fov offset bands.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was what is in the RadialEventWeighter implementation already,but I suppose this Simple implementation with no offset-binning is really just the same with only one bin, so maybe this class is not necessary. The old code supported a case with no offset binning, which is where this comes from, but it's not clear if it was ever used - have to check why it was there.

@maxnoe
Copy link
Copy Markdown
Member

maxnoe commented Jan 24, 2026

- [true_fov_offset, "angular_separation(reco_fov_lon, reco_fov_lat, 0*u.deg, 0*u.deg)"
- [reco_fov_offset, "angular_separation(true_fov_lon, reco_fov_lat, 0*u.deg, 0*u.deg)"

Names and used variables here are not consistent!

Comment thread src/ctapipe/irf/event_weighter.py Outdated
Weighting  should not be in ``ctapipe.io``, as it's really IRF specific. So I
moved it to ctapipe.irf and created a EventWeighter class hierarchy for the
different methods.
@kosack kosack force-pushed the remove_columns_config branch from bd51492 to 6af3bc3 Compare January 26, 2026 13:14
@kosack
Copy link
Copy Markdown
Member Author

kosack commented Jan 26, 2026

In refactoring this, I think we no longer really need DL2EventLoader at all. The DL2EventPreprocessor class is sufficient. Most of the complexity was to support several different use cases: IRF production, observed DL2 production, and inter/intra telescope calibration. With just DL2EventPreprocessor, I can reproduce what was done with DL2EventLoader using for example the following few lines:

from ctapipe.io import TableLoader, EventPreprocessor
from astropy.table import vstack

loader = TableLoader(DL2FILE, dl2=True, simulated=True, observation_info=True)
preprocess = EventPreprocessor(feature_set="dl2_simulation")
events = vstack(
    [
        preprocess(QTable(c.data)) 
        for c in loader.read_subarray_events_chunked(chunk_size=100_000)
    ]
)

So I think it's more clear to just have these lines explicitly inside any Tool that needs them, rather than wrapping them with yet another Component. For the IRF code, for example, I would call this but add a call to RadialEventWeighter. For or other use cases, you in any case need different options to TableLoader and DL2EventPreprocessor, so those should just be done explicitly for each use case.

Even better for tools like inter-telescope calibration, where you probably don't even need to vstack all chunks and can opreate at the chunk level, it's even more efficient.

@kosack
Copy link
Copy Markdown
Member Author

kosack commented Jan 27, 2026

Since this refactoring is getting large, and I don't want to break all the working code, I will split this into a few PRs, one of which will just do the minimal fix here to remove the bad config trait, and others to add the new implementation (EventWeigter, EventPreprocessor). Then I can slowly replace code that uses the DL2EventLoader with this new code.

That also makes it easier for @Voutsi and @mdebony to use the new code in calibration and DL3 production, without breaking the IRF code, and the new classes are much more flexible for those use cases.

I would suggest that we deprecate the current DL2EventLoader, which mixes too many things together - it was fine as an internal part of the IRF module, but now that it is being used in a more general way, it was trying to do to much and mixed science algorithms and IO.

In the new classes I have created, they should make the calibration code very easy to implement without tight coupling to the IRF code. Later, I'll also refactor the IRF module to use them as well, but that will be a second step. I started that here, but found it was too large a change for one PR, and while the refactoring will make the code more maintainable, it will also likely introduce bugs temporarily that I want to avoid in the short term.

This was referenced Jan 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Provenance crashes because it cannot serialise Astropy columns

3 participants