Refactor DL2EventLoader to use FeatureGenerator#2919
Conversation
configuration variables must be traitlets, not complex classes without text serialization. This removes it from the config and makes it a package variable. Fixes #2918
73d1e87 to
3940d93
Compare
The fixture `test_config` was too generic a name, so renamed to `dl2_event_loader_config`
|
Looking deeper at this code, it's somewhat fragile, and confusing. I think it may need some restructuring to support the use case of adding new columns. Having to specify an output schema is a bit heavy - it would be better to just have a list of columns to take from the input files and carry over to the reduced table, right? I can refactor this if that sounds ok. In that case, I would remove the whole table schema object and interface, and just replace it with two config options:
So you would configure it like: DL2EventPreprocessor:
columns_to_rename:
ExtraTreesRegressor_energy: reco_energy
ExtraTreesClassifier_prediction: gh_score
HillasReconstructor_alt: reco_alt
HillasReconstructor_az: reco_az
# These must be columns that exist in the input, or the name of renamed columns.
columns_to_write:
- obs_id
- event_id
- true_energy
- true_az
- true_alt
- reco_energy
- reco_az
- reco_alt
- pointing_az
- pointing_alt
- gh_scoreI can think of even more simplifications, but that would be the basic one. In that case, it's super easy to configure for different cases. If you want to include some olther column like |
|
I'm also thinking a bit forward to a missing interface: event-wise algorithm selection, so I might also add an abstract interface for that and a default implementation that selects a single algorithm. That would replace the whole "column renaming" interface, with somthing more flexiable, since that is what the renaming is doing: selection which column to use for the "final" DL3 output columns. But that could be in a future PR. |
|
Yes, this should never have been a traitlet of type |
|
Actually I have an even better API than my suggestion above: what the
DL2EventPreprocessor:
FeatureGenerator:
features:
# the first few are just renamings
- ["reco_energy", "ExtraTreesRegressor_energy"]
- ["gh_score", "ExtraTreesClassifier_prediction"]
- ["reco_alt", "HillasReconstructor_alt"]
- ["reco_az", "HillasReconstructor_az"]
# can even get rid of the hard-coded computed columns, since we can do math
- ["theta", "angular_separation(reco_az, reco_alt, pointing_az, pointing_alt)"]
features:
- obs_id
- event_id
- true_energy
- true_az
- true_alt
- reco_energy
- reco_az
- reco_alt
- pointing_az
- pointing_alt
- gh_score
QualityQuery:
quality_criteria:
- ...This config then looks almost identical to e.g. train_energy_regressor.yaml. The code can be refactored to be also very similar to the code in the training tool, making it much simpler. |
I think that's a very nice idea. Using a FeatureGenerator would also help with the somewhat ugly calculation of the event multiplicity in #2789. |
|
I've currently replaced all the genearted columns in the old implementation with : FeatureGenerator:
features:
- [reco_energy , "RandomForestRegressor_energy"
- [reco_alt , "HillasReconstructor_alt"
- [reco_az , "HillasReconstructor_az"
- [gh_score , "RandomForestClassifier_prediction"
- [theta , "angular_separation(reco_az, reco_alt, true_az, true_alt)"
- [reco_fov_coord , "altaz_to_fov(reco_az, reco_alt, subarray_pointing_lon, subarray_pointing_lat)"
- [reco_fov_lon , "reco_fov_coord[:,0]"
- [reco_fov_lat , "reco_fov_coord[:,1]"
- [true_fov_coord , "altaz_to_fov(true_az, true_alt, subarray_pointing_lon, subarray_pointing_lat)"
- [true_fov_lon , "true_fov_coord[:,0]"
- [true_fov_lat , "true_fov_coord[:,1]"
- [true_fov_offset, "angular_separation(reco_fov_lon, reco_fov_lat, 0*u.deg, 0*u.deg)"
- [reco_fov_offset, "angular_separation(true_fov_lon, reco_fov_lat, 0*u.deg, 0*u.deg)" |
- gammaness_classifier -> gammaness_reconstructor (to be consistent) - added columns for obs_id,event_id in output f simulation - fixed typo in multiplicity calc
The metadata is now copied to the new table, not shallow copied, since new fields may be added to the new table
|
|
||
| target_spectrum_name = traits.UseEnum( | ||
| Spectra, | ||
| default_value=Spectra.CRAB_HEGRA, |
There was a problem hiding this comment.
By allowing passing in a callable and restricting to UseEnum with a default here, we end up in the situation where we store the target spectrum CRAB_HEGRA in the config, but actually the passed in callable was used.
There was a problem hiding this comment.
Good point. I could just restrict this to pre-defined spectra and not allow an arbitrary one to be passed in. The ability to use abitrary spectra was mostly for testing, but perhaps would be useful for calibration or non-gamma-ray studies, but those could be supported later by adding more spectra to the enum. For testing I could just add a Spectra.FLAT option to test the no-op case.
There was a problem hiding this comment.
Instead of the enum, we could make traitlets wrappers for the spectra themselves.
spectrum_cls: PowerLaw
PowerLaw:
index: 2.0
normalization:
value: 5e-10
unit: "m^-2 s^-1 TeV^-1"
There was a problem hiding this comment.
to still allow setting pre-configured ones, you could add a spectrum_name with default None that updates the values from pre-defined ones.
There was a problem hiding this comment.
That would be a nice separate PR I think. I'll open an issue. For now I'll just restrict to the pre-defined names, since that is what was there before.
| source_spectrum = self.source_spectrum | ||
| if self.is_diffuse: | ||
| source_spectrum = source_spectrum.integrate_cone( | ||
| 0 * u.deg, self.fov_offset_max |
There was a problem hiding this comment.
We need a fov_offset_min here, e.g. to compute sensitivity in fov offset bands.
There was a problem hiding this comment.
That was what is in the RadialEventWeighter implementation already,but I suppose this Simple implementation with no offset-binning is really just the same with only one bin, so maybe this class is not necessary. The old code supported a case with no offset binning, which is where this comes from, but it's not clear if it was ever used - have to check why it was there.
Names and used variables here are not consistent! |
Weighting should not be in ``ctapipe.io``, as it's really IRF specific. So I moved it to ctapipe.irf and created a EventWeighter class hierarchy for the different methods.
bd51492 to
6af3bc3
Compare
|
In refactoring this, I think we no longer really need from ctapipe.io import TableLoader, EventPreprocessor
from astropy.table import vstack
loader = TableLoader(DL2FILE, dl2=True, simulated=True, observation_info=True)
preprocess = EventPreprocessor(feature_set="dl2_simulation")
events = vstack(
[
preprocess(QTable(c.data))
for c in loader.read_subarray_events_chunked(chunk_size=100_000)
]
)So I think it's more clear to just have these lines explicitly inside any Tool that needs them, rather than wrapping them with yet another Even better for tools like inter-telescope calibration, where you probably don't even need to vstack all chunks and can opreate at the chunk level, it's even more efficient. |
|
Since this refactoring is getting large, and I don't want to break all the working code, I will split this into a few PRs, one of which will just do the minimal fix here to remove the bad config trait, and others to add the new implementation (EventWeigter, EventPreprocessor). Then I can slowly replace code that uses the DL2EventLoader with this new code. That also makes it easier for @Voutsi and @mdebony to use the new code in calibration and DL3 production, without breaking the IRF code, and the new classes are much more flexible for those use cases. I would suggest that we deprecate the current In the new classes I have created, they should make the calibration code very easy to implement without tight coupling to the IRF code. Later, I'll also refactor the IRF module to use them as well, but that will be a second step. I started that here, but found it was too large a change for one PR, and while the refactoring will make the code more maintainable, it will also likely introduce bugs temporarily that I want to avoid in the short term. |
This fixes #2918 , and was triggered by it. Configuration variables must be traitlets, not complex classes without text serialization. This removes it from the config and makes it an instance variable that can be set in the constructor. More explicitly, in the code the option
output_table_schemawas aList[astropy.table.Column], which is not allowed (Listis a traitlet, butColumnis not).However, when removing this configuration option it was clear that the API for this class could be significantly improved in such a way as to remove the need for the output_table_schema at all.
Refactoring:
DL2EventPreprocessorto useFeatureGeneratorand do both event-selection and final column selection. This is a large API change, but simplifies the code significantly. The new version will have afeature_setoption that pre-configures the class for different common use cases, simplifing the need for complex configuration files (which are still however allowed).simulationuse case (the original use case for this class)observationuse case (processing DL2 subarray events)feature_set=customand specifying everything in a config file, it may not be necessary.irf.EventWeighterto do event weighting, which should not be something done inio.DL2EventLoaderto use the newDL2EventPreprocessorIn the end, this refactoring will create much simpler workflow, but I think perhaps it doesn't all belong in IO, since only the reading and stacking of the chunks is really an "I/O" operation. THe workflow in this refactored version will be for example for an IRF production tool :
DL2EventPreProcessor:Which means
DL2EventLoaderis just:For IRF processing, a final step of
Is used to split events into FOV bins and do spectral re-weighting.