added HistogramsAggregator by TjarkMiener · Pull Request #2996 · cta-observatory/ctapipe

TjarkMiener · 2026-04-20T10:46:15Z

closes #577

maxnoe · 2026-04-20T10:57:44Z

Note that this kind of histogram monitoring is for example also needed for the cleaning algorithm implemented in EventDisplay (#1857), that uses the pedestal distribution (not just mean or std) in the cleaning.

maxnoe · 2026-04-20T11:01:23Z

I think for these histograms to be useful, we need to ensure that each chunk is using the same binning. I.e. make the bin edges (or n_bins / min / max) configuration options.

see also inline comment about using scikit hist.

TjarkMiener · 2026-04-20T13:22:48Z

I think for these histograms to be useful, we need to ensure that each chunk is using the same binning. I.e. make the bin edges (or n_bins / min / max) configuration options.

yes, correct! We should ensure the same binnng.

see also inline comment about using scikit hist.

I was unaware of the scikit hist. Do you think we need anything else then np.linspace(self.range[0], self.range[1], self.n_bins + 1)?

mexanick · 2026-04-24T11:19:23Z

+
+
+# -------------------------------------------------------------------
+# Inspect one pixel histogram in both chunks and both gain channels


please add a better comment describing that you have two chunks by design, otherwise the line

for chunk_index, ax in enumerate(axes):

looks very weird

mexanick · 2026-04-24T11:23:47Z


    @abstractmethod
-    def _add_result_columns(self, data, masked_elements_of_sample, results_dict):
+    def _add_result_columns(


please do not introduce a breaking change here. Instead of passing metadata as a mandatory argument here, you can e.g. add it as an object attribute.

okay sounds good

mexanick · 2026-04-24T11:24:10Z

+        data,
+        masked_elements_of_sample,
+        results_dict,
+        metadata,


Suggested change

metadata,

mexanick · 2026-04-24T11:24:59Z

        results_dict : dict
-            Dictionary to which statistic columns should be added.
+            Dictionary to which statistic or histogram columns should be added.
+        metadata : dict


Suggested change

metadata : dict

mexanick · 2026-04-24T11:25:08Z

-            Dictionary to which statistic columns should be added.
+            Dictionary to which statistic or histogram columns should be added.
+        metadata : dict
+            Shared metadata container that can be mutated by subclasses.


Suggested change

Shared metadata container that can be mutated by subclasses.

mexanick · 2026-04-24T11:25:22Z

+        data,
+        masked_elements_of_sample,
+        results_dict,
+        metadata,


Suggested change

metadata,

mexanick · 2026-04-24T11:26:47Z

        Set units for statistics columns that inherit from the input data.

-        For StatisticsAggregator, the mean, median, and std columns
+        For StatisticsAggregator, the mean, median, std, and histogram columns


why do you always add histogram, why don't we only compute it when needed?

maxnoe · 2026-04-24T15:16:39Z

+            },
+            "PixelStatisticsCalculator": {
+                "stats_aggregator_type": [
+                    ("type", "*", "HistogramAggregator"),


Do we support multiple aggregators? I.e. can I compute stats and histograms or multiple histograms in one run of the process tool?

no, we only support one at a time and merge them later with the merger tool. It is more motivated that every aggregation needs distinct configuration (see e.g. confgis for the pixel stats here)

maxnoe · 2026-04-24T15:39:55Z

I tried running on images of pedestals using this config:

PixelStatisticsCalculatorTool:
  input_column_name: "image"

PixelStatisticsCalculator:
  stats_aggregator_type: [
    ["type", "*", "HistogramAggregator"],
  ]

HistogramAggregator:
  chunking_type: SizeChunking
  axis_definition:
    class_name: Regular
    bins: 100
    start: -25.0
    stop: 25.0
    name: "pixel_intensity"

SizeChunking:
  chunk_size: 100

and got this error:

❯ ctapipe-calculate-pixel-statistics -i lst_pedestals.h5 -o test.h5 -c hist.yaml --overwrite
2026-04-24 17:37:35,088 ERROR [ctapipe.ctapipe-calculate-pixel-statistics] (tool.run): Caught unexpected exception: The axis name value could not be found
Traceback (most recent call last):
  File "/home/maxnoe/CTAO/ctapipe/src/ctapipe/core/tool.py", line 437, in run
    self.start()
    ~~~~~~~~~~^^
  File "/home/maxnoe/CTAO/ctapipe/src/ctapipe/tools/calculate_pixel_stats.py", line 177, in start
    aggregated_stats = self.stats_calculator.first_pass(
        table=dl1_table,
        tel_id=tel_id,
        col_name=self.input_column_name,
    )
  File "/home/maxnoe/CTAO/ctapipe/src/ctapipe/monitoring/calculator.py", line 160, in first_pass
    aggregated_stats = aggregator(
        table=table,
        masked_elements_of_sample=masked_pixels_of_sample,
        col_name=col_name,
    )
  File "/home/maxnoe/CTAO/ctapipe/src/ctapipe/monitoring/aggregator.py", line 353, in __call__
    self._add_result_columns(
    ~~~~~~~~~~~~~~~~~~~~~~~~^
        chunk[col_name].data,
        ^^^^^^^^^^^^^^^^^^^^^
        masked_elements_of_sample,
        ^^^^^^^^^^^^^^^^^^^^^^^^^^
        results,
        ^^^^^^^^
    )
    ^
  File "/home/maxnoe/CTAO/ctapipe/src/ctapipe/monitoring/aggregator.py", line 442, in _add_result_columns
    histograms = self.compute_histograms(data, masked_elements_of_sample)
  File "/home/maxnoe/CTAO/ctapipe/src/ctapipe/monitoring/aggregator.py", line 496, in compute_histograms
    hist_object.fill(value=values, pixel=pix)
    ~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/maxnoe/.local/conda/envs/cta-dev/lib/python3.13/site-packages/hist/basehist.py", line 319, in fill
    self._name_to_index(k) if isinstance(k, str) else k: v  # type: ignore[redundant-expr]
    ~~~~~~~~~~~~~~~~~~~^^^
  File "/home/maxnoe/.local/conda/envs/cta-dev/lib/python3.13/site-packages/hist/basehist.py", line 246, in _name_to_index
    raise ValueError(f"The axis name {name} could not be found")
ValueError: The axis name value could not be found

any idea?

The input file was produced from:

$  curl -LO https://minio-cta.zeuthen.desy.de/dpps-testdata-public/data/calibpipe-test-data/camera-calibration-test-data/pedestals_LST_dark.simtel.gz
$ ctapipe-process -i pedestals_LST_dark.simtel.gz -o lst_pedestals.h5 --write-images --SimTelEventSource.skip_calibration_events=False --overwrite --no-write-parameters --progress

maxnoe · 2026-04-24T15:48:37Z

I needed to set the axis name to "value".

If this is a hard requirement and it doesn't work with arbitrary user provided values, it should not be needed in the configuration.

Results look good though:

maxnoe · 2026-04-24T15:49:19Z

I notice that I get min/mean/std etc. values in the table containing the histogram data. But I didn't specify a statisticsaggregator.

TjarkMiener · 2026-04-27T09:22:54Z

I notive that I get min/mean/std etc. values in the table containing the histogram data. But I didn't specify a statisticsaggregator.

The values are calculated based on the histograms by weighting the bin counts. Median is then the bin where the cumulative distribution is passing 50%. Let me find the code snippet and ping you there.

TjarkMiener · 2026-04-27T09:25:45Z

+        # Compute the mean and std from histogram counts.
+        weighted_sum = np.sum(centers_expanded * hist_counts, axis=0)
+        with np.errstate(divide="ignore", invalid="ignore"):
+            mean = weighted_sum / counts_sum
+
+        sq_diff = (centers_expanded - mean[np.newaxis, ...]) ** 2
+        variance_num = np.sum(sq_diff * hist_counts, axis=0)
+        with np.errstate(divide="ignore", invalid="ignore"):
+            variance = variance_num / counts_sum
+        std = np.sqrt(variance)
+
+        # Compute the median from histogram counts via the cumulative distribution.
+        cdf = np.cumsum(hist_counts, axis=0)
+        cdf_denominator = cdf[-1, ...]
+        with np.errstate(divide="ignore", invalid="ignore"):
+            cdf = np.divide(
+                cdf,
+                cdf_denominator[np.newaxis, ...],
+                out=np.zeros_like(cdf, dtype=float),
+                where=cdf_denominator[np.newaxis, ...] != 0,
+            )
+        median_idx = np.argmax(cdf >= 0.5, axis=0)
+        median = centers[median_idx]
+
+        # Mark elements with no valid entries as NaN.
+        invalid = counts_sum == 0
+        mean = np.where(invalid, np.nan, mean)
+        std = np.where(invalid, np.nan, std)
+        median = np.where(invalid, np.nan, median)


@maxnoe this is the code snippet to calc the values from the histo data

I am opposed to adding this adh-hoc, histogram based computed statistics when we have access to the actual values which could be used directly.

What is the benefit of computing values based after the loss of resolution due to the histogramming?

Why can users not just use the statistics aggregator and the histogram aggregator at the same time to get stats and histograms?

Why implement these stats manually here when scikit-hist already has this functionality?

https://hist.readthedocs.io/en/latest/user-guide/accumulators.html#weightedmean

sure, we can remove this caluclation, but then need to make sure that the stats pix tool only runs the first_pass.

…so stats value from the histo Use as a normal stats aggregator and add testing for the pix stat comp and tool

bump minior data format version of the two histo related cols in the stat agg

before a breaking change was introduced for dealing with the metadata. But we can do it without breaking changes as done in this commit.

also add a new monitoring group to the hdf5 data format

…cription from low resolution histo bins

…a or pixels

…nt from the camera or pixels

…een e.g. gain channels can be observed in the mean values of the histograms

ctao-sonarqube · 2026-04-27T16:08:04Z

Quality Gate failed

Failed conditions
1 New issue

See analysis details on SonarQube

Catch issues before they fail your Quality Gate with our IDE extension SonarQube for IDE

kosack · 2026-04-28T11:19:53Z

    std = Field(None, "standard deviation of the chunk distribution")


+class ChunkHistogramsContainer(ChunkContainer):


I'd name this singular ChunkHistogramContainer to match how we name others. It only stores one "histogram" (even if it's one with categories per pixel and channel)

kosack · 2026-04-28T11:21:13Z

+    """Container for histograms of the chunk distribution"""
+
+    histogram = Field(None, "histogram of the chunk distribution")
+


I would be nice to have a helper function to convert this Container back to a Hist object. Here is an example:

def hist_from_container( cont: ChunkHistogramsContainer, axis_names=["pedestal", "channel", "pixel"] ) -> Hist: """Returns a Hist constructed from a stored ChunkHistogramsContainer.""" bin_edges = cont.meta["bin_edges"] axes = [hist.axis.Variable(edges=bin_edges, name=axis_names[0])] # the rest of the dimensions for name, n_bins in zip(axis_names[1:], cont.histogram.shape[1:]): if n_bins == 2: axes.append(hist.axis.IntCategory(categories=np.arange(2), name=name)) else: axes.append( hist.axis.Regular(bins=n_bins, start=0, stop=n_bins - 1, name=name) ) h = Hist(*axes) h[...] = cont.histogram[...] return h

Then I can do things like:

with HDF5TableReader("stats.h5") as reader: for i, container in enumerate( reader.read( table_name="/dl1/monitoring/telescope/calibration/camera/pixel_histograms/sky_pedestal_image/tel_001", containers=ChunkHistogramsContainer, prefixes=[""], ) ): h = hist_from_container(container) fig, ax = plt.subplots(1, 3, figsize=(10, 3), layout="constrained") fig.suptitle(f"Chunk {i}") h[:, 0, :].plot(ax=ax[0], norm="log") h[:, 1, :].plot(ax=ax[1], norm="log") h.integrate("pixel").stack("channel").plot(ax=ax[2], legend=True) ax[0].set_title(f"Channel 0") ax[1].set_title(f"Channel 1") ax[2].set_title("Integral oval all pixels")

Ideally, you could also store the axis names (["pedestal", "channel_id", "pixel_id"]) in the metadata, that way making one of these plots is trivial from any file.

What we did in datapipe-testbench is to serialize all the necessary Hist info in the metadata: the axis names, the axis types (e.g. Regular, Variable, Category), units, etc. That would work well here as well (and in the future I could use this class directly).

For column-wise access, you could also do a similar hist_from_table() that would then give back a 4d histogram, with an added time dimension (maybe use the mean-time of the chunk) . Then you can easily do many plots like "plot pixel 3's pedestal over time"

One more nice demo of using this if it's a Hist:

fig, ax = plt.subplots(1,2, figsize=(10,3)) disp1 = CameraDisplay(geom, image=h[:,0,:].profile("pedestal").values(), ax=ax[0]) disp2 = CameraDisplay(geom, image=h[:,0,:].profile("pedestal").variances(), ax=ax[1]) ax[0].set_title(ax[0].get_title() + " Pedestal") ax[1].set_title(ax[0].get_title() + " Pedestal Variance") disp1.add_colorbar() disp2.add_colorbar()

There, you don't even need the stats, however, as @maxnoe pointed out above, computing the mean and variance from a histogram isn't as precise as doing it from the events.

kosack · 2026-04-28T11:22:17Z

+bin_edges = result[0].meta["bin_edges"]
+h = Hist(
+    hist.axis.Regular(len(bin_edges) - 1, bin_edges[0], bin_edges[-1], name="value")
+)


See prev comment: a helper function to do this would be nice.

kosack · 2026-04-28T11:58:44Z

+            histogram=hist_counts,
+            meta={
+                "bin_edges": hist_object.axes[0].edges,
+                "bin_centers": hist_object.axes[0].centers,


would be nice to add some sort of title here as well, so we know what was being aggregated. When making a plot, I would want to know the axes names as well (e.g. which one is pixel_id, channel_id). this could be stored.

kosack · 2026-04-28T12:36:41Z

Since you are using Hist, consider keeping track of overflow and underflow values: it lets you store them as extra bins in the histogram, for example, or you can write them in the metadata. That would be very useful for cases where values fall outside the range.

maxnoe reviewed Apr 20, 2026

View reviewed changes

Comment thread src/ctapipe/containers.py Outdated

maxnoe reviewed Apr 20, 2026

View reviewed changes

Comment thread src/ctapipe/monitoring/aggregator.py

maxnoe reviewed Apr 20, 2026

View reviewed changes