Skip to content

added HistogramsAggregator#2996

Open
TjarkMiener wants to merge 30 commits into
container_maintenancefrom
histo_agg
Open

added HistogramsAggregator#2996
TjarkMiener wants to merge 30 commits into
container_maintenancefrom
histo_agg

Conversation

@TjarkMiener
Copy link
Copy Markdown
Member

@TjarkMiener TjarkMiener commented Apr 20, 2026

closes #577

Comment thread src/ctapipe/containers.py Outdated
Comment thread src/ctapipe/monitoring/aggregator.py
@maxnoe
Copy link
Copy Markdown
Member

maxnoe commented Apr 20, 2026

Note that this kind of histogram monitoring is for example also needed for the cleaning algorithm implemented in EventDisplay (#1857), that uses the pedestal distribution (not just mean or std) in the cleaning.

@maxnoe
Copy link
Copy Markdown
Member

maxnoe commented Apr 20, 2026

I think for these histograms to be useful, we need to ensure that each chunk is using the same binning. I.e. make the bin edges (or n_bins / min / max) configuration options.

see also inline comment about using scikit hist.

@TjarkMiener
Copy link
Copy Markdown
Member Author

I think for these histograms to be useful, we need to ensure that each chunk is using the same binning. I.e. make the bin edges (or n_bins / min / max) configuration options.

yes, correct! We should ensure the same binnng.

see also inline comment about using scikit hist.

I was unaware of the scikit hist. Do you think we need anything else then np.linspace(self.range[0], self.range[1], self.n_bins + 1)?

Comment thread src/ctapipe/monitoring/aggregator.py Outdated
Comment thread src/ctapipe/monitoring/aggregator.py Outdated
Comment thread src/ctapipe/monitoring/aggregator.py Outdated
@TjarkMiener TjarkMiener marked this pull request as ready for review April 21, 2026 08:03
Comment thread src/ctapipe/monitoring/aggregator.py Outdated
Comment thread src/ctapipe/io/datawriter.py Outdated
@TjarkMiener TjarkMiener requested a review from maxnoe April 23, 2026 14:23
Comment thread src/ctapipe/monitoring/aggregator.py Outdated
Comment thread src/ctapipe/monitoring/aggregator.py Outdated
@TjarkMiener TjarkMiener requested a review from maxnoe April 23, 2026 19:45


# -------------------------------------------------------------------
# Inspect one pixel histogram in both chunks and both gain channels
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add a better comment describing that you have two chunks by design, otherwise the line

for chunk_index, ax in enumerate(axes):

looks very weird


@abstractmethod
def _add_result_columns(self, data, masked_elements_of_sample, results_dict):
def _add_result_columns(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please do not introduce a breaking change here. Instead of passing metadata as a mandatory argument here, you can e.g. add it as an object attribute.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay sounds good

Comment thread src/ctapipe/monitoring/aggregator.py Outdated
data,
masked_elements_of_sample,
results_dict,
metadata,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
metadata,

Comment thread src/ctapipe/monitoring/aggregator.py Outdated
results_dict : dict
Dictionary to which statistic columns should be added.
Dictionary to which statistic or histogram columns should be added.
metadata : dict
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
metadata : dict

Comment thread src/ctapipe/monitoring/aggregator.py Outdated
Dictionary to which statistic columns should be added.
Dictionary to which statistic or histogram columns should be added.
metadata : dict
Shared metadata container that can be mutated by subclasses.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Shared metadata container that can be mutated by subclasses.

Comment thread src/ctapipe/monitoring/aggregator.py Outdated
data,
masked_elements_of_sample,
results_dict,
metadata,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
metadata,

Set units for statistics columns that inherit from the input data.

For StatisticsAggregator, the mean, median, and std columns
For StatisticsAggregator, the mean, median, std, and histogram columns
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you always add histogram, why don't we only compute it when needed?

Comment thread src/ctapipe/containers.py Outdated
Comment thread src/ctapipe/tools/calculate_pixel_stats.py Outdated
Comment thread src/ctapipe/tools/tests/test_calculate_pixel_stats.py
},
"PixelStatisticsCalculator": {
"stats_aggregator_type": [
("type", "*", "HistogramAggregator"),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we support multiple aggregators? I.e. can I compute stats and histograms or multiple histograms in one run of the process tool?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, we only support one at a time and merge them later with the merger tool. It is more motivated that every aggregation needs distinct configuration (see e.g. confgis for the pixel stats here)

@maxnoe
Copy link
Copy Markdown
Member

maxnoe commented Apr 24, 2026

I tried running on images of pedestals using this config:

PixelStatisticsCalculatorTool:
  input_column_name: "image"

PixelStatisticsCalculator:
  stats_aggregator_type: [
    ["type", "*", "HistogramAggregator"],
  ]

HistogramAggregator:
  chunking_type: SizeChunking
  axis_definition:
    class_name: Regular
    bins: 100
    start: -25.0
    stop: 25.0
    name: "pixel_intensity"

SizeChunking:
  chunk_size: 100

and got this error:

❯ ctapipe-calculate-pixel-statistics -i lst_pedestals.h5 -o test.h5 -c hist.yaml --overwrite
2026-04-24 17:37:35,088 ERROR [ctapipe.ctapipe-calculate-pixel-statistics] (tool.run): Caught unexpected exception: The axis name value could not be found
Traceback (most recent call last):
  File "/home/maxnoe/CTAO/ctapipe/src/ctapipe/core/tool.py", line 437, in run
    self.start()
    ~~~~~~~~~~^^
  File "/home/maxnoe/CTAO/ctapipe/src/ctapipe/tools/calculate_pixel_stats.py", line 177, in start
    aggregated_stats = self.stats_calculator.first_pass(
        table=dl1_table,
        tel_id=tel_id,
        col_name=self.input_column_name,
    )
  File "/home/maxnoe/CTAO/ctapipe/src/ctapipe/monitoring/calculator.py", line 160, in first_pass
    aggregated_stats = aggregator(
        table=table,
        masked_elements_of_sample=masked_pixels_of_sample,
        col_name=col_name,
    )
  File "/home/maxnoe/CTAO/ctapipe/src/ctapipe/monitoring/aggregator.py", line 353, in __call__
    self._add_result_columns(
    ~~~~~~~~~~~~~~~~~~~~~~~~^
        chunk[col_name].data,
        ^^^^^^^^^^^^^^^^^^^^^
        masked_elements_of_sample,
        ^^^^^^^^^^^^^^^^^^^^^^^^^^
        results,
        ^^^^^^^^
    )
    ^
  File "/home/maxnoe/CTAO/ctapipe/src/ctapipe/monitoring/aggregator.py", line 442, in _add_result_columns
    histograms = self.compute_histograms(data, masked_elements_of_sample)
  File "/home/maxnoe/CTAO/ctapipe/src/ctapipe/monitoring/aggregator.py", line 496, in compute_histograms
    hist_object.fill(value=values, pixel=pix)
    ~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/maxnoe/.local/conda/envs/cta-dev/lib/python3.13/site-packages/hist/basehist.py", line 319, in fill
    self._name_to_index(k) if isinstance(k, str) else k: v  # type: ignore[redundant-expr]
    ~~~~~~~~~~~~~~~~~~~^^^
  File "/home/maxnoe/.local/conda/envs/cta-dev/lib/python3.13/site-packages/hist/basehist.py", line 246, in _name_to_index
    raise ValueError(f"The axis name {name} could not be found")
ValueError: The axis name value could not be found

any idea?

The input file was produced from:

$  curl -LO https://minio-cta.zeuthen.desy.de/dpps-testdata-public/data/calibpipe-test-data/camera-calibration-test-data/pedestals_LST_dark.simtel.gz
$ ctapipe-process -i pedestals_LST_dark.simtel.gz -o lst_pedestals.h5 --write-images --SimTelEventSource.skip_calibration_events=False --overwrite --no-write-parameters --progress

@maxnoe
Copy link
Copy Markdown
Member

maxnoe commented Apr 24, 2026

I needed to set the axis name to "value".

If this is a hard requirement and it doesn't work with arbitrary user provided values, it should not be needed in the configuration.

Results look good though:

pedestal_hists

@maxnoe
Copy link
Copy Markdown
Member

maxnoe commented Apr 24, 2026

I notice that I get min/mean/std etc. values in the table containing the histogram data. But I didn't specify a statisticsaggregator.

@TjarkMiener
Copy link
Copy Markdown
Member Author

I notive that I get min/mean/std etc. values in the table containing the histogram data. But I didn't specify a statisticsaggregator.

The values are calculated based on the histograms by weighting the bin counts. Median is then the bin where the cumulative distribution is passing 50%. Let me find the code snippet and ping you there.

Comment thread src/ctapipe/monitoring/aggregator.py Outdated
Comment on lines +520 to +548
# Compute the mean and std from histogram counts.
weighted_sum = np.sum(centers_expanded * hist_counts, axis=0)
with np.errstate(divide="ignore", invalid="ignore"):
mean = weighted_sum / counts_sum

sq_diff = (centers_expanded - mean[np.newaxis, ...]) ** 2
variance_num = np.sum(sq_diff * hist_counts, axis=0)
with np.errstate(divide="ignore", invalid="ignore"):
variance = variance_num / counts_sum
std = np.sqrt(variance)

# Compute the median from histogram counts via the cumulative distribution.
cdf = np.cumsum(hist_counts, axis=0)
cdf_denominator = cdf[-1, ...]
with np.errstate(divide="ignore", invalid="ignore"):
cdf = np.divide(
cdf,
cdf_denominator[np.newaxis, ...],
out=np.zeros_like(cdf, dtype=float),
where=cdf_denominator[np.newaxis, ...] != 0,
)
median_idx = np.argmax(cdf >= 0.5, axis=0)
median = centers[median_idx]

# Mark elements with no valid entries as NaN.
invalid = counts_sum == 0
mean = np.where(invalid, np.nan, mean)
std = np.where(invalid, np.nan, std)
median = np.where(invalid, np.nan, median)
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maxnoe this is the code snippet to calc the values from the histo data

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am opposed to adding this adh-hoc, histogram based computed statistics when we have access to the actual values which could be used directly.

What is the benefit of computing values based after the loss of resolution due to the histogramming?

Why can users not just use the statistics aggregator and the histogram aggregator at the same time to get stats and histograms?

Why implement these stats manually here when scikit-hist already has this functionality?

https://hist.readthedocs.io/en/latest/user-guide/accumulators.html#weightedmean

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, we can remove this caluclation, but then need to make sure that the stats pix tool only runs the first_pass.

@TjarkMiener TjarkMiener requested a review from mexanick April 27, 2026 09:32
Comment thread src/ctapipe/monitoring/aggregator.py
TjarkMiener and others added 23 commits April 27, 2026 16:36
…so stats value from the histo

Use as a normal stats aggregator and add testing for the pix stat comp and tool
bump minior data format version of the two histo related cols in the stat agg
before a breaking change was introduced for dealing with the metadata. But we can do it without breaking changes as done in this commit.
also add a new monitoring group to the hdf5 data format
…een e.g. gain channels can be observed in the mean values of the histograms
@TjarkMiener TjarkMiener requested review from maxnoe and mexanick April 27, 2026 15:51
@ctao-sonarqube
Copy link
Copy Markdown

Quality Gate failed Quality Gate failed

Failed conditions
1 New issue

See analysis details on SonarQube

Catch issues before they fail your Quality Gate with our IDE extension SonarQube for IDE SonarQube for IDE

Comment thread src/ctapipe/containers.py
std = Field(None, "standard deviation of the chunk distribution")


class ChunkHistogramsContainer(ChunkContainer):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd name this singular ChunkHistogramContainer to match how we name others. It only stores one "histogram" (even if it's one with categories per pixel and channel)

Comment thread src/ctapipe/containers.py
"""Container for histograms of the chunk distribution"""

histogram = Field(None, "histogram of the chunk distribution")

Copy link
Copy Markdown
Member

@kosack kosack Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be nice to have a helper function to convert this Container back to a Hist object. Here is an example:

def hist_from_container(
    cont: ChunkHistogramsContainer, axis_names=["pedestal", "channel", "pixel"]
) -> Hist:
    """Returns a Hist constructed from a stored ChunkHistogramsContainer."""

    bin_edges = cont.meta["bin_edges"]
    axes = [hist.axis.Variable(edges=bin_edges, name=axis_names[0])]

    # the rest of the dimensions
    for name, n_bins in zip(axis_names[1:], cont.histogram.shape[1:]):
        if n_bins == 2:
            axes.append(hist.axis.IntCategory(categories=np.arange(2), name=name))
        else:
            axes.append(
                hist.axis.Regular(bins=n_bins, start=0, stop=n_bins - 1, name=name)
            )

    h = Hist(*axes)
    h[...] = cont.histogram[...]
    return h

Then I can do things like:

with HDF5TableReader("stats.h5") as reader:
    for i, container in enumerate(
        reader.read(
            table_name="/dl1/monitoring/telescope/calibration/camera/pixel_histograms/sky_pedestal_image/tel_001",
            containers=ChunkHistogramsContainer,
            prefixes=[""],
        )
    ):
        h = hist_from_container(container)

        fig, ax = plt.subplots(1, 3, figsize=(10, 3), layout="constrained")
        fig.suptitle(f"Chunk {i}")
        h[:, 0, :].plot(ax=ax[0], norm="log")
        h[:, 1, :].plot(ax=ax[1], norm="log")
        h.integrate("pixel").stack("channel").plot(ax=ax[2], legend=True)

        ax[0].set_title(f"Channel 0")
        ax[1].set_title(f"Channel 1")
        ax[2].set_title("Integral oval all pixels")
image

Ideally, you could also store the axis names (["pedestal", "channel_id", "pixel_id"]) in the metadata, that way making one of these plots is trivial from any file.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What we did in datapipe-testbench is to serialize all the necessary Hist info in the metadata: the axis names, the axis types (e.g. Regular, Variable, Category), units, etc. That would work well here as well (and in the future I could use this class directly).

Copy link
Copy Markdown
Member

@kosack kosack Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For column-wise access, you could also do a similar hist_from_table() that would then give back a 4d histogram, with an added time dimension (maybe use the mean-time of the chunk) . Then you can easily do many plots like "plot pixel 3's pedestal over time"

Copy link
Copy Markdown
Member

@kosack kosack Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more nice demo of using this if it's a Hist:

fig, ax = plt.subplots(1,2, figsize=(10,3))
disp1 = CameraDisplay(geom, image=h[:,0,:].profile("pedestal").values(), ax=ax[0])
disp2 = CameraDisplay(geom, image=h[:,0,:].profile("pedestal").variances(), ax=ax[1])

ax[0].set_title(ax[0].get_title() +  " Pedestal")
ax[1].set_title(ax[0].get_title() +  " Pedestal Variance")

disp1.add_colorbar()
disp2.add_colorbar()
image

There, you don't even need the stats, however, as @maxnoe pointed out above, computing the mean and variance from a histogram isn't as precise as doing it from the events.

bin_edges = result[0].meta["bin_edges"]
h = Hist(
hist.axis.Regular(len(bin_edges) - 1, bin_edges[0], bin_edges[-1], name="value")
)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See prev comment: a helper function to do this would be nice.

histogram=hist_counts,
meta={
"bin_edges": hist_object.axes[0].edges,
"bin_centers": hist_object.axes[0].centers,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be nice to add some sort of title here as well, so we know what was being aggregated. When making a plot, I would want to know the axes names as well (e.g. which one is pixel_id, channel_id). this could be stored.

@kosack
Copy link
Copy Markdown
Member

kosack commented Apr 28, 2026

Since you are using Hist, consider keeping track of overflow and underflow values: it lets you store them as extra bins in the histogram, for example, or you can write them in the metadata. That would be very useful for cases where values fall outside the range.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Pixel-wise aggregation of histograms

4 participants