added HistogramsAggregator#2996
Conversation
|
Note that this kind of histogram monitoring is for example also needed for the cleaning algorithm implemented in EventDisplay (#1857), that uses the pedestal distribution (not just mean or std) in the cleaning. |
|
I think for these histograms to be useful, we need to ensure that each chunk is using the same binning. I.e. make the bin edges (or n_bins / min / max) configuration options. see also inline comment about using scikit hist. |
yes, correct! We should ensure the same binnng.
I was unaware of the scikit hist. Do you think we need anything else then |
|
|
||
|
|
||
| # ------------------------------------------------------------------- | ||
| # Inspect one pixel histogram in both chunks and both gain channels |
There was a problem hiding this comment.
please add a better comment describing that you have two chunks by design, otherwise the line
for chunk_index, ax in enumerate(axes):
looks very weird
|
|
||
| @abstractmethod | ||
| def _add_result_columns(self, data, masked_elements_of_sample, results_dict): | ||
| def _add_result_columns( |
There was a problem hiding this comment.
please do not introduce a breaking change here. Instead of passing metadata as a mandatory argument here, you can e.g. add it as an object attribute.
| data, | ||
| masked_elements_of_sample, | ||
| results_dict, | ||
| metadata, |
There was a problem hiding this comment.
| metadata, |
| results_dict : dict | ||
| Dictionary to which statistic columns should be added. | ||
| Dictionary to which statistic or histogram columns should be added. | ||
| metadata : dict |
There was a problem hiding this comment.
| metadata : dict |
| Dictionary to which statistic columns should be added. | ||
| Dictionary to which statistic or histogram columns should be added. | ||
| metadata : dict | ||
| Shared metadata container that can be mutated by subclasses. |
There was a problem hiding this comment.
| Shared metadata container that can be mutated by subclasses. |
| data, | ||
| masked_elements_of_sample, | ||
| results_dict, | ||
| metadata, |
There was a problem hiding this comment.
| metadata, |
| Set units for statistics columns that inherit from the input data. | ||
|
|
||
| For StatisticsAggregator, the mean, median, and std columns | ||
| For StatisticsAggregator, the mean, median, std, and histogram columns |
There was a problem hiding this comment.
why do you always add histogram, why don't we only compute it when needed?
| }, | ||
| "PixelStatisticsCalculator": { | ||
| "stats_aggregator_type": [ | ||
| ("type", "*", "HistogramAggregator"), |
There was a problem hiding this comment.
Do we support multiple aggregators? I.e. can I compute stats and histograms or multiple histograms in one run of the process tool?
There was a problem hiding this comment.
no, we only support one at a time and merge them later with the merger tool. It is more motivated that every aggregation needs distinct configuration (see e.g. confgis for the pixel stats here)
|
I tried running on images of pedestals using this config: and got this error: any idea? The input file was produced from: |
|
I notice that I get min/mean/std etc. values in the table containing the histogram data. But I didn't specify a statisticsaggregator. |
The values are calculated based on the histograms by weighting the bin counts. Median is then the bin where the cumulative distribution is passing 50%. Let me find the code snippet and ping you there. |
| # Compute the mean and std from histogram counts. | ||
| weighted_sum = np.sum(centers_expanded * hist_counts, axis=0) | ||
| with np.errstate(divide="ignore", invalid="ignore"): | ||
| mean = weighted_sum / counts_sum | ||
|
|
||
| sq_diff = (centers_expanded - mean[np.newaxis, ...]) ** 2 | ||
| variance_num = np.sum(sq_diff * hist_counts, axis=0) | ||
| with np.errstate(divide="ignore", invalid="ignore"): | ||
| variance = variance_num / counts_sum | ||
| std = np.sqrt(variance) | ||
|
|
||
| # Compute the median from histogram counts via the cumulative distribution. | ||
| cdf = np.cumsum(hist_counts, axis=0) | ||
| cdf_denominator = cdf[-1, ...] | ||
| with np.errstate(divide="ignore", invalid="ignore"): | ||
| cdf = np.divide( | ||
| cdf, | ||
| cdf_denominator[np.newaxis, ...], | ||
| out=np.zeros_like(cdf, dtype=float), | ||
| where=cdf_denominator[np.newaxis, ...] != 0, | ||
| ) | ||
| median_idx = np.argmax(cdf >= 0.5, axis=0) | ||
| median = centers[median_idx] | ||
|
|
||
| # Mark elements with no valid entries as NaN. | ||
| invalid = counts_sum == 0 | ||
| mean = np.where(invalid, np.nan, mean) | ||
| std = np.where(invalid, np.nan, std) | ||
| median = np.where(invalid, np.nan, median) |
There was a problem hiding this comment.
@maxnoe this is the code snippet to calc the values from the histo data
There was a problem hiding this comment.
I am opposed to adding this adh-hoc, histogram based computed statistics when we have access to the actual values which could be used directly.
What is the benefit of computing values based after the loss of resolution due to the histogramming?
Why can users not just use the statistics aggregator and the histogram aggregator at the same time to get stats and histograms?
Why implement these stats manually here when scikit-hist already has this functionality?
https://hist.readthedocs.io/en/latest/user-guide/accumulators.html#weightedmean
There was a problem hiding this comment.
sure, we can remove this caluclation, but then need to make sure that the stats pix tool only runs the first_pass.
…so stats value from the histo Use as a normal stats aggregator and add testing for the pix stat comp and tool
bump minior data format version of the two histo related cols in the stat agg
before a breaking change was introduced for dealing with the metadata. But we can do it without breaking changes as done in this commit.
also add a new monitoring group to the hdf5 data format
…cription from low resolution histo bins
…nt from the camera or pixels
…een e.g. gain channels can be observed in the mean values of the histograms
|
| std = Field(None, "standard deviation of the chunk distribution") | ||
|
|
||
|
|
||
| class ChunkHistogramsContainer(ChunkContainer): |
There was a problem hiding this comment.
I'd name this singular ChunkHistogramContainer to match how we name others. It only stores one "histogram" (even if it's one with categories per pixel and channel)
| """Container for histograms of the chunk distribution""" | ||
|
|
||
| histogram = Field(None, "histogram of the chunk distribution") | ||
|
|
There was a problem hiding this comment.
I would be nice to have a helper function to convert this Container back to a Hist object. Here is an example:
def hist_from_container(
cont: ChunkHistogramsContainer, axis_names=["pedestal", "channel", "pixel"]
) -> Hist:
"""Returns a Hist constructed from a stored ChunkHistogramsContainer."""
bin_edges = cont.meta["bin_edges"]
axes = [hist.axis.Variable(edges=bin_edges, name=axis_names[0])]
# the rest of the dimensions
for name, n_bins in zip(axis_names[1:], cont.histogram.shape[1:]):
if n_bins == 2:
axes.append(hist.axis.IntCategory(categories=np.arange(2), name=name))
else:
axes.append(
hist.axis.Regular(bins=n_bins, start=0, stop=n_bins - 1, name=name)
)
h = Hist(*axes)
h[...] = cont.histogram[...]
return hThen I can do things like:
with HDF5TableReader("stats.h5") as reader:
for i, container in enumerate(
reader.read(
table_name="/dl1/monitoring/telescope/calibration/camera/pixel_histograms/sky_pedestal_image/tel_001",
containers=ChunkHistogramsContainer,
prefixes=[""],
)
):
h = hist_from_container(container)
fig, ax = plt.subplots(1, 3, figsize=(10, 3), layout="constrained")
fig.suptitle(f"Chunk {i}")
h[:, 0, :].plot(ax=ax[0], norm="log")
h[:, 1, :].plot(ax=ax[1], norm="log")
h.integrate("pixel").stack("channel").plot(ax=ax[2], legend=True)
ax[0].set_title(f"Channel 0")
ax[1].set_title(f"Channel 1")
ax[2].set_title("Integral oval all pixels")
Ideally, you could also store the axis names (["pedestal", "channel_id", "pixel_id"]) in the metadata, that way making one of these plots is trivial from any file.
There was a problem hiding this comment.
What we did in datapipe-testbench is to serialize all the necessary Hist info in the metadata: the axis names, the axis types (e.g. Regular, Variable, Category), units, etc. That would work well here as well (and in the future I could use this class directly).
There was a problem hiding this comment.
For column-wise access, you could also do a similar hist_from_table() that would then give back a 4d histogram, with an added time dimension (maybe use the mean-time of the chunk) . Then you can easily do many plots like "plot pixel 3's pedestal over time"
There was a problem hiding this comment.
One more nice demo of using this if it's a Hist:
fig, ax = plt.subplots(1,2, figsize=(10,3))
disp1 = CameraDisplay(geom, image=h[:,0,:].profile("pedestal").values(), ax=ax[0])
disp2 = CameraDisplay(geom, image=h[:,0,:].profile("pedestal").variances(), ax=ax[1])
ax[0].set_title(ax[0].get_title() + " Pedestal")
ax[1].set_title(ax[0].get_title() + " Pedestal Variance")
disp1.add_colorbar()
disp2.add_colorbar()
There, you don't even need the stats, however, as @maxnoe pointed out above, computing the mean and variance from a histogram isn't as precise as doing it from the events.
| bin_edges = result[0].meta["bin_edges"] | ||
| h = Hist( | ||
| hist.axis.Regular(len(bin_edges) - 1, bin_edges[0], bin_edges[-1], name="value") | ||
| ) |
There was a problem hiding this comment.
See prev comment: a helper function to do this would be nice.
| histogram=hist_counts, | ||
| meta={ | ||
| "bin_edges": hist_object.axes[0].edges, | ||
| "bin_centers": hist_object.axes[0].centers, |
There was a problem hiding this comment.
would be nice to add some sort of title here as well, so we know what was being aggregated. When making a plot, I would want to know the axes names as well (e.g. which one is pixel_id, channel_id). this could be stored.
|
Since you are using |





closes #577