Histogram Diagnostic

Description

The Histogram diagnostic is a set of tools for computing and visualizing histograms or probability density functions (PDFs) of climate variables. It supports comparative analysis between a target dataset (typically a climate model) and a reference dataset, commonly an observational or reanalysis product such as ERA5.

Histogram provides tools to plot:

Raw histograms (counts per bin)
Normalized PDFs (probability density functions)
Multi-model comparisons with reference data overlay

Classes

There is one class for the analysis and one for the plotting:

Histogram: retrieves the data and computes histograms or PDFs over specified regions. It handles latitudinal weighting, bin configuration, and regional selection. Results are saved as class attributes and as NetCDF files.
PlotHistogram: provides methods for plotting histograms and PDFs. It generates plots with optional logarithmic scales, smoothing, and customizable axis limits.

Note

The diagnostic computes histograms over the entire temporal period specified (no seasonal decomposition).

File structure

The diagnostic is located in the aqua/diagnostics/histogram/ directory, which contains both the source code and the command line interface (CLI) script.
A template configuration file is available at aqua/diagnostics/templates/diagnostics/config-histogram.yaml
Region definitions are available in aqua/diagnostics/config/definitions/regions.yaml
Notebooks are available in the notebooks/diagnostics/histogram/ directory and contain examples of how to use the diagnostic.

Input variables and datasets

The diagnostic works with climate variables on regular latitude-longitude grids: Some of the variables that are typically used in this diagnostic are:

2t (2 metre temperature)
tprate (total precipitation rate)
sst (sea surface temperature)

It also supports derived variables using EvaluateFormula syntax (e.g., 2t - 273.15 for temperature in °C).

Basic usage

The basic usage of this diagnostic is explained with a working example in the notebook. The basic structure of the analysis is the following:

from aqua.diagnostics import Histogram, PlotHistogram

hist_dataset = Histogram(
    catalog='climatedt-phase1',
    model='ICON',
    exp='historical-1990',
    source='lra-r100-monthly',
    startdate='1990-01-01',
    enddate='1999-12-31',
    bins=100,
    weighted=True,
    loglevel='INFO'
)

hist_obs = Histogram(
    catalog='obs',
    model='ERA5',
    exp='era5',
    source='monthly',
    startdate='1990-01-01',
    enddate='1999-12-31',
    bins=100,
    weighted=True,
    loglevel='INFO'
)

hist_dataset.run(var='tprate', units='mm/day', density=True)
hist_obs.run(var='tprate', units='mm/day', density=True)

plot = PlotHistogram(
    data=[hist_dataset.histogram_data],
    ref_data=hist_obs.histogram_data,
    loglevel='INFO'
)

plot.run(ylogscale=True, xlogscale=False, smooth=False)

Note

Start/end dates and reference dataset can be customized. If not specified otherwise, plots will be saved in PNG and PDF format in the current working directory.

CLI usage

The diagnostic can be run from the command line interface (CLI) by running the following command:

cd $AQUA/aqua/diagnostics/histogram
python cli_histogram.py --config <path_to_config_file>

Additionally, the CLI can be run with the following optional arguments:

--config, -c: Path to the configuration file.
--nworkers, -n: Number of workers to use for parallel processing.
--cluster: Cluster to use for parallel processing. By default a local cluster is used.
--loglevel, -l: Logging level. Default is WARNING.
--catalog: Catalog to use for the analysis. Can be defined in the config file.
--model: Model to analyse. Can be defined in the config file.
--exp: Experiment to analyse. Can be defined in the config file.
--source: Source to analyse. Can be defined in the config file.
--outputdir: Output directory for the plots.
--startdate: Start date for the analysis.
--enddate: End date for the analysis.

Configuration file structure

The configuration file is a YAML file that contains the details on the dataset to analyse or use as reference, the output directory and the diagnostic settings. Most of the settings are common to all the diagnostics (see Diagnostics configuration files). Here we describe only the specific settings for the histogram diagnostic.

histogram: a block (nested in the diagnostics block) containing options for the Histogram diagnostic. Variable-specific parameters override the defaults.
- run: enable/disable the diagnostic.
- diagnostic_name: name of the diagnostic. histogram by default.
- variables: list of variables to analyse with their regions.
- formulae: list of formulae to compute new variables from existing ones.
- bins: number of bins for histogram computation.
- range: range for histogram bins as [min, max], or null for auto.
- weighted: use latitudinal weights to account for grid cell area.
- density: if true, compute probability density function (PDF) instead of counts.
- box_brd: apply box boundaries for region selection.
- xlogscale / ylogscale: use logarithmic scale for x/y axes in plots.
- smooth: apply smoothing to histogram.
- smooth_window: window size for smoothing.

histogram:
    run: true
    diagnostic_name: 'histogram'
    bins: 100
    range: null
    weighted: true
    density: true
    box_brd: true
    xlogscale: false
    ylogscale: true
    smooth: false
    smooth_window: 5
    variables:
      - name: '2t'
        regions: [null, 'tropics']

Output

The diagnostic produces the following outputs:

Histogram/PDF line plots
Multi-model comparisons with reference data
Optional smoothing and custom axis limits

Plots are saved in both PDF and PNG format. Data outputs are saved as NetCDF files.

Observations

The default reference dataset is ERA5 reanalysis, provided by ECMWF.

Other common reference datasets include MSWEP (Multi-Source Weighted-Ensemble Precipitation) and BERKELEY-EARTH (Berkeley Earth Surface Temperature).

Custom reference datasets can be configured in the configuration file.

Available demo notebooks

Notebooks are stored in notebooks/diagnostics/histogram:

histogram.ipynb

Authors and contributors

This diagnostic is maintained by Marco Cadau (@mcadau, marco.cadau@polito.it). Contributions are welcome — please open an issue or a pull request. For questions or suggestions, contact the AQUA team or the maintainer.

Detailed API

This section provides a detailed reference for the Application Programming Interface (API) of the histogram diagnostic, generated from the function docstrings.

class aqua.diagnostics.histogram.Histogram(model: str, exp: str, source: str, catalog: str = None, regrid: str = None, startdate: str = None, enddate: str = None, region: str = None, lon_limits: list = None, lat_limits: list = None, regions_file_path: str = None, bins: int = 100, range: tuple = None, weighted: bool = True, diagnostic_name: str = 'histogram', loglevel: str = 'WARNING')

Bases: Diagnostic

Class to compute histograms and probability density functions (PDFs) of a variable over a specified region. Retrieves data from catalog, computes histograms/PDFs for the entire period, and saves results to netcdf files.

Initialize the Histogram diagnostic class.

Parameters:

model (str) – Model to be used for data retrieval.
exp (str) – Experiment to be used for data retrieval.
source (str) – Source to be used for data retrieval.
catalog (str, optional) – Catalog for data retrieval.
regrid (str, optional) – Regridding method.
startdate (str, optional) – Start date of data to retrieve.
enddate (str, optional) – End date of data to retrieve.
region (str, optional) – Region for data retrieval.
lon_limits (list, optional) – Longitude limits of region.
lat_limits (list, optional) – Latitude limits of region.
regions_file_path (str, optional) – Path to regions file.
bins (int, optional) – Number of bins for histogram. Default 100.
range (tuple, optional) – Range for histogram bins (min, max).
weighted (bool, optional) – Use latitudinal weights. Default True.
diagnostic_name (str, optional) – Name of diagnostic. Default ‘histogram’.
loglevel (str, optional) – Log level.

MINIMUM_MONTHS_REQUIRED = 12

compute_histogram(box_brd: bool = True, density: bool = True)

Compute histogram of the data for the entire period.

Parameters:

box_brd (bool) – Include box boundaries in area selection.
density (bool) – If True, returns PDF normalized to integrate to 1.

retrieve(var: str, formula: bool = False, long_name: str = None, units: str = None, standard_name: str = None, reader_kwargs: dict = {})

Retrieve data for the specified variable using the parent Diagnostic class.

Parameters:

var (str) – Variable to retrieve.
formula (bool) – Whether to use formula for variable.
long_name (str) – Long name of variable.
units (str) – Units of variable.
standard_name (str) – Standard name of variable.
reader_kwargs (dict) – Additional Reader kwargs.

run(var: str, formula: bool = False, long_name: str = None, units: str = None, standard_name: str = None, box_brd: bool = True, density: bool = True, outputdir: str = './', rebuild: bool = True, reader_kwargs: dict = {})

Run all steps for histogram computation.

Parameters:

var (str) – Variable to retrieve and compute.
formula (bool) – Use formula for variable.
long_name (str) – Long name of variable.
units (str) – Units of variable.
standard_name (str) – Standard name of variable.
box_brd (bool) – Include box boundaries.
density (bool) – Return PDF (normalized) instead of counts.
outputdir (str) – Output directory.
rebuild (bool) – Rebuild existing files.
reader_kwargs (dict) – Additional Reader kwargs.

save_netcdf(outputdir: str = './', rebuild: bool = True)

Save histogram data to netcdf file.

Parameters:

outputdir (str) – Output directory.
rebuild (bool) – Rebuild if file exists.

class aqua.diagnostics.histogram.PlotHistogram(data=None, ref_data=None, diagnostic_name='histogram', density=True, loglevel: str = 'WARNING')

Bases: object

Class for plotting Histogram diagnostics. Provides methods to plot histogram/PDF data with customizable labels, titles, and styling options.

Initialize the PlotHistogram class.

Parameters:

data – List of histogram DataArrays to plot, or single DataArray.
ref_data – Reference histogram DataArray.
diagnostic_name (str) – Name of the diagnostic. Default is ‘histogram’.
density (bool) – Whether data represents PDF (True) or counts (False).
loglevel (str) – Logging level. Default is ‘WARNING’.

get_data_info(): Extract metadata from data arrays.

plot(data_labels=None, ref_label=None, title=None, style=None, xlogscale=False, ylogscale=True, xmax=None, xmin=None, ymax=None, ymin=None, smooth=False, smooth_window=5, labelsize=None)

Plot histogram data.

Parameters:

data_labels (list, optional) – Labels for the data.
ref_label (str, optional) – Label for the reference data.
title (str, optional) – Title for the plot.
style (str, optional) – Plotting style.
xlogscale (bool) – Use log scale for x-axis.
ylogscale (bool) – Use log scale for y-axis.
xmax (float, optional) – Maximum x value.
xmin (float, optional) – Minimum x value.
ymax (float, optional) – Maximum y value.
ymin (float, optional) – Minimum y value.
smooth (bool) – Apply smoothing to data.
smooth_window (int) – Window size for smoothing.
labelsize (int, optional) – Font size for labels.

Returns:

Matplotlib figure and axes objects.

Return type:

tuple

run(outputdir='./', rebuild=True, dpi=300, style=None, format: str | list = ['png', 'pdf', 'svg'], xlogscale=False, ylogscale=True, xmax=None, xmin=None, ymax=None, ymin=None, smooth=False, smooth_window=5, labelsize=None, show=False)

Run the complete plotting workflow.

Parameters:

outputdir (str) – Output directory to save the plot.
rebuild (bool) – If True, rebuild the plot even if it already exists.
dpi (int) – Dots per inch for the plot.
style (str) – Plotting style.
format (str or list) – Format(s) to save the figure. Default is SAVE_FORMAT.
xlogscale (bool) – Use log scale for x-axis.
ylogscale (bool) – Use log scale for y-axis.
xmax (float, optional) – Maximum x value.
xmin (float, optional) – Minimum x value.
ymax (float, optional) – Maximum y value.
ymin (float, optional) – Minimum y value.
smooth (bool) – Apply smoothing to data.
smooth_window (int) – Window size for smoothing.
show (bool) – If True, display the plot interactively.

save_plot(fig, description: str = None, rebuild: bool = True, outputdir: str = './', dpi: int = 300, format: str | list = ['png', 'pdf', 'svg'])

Save the plot to a file.

Parameters:

fig (matplotlib.figure.Figure) – Figure object.
description (str) – Description of the plot.
rebuild (bool) – If True, rebuild the plot even if it already exists.
outputdir (str) – Output directory to save the plot.
dpi (int) – Dots per inch for the plot.
format (str or list) – Format(s) to save the figure. Default is SAVE_FORMAT.

set_data_labels(): Set the data labels for the plot.

set_description(): Set the description for the plot.

set_ref_label(): Set the reference label for the plot.

set_title(): Set the title for the plot.