.. _histogram:

Histogram Diagnostic
====================

Description
-----------

The **Histogram** diagnostic is a set of tools for computing and visualizing histograms or probability density functions (PDFs) of climate variables.
It supports comparative analysis between a target dataset (typically a climate model) and a reference dataset, commonly an observational or reanalysis product such as ERA5.

Histogram provides tools to plot:

- Raw histograms (counts per bin)
- Normalized PDFs (probability density functions)
- Multi-model comparisons with reference data overlay

Classes
-------

There is one class for the analysis and one for the plotting:

* **Histogram**: retrieves the data and computes histograms or PDFs over specified regions.
  It handles latitudinal weighting, bin configuration, and regional selection.
  Results are saved as class attributes and as NetCDF files.

* **PlotHistogram**: provides methods for plotting histograms and PDFs.
  It generates plots with optional logarithmic scales, smoothing, and customizable axis limits.

.. note::

    The diagnostic computes histograms over the **entire temporal period** specified
    (no seasonal decomposition).

File structure
--------------

* The diagnostic is located in the ``aqua/diagnostics/histogram/`` directory, which contains both the source code and the command line interface (CLI) script.
* A template configuration file is available at ``aqua/diagnostics/templates/diagnostics/config-histogram.yaml``
* Region definitions are available in ``aqua/diagnostics/config/tools/histogram/definitions/regions.yaml``
* Notebooks are available in the ``notebooks/diagnostics/histogram/`` directory and contain examples of how to use the diagnostic.

Input variables and datasets
----------------------------

The diagnostic works with climate variables on regular latitude-longitude grids:
Some of the variables that are typically used in this diagnostic are:

* ``2t`` (2 metre temperature)
* ``tprate`` (total precipitation rate)
* ``sst`` (sea surface temperature)

It also supports derived variables using ``EvaluateFormula`` syntax (e.g., ``2t - 273.15`` for temperature in °C).

Basic usage
-----------

The basic usage of this diagnostic is explained with a working example in the notebook.
The basic structure of the analysis is the following:

.. code-block:: python

    from aqua.diagnostics import Histogram, PlotHistogram

    hist_dataset = Histogram(
        catalog='climatedt-phase1',
        model='ICON',
        exp='historical-1990',
        source='lra-r100-monthly',
        startdate='1990-01-01',
        enddate='1999-12-31',
        bins=100,
        weighted=True,
        loglevel='INFO'
    )

    hist_obs = Histogram(
        catalog='obs',
        model='ERA5',
        exp='era5',
        source='monthly',
        startdate='1990-01-01',
        enddate='1999-12-31',
        bins=100,
        weighted=True,
        loglevel='INFO'
    )

    hist_dataset.run(var='tprate', units='mm/day', density=True)
    hist_obs.run(var='tprate', units='mm/day', density=True)

    plot = PlotHistogram(
        data=[hist_dataset.histogram_data],
        ref_data=hist_obs.histogram_data,
        loglevel='INFO'
    )

    plot.run(ylogscale=True, xlogscale=False, smooth=False)

.. note::

    Start/end dates and reference dataset can be customized.
    If not specified otherwise, plots will be saved in PNG and PDF format in the current working directory.

CLI usage
---------

The diagnostic can be run from the command line interface (CLI) by running the following command:

.. code-block:: bash

    cd $AQUA/aqua/diagnostics/histogram
    python cli_histogram.py --config <path_to_config_file>

Additionally, the CLI can be run with the following optional arguments:

- ``--config``, ``-c``: Path to the configuration file.
- ``--nworkers``, ``-n``: Number of workers to use for parallel processing.
- ``--cluster``: Cluster to use for parallel processing. By default a local cluster is used.
- ``--loglevel``, ``-l``: Logging level. Default is ``WARNING``.
- ``--catalog``: Catalog to use for the analysis. Can be defined in the config file.
- ``--model``: Model to analyse. Can be defined in the config file.
- ``--exp``: Experiment to analyse. Can be defined in the config file.
- ``--source``: Source to analyse. Can be defined in the config file.
- ``--outputdir``: Output directory for the plots.
- ``--startdate``: Start date for the analysis.
- ``--enddate``: End date for the analysis.

Configuration file structure
----------------------------

The configuration file is a YAML file that contains the details on the dataset to analyse or use as reference, the output directory and the diagnostic settings.
Most of the settings are common to all the diagnostics (see :ref:`diagnostics-configuration-files`).
Here we describe only the specific settings for the histogram diagnostic.

* ``histogram``: a block (nested in the ``diagnostics`` block) containing options for the Histogram diagnostic.
  Variable-specific parameters override the defaults.

    * ``run``: enable/disable the diagnostic.
    * ``diagnostic_name``: name of the diagnostic. ``histogram`` by default.
    * ``variables``: list of variables to analyse with their regions.
    * ``formulae``: list of formulae to compute new variables from existing ones.
    * ``bins``: number of bins for histogram computation.
    * ``range``: range for histogram bins as [min, max], or null for auto.
    * ``weighted``: use latitudinal weights to account for grid cell area.
    * ``density``: if true, compute probability density function (PDF) instead of counts.
    * ``box_brd``: apply box boundaries for region selection.
    * ``xlogscale`` / ``ylogscale``: use logarithmic scale for x/y axes in plots.
    * ``smooth``: apply smoothing to histogram.
    * ``smooth_window``: window size for smoothing.

.. code-block:: yaml

    histogram:
        run: true
        diagnostic_name: 'histogram'
        bins: 100
        range: null
        weighted: true
        density: true
        box_brd: true
        xlogscale: false
        ylogscale: true
        smooth: false
        smooth_window: 5
        variables:
          - name: '2t'
            regions: [null, 'tropics']

Output
------

The diagnostic produces the following outputs:

* Histogram/PDF line plots
* Multi-model comparisons with reference data
* Optional smoothing and custom axis limits

Plots are saved in both PDF and PNG format.
Data outputs are saved as NetCDF files.

Observations
------------------

The default reference dataset is ERA5 reanalysis, provided by ECMWF.

Other common reference datasets include MSWEP (Multi-Source Weighted-Ensemble Precipitation) and BERKELEY-EARTH (Berkeley Earth Surface Temperature).

Custom reference datasets can be configured in the configuration file.

.. Example Plots
.. -------------
..
.. .. figure:: figures/histogram.histogram_pdf.climatedt-phase1.ICON.historical-1990.r1.tprate.png
..    :align: center
..    :width: 100%
..
..    Probability density function (PDF) of precipitation rate (mm/day) for the global region,
..    showing ICON model output compared to ERA5 reference data.


Available demo notebooks
------------------------

Notebooks are stored in ``notebooks/diagnostics/histogram``:

- `histogram.ipynb <https://github.com/DestinE-Climate-DT/AQUA-diagnostics/tree/main/notebooks/diagnostics/histogram/histogram.ipynb>`_

Authors and contributors
------------------------

This diagnostic is maintained by Marco Cadau (`@mcadau <https://github.com/mcadau>`_, `marco.cadau@polito.it <mailto:marco.cadau@polito.it>`_).
Contributions are welcome — please open an issue or a pull request.
For questions or suggestions, contact the AQUA team or the maintainer.

Detailed API
------------

This section provides a detailed reference for the Application Programming Interface (API) of the ``histogram`` diagnostic,
generated from the function docstrings.

.. automodule:: aqua.diagnostics.histogram
    :members:
    :undoc-members:
    :show-inheritance: