From 52541569f4eaf25cb164ba1e0e12f6f5064f0148 Mon Sep 17 00:00:00 2001
From: sschmidt23 <sschmidt@physics.ucdavis.edu>
Date: Wed, 5 Feb 2025 16:59:00 -0800
Subject: [PATCH 1/2] add nb and add pz-rail-dnf to deps

---
 examples/estimation_examples/DNF_Demo.ipynb | 481 ++++++++++++++++++++
 pyproject.toml                              |   1 +
 rail_packages.yml                           |   1 +
 3 files changed, 483 insertions(+)
 create mode 100644 examples/estimation_examples/DNF_Demo.ipynb
diff --git a/examples/estimation_examples/DNF_Demo.ipynb b/examples/estimation_examples/DNF_Demo.ipynb
new file mode 100644
index 0000000..3637aa4
--- /dev/null
+++ b/examples/estimation_examples/DNF_Demo.ipynb
@@ -0,0 +1,481 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# RAIL's DNF implementation example\n",
+    "\n",
+    "**Authors**: Laura Toribio San Cipriano, Sam Schmidt and Juan De Vicente<br>\n",
+    "**last successfully run**: Feb 05, 2025\n",
+    "\n",
+    "This is a notebook demonstrating some of the features of the LSSTDESC `RAIL` version of the DNF estimator, see **[De Vicente et al. (2016)](https://arxiv.org/abs/1511.07623)** for more details on the algorithm.<br>\n",
+    "\n",
+    "DNF (Directional Neighbourhood Fitting) is a nearest-neighbor approach for photometric redshift estimation developed at the CIEMAT (Centro de Investigaciones Energéticas, Medioambientales y Tecnológicas) at Madrid. DNF computes the photo-z hyperplane that best fits the directional neighbourhood of a photometric galaxy in the training sample. \n",
+    "\n",
+    "The current version of the code for `RAIL`consists of a training stage, `DNFInformer` and a estimation stage `DNFEstimator`. `DNFInformer` is a class that preprocesses the protometric data, handles missing or non-detected values, and trains a first basic k-Nearest Neighbors regressor for redshift prediction. The `DNFEstimator` calculates photometric redshifts based on an enhancement of Nearest Neighbor techniques. The class supports three main metrics for redshift estimation: ENF, ANF or DNF.\n",
+    "\n",
+    "- **ENF**: Euclidean neighbourhood. It's a common distance metric used in kNN (k-Nearest Neighbors) for photometric redshift prediction.\n",
+    "- **ANF**: uses normalized inner product for more accurate photo-z predictions. It is particularly **recommended** when working with datasets containing more than four filters.\n",
+    "- **DNF**: combines Euclidean and angular metrics, improving accuracy, especially for larger neighborhoods, and maintaining proportionality in observable content.\n",
+    "\n",
+    "\n",
+    "### `DNFInformer`\n",
+    "\n",
+    "The `DNFInformer` class processes a training dataset and produces a model file containing the computed magnitudes, colors, and their associated errors for the dataset. This model is then utilized in the `DNFEstimator` stage for photometric redshift estimation. Missing photometric detections (non-detections) are handled by replacing them with a configurable placeholder value, or optionally ignoring them during model training.\n",
+    "\n",
+    "The configurable parameters for `DNFInformer` include:\n",
+    "\n",
+    "- `bands`: List of band names expected in the input dataset.\n",
+    "- `err_bands`: List of magnitude error column names corresponding to the bands.\n",
+    "- `redshift_col`: String indicating the name of the redshift column in the input data.\n",
+    "- `mag_limits`: Dictionary with band names as keys and floats representing the acceptable magnitude range for each band.\n",
+    "- `nondetect_val`: Float or np.nan, the value indicating a non-detection, which will be replaced by the values in mag_limits.\n",
+    "- `replace_nondetect`: Boolean; if True, non-detections are replaced with the specified nondetect_val. If False, non-detections are ignored during the neighbor-finding process.\n",
+    "\n",
+    "\n",
+    "### `DNFEstimator`\n",
+    "\n",
+    "The `DNFEstimator` class uses the model generated by DNFInformer to compute photometric redshifts for new datasets and the PDFs. It identifies the nearest neighbors from the training data using various distance metrics and estimates redshifts based on these neighbors.\n",
+    "\n",
+    "The configurable parameters for `DNFEstimator` include:\n",
+    "\n",
+    "- `bands`, `err_bands`, `redshift_col`, `nondetect_val`, `mag_limits`: As described for `DNFInformer`.\n",
+    "- `selection_mode`: Integer indicating the method for neighbor selection:\n",
+    "    * `0`: Euclidean Neighbourhood Fitting (ENF).\n",
+    "    * `1`: Angular Neighbourhood Fitting (ANF).\n",
+    "    * `2`: Directional Neighbourhood Fitting (DNF).\n",
+    "- `zmin`, `zmax`, `nzbins`: Float values defining the minimum and maximum redshift range and the number of bins for estimation of the PDFs.\n",
+    "- `pdf_estimation`: Boolean; if True, computes a probability density function (PDF) for the redshift of each object.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "import matplotlib.pyplot as plt\n",
+    "import numpy as np\n",
+    "#%matplotlib inline "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import rail\n",
+    "import qp\n",
+    "from rail.core.data import TableHandle\n",
+    "from rail.core.stage import RailStage"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "DS = RailStage.data_store\n",
+    "DS.__class__.allow_overwrite = True"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Training the informer"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "You can configure DNF by setting options in a dictionary when initializing an instance of our `DNFInformer` stage. Any parameters not explicitly defined will use their default values."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "dnf_dict = dict(zmin=0.0, zmax=3.0, nzbins=301, hdf5_groupname='photometry')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We will begin by training the algorithm, to to this we instantiate a rail object with a call to the base class.<br>"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from rail.estimation.algos.dnf import DNFInformer, DNFEstimator\n",
+    "pz_train = DNFInformer.make_stage(name='inform_DNF', model='demo_DNF_model.pkl', **dnf_dict)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Now, let's load our training data, which is stored in hdf5 format.  We'll load it into the Data Store so that the ceci stages are able to access it."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from rail.utils.path_utils import RAILDIR\n",
+    "trainFile = os.path.join(RAILDIR, 'rail/examples_data/testdata/test_dc2_training_9816.hdf5')\n",
+    "testFile = os.path.join(RAILDIR, 'rail/examples_data/testdata/test_dc2_validation_9816.hdf5')\n",
+    "training_data = DS.read_file(\"training_data\", TableHandle, trainFile)\n",
+    "test_data = DS.read_file(\"test_data\", TableHandle, testFile)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The inform stage of DNF transforms magnitudes into colors, corrects undetected values in the training data, and saves them as a model dictionary. This dictionary is then stored in a pickle file specified by the model keyword above, in this case, ‘demo_dnf_model.pkl’."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%%time\n",
+    "pz_train.inform(training_data)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Run DNF"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Now, we can configure the main photo-z stage and run our algorithm on the data to generate basic photo-z estimates. Keep in mind that we are loading the trained model obtained from the inform stage using the statement`model=pz_train.get_handle('model')`. We will set `nondetect_replace` to `True` to replace non-detection magnitudes with their 1-sigma limits and utilize all colors.\n",
+    "\n",
+    "DNF provides three methods for selecting the distance metric: Euclidean (\"ENF,\" set with `selection_mode` of `0`), Angular (\"ANF,\" set with `selection_mode = 1`, which is the default for this stage), and Directional (\"DNF,\" set with `selection_mode = 2`).\n",
+    "\n",
+    "For our first example, we will set `selection_mode` to `1`, using the angular distance:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%%time\n",
+    "pz = DNFEstimator.make_stage(name='DNF_estimate', hdf5_groupname='photometry',\n",
+    "                        model=pz_train.get_handle('model'),\n",
+    "                        selection_mode=1,\n",
+    "                        nondetect_replace=True)\n",
+    "results = pz.estimate(test_data)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "DNF calculates its own point estimate, `DNF_Z`, which is stored in the qp Ensemble `ancil` data. Also, DNF calculates other photo-zs called `DNF_ZN`.\n",
+    "\n",
+    "- `DNF_Z` represents the photometric redshift for each galaxy computed as the weighted average or hyperplane fit (depending on the option selected) for a set of neighbors determined by a specific metric (ENF, ANF, DNF) where the outliers are removed\n",
+    "\n",
+    "- `DNF_ZN` represents the photometric redshift using only the closest neighbor. It is mainly used for computing the redshift distributions.\n",
+    "  \n",
+    "Let's plot that versus the true redshift.  We can also compute the PDF mode for each object and plot that as well:\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "zdnf = results().ancil['DNF_Z'].flatten()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "zn_dnf = results().ancil['DNF_ZN'].flatten()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "zgrid = np.linspace(0,3,301)\n",
+    "zmode = results().mode(zgrid).flatten()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "zmode"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Let's plot the redshift mode against the true redshifts to see how they look:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "plt.figure(figsize=(8,8))\n",
+    "plt.scatter(test_data()['photometry']['redshift'],zmode,s=1,c='k',label='DNF mode')\n",
+    "plt.plot([0,3],[0,3],'r--');\n",
+    "plt.xlabel(\"true redshift\")\n",
+    "plt.ylabel(\"DNF photo-z mode\")\n",
+    "plt.ylim(0,3)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "plt.figure(figsize=(8,8))\n",
+    "plt.scatter(test_data()['photometry']['redshift'], zdnf, s=1, c='k')\n",
+    "plt.plot([0,3],[0,3], 'r--');\n",
+    "plt.xlabel(\"true redshift\")\n",
+    "plt.ylabel(\"DNF_Z\")\n",
+    "plt.ylim(0,3)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "plt.figure(figsize=(8,8))\n",
+    "plt.scatter(test_data()['photometry']['redshift'], zn_dnf, s=1, c='k')\n",
+    "plt.plot([0,3],[0,3], 'r--');\n",
+    "plt.xlabel(\"true redshift\")\n",
+    "plt.ylabel(\"DNF_ZN\")\n",
+    "plt.ylim(0,3)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## plotting PDFs\n",
+    "\n",
+    "In addition to point estimates, we can also plot a few of the full PDFs produced by DNF using the `plot_native` method of the qp Ensemble that we've created as `results`.  We can specify which PDF to plot with the `key` argument to `plot_native`, let's plot four, the 5th, 1380th, 14481st, and 18871st:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "fig, axs = plt.subplots(2, 2, figsize=(12,8))\n",
+    "whichgals = [4, 1379, 14480, 18870]\n",
+    "for ax, which in zip(axs.flat, whichgals):\n",
+    "    ax.set_xlim(0,3)\n",
+    "    results().plot_native(key=which, axes=ax)\n",
+    "    ax.set_xlabel(\"redshift\")\n",
+    "    ax.set_ylabel(\"p(z)\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Other distance metrics\n",
+    "\n",
+    "Besides DNF there are options for ENF and ANF.\n",
+    "\n",
+    "Let's run our estimator using `selection_mode=0` for the Euclidean distance, and compare both the mode results and PDF results:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%%time\n",
+    "pz2 = DNFEstimator.make_stage(name='DNF_estimate2', hdf5_groupname='photometry',\n",
+    "                        model=pz_train.get_handle('model'),\n",
+    "                        selection_mode=0,\n",
+    "                        nondetect_replace=True)\n",
+    "results2 = pz2.estimate(test_data)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "zdnf2 = results2().ancil['DNF_Z'].flatten()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "zgrid = np.linspace(0,3,301)\n",
+    "zmode2 = results2().mode(zgrid).flatten()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "plt.figure(figsize=(8,8))\n",
+    "plt.scatter(test_data()['photometry']['redshift'],zmode2,s=1,c='k',label='DNF mode')\n",
+    "plt.plot([0,3],[0,3],'r--');\n",
+    "plt.xlabel(\"true redshift\")\n",
+    "plt.ylabel(\"DNF photo-z mode\")\n",
+    "plt.ylim(0,3)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "plt.figure(figsize=(8,8))\n",
+    "plt.scatter(test_data()['photometry']['redshift'], zdnf2, s=1, c='k')\n",
+    "plt.plot([0,3],[0,3], 'r--');\n",
+    "plt.xlabel(\"true redshift\")\n",
+    "plt.ylabel(\"DNF_Z\")\n",
+    "plt.ylim(0,3)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Let's directly compare the \"angular\" and \"Euclidean\" distance estimates on the same axes:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "plt.figure(figsize=(8,8))\n",
+    "plt.scatter(test_data()['photometry']['redshift'], zdnf, s=2, c='k', label=\"angular\")\n",
+    "plt.scatter(test_data()['photometry']['redshift'], zdnf2, s=1, c='r', label=\"Euclidean\")\n",
+    "plt.legend(loc='upper left', fontsize=10)\n",
+    "plt.plot([0,3],[0,3], 'm--');\n",
+    "plt.xlabel(\"true redshift\")\n",
+    "plt.ylabel(\"DNF_Z\")\n",
+    "plt.ylim(0,3)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "plt.figure(figsize=(8,8))\n",
+    "plt.scatter(test_data()['photometry']['redshift'], zmode, s=2, c='k')\n",
+    "plt.scatter(test_data()['photometry']['redshift'], zmode2, s=1, c='r')\n",
+    "plt.plot([0,3],[0,3], 'm--');\n",
+    "plt.xlabel(\"true redshift\")\n",
+    "plt.ylabel(\"DNF_Z\")\n",
+    "plt.ylim(0,3)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Finally, let's directly compare the same PDFs that we plotted above"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "fig, axs = plt.subplots(2, 2, figsize=(12,8))\n",
+    "whichgals = [4, 1379, 14480, 18870]\n",
+    "for ax, which in zip(axs.flat, whichgals):\n",
+    "    ax.set_xlim(0,3)\n",
+    "    results().plot_native(key=which, axes=ax, label=\"angular\")\n",
+    "    results2().plot_native(key=which, axes=ax, label=\"Euclidean\")\n",
+    "    ax.set_xlabel(\"redshift\")\n",
+    "    ax.set_ylabel(\"p(z)\")\n",
+    "ax.legend(loc='upper left', fontsize=12)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.16"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
diff --git a/pyproject.toml b/pyproject.toml
index aed4fe7..33e9f29 100644
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -26,6 +26,7 @@ algos = [
     "pz-rail-astro-tools",
     "pz-rail-bpz",
     "pz-rail-cmnn",
+    "pz-rail-dnf",
     "pz-rail-dsps",
     "pz-rail-flexzboost",
     "pz-rail-fsps",
diff --git a/rail_packages.yml b/rail_packages.yml
index 98d55fe..049210b 100644
--- a/rail_packages.yml
+++ b/rail_packages.yml
@@ -3,6 +3,7 @@ rail_astro_tools: pz-rail-astro-tools
 rail_bpz: pz-rail-bpz
 rail_cmnn: pz-rail-cmnn
 rail_delight: pz-rail_delight
+rail_dnf: pz-rail-dnf
 rail_dsps: pz-rail-dsps
 rail_flexzboost: pz-rail-flexzboost
 rail_fsps: pz-rail-fsps

From cf9fa58190adb6bae69d3ac164ebe8dd6a08c94e Mon Sep 17 00:00:00 2001
From: sschmidt23 <sschmidt@physics.ucdavis.edu>
Date: Wed, 5 Feb 2025 17:20:58 -0800
Subject: [PATCH 2/2] add DNF notebook, uncomment lines in SpecSelector NB

---
 .../creation_examples/example_SOMSpecSelector.ipynb    | 10 +++++-----
 examples/estimation_examples/DNF_Demo.ipynb            |  2 +-
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/examples/creation_examples/example_SOMSpecSelector.ipynb b/examples/creation_examples/example_SOMSpecSelector.ipynb
index 45cec0c..5065830 100644
--- a/examples/creation_examples/example_SOMSpecSelector.ipynb
+++ b/examples/creation_examples/example_SOMSpecSelector.ipynb
@@ -7,9 +7,9 @@
    "source": [
     "# SOMSpecSelector Demo\n",
     "\n",
-    "Author: Sam Schmidt\n",
+    "**Author**: Sam Schmidt\n",
     "\n",
-    "Last successfully run: Jan 7, 2025\n",
+    "**Last successfully run**: Feb 5, 2025\n",
     "\n",
     "This is a short demo of the use of the SOM-based degrader `SOMSpecSelector` that is designed to select a subset of an input galaxy sample via SOM classification such that they match the properties of a reference sample, e.g. to make mock spectroscopic selections for a training set.  \n",
     "\n",
@@ -302,9 +302,9 @@
     "\n",
     "#UNCOMMENT THESE LINES TO GRAB THE LARGER DATA FILES!\n",
     "\n",
-    "#if not os.path.exists(training_file):\n",
-    "#  os.system('curl -O https://portal.nersc.gov/cfs/lsst/PZ/romandesc_specdeep.tar')\n",
-    "#!tar -xvf romandesc_specdeep.tar"
+    "if not os.path.exists(training_file):\n",
+    "  os.system('curl -O https://portal.nersc.gov/cfs/lsst/PZ/romandesc_specdeep.tar')\n",
+    "!tar -xvf romandesc_specdeep.tar"
    ]
   },
   {
diff --git a/examples/estimation_examples/DNF_Demo.ipynb b/examples/estimation_examples/DNF_Demo.ipynb
index 3637aa4..58df8bd 100644
--- a/examples/estimation_examples/DNF_Demo.ipynb
+++ b/examples/estimation_examples/DNF_Demo.ipynb
@@ -473,7 +473,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.10.16"
+   "version": "3.10.13"
   }
  },
  "nbformat": 4,