- The Problem
- The Goal
- State of the Art
- Data Privacy Notice
- License
- Model Architecture
- Future Directions & Proof of Concept
One of the major bottlenecks in developing water disaggregation methods for flow trace analysis is the lack of publicly available data. Household water consumption data is highly sensitive information, making it extremely difficult for researchers to access real-world datasets required for analysis and method development.
There is a significant need for a robust tool capable of simulating realistic residential water usage data. Such a tool would allow researchers to freely develop, test, and benchmark methods for water end-use disaggregation without relying on restricted or private datasets.
This repository serves as a proof-of-concept for using a 1D Conditional Diffusion Model to generate highly realistic, synthetic residential water usage data. By modeling the complex diurnal patterns and weather-driven irrigation events of individual households, we aim to provide a foundation for a generative tool that researchers can use for end-use disaggregation.
Currently, the state-of-the-art simulator for this domain is STREaM by Cominola et al. (2016). Our approach explores modern deep generative models (specifically, Denoising Diffusion Probabilistic Models) as a novel alternative to create similar simulation tools.
For security and privacy reasons, none of the real-world AMI data used to train the model is included in this repository. However, all of the code required to build the features, process the windows, train the model, and generate the synthetic samples is provided here.
This project is licensed under the Apache License 2.0. This means the code is fully open-source and includes an explicit patent grant, but it is provided "as is" without any warranties, and the creators cannot be held legally liable for how the software is used.
This section details the complete end-to-end architecture of the 1D Conditional Diffusion Model designed for synthetic residential water usage generation. It covers the data processing pipeline, the dual-track mathematical solutions for zero-inflated data, the dual-stream network architecture, and the fundamental diffusion equations.
The model operates on high-resolution Advanced Metering Infrastructure (AMI) data, sampled at 15-minute intervals (96 intervals per 24-hour period).
The model is conditioned on two explicit streams of information to separate human diurnal behavior from stochastic climate-driven behavior (like irrigation).
-
Indoor/Behavioral Stream (
c_in, 12 features)- Temporal Cyclicals (
month_sin/cos,dow_sin/cos,hour_sin/cos): Bounded[-1, 1]trigonometric encodings that allow the model to learn the cyclical nature of time seamlessly. - Lagged Usage (
usage_15m,usage_30m,usage_1h,usage_6h,usage_24h,usage_48h): Log-normalized historical usage allowing the model to condition on recent and mid-range household activity states.
- Temporal Cyclicals (
-
Outdoor/Climate Stream (
c_out, 10 features)- Current Climate:
temp_c,precip_mm,snow_cm,snow_flag. - Lagged Climate:
temp_1h,temp_24h,temp_48h. - Accumulated Climate Variables:
precip_3d: 3-day rolling sum of precipitation.snow_24h: 24-hour rolling sum of snowfall.gdd_7d: 7-day rolling sum of Growing Degree Days.
- Current Climate:
The pipeline extracts sequences using disjoint daily windows across 2023 and 2024. This ensures each training sample represents a physically consistent 24-hour profile and prevents temporal leakage that would occur with overlapping strides.
Standard generative models fail on raw water usage because the data is zero-inflated and heavily right-skewed. A house uses exactly
To solve this mathematically, the Dataloader applies a Hurdle Transformation, splitting the single water usage track into a 2-channel tensor:
-
Occurrence Mask (
$m$ ): A binary mask[0.0, 1.0]indicating if any flow occurred. -
Log-Magnitude (
$v$ ): A strictly continuous value representing the absolute volume.$v_{raw} = \log(x + 1)$ $v_{norm} = \frac{v_{raw} - \mu}{\sigma}$
The backbone is a Dual-Stream 1D U-Net utilizing FiLM (Feature-wise Linear Modulation) conditioning.
graph TD
X["Noisy Input x_t<br>(2 Channels: Mask, Mag)"] --> U1["UNet_In"]
X --> U2["UNet_Out"]
T["Time Embedding<br>t"] -.-> U1
T -.-> U2
C_IN["c_in<br>(Behavioral/Time)"] --> E_IN["Dense Encoder"]
C_OUT["c_out<br>(Climate/Weather)"] --> E_OUT["Dense Encoder"]
E_IN --> CAT["Concat Full Context"]
E_OUT --> CAT
CAT -.->|"FiLM Injection"| U1
CAT -.->|"FiLM Injection"| U2
U1 --> SUM((+))
U2 --> SUM
SUM --> OUT["Predicted Noise ε_θ<br>(2 Channels)"]
The architecture isolates indoor routines from outdoor irrigation conceptually by using two parallel networks, but crucially, both streams receive the full combined context (Calendar + Weather).
-
UNet_infocuses on baseline diurnal routines. -
UNet_outfocuses on massive irrigation spikes. The final noise prediction is simply$\epsilon_\theta = \epsilon_{in} + \epsilon_{out}$ .
Instead of concatenating conditions to the input, the encoded vectors
Click to expand: Internal 1D U-Net Tensor Mathematics
graph TD
%% Styling
classDef input fill:#fcf4cd,stroke:#333,stroke-width:1px;
classDef block fill:#cde9ce,stroke:#333,stroke-width:1px;
classDef op fill:#ffffff,stroke:#333,stroke-width:1px;
classDef skip stroke:#333,stroke-width:2px,stroke-dasharray: 5 5;
subgraph Conditioning ["Conditioning Context"]
T["Time Embedding<br>(1, 64)"]:::input
C["Covariates (c_in + c_out)<br>(1, 128)"]:::input
CatCond(("Concat")):::op
Context["Combined Context c<br>(1, 192)"]:::block
T --> CatCond
C --> CatCond
CatCond --> Context
end
subgraph Encoder ["Encoder (Contracting Path)"]
X["x_t: Noisy Input<br>(1, 2, 96)"]:::input
RB1["ResBlock 1<br>+ FiLM Injection<br>Out: (1, 64, 96)"]:::block
Pool1["AvgPool1d<br>Out: (1, 64, 48)"]:::op
RB2["ResBlock 2<br>+ FiLM Injection<br>Out: (1, 128, 48)"]:::block
Pool2["AvgPool1d<br>Out: (1, 128, 24)"]:::op
end
subgraph Bottleneck ["Bottleneck"]
RB3["ResBlock 3<br>+ FiLM Injection<br>Out: (1, 128, 24)"]:::block
end
subgraph Decoder ["Decoder (Expansive Path)"]
Up1["Upsample 1d<br>Out: (1, 128, 48)"]:::op
Cat1(("Concat")):::op
RB4["ResBlock 4<br>+ FiLM Injection<br>Out: (1, 64, 48)"]:::block
Up2["Upsample 1d<br>Out: (1, 64, 96)"]:::op
Cat2(("Concat")):::op
RB5["ResBlock 5<br>+ FiLM Injection<br>Out: (1, 64, 96)"]:::block
end
subgraph Output ["Output Head"]
FinalConv["Conv1d<br>Out: (1, 2, 96)"]:::block
Eps["Predicted Noise ε_θ<br>(1, 2, 96)"]:::input
end
%% Main Data Flow
X --> RB1
RB1 --> Pool1
Pool1 --> RB2
RB2 --> Pool2
Pool2 --> RB3
RB3 --> Up1
Up1 --> Cat1
Cat1 -->|"(1, 256, 48)"| RB4
RB4 --> Up2
Up2 --> Cat2
Cat2 -->|"(1, 128, 96)"| RB5
RB5 --> FinalConv
FinalConv --> Eps
%% Skip Connections
RB2 -.->|"Skip Connection<br>(1, 128, 48)"| Cat1
RB1 -.->|"Skip Connection<br>(1, 64, 96)"| Cat2
%% Conditioning Flow
Context -.->|"FiLM parameters (γ, β)"| RB1
Context -.->|"FiLM parameters (γ, β)"| RB2
Context -.->|"FiLM parameters (γ, β)"| RB3
Context -.->|"FiLM parameters (γ, β)"| RB4
Context -.->|"FiLM parameters (γ, β)"| RB5
The model operates under the framework of Denoising Diffusion Probabilistic Models (DDPM).
We define a fixed variance schedule
The forward process corrupts the true 2-channel data
Which yields the closed-form sampling step:
The network learns to reverse the process by predicting the injected noise
During sampling, we start with pure Gaussian noise
Once we reach
graph LR
A["x_0_hat: 2-Channels"] --> B["Channel 0: Mask m"]
A --> C["Channel 1: Magnitude v"]
B --> D{"m > 0.5?"}
C --> E["gallons = exp(v*σ + μ) - 1"]
D -- Yes --> F["Output = gallons"]
D -- No --> G["Output = 0.000"]
E --> F
Mathematically, the final generated water usage
This repository is primarily a proof of concept demonstrating the feasibility of using 1D Conditional Diffusion Models for residential water usage simulation.
While the current implementation operates on 15-minute intervals (the standard for most Advanced Metering Infrastructure), the architectural framework is designed to be scale-invariant. Ideally, for high-fidelity end-use disaggregation research, this model would be trained on sub-minute temporal resolution data (e.g., 1-second or 5-second intervals). At higher resolutions, the diffusion model would be better equipped to capture the fine-grained "signatures" of individual fixtures, such as the distinct flow characteristics of a specific dishwasher cycle or the unique pressure-drop signature of a shower.