Add ml cloud fraction closure #4185

Julians42 · 2025-12-22T04:15:07Z

Purpose

adds machine learning cloud fraction. 42% error reduction in SCM calibration to LES over quadrature approach.

To-do

Change NN weights to be online calibrated weights (currently offline calibrated) by choosing the best particle.
Performance was really slow in the diagnostic case that I tested - either my allocation request on HPC was bad or performance / allocations are really bad. It was fine in a previous branch j/mlcf (prognostic edmfx in costa's aiwq setup in coupler was equal speed to quadrature) before rebasing to Anna's version (although I don't think any changes she made would affect performance here).
There is stuff to clean up but I want to get it running ci before tomorrow.

Content

I have read and checked the items on the review checklist.

tapios

LGTM (modulo the Slack discussion on how to store/apply the NN weights). Would be good for @szy21 to have a quick look too before merging.

tapios · 2025-12-22T18:25:22Z

src/cache/cloud_fraction.jl

+    # distance to saturation in temperature space 
+    θli = p.scratch.ᶜtemp_scalar_2
+    θli .= TD.liquid_ice_pottemp.(thermo_params, ᶜts)
+    delta_θli = FT(0.1)


Add some code comments to explain what's happening here: Approximate gradient dq_sat/dθli by one-sided finite difference.

Wouldn't it make more sense to take a negative delta_θli, given that saturation occurs with cooling?

Sounds good - I'll clear up the code. I was debating using centered difference since that's more accurate but its more costly (another allocation/ computation where I don't think it really matters).

@tapios - In a cloud, saturation temperature is reached with warming so I think the reverse argument also applies in that case. If you want I can switch to minus and recalibrate? (otherwise I'll just leave as plus) I don't intuitively see which side will give less error...

tapios · 2025-12-22T18:26:10Z

src/cache/cloud_fraction.jl

+    # mixing length
+    ᶜmixing_length_field = p.scratch.ᶜtemp_scalar
+    ᶜmixing_length_field .= ᶜmixing_length(Y, p)
+    # distance to saturation in q space


It would be cleaner and clearer to move the distance to saturation calculations to their own function.

tapios · 2025-12-22T18:27:51Z

src/cache/cloud_fraction.jl

+    p.precomputed.cloud_diagnostics_tuple.cf .= cf
+
+    # ... and add contributions from the updrafts if using EDMF.
+    if turbconv_model isa PrognosticEDMFX || turbconv_model isa DiagnosticEDMFX


Also, better to make the updraft cloud fraction its own function (we'll need it in several places)

haakon-e

Really cool work Julian! Left some suggestions for you to ponder

haakon-e · 2025-12-23T04:44:48Z

src/cache/cloud_fraction.jl

+    # ᶜρa⁰ = @. lazy(ρa⁰(Y.c.ρ, Y.c.sgsʲs, turbconv_model))
+    FT = eltype(p.params)
+
+    # COMPUTE quantities needed to form pi groups 


Most of this function except these lines forming pi groups and calling the NN seems to be identical to set_cloud_fraction! for QuadratureCloud. Perhaps it would be nicer to use that function directly, i.e.:

NVTX.@annotate function set_cloud_fraction!( Y, p, ::Union{EquilMoistModel, NonEquilMoistModel}, cloud_model::Union{QuadratureCloud, CloudML}, ) # ... if cloud_model isa CloudML # overwrite with the ML computed cloud fraction, leaving q_liq, q_ice computed via quadrature @. p.precomputed.cloud_diagnostics_tuple.cf = set_ml_cloud_fraction(cloud_model, ᶜmixing_length_field, Y.c, thermo_params, ᶜts, p.precomputed.ᶜgradᵥ_q_tot, p.precomputed.ᶜgradᵥ_θ_liq_ice) end # ...

and then define set_ml_cloud_fraction as the pointwise version of the code below. This has the added benefit of avoiding all the allocations you're introducing, which may be the cause of the performance slowdown you're observing.

I also think something like this would be better.

haakon-e · 2025-12-23T04:47:46Z

src/solver/model_getters.jl

        GridScaleCloud()
    elseif cloud_model == "quadrature"
        QuadratureCloud(SGSQuadrature(FT))
+    elseif cloud_model == "cloud_ml"


It might be better to give it a more descriptive name here, since the ML architecture choice is so specific, e.g. Schmitt2026MLCloudFraction or something similar.

haakon-e · 2025-12-23T04:49:50Z

src/solver/model_getters.jl

+    elseif cloud_model == "cloud_ml"
+        nn_filepath = joinpath(
+            @clima_artifact("cloud_fraction_nn"),
+            "arch_2layers_8nodes.jld2",


could consider moving the strings "cloud_fraction_nn" and "arch_2layers_8nodes.jld2" to the config/toml file so that you can easily change them later as you refine the model.

@haakon-e cloud_fraction_nn is the name of the artifact so will keep fixed, but arch_2layers_8nodes could be for example set in the config and seems like a good idea to me if we want to test different architectures down the line

haakon-e · 2025-12-23T04:50:34Z

src/solver/model_getters.jl

@@ -1,3 +1,6 @@
+using Flux


is using Flux needed because you do JLD2.load below?

We need it for the reconstruction of the network (just after the load were we do network(weights)). @imreddyTeja should we be doing import Flux or import it somewhere else?

Artifacts.toml

szy21 · 2025-12-23T19:39:29Z

src/cache/cloud_fraction.jl

+    #dqt_dz
+    ᶜ∇q =
+        dot.(
+            Geometry.WVector.(p.precomputed.ᶜgradᵥ_q_tot),
+            Ref(ClimaCore.Geometry.WVector(FT(1.0))),
+        )
+    #dθli_dz
+    ᶜ∇θ =
+        dot.(
+            Geometry.WVector.(p.precomputed.ᶜgradᵥ_θ_liq_ice),
+            Ref(ClimaCore.Geometry.WVector(FT(1.0))),
+        )


You can probably use the projected_vector_data function here, although it won't be much simpler.

szy21 · 2025-12-23T19:44:17Z

src/cache/cloud_fraction.jl

+    cf .= apply_cf_nn.(Ref(cloud_ml.model), π_1, π_2, π_3, π_4)
+    #Main.@infiltrate
+    # overwrite with the ML computed cloud fraction, leaving q_liq, q_ice computed via quadrature
+    p.precomputed.cloud_diagnostics_tuple.cf .= cf


This should be multiplied by ᶜρa⁰ for prognostic EDMFX if it's the environment cloud fraction (and I think it should be, as the updraft is treated separately below).

Working ML cloud fraction Adds updraft boolean and liq and ice water content via quadrature broadcasted wrapper, ready to gpu test change to use artifact

Co-authored-by: Haakon Ludvig Langeland Ervik <[email protected]>

szy21 · 2025-12-27T04:47:34Z

src/cache/cloud_fraction.jl

+    q_v = p.scratch.ᶜtemp_scalar_4
+    q_v .= specific.(Y.c.ρq_tot, Y.c.ρ)
+    # Saturation state at current thermodynamic state
+    q_sat = p.scratch.ᶜtemp_scalar_5


ᶜtemp_scalar_5 is used in ᶜmixing_length called on line 301. It is materialized, so it is probably ok to reuse it here, but it would be good to make sure. An easy way to check is to change the temp_scalar here and see if the result is the same.

szy21 · 2025-12-27T04:50:11Z

src/cache/cloud_fraction.jl

+    π_1 = (q_sat .- q_v) ./ q_sat
+    π_2 = Δθli ./ θli_sat
+    π_3 = @. (((dqsatdθli * ᶜ∇θ - ᶜ∇q) * ᶜmixing_length_field) / q_sat)
+    π_4 = @. (ᶜ∇θ * ᶜmixing_length_field) / θli_sat


I think these will allocate?

szy21 · 2025-12-27T04:51:15Z

src/cache/cloud_fraction.jl

+    return cf
+end
+
+function saturation_distance(q_v, q_sat, ᶜts, θli, thermo_params, Δθli_fd)


I think this function allocates. Could you make it a point-wise function?

Actually, maybe it is better to make the entire set_ml_cloud_fraction point-wise, if you can. That will avoid the allocations.

@szy21 I edited to make what I could pointwise to reduce allocations. As there isn't a function for pointwise mixing length and I couldn't find one, or didn't see how, for projected_vector_data I had to add an extra function to keep the original set_cloud_fraction! function clean. Let me know if you can think of a better way (e.g., with one function) or better naming conventions.

This is looking good. I can take a look tonight / tomorrow. But if you are in a hurry to get this in I'm fine with merging it. (I didn't look at the changes in packages in Project and Manifest - maybe it would be good if a software engineer looks at them?)

tapios

LGTM

tapios · 2025-12-27T22:47:35Z

src/cache/cloud_fraction.jl

    p,
    ::Union{EquilMoistModel, NonEquilMoistModel},
-    qc::QuadratureCloud,
+    qc::Union{QuadratureCloud, MLCloud},


Why do we have both Schmitt2026ML and MLCloud? Seems confusing.

Good point - I agree it's currently not consistent with the naming convention for the other cloud fraction parameterizations. I think the most consistent is snake case without "cloud" in the name, so just dispatching on cloud_model: ml. My only concern, as @haakon-e pointed out was that we may have other ml models (@trontrytel was going to work on one for the covariance) so then it becomes ambiguous and since the architecture / pi groups will be specific to a paper I write next year. We could leave it simple for now and just put as ml and try and disambiguate when another ml cloud case arises? Thoughts?

I'll leave it up to you how to handle this. As a general rule, I think we should avoid naming parameterizations after people, because it leads to hesitancy downstream to change the parameterizations--a stasis I have seen many times that is not healthy. But I have no objection and strong thoughts how to handle it here. It may help to make clearer that only cloud fraction is ML--in principle other quantities (condensate path, effective radius etc.) could be too. So something like MLCloudFraction may work. But again: your call. it's not crucially important.

I think we can call it MLCloud (or MLCloudFraction) and worry about other options when we have them, but up to you.

szy21

The physics part looks good to me now. It may be good if @dennisYatunin can take a look at the package and autodiff change. But I'm fine with merging it.

szy21 · 2025-12-28T05:22:39Z

Project.toml

 uuid = "b2c96348-7fb7-4fe0-8da9-78d88439e717"
 authors = ["Climate Modeling Alliance"]
-version = "0.33.0"
+version = "0.33.1"


Is the version change on purpose?

szy21 · 2025-12-28T05:24:14Z

src/cache/cloud_fraction.jl

    p,
    ::Union{EquilMoistModel, NonEquilMoistModel},
-    qc::QuadratureCloud,
+    qc::Union{QuadratureCloud, MLCloud},


I think we can call it MLCloud (or MLCloudFraction) and worry about other options when we have them, but up to you.

szy21 · 2025-12-28T05:27:02Z

src/cache/cloud_fraction.jl

+            Ref(thermo_params),
+            Ref(FT),


Suggested change

Ref(thermo_params),

Ref(FT),

thermo_params

(I'm not sure, but I don't think you need Ref here. And you can get FT from thermo_params.)

szy21 · 2025-12-28T05:27:11Z

src/cache/cloud_fraction.jl

+    ρ,
+    ᶜts,
+    thermo_params,
+    FT,


Suggested change

FT,

src/cache/cloud_fraction.jl

szy21 · 2025-12-28T05:31:22Z

src/cache/cloud_fraction.jl

+    # Liquid–ice potential temperature at current thermodynamic state
+    θli = TD.liquid_ice_pottemp(thermo_params, ᶜts)
+
+    q_v = specific(ρq_tot, ρ)


Suggested change

q_v = specific(ρq_tot, ρ)

(and use q_tot, which is an input argument for q_v below)

And note q_tot can be negative so you may want to clip it.

Co-authored-by: Zhaoyi Shen <[email protected]>

Julians42 requested review from tapios and trontrytel December 22, 2025 04:15

tapios approved these changes Dec 22, 2025

View reviewed changes

tapios requested a review from szy21 December 22, 2025 18:39

haakon-e reviewed Dec 23, 2025

View reviewed changes

imreddyTeja force-pushed the j/mlcf2 branch from 6d04848 to 62f55a3 Compare December 23, 2025 18:27

szy21 reviewed Dec 23, 2025

View reviewed changes

Julians42 force-pushed the j/mlcf2 branch from 15095a3 to 1d87097 Compare December 27, 2025 00:42

Julians42 and others added 9 commits December 26, 2025 19:43

adds skeleton before rebase

811edb4

Working ML cloud fraction Adds updraft boolean and liq and ice water content via quadrature broadcasted wrapper, ready to gpu test change to use artifact

Rebase to anias changes

6c69f68

relax flux dependency and add new manifest

56069ee

Fix method ambiguity from StatsBase; Modify Manifests

cb498ca

Modify Flux compat

6de59cd

Update test/dependencies for Flux and JLD2

2f9ea01

Update Artifacts.toml

679010f

Co-authored-by: Haakon Ludvig Langeland Ervik <[email protected]>

Refactor

229e028

Reduce allocations, address feedback

65ce871

Julians42 force-pushed the j/mlcf2 branch from 1d87097 to 65ce871 Compare December 27, 2025 00:44

Julians42 requested review from haakon-e, szy21 and tapios December 27, 2025 00:52

Fix gpu nvtx overuse

b971387

szy21 reviewed Dec 27, 2025

View reviewed changes

Makes ML cloud fraction pointwise to reduce allocations

e5067eb

Julians42 requested a review from szy21 December 27, 2025 18:16

tapios approved these changes Dec 27, 2025

View reviewed changes

szy21 approved these changes Dec 28, 2025

View reviewed changes

Julians42 and others added 2 commits December 28, 2025 08:34

Cleanup

73da312

Apply suggestions from code review

e01d02f

Co-authored-by: Zhaoyi Shen <[email protected]>

q_tot changes

170087e

Add ml cloud fraction closure #4185

Are you sure you want to change the base?

Add ml cloud fraction closure #4185

Conversation

Julians42 commented Dec 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

To-do

Content

Uh oh!

tapios left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

haakon-e left a comment

Choose a reason for hiding this comment

Uh oh!

haakon-e Dec 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tapios left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tapios Dec 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

szy21 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Julians42 commented Dec 22, 2025 •

edited

Loading

haakon-e Dec 23, 2025 •

edited

Loading

tapios Dec 28, 2025 •

edited

Loading