Make block_size of BuildHistKernel adaptive #11808

razdoburdin · 2025-11-12T17:18:12Z

Current version of xgboost utilizes fixed block_size = 256 for hist building.

This PR make this value an adaptive function of model parameters and CPU cache size. The change is important mostly for ColsWiseBuildHistKernel and demonstrates up to 2x speed-up for epsilon dataset.

trivialfis · 2025-11-12T17:28:03Z

Thank you for the optimizations! The code looks reasonable, but please add comments when the PR is ready for review. (and ping me).

src/common/cache_manager.cc

Vika-F · 2025-11-13T13:02:43Z

src/tree/hist/histogram.h

+    std::size_t occupied_space = (hist_fit_to_l1 ? hist_size : 0) + offsets_size + idx_bin_size;
+    space_in_l1_for_rows = usable_l1_size > occupied_space ? usable_l1_size - occupied_space : 0;
+  }
+  std::size_t block_size = std::max<std::size_t>(1, space_in_l1_for_rows / l1_row_foot_print);


Previously block_size was always 256 rows, which is quite large. And now it is 1 row in case no more rows fit into L1. Won't this change affect the performance in the case when there are no enough space for rows in L1?
Should it be max(256, space_in_l1_for_rows / l1_row_foot_print) ?
Or maybe L2 size should be used to calculate the block_size?

I think, cacheline_size / (2 * sizeof(float) = 8, would be the best value in this case. Using L2 would result to a huge block_size (~1e4-1e5) and produce potential underutilization of CPU cores (blocks are processed in parallel, and if blocks a very big, than some cores would be out of job).

Co-authored-by: Victoriya Fedotova <[email protected]>

razdoburdin · 2025-11-13T16:21:00Z

Hi @trivialfis , this PR is ready for review.

trivialfis

Would you like to explain the cache info in code comments? Also, the construction of hist space on how and why it depends on the cache size?

razdoburdin · 2025-11-14T09:10:36Z

Would you like to explain the cache info in code comments? Also, the construction of hist space on how and why it depends on the cache size?

done

trivialfis · 2025-11-14T10:35:54Z

src/common/cache_manager.cc

+      GetCacheInfo(cache_num++, &type, &level, &sets, &line_size, &partitions, &ways);
+    if (!trust_cpuid) return trust_cpuid;
+
+    if (type == kCpuidTypeNull) break;


Is the cache_sizes[idx] valid if we break here?

In this case we use default values from SetDefaultCaches

If this loop breaks, the function returns true, then this line does not execute:

if (!trust_cpuid) SetDefaultCaches();

are we using the default values?

if the loop breaks, it means CPU doesn't have all 4 cache levels, but all values being already read are correct. I have made some refactoring to make this part more clear.

src/tree/hist/histogram.h

trivialfis · 2025-11-14T10:45:16Z

src/tree/hist/histogram.h

+    std::size_t n_bins = gidx.cut.Ptrs().back();
+    std::size_t n_columns = gidx.cut.Ptrs().size() - 1;
+    bool any_missing = !gidx.IsDense();
+    std::size_t hist_size = 2 * sizeof(double) * n_bins;


Consider using sizeof(GradientPair) and sizeof(GradientPairPrecise) instead of sizeof(float) * 2 (for all sizeof calls in this PR).

Hmm, I made the comment for line 286, which is not done.

trivialfis · 2025-11-14T10:48:02Z

src/tree/hist/histogram.h

+    */
+
+    /* First step: determine whether one histogram column fits into L1.
+    * The maximum number of elements in a column is 2^8, 2^16, or 2^32,


Could you please elaborate on what it means to be the maximum number of elements in a (histogram) column? I thought that's the number of histogram bins?

you are right, bins is a correct term. I have fixed the description.

Thank you for updating the comments. It's still not clear to me what it means to have "maximum number of bins" in a column. So, what happens if I specify the training parameter max_bin=53?

You are right, it is better to use max_bin in this case, otherwise the estimation would be too conservative. I have updated the code.

You are right, it is better to use max_bin in this case, otherwise the estimation would be too conservative

I didn't make any suggestion? I was curious about the constraint and where the numbers 2^8, 2^16 originate or what they are for.

ok, I have assumed that max_bin would be 2^8, 2^16 or 2^32 as a limit cases for BinTypeSize = 1, 2 or 4. It is better to use the exact max_bin value to have more accurate estimation.

trivialfis · 2025-11-17T22:43:11Z

src/tree/hist/histogram.h

+    /* First step: determine whether one histogram column fits into L1.
+     * Note: column-wise kernel is used for dense data only.
+     */
+    std::size_t hist_col_size = 2 * sizeof(double) * max_bin;


#11808 (comment)

trivialfis

Will do some tests myself today. Will merge if nothing stands out.

The hypervisor prevention part is a bit concerning though, most of the large jobs are run under VMs.

trivialfis · 2025-11-19T07:26:26Z

src/common/cache_manager.cc

+/* Detect CPU cache sizes at runtime using CPUID.
+ * CPUID cannot be used reliably on:
+ * 1. non-x86_64 architectures
+ * 2. virtualized environments (CPUID may report incorrect cache sizes)


May I ask, does this pretty much rule out most of cloud instances?

yes :(
but we don't have any good way to find real cache sizes in this case.

trivialfis · 2025-11-20T08:50:56Z

@razdoburdin I shared a WIP benchmark result with your GitHub email address [email protected], please take a look when you are available. I highlighted the regression cases.

razdoburdin · 2025-11-20T09:01:36Z

@razdoburdin I shared a WIP benchmark result with your GitHub email address [email protected], please take a look when you are available. I highlighted the regression cases.

haven't received them. Could you please duplicate to [email protected] ?

trivialfis · 2025-11-20T09:19:51Z

CPU l1 hist.csv
I will upload a CSV instead. ;-(

razdoburdin · 2025-11-20T09:24:22Z

CPU l1 hist.csv I will upload a CSV instead. ;-(

Could you also share HW details? Is it a virtualized environment? ARM or x86?

trivialfis · 2025-11-20T09:28:26Z

Could you also share HW details? Is it a virtualized environment? ARM or x86?

Just my personal desktop:

AMD Ryzen 9 7900X3D 12-Core Processor. I can find an Intel server if needed, but I need to use machines from work.
Bare metal. No VM or container.

into dev/cpu/l1_hist

razdoburdin · 2025-11-20T10:23:06Z

AMD Ryzen 9 7900X3D 12-Core Processor. I can find an Intel server if needed, but I need to use machines from work.

Bare metal. No VM or container.

I see. 7900X3D has great L3 per core capacity, I didn't take into account before. I have upgraded the code.

trivialfis · 2025-11-20T13:50:55Z

branch	n_samples_per_batch	n_features	n_batches	size (GB)	max_bin	n_rounds	rmse	DMatrix-Train	Train
L1 hist	1048576	245	32	30.625	257	128	14.84709263	89.5329814	438.3597314
L1 hist + L3	1048576	245	32	30.625	257	128	14.84709263	87.28941536	422.529346
master	1048576	245	32	30.625	257	128	14.84709263	89.73098493	284.3891058

razdoburdin · 2025-11-20T16:27:47Z

branch n_samples_per_batch n_features n_batches sparsity size (GB) max_bin n_rounds rmse DMatrix-Train Train
L1 hist 1048576 245 32 0 30.625 257 128 14.84709263 89.5329814 438.3597314
L1 hist + L3 1048576 245 32 0 30.625 257 128 14.84709263 87.28941536 422.529346
master 1048576 245 32 0 30.625 257 128 14.84709263 89.73098493 284.3891058

ok, i hope I have found the reason.
AMD stores the topology in another leaf. I have added the vendor switch. I have also removed under_hypervisor flag, as far as current realization automatically fallback to default values if virtualized environment is unable to report cache size.

Unfortunately I don't have any 7900X3D to verify the perf by myself.

trivialfis · 2025-11-20T18:52:12Z

Excellent! I can confirm that the regression is now fixed. Would you like to share your benchmark results for the datasets that you have tested?

I will run some more tests to deliberately disable cpuid (hence using the default value), just in case there's a regression for cloud users.

as far as current realization automatically fallback to default values if virtualized environment is unable to report cache size.

Could you please elaborate on how the current code detects the case and performs fallback? Is it guaranteed that under VM, if (type == kCpuidTypeNull) break is true? Asking since previously there was an explicit check for VM; now that the check is gone, it looks like the loop will finish all the way down to the bottom cache level. (I should probably create a VM to test this ....)

trivialfis · 2025-11-22T19:22:57Z

Hi, I will be on holiday next week. Response might be slow.

I don't want to make this PR difficult, and the CPU implementation could benefit greatly from optimizations. Please feel free to create specialized code for targeted sets of CPUs, make sure the specialization is well-scoped, say within 20 lines of code, and doesn't regress other CPUs.

razdoburdin · 2025-11-24T09:47:08Z

Excellent! I can confirm that the regression is now fixed. Would you like to share your benchmark results for the datasets that you have tested?

here are benchmarks results, I have made on my 56-cores machine. epsilon is the only case with ColWiseHistBuild kernel in use.

razdoburdin · 2025-11-24T09:52:02Z

Could you please elaborate on how the current code detects the case and performs fallback? Is it guaranteed that under VM, if (type == kCpuidTypeNull) break is true? Asking since previously there was an explicit check for VM; now that the check is gone, it looks like the loop will finish all the way down to the bottom cache level. (I should probably create a VM to test this ....)

All cache sizes are initialized by -1 by default.
In case some cache level doesn't exist (or VM is configured not to report it) the condition (type == kCpuidTypeNull) would break the execution and some elements of cache_size array would still be equal to -1. The getters in CacheManager class check if the corresponding element is -1 and return the default value in these case.

razdoburdin · 2025-12-02T12:05:17Z

hi @trivialfis,

what is your opinion about this?

trivialfis · 2025-12-02T20:21:35Z

Running some tests (picking up from the last week). Out of curiosity, is the Linux /sys/devices/system/cpu/cpu0/cache/index2/size a reliable source of information? In addition, what happens to CPUs with efficient/performance cores, or what happens with CPUs that have different dies (for example, amd 3d cache)?

razdoburdin · 2025-12-03T11:52:36Z

Out of curiosity, is the Linux /sys/devices/system/cpu/cpu0/cache/index2/size a reliable source of information?

In the best of my understanding it utilizes similar to cpuid by kernel loading, so the results should be the same if the kernel is new.

In addition, what happens to CPUs with efficient/performance cores

cpuid would report L1/L2 corresponding to the logical core, that executes the cpuid. So not 100% optimal values for all cores.

, or what happens with CPUs that have different dies (for example, amd 3d cache)?

it would report a total L3 size per CPU (for example in case of 2-dies with 16MB each, the reported value would be 32MB).

src/common/cache_manager.cc

trivialfis · 2025-12-03T12:07:57Z

@razdoburdin Could you please help take a look into the sycl error? Seems broken by dependency updates.

razdoburdin · 2025-12-03T13:09:49Z

@razdoburdin Could you please help take a look into the sycl error? Seems broken by dependency updates.

yes, I will take a look

Dmitry Razdoburdin added 3 commits November 12, 2025 05:18

initial

a7643ac

logging

498043e

min block size

956fd7f

razdoburdin marked this pull request as draft November 12, 2025 17:18

Dmitry Razdoburdin added 6 commits November 12, 2025 23:40

detect untrusted cases

dc9173d

linting and R fix

9333f7c

fix build

369683b

lint and refactoring

d0d47a3

refactoring

88f246e

comments

2f56e65

Vika-F reviewed Nov 13, 2025

View reviewed changes

razdoburdin and others added 6 commits November 13, 2025 14:03

Update src/common/cache_manager.cc

7522e70

Co-authored-by: Victoriya Fedotova <[email protected]>

Update src/common/cache_manager.cc

fb86e71

Co-authored-by: Victoriya Fedotova <[email protected]>

Update src/common/cache_manager.cc

b08da41

Co-authored-by: Victoriya Fedotova <[email protected]>

fix

e8d7f93

refactor

c594b59

fix

e73cf6e

razdoburdin marked this pull request as ready for review November 13, 2025 16:20

trivialfis reviewed Nov 13, 2025

View reviewed changes

more comments

465cd68

trivialfis reviewed Nov 14, 2025

View reviewed changes

Dmitry Razdoburdin and others added 4 commits November 14, 2025 06:31

make row/col wise hist dispatching more clear

5598949

use max_bin

15f9113

Merge branch 'master' into dev/cpu/l1_hist

e249b38

use cpuidex for gnu/clang

6179c88

trivialfis reviewed Nov 17, 2025

View reviewed changes

clear code

49a1217

Dmitry Razdoburdin added 2 commits November 18, 2025 02:51

minor refactor

8773b05

linting

d668c24

trivialfis reviewed Nov 19, 2025

View reviewed changes

Merge branch 'master' into dev/cpu/l1_hist

0d293db

Dmitry Razdoburdin added 2 commits November 20, 2025 02:19

take l3 into account

5cafb2c

Merge branch 'dev/cpu/l1_hist' of https://github.com/razdoburdin/xgboost

6f855e1

into dev/cpu/l1_hist

razdoburdin marked this pull request as draft November 20, 2025 15:36

Dmitry Razdoburdin added 2 commits November 20, 2025 07:41

fix amd cache detection; remove hypervisor check

17f25a0

remove debug prints

436b416

razdoburdin marked this pull request as ready for review November 20, 2025 16:27

razdoburdin mentioned this pull request Nov 27, 2025

Fix incorrect CPU topology initialization uxlfoundation/oneDAL#3304

Open

9 tasks

fix warnings.

356b258

trivialfis approved these changes Dec 3, 2025

View reviewed changes

src/common/cache_manager.cc Outdated Show resolved Hide resolved

Uh oh!

Make block_size of BuildHistKernel adaptive #11808

Are you sure you want to change the base?

Make block_size of BuildHistKernel adaptive #11808

Conversation

razdoburdin commented Nov 12, 2025

Uh oh!

trivialfis commented Nov 12, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

razdoburdin commented Nov 13, 2025

Uh oh!

trivialfis left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

razdoburdin commented Nov 14, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

trivialfis Nov 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

trivialfis left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

trivialfis commented Nov 20, 2025

Uh oh!

razdoburdin commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

trivialfis commented Nov 20, 2025

Uh oh!

razdoburdin commented Nov 20, 2025

Uh oh!

trivialfis commented Nov 20, 2025

Uh oh!

razdoburdin commented Nov 20, 2025

Uh oh!

trivialfis commented Nov 20, 2025

trivialfis left a comment •

edited

Loading

trivialfis Nov 17, 2025 •

edited

Loading

razdoburdin commented Nov 20, 2025 •

edited

Loading

trivialfis commented Nov 20, 2025 •

edited

Loading

trivialfis commented Nov 22, 2025 •

edited

Loading

trivialfis commented Dec 2, 2025 •

edited

Loading