-
-
Notifications
You must be signed in to change notification settings - Fork 8.8k
Make block_size of BuildHistKernel adaptive #11808
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
Thank you for the optimizations! The code looks reasonable, but please add comments when the PR is ready for review. (and ping me). |
src/tree/hist/histogram.h
Outdated
| std::size_t occupied_space = (hist_fit_to_l1 ? hist_size : 0) + offsets_size + idx_bin_size; | ||
| space_in_l1_for_rows = usable_l1_size > occupied_space ? usable_l1_size - occupied_space : 0; | ||
| } | ||
| std::size_t block_size = std::max<std::size_t>(1, space_in_l1_for_rows / l1_row_foot_print); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously block_size was always 256 rows, which is quite large. And now it is 1 row in case no more rows fit into L1. Won't this change affect the performance in the case when there are no enough space for rows in L1?
Should it be max(256, space_in_l1_for_rows / l1_row_foot_print) ?
Or maybe L2 size should be used to calculate the block_size?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think, cacheline_size / (2 * sizeof(float) = 8, would be the best value in this case. Using L2 would result to a huge block_size (~1e4-1e5) and produce potential underutilization of CPU cores (blocks are processed in parallel, and if blocks a very big, than some cores would be out of job).
Co-authored-by: Victoriya Fedotova <[email protected]>
Co-authored-by: Victoriya Fedotova <[email protected]>
Co-authored-by: Victoriya Fedotova <[email protected]>
|
Hi @trivialfis , this PR is ready for review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you like to explain the cache info in code comments? Also, the construction of hist space on how and why it depends on the cache size?
done |
src/common/cache_manager.cc
Outdated
| GetCacheInfo(cache_num++, &type, &level, &sets, &line_size, &partitions, &ways); | ||
| if (!trust_cpuid) return trust_cpuid; | ||
|
|
||
| if (type == kCpuidTypeNull) break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the cache_sizes[idx] valid if we break here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case we use default values from SetDefaultCaches
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this loop breaks, the function returns true, then this line does not execute:
if (!trust_cpuid) SetDefaultCaches();
are we using the default values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the loop breaks, it means CPU doesn't have all 4 cache levels, but all values being already read are correct. I have made some refactoring to make this part more clear.
src/tree/hist/histogram.h
Outdated
| std::size_t n_bins = gidx.cut.Ptrs().back(); | ||
| std::size_t n_columns = gidx.cut.Ptrs().size() - 1; | ||
| bool any_missing = !gidx.IsDense(); | ||
| std::size_t hist_size = 2 * sizeof(double) * n_bins; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider using sizeof(GradientPair) and sizeof(GradientPairPrecise) instead of sizeof(float) * 2 (for all sizeof calls in this PR).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I made the comment for line 286, which is not done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
src/tree/hist/histogram.h
Outdated
| */ | ||
|
|
||
| /* First step: determine whether one histogram column fits into L1. | ||
| * The maximum number of elements in a column is 2^8, 2^16, or 2^32, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please elaborate on what it means to be the maximum number of elements in a (histogram) column? I thought that's the number of histogram bins?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are right, bins is a correct term. I have fixed the description.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for updating the comments. It's still not clear to me what it means to have "maximum number of bins" in a column. So, what happens if I specify the training parameter max_bin=53?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, it is better to use max_bin in this case, otherwise the estimation would be too conservative. I have updated the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, it is better to use max_bin in this case, otherwise the estimation would be too conservative
I didn't make any suggestion? I was curious about the constraint and where the numbers 2^8, 2^16 originate or what they are for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, I have assumed that max_bin would be 2^8, 2^16 or 2^32 as a limit cases for BinTypeSize = 1, 2 or 4. It is better to use the exact max_bin value to have more accurate estimation.
src/tree/hist/histogram.h
Outdated
| /* First step: determine whether one histogram column fits into L1. | ||
| * Note: column-wise kernel is used for dense data only. | ||
| */ | ||
| std::size_t hist_col_size = 2 * sizeof(double) * max_bin; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
trivialfis
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do some tests myself today. Will merge if nothing stands out.
The hypervisor prevention part is a bit concerning though, most of the large jobs are run under VMs.
src/common/cache_manager.cc
Outdated
| /* Detect CPU cache sizes at runtime using CPUID. | ||
| * CPUID cannot be used reliably on: | ||
| * 1. non-x86_64 architectures | ||
| * 2. virtualized environments (CPUID may report incorrect cache sizes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May I ask, does this pretty much rule out most of cloud instances?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes :(
but we don't have any good way to find real cache sizes in this case.
|
@razdoburdin I shared a WIP benchmark result with your GitHub email address |
haven't received them. Could you please duplicate to [email protected] ? |
|
CPU l1 hist.csv |
Could you also share HW details? Is it a virtualized environment? ARM or x86? |
Just my personal desktop:
|
I see. 7900X3D has great L3 per core capacity, I didn't take into account before. I have upgraded the code. |
|
ok, i hope I have found the reason. Unfortunately I don't have any 7900X3D to verify the perf by myself. |
|
Excellent! I can confirm that the regression is now fixed. Would you like to share your benchmark results for the datasets that you have tested? I will run some more tests to deliberately disable cpuid (hence using the default value), just in case there's a regression for cloud users.
Could you please elaborate on how the current code detects the case and performs fallback? Is it guaranteed that under VM, |
|
Hi, I will be on holiday next week. Response might be slow. I don't want to make this PR difficult, and the CPU implementation could benefit greatly from optimizations. Please feel free to create specialized code for targeted sets of CPUs, make sure the specialization is well-scoped, say within 20 lines of code, and doesn't regress other CPUs. |
|
Could you please elaborate on how the current code detects the case and performs fallback? Is it guaranteed that under VM, All cache sizes are initialized by |
|
hi @trivialfis, what is your opinion about this? |
|
Running some tests (picking up from the last week). Out of curiosity, is the Linux |
In the best of my understanding it utilizes similar to cpuid by kernel loading, so the results should be the same if the kernel is new.
cpuid would report L1/L2 corresponding to the logical core, that executes the cpuid. So not 100% optimal values for all cores.
it would report a total L3 size per CPU (for example in case of 2-dies with 16MB each, the reported value would be 32MB). |
|
@razdoburdin Could you please help take a look into the sycl error? Seems broken by dependency updates. |
yes, I will take a look |

Current version of xgboost utilizes fixed block_size = 256 for hist building.
This PR make this value an adaptive function of model parameters and CPU cache size. The change is important mostly for ColsWiseBuildHistKernel and demonstrates up to 2x speed-up for epsilon dataset.