You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Tile-level partitioning in jr/ir loops (ex-trsm). (#695)
Details:
- Reimplemented parallelization of the JR loop in gemmt (which is
recycled for herk, her2k, syrk, and syr2k). Previously, the
rectangular region of the current MC x NC panel of C would be
parallelized separately from from the diagonal region of that same
submatrix, with the rectangular portion being assigned to threads via
slab or round-robin (rr) partitioning (as determined at configure-
time) and the diagonal region being assigned via round-robin. This
approach did not work well when extracting lots of parallelism from
the JR loop and was often suboptimal even for smaller degrees of
parallelism. This commit implements tile-level load balancing (tlb) in
which the IR loop is effectively subjugated in service of more
equitably dividing work in the JR loop. This approach is especially
potent for certain situations where the diagonal region of the MC x NR
panel of C are significant relative to the entire region. However, it
also seems to benefit many problem sizes of other level-3 operations
(excluding trsm, which has an inherent algorithmic dependency in the
IR loop that prevents the application of tlb). For now, tlb is
implemented as _var2b.c macrokernels for gemm (which forms the basis
for gemm, hemm, and symm), gemmt (which forms the basis of herk,
her2k, syrk, and syr2k), and trmm (which forms the basis of trmm and
trmm3). Which function pointers (_var2() or _var2b()) are embedded in
the control tree will depend on whether the BLIS_ENABLE_JRIR_TLB cpp
macro is defined, which is controlled by the value passed to the
existing --thread-part-jrir=METHOD (or -r METHOD) configure option.
This script adds 'tlb' as a valid option alongside the previously
supported values of 'slab' and 'rr'. ('slab' is still the default.)
Thanks to Leick Robinson for abstractly inspiring this work, and to
Minh Quan Ho for inquiring (in PR #562, and before that in Issue #437)
about the possibility of improved load balance in macrokernel loops,
and even prototyping what it might look like, long before I fully
understood the problem.
- In bli_thread_range_weighted_sub(), tweaked the the way we compute the
area of the current MC x NC trapezoidal panel of C by better taking
into account the microtile structure along the diagonal. Previously,
it was an underestimate, as it assumed MR = NR = 1 (that is, it
assumed that the microtile column of C that overlapped with microtiles
exactly coincided with the diagonal). Now, we only assume MR = NR.
This is still a slight underestimate when MR != NR, so the additional
area is scaled by 1.5 in a hackish attempt to compensate for this, as
well as other additional effects that are difficult to model (such as
the increased cost of writing to temporary tiles before finally
updating C). The net effect of this better estimation of the
trapezoidal area should be (on average) slightly larger regions
assigned to threads that have little or no overlap with the diagonal
region (and correspondingly slightly smaller regions in the diagonal
region), which we expect will lead to slightly better load balancing
in most situations.
- Spun off the contents of bli_thread.[ch] that relate to computing
thread ranges into one of three source/header file pairs:
- bli_thread_range.[ch], which define functions that are not specific
to the jr/ir loops;
- bli_thread_range_slab_rr.[ch], which define functions that implement
slab or round-robin partitioning for the jr/ir loops;
- bli_thread_range_tlb.[ch], which define functions that implement
tlb for the jr/ir loops.
- Fixed the computation of a_next in the last iteration of the IR loop
in bli_gemmt_l_ker_var2(). Previously, it always "wrapped" back around
to the first micropanel of the current MC x KC packed block of A.
However, this is almost never actually the micropanel that is used
next. A new macro, bli_gemmt_l_wrap_a_upanel(), computes a_next
correctly, with a similarly named bli_gemmt_u_wrap_a_upanel() for use
in the upper-stored case (which *does* actually always choose the
first micropanel of A as its a_next at the end of the IR loop).
- Removed adjustments for a_next/b_next (a2/b2) for the diagonal-
intersecting case of gemmt_l_ker_var2() and the above-diagonal case
of gemmt_u_ker_var2() since these cases will only coincide with the
last iteration of the IR loop in very small problems.
- Defined bli_is_last_iter_l() and bli_is_last_iter_u(), the latter of
which explicitly considers whether the current microtile is the last
tile that intersects the diagonal. (The former does the same, but the
computation coincides with the original bli_is_last_iter().) These
functions are now used in gemmt to test when a_next (or a2) should
"wrap" (as discussed above). Also defined bli_is_last_iter_tlb_l()
and bli_is_last_iter_tlb_u(), which are similar to the aforementioned
functions but are used when employing tlb in gemmt.
- Redefined macros in bli_packm_thrinfo.h, which test whether an
iteration of work is assigned to a thread, as static inline functions
in bli_param_macro_defs.h (and then deleted bli_packm_thrinfo.h).
In the process of redefining these macros, I also renamed them from
bli_packm_my_iter_rr/sl() to bli_is_my_iter_rr/sl().
- Renamed
bli_thread_range_jrir_rr() -> bli_thread_range_rr()
bli_thread_range_jrir_sl() -> bli_thread_range_sl()
bli_thread_range_jrir() -> bli_thread_range_slrr()
- Renamed
bli_is_last_iter() -> bli_is_last_iter_slrr()
- Defined
bli_info_get_thread_jrir_tlb()
and renamed:
- bli_info_get_thread_part_jrir_slab() ->
bli_info_get_thread_jrir_slab()
- bli_info_get_thread_part_jrir_rr() ->
bli_info_get_thread_jrir_rr()
- Modified bli_rntm_set_ways_for_op() to redirect IR loop parallelism
into the JR loop when tlb is enabled for non-trsm level-3 operations.
- Added a sanity check to prevent bli_prune_unref_mparts() from being
used on packed objects. This prohibition is necessary because the
current implementation does not take into account the atomicity of
packed micropanel widths relative to the diagonal of structured
matrices. That is, the function prunes greedily without regard to
whether doing so would prune off part of a micropanel *which has
already been packed* and assigned to a thread for inclusion in the
computation.
- Further restricted early returns in bli_prune_unref_mparts() to
situations where the primary matrix is not only of general structure
but also dense (in terms of its uplo_t value). The addition of the
matrix's dense-ness to the conditional is required because gemmt is
somewhat unusual in that its C matrix has general structure but is
marked as lower- or upper-stored via its uplo_t. By only checking
for general structure, attempts to prune gemmt C matrices would
incorrectly result in early returns, even though that operation
effectively treats the matrix as symmetric (and stored in only one
triangle).
- Fixed a latent bug in bli_thread_range_rr() wherein incorrect ranges
were computed when 1 < bf. Thankfully, this bug was not yet
manifesting since all current invocations used bf == 1.
- Fixed a latent bug in some unexercised code in bli_?gemmt_l_ker_var2()
that would perform incorrect pruning of unreferenced regions above
where the diagonal of a lower-stored matrix intersects the right edge.
Thankfully, the bug was not harming anything since those unreferenced
regions were being pruned prior to the macrokernel.
- Rewrote slab/rr-based gemmt macrokernels so that they no longer carved
C into rectangular and diagonal regions prior to parallelizing each
separately. The new macrokernels use a unified loop structure where
quadratic (slab) partitioning is used.
- Updated all level-3 macrokernels to have a more uniform coding style,
such as wrt combining variable declarations with initializations as
well as the use of const.
- Updated bls_l3_packm_var[123].c to use bli_thrinfo_n_way() and
bli_thrinfo_work_id() instead of bli_thrinfo_num_threads() and
bli_thrinfo_thread_id(), respectively. This change probably should
have been included in aeb5f0c.
- Removed old prototypes in bli_gemmt_var.h and bli_trmm_var.h that
corresponded to functions that were removed in aeb5f0c.
- Other very minor cleanups.
- Comment updates.
- (cherry picked from commit 2e1ba9d)
0 commit comments