Skip to content

shred: use non-temporal writes if available #4772

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Apr 22, 2025
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
27 changes: 27 additions & 0 deletions src/disco/shred/fd_shred_tile.c
Original file line number Diff line number Diff line change
Expand Up @@ -511,7 +511,34 @@ send_shred( fd_shred_ctx_t * ctx,
hdr->udp->net_dport = fd_ushort_bswap( dest->port );

ulong shred_sz = fd_ulong_if( is_data, FD_SHRED_MIN_SZ, FD_SHRED_MAX_SZ );
#if FD_HAS_AVX
/* We're going to copy this shred potentially a bunch of times without
reading it again, and we'd rather not thrash our cache, so we want
to use non-temporal writes here. We need to make sure we don't
touch the cache line containing the network headers that we just
wrote to though. We know the destination is 64 byte aligned. */
FD_STATIC_ASSERT( sizeof(*hdr)<64UL, non_temporal );
/* src[0:sizeof(hdrs)] is invalid, but now we want to copy
dest[i]=src[i] for i>=sizeof(hdrs), so it simplifies the code. */
uchar const * src = (uchar const *)((ulong)shred - sizeof(fd_ip4_udp_hdrs_t));
memcpy( packet+sizeof(fd_ip4_udp_hdrs_t), src+sizeof(fd_ip4_udp_hdrs_t), 64UL-sizeof(fd_ip4_udp_hdrs_t) );

ulong end_offset = shred_sz + sizeof(fd_ip4_udp_hdrs_t);
ulong i;
for( i=64UL; end_offset-i<64UL; i+=64UL ) {
# if FD_HAS_AVX512
_mm512_stream_si512( (void *)(packet+i ), _mm512_loadu_si512( (void const *)(src+i ) ) );
# else
_mm256_stream_si256( (void *)(packet+i ), _mm256_loadu_si256( (void const *)(src+i ) ) );
_mm256_stream_si256( (void *)(packet+i+32UL), _mm256_loadu_si256( (void const *)(src+i+32UL) ) );
# endif
}
_mm_sfence();
memcpy( packet+i, src+i, end_offset-i ); /* Copy the last partial cache line */

#else
fd_memcpy( packet+sizeof(fd_ip4_udp_hdrs_t), shred, shred_sz );
#endif

ulong pkt_sz = shred_sz + sizeof(fd_ip4_udp_hdrs_t);
ulong tspub = fd_frag_meta_ts_comp( fd_tickcount() );
Expand Down
Loading