-
Notifications
You must be signed in to change notification settings - Fork 84
Add only TLB support #118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
Add only TLB support #118
Conversation
…s into existing ones.
Temporary change to aim at dev-tlb branch
@William-An , @FJShen - can you guys look at this? This is a lot of code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall - I think we need to have a meeting with @William-An @LAhmos @JRPan @christindbose to discuss this.
Generally, the code has some small issues I would like to see fixed - but I would like you @yechen3 to create a presentation / readme that describes the feature to us so we are all on the same page with what is happening.
Also, some basic questions:
1 - do the config files have to change?
2 - is there any slowdown in sim time?
3 - does the change effect correlation?
Also please take the addition to the trace system seriously - this is invaluable to debug.
@@ -16,7 +16,7 @@ on: | |||
# By default regress against accel-sim's dev branch | |||
env: | |||
ACCELSIM_REPO: https://github.com/purdue-aalp/accel-sim-framework-public.git | |||
ACCELSIM_BRANCH: dev | |||
ACCELSIM_BRANCH: dev-tlb |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need the dev-tlb branch of accel-sim for gpgpu-sim to work?
If the config is not configured to use tlbs, then does it still need this branch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a circular dependency issue: the tlb version of Accel-Sim needs the tlb version of gpgpu-sim and vice versa. We should have some ways to deal with this. Some ideas I have:
- Specify the branch to use for CI runs? Not sure if github allows inputs from users.
- First use the
dev
branch to testAccel-Sim/dev
withGPGPU-Sim/dev-tlb
, make sure things are not broken. Then add additional CI test forAccel-Sim/dev-tlb
andGPGPU-Sim/dev-tlb
Also, is it possible to control whether to enable TLB functionality or not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only chnage in dev-tlb branch is adding m_memory_stats argument in trace_shader_core_ctx() within trace_driven.cc. It still needs this change regardless of whether tlb config is used or not. So is it better to overload this functions and create two variants.
@@ -448,9 +448,9 @@ void shader_core_ctx::create_exec_pipeline() { | |||
} | |||
} | |||
|
|||
m_ldst_unit = new ldst_unit(m_icnt, m_mem_fetch_allocator, this, | |||
m_ldst_unit = new ldst_unit(m_gpu, m_icnt, m_mem_fetch_allocator, this, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the ldst unit now needs the whole GPU because it needs to look up some global page table information?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it needs the whole GPU to register tlb flush callback functions. The gpu page table is managed inside gmmu class, which is instantiated once per gpu. Additionally, I'm assuming the page table will never be missed (as in regular memcpy and UVA setups), and each page table walk incurs a constant latency of 100 cycles.
@@ -2011,7 +2013,7 @@ mem_stage_stall_type ldst_unit::process_cache_access( | |||
} | |||
if (status == HIT) { | |||
assert(!read_sent); | |||
inst.accessq_pop_back(); | |||
inst.accessq_pop_front(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we changing this to FIFO?
@@ -2618,20 +2744,24 @@ void ldst_unit::init(mem_fetch_interface *icnt, | |||
m_next_global = NULL; | |||
m_last_inst_gpu_sim_cycle = 0; | |||
m_last_inst_gpu_tot_sim_cycle = 0; | |||
|
|||
gpu->getGmmu()->register_tlbflush_callback( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok - the global MMU gets a call from every SM?
This should really be done at a different level. I don't like pushing this global stuff down into the ldst unit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is only called during initialization to register TLB flush callback functions. These callbacks should be triggered when the TLB is flushed—for example, when a kernel completes.
I have been actively and eagerly following this PR. If the functionality works fine, I intend to manually add these changes to my own custom AccelSim. |
Ok, I will prepare some slides to present next week and work on fixing those issue this week. Answers to @tgrogers questions:
|
@beneslami Thanks for following the PR! The current version should work, but it hasn’t been fully verified yet. We’re planning to have a meeting to discuss and review everything in more detail. I’ll make sure to update you with the outcome afterward. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please consider adding a block of comments to provide an overview of TLB feature. How it works, what it interacts with, its assumptions, etc.
option_parser_register( | ||
opp, "-page_table_walk_latency", OPT_INT64, &page_table_walk_latency, | ||
"Average page table walk latency (in core cycle).", "100"); | ||
option_parser_register(opp, "-page_size", OPT_CSTR, &page_size_string, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This argument seems unused in the whole PR. Is it used anywhere? If used, please consider implementing safety check code to test its validity; if not, please consider removing it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's used inside gmmu_t constructor to decide the page number from an address. What do you mean by the validity?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The description says "GDDR page size, only 4KB/2MB avaliable." It would be nice to have a sanity check in the code for this.
Please try to accommodate Western Time Zone :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR implements TLB (Translation Lookaside Buffer) support into the GPU simulator to handle virtual memory translation. The implementation adds a fully associative TLB with configurable size (default 4096 entries) and LRU replacement policy to the load/store units, along with page table walk latency simulation and memory management unit (GMMU) infrastructure.
Key changes include:
- Addition of TLB infrastructure with hit/miss tracking and statistics collection
- Integration of GMMU (Graphics Memory Management Unit) for handling page table walks
- Implementation of communication queues between cores and GMMU for TLB miss handling
Reviewed Changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 5 comments.
Show a summary per file
File | Description |
---|---|
src/gpgpu-sim/shader.h | Adds TLB functionality to ldst_unit class, GMMU communication queues, and new constructor parameters |
src/gpgpu-sim/shader.cc | Implements TLB cycle processing, page table walk handling, and memory access queue modifications |
src/gpgpu-sim/mem_latency_stat.h | Adds TLB statistics tracking data structures |
src/gpgpu-sim/mem_latency_stat.cc | Implements TLB statistics collection and reporting functionality |
src/gpgpu-sim/mem_fetch.h | Adds getter method for memory access information |
src/gpgpu-sim/gpu-sim.h | Defines GMMU class and adds TLB configuration parameters |
src/gpgpu-sim/gpu-sim.cc | Implements GMMU cycle processing and page table walk latency simulation |
src/gpgpu-sim/gpu-misc.h | Adds min4 utility function for clock domain management |
src/abstract_hardware_model.h | Adds TLB miss tracking to warp instructions and memory access queue methods |
src/abstract_hardware_model.cc | Contains placeholder for page size conversion functionality |
.github/workflows/accelsim.yml | Updates CI branch reference for TLB development |
Comments suppressed due to low confidence (1)
src/gpgpu-sim/shader.h:1498
- The variable name 'tlb' is ambiguous. It should be renamed to something more descriptive like 'tlb_entries' or 'tlb_page_list' to clarify that it contains TLB entries.
std::list<mem_addr_t> tlb;
@@ -2011,7 +2013,7 @@ mem_stage_stall_type ldst_unit::process_cache_access( | |||
} | |||
if (status == HIT) { | |||
assert(!read_sent); | |||
inst.accessq_pop_back(); | |||
inst.accessq_pop_front(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changing from accessq_pop_back() to accessq_pop_front() alters the access pattern from stack-like (LIFO) to queue-like (FIFO). This change should be verified to ensure it doesn't break the expected memory access ordering semantics.
inst.accessq_pop_front(); | |
inst.accessq_pop_back(); // Restore stack-like (LIFO) behavior for memory access ordering |
Copilot uses AI. Check for mistakes.
|
||
fprintf(fout, "========================================TLB " | ||
"statistics(thrashing)==============================\n"); | ||
std::map<mem_addr_t, unsigned> tlb_thrash[num_cluster]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Variable-length array 'tlb_thrash[num_cluster]' may cause stack overflow for large cluster counts. Consider using dynamic allocation with std::vector instead.
std::map<mem_addr_t, unsigned> tlb_thrash[num_cluster]; | |
std::vector<std::map<mem_addr_t, unsigned>> tlb_thrash(num_cluster); |
Copilot uses AI. Check for mistakes.
m_core->get_gpu()->gpu_tot_sim_cycle); | ||
|
||
// send it over downward queues (CU to GMMU) to suffer for far fetch latency | ||
m_cu_gmmu_queue.push_back(mf); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The TLB miss handling creates a memory fetch for every TLB miss, which could lead to excessive queue growth and memory usage under high TLB miss rates. Consider implementing backpressure or batching mechanisms.
Copilot uses AI. Check for mistakes.
Copilot has some valid points |
@tgrogers @JRPan @William-An @FJShen I've made some slides in here (gpgpu-sim-tlb.pptx). Please let me know when you guys are available to meet or if you have any questions. |
Hi @yechen3 Also, I would like to highlight a few points. Maybe these points are already taken care of or they may be insightful for you: 1- I went through your code and saw that the class GMMU is implemented. I was wondering if it is possible to simulate the presence of IOMMU by tweaking the internal attributes of GMMU class ? As you know, CPU-GPU executions use unified virtual memory and each time a TLB miss happens in Last Level TLB of GPU, it sends request to IOMMU. This means we are supposed to experience some amount of PCI-e latency (Based on this paper). 2- It's good that GMMU is defined as a class because in my customized AccelSim, I implemented chiplet-based GPU system. And I can create GMMU object per chiplet. The TLB is also implemented class-based ? As far as I remember, the TLB was defined as a global map (correct me if I'm wrong)? It's a scalable idea to have a base TLB, then L1 TLB is defined, inheriting base TLB, and so on. It's because address translation is not yet scalable in MCM-GPU systems and there needs to have different micro-architectural tweaks and analysis, which requires agile baseline implementation of TLB. Thanks |
Hi @beneslami, we are going to have an internal review next Monday. Regarding your suggestions:
Thanks again for your feedback. |
@@ -93,6 +93,7 @@ tr1_hash_map<new_addr_type, unsigned> address_random_interleaving; | |||
#define L2 0x02 | |||
#define DRAM 0x04 | |||
#define ICNT 0x08 | |||
#define GMMU 0x10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor nitpicking: Can you rewrite this in the form of 1 << 1, 1 << 2
? Just to keep things a bit cleaner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we were to adopt the form "1 << n", we better wrap them in a pair of parentheses at macro definition. The C++ shift operators have low precedence.
class memory_stats_t *m_memory_stats; | ||
}; | ||
|
||
struct lp_tree_node { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this struct ever used?
// page table walk delay queue | ||
std::list<page_table_walk_latency_t> page_table_walk_queue; | ||
|
||
enum class latency_type { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same for this enum, is it ever used?
std::list<mem_fetch *> m_cu_gmmu_queue; | ||
|
||
// set of virtual addresses present in TLB | ||
std::list<mem_addr_t> tlb; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we can document somewhere that the TLB is a fully-associative cache.
@@ -1366,6 +1367,9 @@ class ldst_unit : public pipelined_simd_unit { | |||
virtual void cycle(); | |||
|
|||
void fill(mem_fetch *mf); | |||
// function to fill the gmmu to cu queue | |||
// from the cluster to load/store unit | |||
void fill_mem_access(mem_fetch *mf); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this method is only used for TLB-related actions, we should give a specific name rather than a generic "mem_access".
|
||
// for debugging | ||
unsigned long long m_last_inst_gpu_sim_cycle; | ||
unsigned long long m_last_inst_gpu_tot_sim_cycle; | ||
|
||
// two queues that interface with texture processor cluster | ||
std::list<mem_fetch *> m_gmmu_cu_queue; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These two queues are the core-specific queues. It will be nice if they have different names from the cluster queues.
|
||
// queues that pass memory accesses between core and GMMU | ||
// as cluster interfaces between CU and GMMU | ||
std::list<mem_fetch *> m_gmmu_cu_queue; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should use different names to differentiate the cluster queue and core queue.
Something like m_mmu_gmmu2cluster_queue
and m_mmu_cluster2core_queue
Extracted TLB-only features from the UVM support in PR#108 and integrated them into existing classes to minimize API changes. Current features include:
1. A 4096-entry, fully associative TLB with LRU replacement policy by default
2. 4-port TLB capable of completing up to 4 lookups per cycle
3. Support for 4KB pages, with a page table walk latency of 100 cycles
4. TLB flushed at the end of each kernel
5. All parameters are configurable
To-do-list:
1. Need a mechanism to disable TLB (i.e., bypass ldst_unit::tlb_cycle)
2. Probably need to merge requests from TLB to GMMU of same virtual pages to reduce traffic.