Skip to content

Add only TLB support #118

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 13 commits into
base: dev
Choose a base branch
from
Open

Add only TLB support #118

wants to merge 13 commits into from

Conversation

yechen3
Copy link

@yechen3 yechen3 commented May 15, 2025

Extracted TLB-only features from the UVM support in PR#108 and integrated them into existing classes to minimize API changes. Current features include:
1. A 4096-entry, fully associative TLB with LRU replacement policy by default
2. 4-port TLB capable of completing up to 4 lookups per cycle
3. Support for 4KB pages, with a page table walk latency of 100 cycles
4. TLB flushed at the end of each kernel
5. All parameters are configurable

To-do-list:
1. Need a mechanism to disable TLB (i.e., bypass ldst_unit::tlb_cycle)
2. Probably need to merge requests from TLB to GMMU of same virtual pages to reduce traffic.

@tgrogers
Copy link

tgrogers commented Jul 4, 2025

@William-An , @FJShen - can you guys look at this? This is a lot of code.

Copy link

@tgrogers tgrogers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall - I think we need to have a meeting with @William-An @LAhmos @JRPan @christindbose to discuss this.

Generally, the code has some small issues I would like to see fixed - but I would like you @yechen3 to create a presentation / readme that describes the feature to us so we are all on the same page with what is happening.

Also, some basic questions:
1 - do the config files have to change?
2 - is there any slowdown in sim time?
3 - does the change effect correlation?

Also please take the addition to the trace system seriously - this is invaluable to debug.

@@ -16,7 +16,7 @@ on:
# By default regress against accel-sim's dev branch
env:
ACCELSIM_REPO: https://github.com/purdue-aalp/accel-sim-framework-public.git
ACCELSIM_BRANCH: dev
ACCELSIM_BRANCH: dev-tlb
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need the dev-tlb branch of accel-sim for gpgpu-sim to work?
If the config is not configured to use tlbs, then does it still need this branch?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a circular dependency issue: the tlb version of Accel-Sim needs the tlb version of gpgpu-sim and vice versa. We should have some ways to deal with this. Some ideas I have:

  • Specify the branch to use for CI runs? Not sure if github allows inputs from users.
  • First use the dev branch to test Accel-Sim/dev with GPGPU-Sim/dev-tlb, make sure things are not broken. Then add additional CI test for Accel-Sim/dev-tlb and GPGPU-Sim/dev-tlb

Also, is it possible to control whether to enable TLB functionality or not?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only chnage in dev-tlb branch is adding m_memory_stats argument in trace_shader_core_ctx() within trace_driven.cc. It still needs this change regardless of whether tlb config is used or not. So is it better to overload this functions and create two variants.

@@ -448,9 +448,9 @@ void shader_core_ctx::create_exec_pipeline() {
}
}

m_ldst_unit = new ldst_unit(m_icnt, m_mem_fetch_allocator, this,
m_ldst_unit = new ldst_unit(m_gpu, m_icnt, m_mem_fetch_allocator, this,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the ldst unit now needs the whole GPU because it needs to look up some global page table information?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it needs the whole GPU to register tlb flush callback functions. The gpu page table is managed inside gmmu class, which is instantiated once per gpu. Additionally, I'm assuming the page table will never be missed (as in regular memcpy and UVA setups), and each page table walk incurs a constant latency of 100 cycles.

@@ -2011,7 +2013,7 @@ mem_stage_stall_type ldst_unit::process_cache_access(
}
if (status == HIT) {
assert(!read_sent);
inst.accessq_pop_back();
inst.accessq_pop_front();
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we changing this to FIFO?

@@ -2618,20 +2744,24 @@ void ldst_unit::init(mem_fetch_interface *icnt,
m_next_global = NULL;
m_last_inst_gpu_sim_cycle = 0;
m_last_inst_gpu_tot_sim_cycle = 0;

gpu->getGmmu()->register_tlbflush_callback(
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok - the global MMU gets a call from every SM?
This should really be done at a different level. I don't like pushing this global stuff down into the ldst unit.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only called during initialization to register TLB flush callback functions. These callbacks should be triggered when the TLB is flushed—for example, when a kernel completes.

@beneslami
Copy link

I have been actively and eagerly following this PR.
So far, no concrete verdict for this branch ?

If the functionality works fine, I intend to manually add these changes to my own custom AccelSim.

@yechen3
Copy link
Author

yechen3 commented Jul 8, 2025

Overall - I think we need to have a meeting with @William-An @LAhmos @JRPan @christindbose to discuss this.

Generally, the code has some small issues I would like to see fixed - but I would like you @yechen3 to create a presentation / readme that describes the feature to us so we are all on the same page with what is happening.

Also, some basic questions: 1 - do the config files have to change? 2 - is there any slowdown in sim time? 3 - does the change effect correlation?

Also please take the addition to the trace system seriously - this is invaluable to debug.

Ok, I will prepare some slides to present next week and work on fixing those issue this week.

Answers to @tgrogers questions:

  1. I added three configurable variables—tlb_size, page_size, and page_table_walk_latency. They all have default values, so existing config files don’t need to be modified.
  2. There’s no noticeable slowdown for the GPU-Micro benchmarks or Rodinia-3.1 workloads.
  3. The correlations (gpc_cycles) remain largely consistent:
    Rodinia-3.1: QV100-SASS
    (17 apps: 1 < 1% Err, 6 under, 11 over, 4 < 10% Err)
    [Correl = 0.9836 | Err = 27.83% | Agg_Err = 27.84% | RPD = 25.20% | NMSE = 0.45]
    GPU-Microbenchmark: QV100-SASS
    (11 apps: 0 < 1% Err, 6 under, 5 over, 2 < 10% Err)
    [Correl = 0.9419 | Err = 98.93% | Agg_Err = 30.09% | RPD = 38.08% | NMSE = 0.50]

@yechen3
Copy link
Author

yechen3 commented Jul 8, 2025

I have been actively and eagerly following this PR. So far, no concrete verdict for this branch ?

If the functionality works fine, I intend to manually add these changes to my own custom AccelSim.

@beneslami Thanks for following the PR! The current version should work, but it hasn’t been fully verified yet. We’re planning to have a meeting to discuss and review everything in more detail. I’ll make sure to update you with the outcome afterward.

Copy link

@FJShen FJShen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please consider adding a block of comments to provide an overview of TLB feature. How it works, what it interacts with, its assumptions, etc.

option_parser_register(
opp, "-page_table_walk_latency", OPT_INT64, &page_table_walk_latency,
"Average page table walk latency (in core cycle).", "100");
option_parser_register(opp, "-page_size", OPT_CSTR, &page_size_string,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This argument seems unused in the whole PR. Is it used anywhere? If used, please consider implementing safety check code to test its validity; if not, please consider removing it.

Copy link
Author

@yechen3 yechen3 Jul 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's used inside gmmu_t constructor to decide the page number from an address. What do you mean by the validity?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The description says "GDDR page size, only 4KB/2MB avaliable." It would be nice to have a sanity check in the code for this.

@JRPan
Copy link

JRPan commented Jul 10, 2025

Ok, I will prepare some slides to present next week and work on fixing those issue this week.

Please try to accommodate Western Time Zone :)
Thanks.

@JRPan JRPan requested a review from Copilot July 14, 2025 03:18
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements TLB (Translation Lookaside Buffer) support into the GPU simulator to handle virtual memory translation. The implementation adds a fully associative TLB with configurable size (default 4096 entries) and LRU replacement policy to the load/store units, along with page table walk latency simulation and memory management unit (GMMU) infrastructure.

Key changes include:

  • Addition of TLB infrastructure with hit/miss tracking and statistics collection
  • Integration of GMMU (Graphics Memory Management Unit) for handling page table walks
  • Implementation of communication queues between cores and GMMU for TLB miss handling

Reviewed Changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
src/gpgpu-sim/shader.h Adds TLB functionality to ldst_unit class, GMMU communication queues, and new constructor parameters
src/gpgpu-sim/shader.cc Implements TLB cycle processing, page table walk handling, and memory access queue modifications
src/gpgpu-sim/mem_latency_stat.h Adds TLB statistics tracking data structures
src/gpgpu-sim/mem_latency_stat.cc Implements TLB statistics collection and reporting functionality
src/gpgpu-sim/mem_fetch.h Adds getter method for memory access information
src/gpgpu-sim/gpu-sim.h Defines GMMU class and adds TLB configuration parameters
src/gpgpu-sim/gpu-sim.cc Implements GMMU cycle processing and page table walk latency simulation
src/gpgpu-sim/gpu-misc.h Adds min4 utility function for clock domain management
src/abstract_hardware_model.h Adds TLB miss tracking to warp instructions and memory access queue methods
src/abstract_hardware_model.cc Contains placeholder for page size conversion functionality
.github/workflows/accelsim.yml Updates CI branch reference for TLB development
Comments suppressed due to low confidence (1)

src/gpgpu-sim/shader.h:1498

  • The variable name 'tlb' is ambiguous. It should be renamed to something more descriptive like 'tlb_entries' or 'tlb_page_list' to clarify that it contains TLB entries.
  std::list<mem_addr_t> tlb;

@@ -2011,7 +2013,7 @@ mem_stage_stall_type ldst_unit::process_cache_access(
}
if (status == HIT) {
assert(!read_sent);
inst.accessq_pop_back();
inst.accessq_pop_front();
Copy link
Preview

Copilot AI Jul 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing from accessq_pop_back() to accessq_pop_front() alters the access pattern from stack-like (LIFO) to queue-like (FIFO). This change should be verified to ensure it doesn't break the expected memory access ordering semantics.

Suggested change
inst.accessq_pop_front();
inst.accessq_pop_back(); // Restore stack-like (LIFO) behavior for memory access ordering

Copilot uses AI. Check for mistakes.


fprintf(fout, "========================================TLB "
"statistics(thrashing)==============================\n");
std::map<mem_addr_t, unsigned> tlb_thrash[num_cluster];
Copy link
Preview

Copilot AI Jul 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Variable-length array 'tlb_thrash[num_cluster]' may cause stack overflow for large cluster counts. Consider using dynamic allocation with std::vector instead.

Suggested change
std::map<mem_addr_t, unsigned> tlb_thrash[num_cluster];
std::vector<std::map<mem_addr_t, unsigned>> tlb_thrash(num_cluster);

Copilot uses AI. Check for mistakes.

m_core->get_gpu()->gpu_tot_sim_cycle);

// send it over downward queues (CU to GMMU) to suffer for far fetch latency
m_cu_gmmu_queue.push_back(mf);
Copy link
Preview

Copilot AI Jul 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The TLB miss handling creates a memory fetch for every TLB miss, which could lead to excessive queue growth and memory usage under high TLB miss rates. Consider implementing backpressure or batching mechanisms.

Copilot uses AI. Check for mistakes.

@JRPan
Copy link

JRPan commented Jul 14, 2025

Copilot has some valid points

@yechen3
Copy link
Author

yechen3 commented Jul 14, 2025

Generally, the code has some small issues I would like to see fixed - but I would like you @yechen3 to create a presentation / readme that describes the feature to us so we are all on the same page with what is happening.

@tgrogers @JRPan @William-An @FJShen I've made some slides in here (gpgpu-sim-tlb.pptx). Please let me know when you guys are available to meet or if you have any questions.

@beneslami
Copy link

Hi @yechen3
I'm curious and excited to know about the latest status of your conclusion regarding TLB implementation in AccelSim.

Also, I would like to highlight a few points. Maybe these points are already taken care of or they may be insightful for you:

1- I went through your code and saw that the class GMMU is implemented. I was wondering if it is possible to simulate the presence of IOMMU by tweaking the internal attributes of GMMU class ? As you know, CPU-GPU executions use unified virtual memory and each time a TLB miss happens in Last Level TLB of GPU, it sends request to IOMMU. This means we are supposed to experience some amount of PCI-e latency (Based on this paper).

2- It's good that GMMU is defined as a class because in my customized AccelSim, I implemented chiplet-based GPU system. And I can create GMMU object per chiplet. The TLB is also implemented class-based ? As far as I remember, the TLB was defined as a global map (correct me if I'm wrong)? It's a scalable idea to have a base TLB, then L1 TLB is defined, inheriting base TLB, and so on. It's because address translation is not yet scalable in MCM-GPU systems and there needs to have different micro-architectural tweaks and analysis, which requires agile baseline implementation of TLB.

Thanks
Ben

@yechen3
Copy link
Author

yechen3 commented Jul 21, 2025

Hi @yechen3 I'm curious and excited to know about the latest status of your conclusion regarding TLB implementation in AccelSim.

Also, I would like to highlight a few points. Maybe these points are already taken care of or they may be insightful for you:

1- I went through your code and saw that the class GMMU is implemented. I was wondering if it is possible to simulate the presence of IOMMU by tweaking the internal attributes of GMMU class ? As you know, CPU-GPU executions use unified virtual memory and each time a TLB miss happens in Last Level TLB of GPU, it sends request to IOMMU. This means we are supposed to experience some amount of PCI-e latency (Based on this paper).

2- It's good that GMMU is defined as a class because in my customized AccelSim, I implemented chiplet-based GPU system. And I can create GMMU object per chiplet. The TLB is also implemented class-based ? As far as I remember, the TLB was defined as a global map (correct me if I'm wrong)? It's a scalable idea to have a base TLB, then L1 TLB is defined, inheriting base TLB, and so on. It's because address translation is not yet scalable in MCM-GPU systems and there needs to have different micro-architectural tweaks and analysis, which requires agile baseline implementation of TLB.

Thanks Ben

Hi @beneslami, we are going to have an internal review next Monday. Regarding your suggestions:

  1. We adopted TLB implementation from UVMSmart, which simulates the full virtual memory system, (i.e. PCIe queue, page table, prefetcher). However, the simulation time was significantly increased as a result. That’s the main reason we decided to simplify the design and retain only the TLB component. Also, in most scenarios, CPU-GPU memory coherence (as in UVM) is uncommon, while GPU peer memory access via Unified Virtual Addressing (UVA) is more typical.

  2. It's a good suggestion, we’ll go ahead and make that change.

Thanks again for your feedback.

@@ -93,6 +93,7 @@ tr1_hash_map<new_addr_type, unsigned> address_random_interleaving;
#define L2 0x02
#define DRAM 0x04
#define ICNT 0x08
#define GMMU 0x10

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nitpicking: Can you rewrite this in the form of 1 << 1, 1 << 2? Just to keep things a bit cleaner.

Copy link

@FJShen FJShen Aug 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we were to adopt the form "1 << n", we better wrap them in a pair of parentheses at macro definition. The C++ shift operators have low precedence.

class memory_stats_t *m_memory_stats;
};

struct lp_tree_node {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this struct ever used?

// page table walk delay queue
std::list<page_table_walk_latency_t> page_table_walk_queue;

enum class latency_type {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same for this enum, is it ever used?

std::list<mem_fetch *> m_cu_gmmu_queue;

// set of virtual addresses present in TLB
std::list<mem_addr_t> tlb;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we can document somewhere that the TLB is a fully-associative cache.

@@ -1366,6 +1367,9 @@ class ldst_unit : public pipelined_simd_unit {
virtual void cycle();

void fill(mem_fetch *mf);
// function to fill the gmmu to cu queue
// from the cluster to load/store unit
void fill_mem_access(mem_fetch *mf);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this method is only used for TLB-related actions, we should give a specific name rather than a generic "mem_access".


// for debugging
unsigned long long m_last_inst_gpu_sim_cycle;
unsigned long long m_last_inst_gpu_tot_sim_cycle;

// two queues that interface with texture processor cluster
std::list<mem_fetch *> m_gmmu_cu_queue;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two queues are the core-specific queues. It will be nice if they have different names from the cluster queues.


// queues that pass memory accesses between core and GMMU
// as cluster interfaces between CU and GMMU
std::list<mem_fetch *> m_gmmu_cu_queue;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use different names to differentiate the cluster queue and core queue.

Something like m_mmu_gmmu2cluster_queue and m_mmu_cluster2core_queue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants