Add only TLB support #118

yechen3 · 2025-05-15T05:37:51Z

Extracted TLB-only features from the UVM support in PR#108 and integrated them into existing classes to minimize API changes. Current features include:
1. A 4096-entry, fully associative TLB with LRU replacement policy by default
2. 4-port TLB capable of completing up to 4 lookups per cycle
3. Support for 4KB pages, with a page table walk latency of 100 cycles
4. TLB flushed at the end of each kernel
5. All parameters are configurable

To-do-list:
1. Need a mechanism to disable TLB (i.e., bypass ldst_unit::tlb_cycle)
2. Probably need to merge requests from TLB to GMMU of same virtual pages to reduce traffic.

…ser.

…s into existing ones.

…bution-public into dev

Temporary change to aim at dev-tlb branch

tgrogers · 2025-07-04T19:12:34Z

@William-An , @FJShen - can you guys look at this? This is a lot of code.

tgrogers

Overall - I think we need to have a meeting with @William-An @LAhmos @JRPan @christindbose to discuss this.

Generally, the code has some small issues I would like to see fixed - but I would like you @yechen3 to create a presentation / readme that describes the feature to us so we are all on the same page with what is happening.

Also, some basic questions:
1 - do the config files have to change?
2 - is there any slowdown in sim time?
3 - does the change effect correlation?

Also please take the addition to the trace system seriously - this is invaluable to debug.

tgrogers · 2025-07-05T18:36:48Z

.github/workflows/accelsim.yml

@@ -16,7 +16,7 @@ on:
 # By default regress against accel-sim's dev branch
 env:
  ACCELSIM_REPO: https://github.com/purdue-aalp/accel-sim-framework-public.git
-  ACCELSIM_BRANCH: dev
+  ACCELSIM_BRANCH: dev-tlb


Do we need the dev-tlb branch of accel-sim for gpgpu-sim to work?
If the config is not configured to use tlbs, then does it still need this branch?

I think this is a circular dependency issue: the tlb version of Accel-Sim needs the tlb version of gpgpu-sim and vice versa. We should have some ways to deal with this. Some ideas I have:

Specify the branch to use for CI runs? Not sure if github allows inputs from users.

First use the dev branch to test Accel-Sim/dev with GPGPU-Sim/dev-tlb, make sure things are not broken. Then add additional CI test for Accel-Sim/dev-tlb and GPGPU-Sim/dev-tlb

Also, is it possible to control whether to enable TLB functionality or not?

The only chnage in dev-tlb branch is adding m_memory_stats argument in trace_shader_core_ctx() within trace_driven.cc. It still needs this change regardless of whether tlb config is used or not. So is it better to overload this functions and create two variants.

src/abstract_hardware_model.cc

src/gpgpu-sim/mem_latency_stat.cc

tgrogers · 2025-07-05T18:39:31Z

src/gpgpu-sim/shader.cc

@@ -448,9 +448,9 @@ void shader_core_ctx::create_exec_pipeline() {
    }
  }

-  m_ldst_unit = new ldst_unit(m_icnt, m_mem_fetch_allocator, this,
+  m_ldst_unit = new ldst_unit(m_gpu, m_icnt, m_mem_fetch_allocator, this,


the ldst unit now needs the whole GPU because it needs to look up some global page table information?

No, it needs the whole GPU to register tlb flush callback functions. The gpu page table is managed inside gmmu class, which is instantiated once per gpu. Additionally, I'm assuming the page table will never be missed (as in regular memcpy and UVA setups), and each page table walk incurs a constant latency of 100 cycles.

tgrogers · 2025-07-05T18:40:16Z

src/gpgpu-sim/shader.cc

@@ -2011,7 +2013,7 @@ mem_stage_stall_type ldst_unit::process_cache_access(
  }
  if (status == HIT) {
    assert(!read_sent);
-    inst.accessq_pop_back();
+    inst.accessq_pop_front();


Why are we changing this to FIFO?

src/gpgpu-sim/shader.cc

tgrogers · 2025-07-05T18:44:15Z

src/gpgpu-sim/shader.cc

@@ -2618,20 +2744,24 @@ void ldst_unit::init(mem_fetch_interface *icnt,
  m_next_global = NULL;
  m_last_inst_gpu_sim_cycle = 0;
  m_last_inst_gpu_tot_sim_cycle = 0;
+
+  gpu->getGmmu()->register_tlbflush_callback(


ok - the global MMU gets a call from every SM?
This should really be done at a different level. I don't like pushing this global stuff down into the ldst unit.

This is only called during initialization to register TLB flush callback functions. These callbacks should be triggered when the TLB is flushed—for example, when a kernel completes.

beneslami · 2025-07-07T23:12:25Z

I have been actively and eagerly following this PR.
So far, no concrete verdict for this branch ?

If the functionality works fine, I intend to manually add these changes to my own custom AccelSim.

yechen3 · 2025-07-08T18:48:45Z

Overall - I think we need to have a meeting with @William-An @LAhmos @JRPan @christindbose to discuss this.

Generally, the code has some small issues I would like to see fixed - but I would like you @yechen3 to create a presentation / readme that describes the feature to us so we are all on the same page with what is happening.

Also, some basic questions: 1 - do the config files have to change? 2 - is there any slowdown in sim time? 3 - does the change effect correlation?

Also please take the addition to the trace system seriously - this is invaluable to debug.

Ok, I will prepare some slides to present next week and work on fixing those issue this week.

Answers to @tgrogers questions:

I added three configurable variables—tlb_size, page_size, and page_table_walk_latency. They all have default values, so existing config files don’t need to be modified.
There’s no noticeable slowdown for the GPU-Micro benchmarks or Rodinia-3.1 workloads.
The correlations (gpc_cycles) remain largely consistent:
Rodinia-3.1: QV100-SASS
(17 apps: 1 < 1% Err, 6 under, 11 over, 4 < 10% Err)
[Correl = 0.9836 | Err = 27.83% | Agg_Err = 27.84% | RPD = 25.20% | NMSE = 0.45]
GPU-Microbenchmark: QV100-SASS
(11 apps: 0 < 1% Err, 6 under, 5 over, 2 < 10% Err)
[Correl = 0.9419 | Err = 98.93% | Agg_Err = 30.09% | RPD = 38.08% | NMSE = 0.50]

yechen3 · 2025-07-08T18:54:41Z

I have been actively and eagerly following this PR. So far, no concrete verdict for this branch ?

If the functionality works fine, I intend to manually add these changes to my own custom AccelSim.

@beneslami Thanks for following the PR! The current version should work, but it hasn’t been fully verified yet. We’re planning to have a meeting to discuss and review everything in more detail. I’ll make sure to update you with the outcome afterward.

FJShen

Please consider adding a block of comments to provide an overview of TLB feature. How it works, what it interacts with, its assumptions, etc.

FJShen · 2025-07-10T18:17:04Z

src/gpgpu-sim/gpu-sim.cc

+  option_parser_register(
+      opp, "-page_table_walk_latency", OPT_INT64, &page_table_walk_latency,
+      "Average page table walk latency (in core cycle).", "100");
+  option_parser_register(opp, "-page_size", OPT_CSTR, &page_size_string,


This argument seems unused in the whole PR. Is it used anywhere? If used, please consider implementing safety check code to test its validity; if not, please consider removing it.

It's used inside gmmu_t constructor to decide the page number from an address. What do you mean by the validity?

The description says "GDDR page size, only 4KB/2MB avaliable." It would be nice to have a sanity check in the code for this.

src/abstract_hardware_model.h

src/abstract_hardware_model.cc

src/abstract_hardware_model.h

JRPan · 2025-07-10T18:47:15Z

Ok, I will prepare some slides to present next week and work on fixing those issue this week.

Please try to accommodate Western Time Zone :)
Thanks.

Copilot

Pull Request Overview

This PR implements TLB (Translation Lookaside Buffer) support into the GPU simulator to handle virtual memory translation. The implementation adds a fully associative TLB with configurable size (default 4096 entries) and LRU replacement policy to the load/store units, along with page table walk latency simulation and memory management unit (GMMU) infrastructure.

Key changes include:

Addition of TLB infrastructure with hit/miss tracking and statistics collection
Integration of GMMU (Graphics Memory Management Unit) for handling page table walks
Implementation of communication queues between cores and GMMU for TLB miss handling

Reviewed Changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
src/gpgpu-sim/shader.h	Adds TLB functionality to ldst_unit class, GMMU communication queues, and new constructor parameters
src/gpgpu-sim/shader.cc	Implements TLB cycle processing, page table walk handling, and memory access queue modifications
src/gpgpu-sim/mem_latency_stat.h	Adds TLB statistics tracking data structures
src/gpgpu-sim/mem_latency_stat.cc	Implements TLB statistics collection and reporting functionality
src/gpgpu-sim/mem_fetch.h	Adds getter method for memory access information
src/gpgpu-sim/gpu-sim.h	Defines GMMU class and adds TLB configuration parameters
src/gpgpu-sim/gpu-sim.cc	Implements GMMU cycle processing and page table walk latency simulation
src/gpgpu-sim/gpu-misc.h	Adds min4 utility function for clock domain management
src/abstract_hardware_model.h	Adds TLB miss tracking to warp instructions and memory access queue methods
src/abstract_hardware_model.cc	Contains placeholder for page size conversion functionality
.github/workflows/accelsim.yml	Updates CI branch reference for TLB development

Comments suppressed due to low confidence (1)

src/gpgpu-sim/shader.h:1498

The variable name 'tlb' is ambiguous. It should be renamed to something more descriptive like 'tlb_entries' or 'tlb_page_list' to clarify that it contains TLB entries.

  std::list<mem_addr_t> tlb;

src/gpgpu-sim/shader.cc

Copilot · 2025-07-14T03:19:26Z

src/gpgpu-sim/shader.cc

@@ -2011,7 +2013,7 @@ mem_stage_stall_type ldst_unit::process_cache_access(
  }
  if (status == HIT) {
    assert(!read_sent);
-    inst.accessq_pop_back();
+    inst.accessq_pop_front();


Changing from accessq_pop_back() to accessq_pop_front() alters the access pattern from stack-like (LIFO) to queue-like (FIFO). This change should be verified to ensure it doesn't break the expected memory access ordering semantics.

Suggested change

inst.accessq_pop_front();

inst.accessq_pop_back(); // Restore stack-like (LIFO) behavior for memory access ordering

Copilot · 2025-07-14T03:19:27Z

src/gpgpu-sim/mem_latency_stat.cc

+
+  fprintf(fout, "========================================TLB "
+                "statistics(thrashing)==============================\n");
+  std::map<mem_addr_t, unsigned> tlb_thrash[num_cluster];


Variable-length array 'tlb_thrash[num_cluster]' may cause stack overflow for large cluster counts. Consider using dynamic allocation with std::vector instead.

Suggested change

std::map<mem_addr_t, unsigned> tlb_thrash[num_cluster];

std::vector<std::map<mem_addr_t, unsigned>> tlb_thrash(num_cluster);

Copilot · 2025-07-14T03:19:27Z

src/gpgpu-sim/shader.cc

+                m_core->get_gpu()->gpu_tot_sim_cycle);
+
+      // send it over downward queues (CU to GMMU) to suffer for far fetch latency
+      m_cu_gmmu_queue.push_back(mf);


The TLB miss handling creates a memory fetch for every TLB miss, which could lead to excessive queue growth and memory usage under high TLB miss rates. Consider implementing backpressure or batching mechanisms.

src/abstract_hardware_model.cc

JRPan · 2025-07-14T03:22:31Z

Copilot has some valid points

yechen3 · 2025-07-14T05:26:41Z

Generally, the code has some small issues I would like to see fixed - but I would like you @yechen3 to create a presentation / readme that describes the feature to us so we are all on the same page with what is happening.

@tgrogers @JRPan @William-An @FJShen I've made some slides in here (gpgpu-sim-tlb.pptx). Please let me know when you guys are available to meet or if you have any questions.

beneslami · 2025-07-17T16:56:54Z

Hi @yechen3
I'm curious and excited to know about the latest status of your conclusion regarding TLB implementation in AccelSim.

Also, I would like to highlight a few points. Maybe these points are already taken care of or they may be insightful for you:

1- I went through your code and saw that the class GMMU is implemented. I was wondering if it is possible to simulate the presence of IOMMU by tweaking the internal attributes of GMMU class ? As you know, CPU-GPU executions use unified virtual memory and each time a TLB miss happens in Last Level TLB of GPU, it sends request to IOMMU. This means we are supposed to experience some amount of PCI-e latency (Based on this paper).

2- It's good that GMMU is defined as a class because in my customized AccelSim, I implemented chiplet-based GPU system. And I can create GMMU object per chiplet. The TLB is also implemented class-based ? As far as I remember, the TLB was defined as a global map (correct me if I'm wrong)? It's a scalable idea to have a base TLB, then L1 TLB is defined, inheriting base TLB, and so on. It's because address translation is not yet scalable in MCM-GPU systems and there needs to have different micro-architectural tweaks and analysis, which requires agile baseline implementation of TLB.

Thanks
Ben

yechen3 · 2025-07-21T18:46:38Z

Hi @yechen3 I'm curious and excited to know about the latest status of your conclusion regarding TLB implementation in AccelSim.

Also, I would like to highlight a few points. Maybe these points are already taken care of or they may be insightful for you:

1- I went through your code and saw that the class GMMU is implemented. I was wondering if it is possible to simulate the presence of IOMMU by tweaking the internal attributes of GMMU class ? As you know, CPU-GPU executions use unified virtual memory and each time a TLB miss happens in Last Level TLB of GPU, it sends request to IOMMU. This means we are supposed to experience some amount of PCI-e latency (Based on this paper).

2- It's good that GMMU is defined as a class because in my customized AccelSim, I implemented chiplet-based GPU system. And I can create GMMU object per chiplet. The TLB is also implemented class-based ? As far as I remember, the TLB was defined as a global map (correct me if I'm wrong)? It's a scalable idea to have a base TLB, then L1 TLB is defined, inheriting base TLB, and so on. It's because address translation is not yet scalable in MCM-GPU systems and there needs to have different micro-architectural tweaks and analysis, which requires agile baseline implementation of TLB.

Thanks Ben

Hi @beneslami, we are going to have an internal review next Monday. Regarding your suggestions:

We adopted TLB implementation from UVMSmart, which simulates the full virtual memory system, (i.e. PCIe queue, page table, prefetcher). However, the simulation time was significantly increased as a result. That’s the main reason we decided to simplify the design and retain only the TLB component. Also, in most scenarios, CPU-GPU memory coherence (as in UVM) is uncommon, while GPU peer memory access via Unified Virtual Addressing (UVA) is more typical.
It's a good suggestion, we’ll go ahead and make that change.

Thanks again for your feedback.

William-An · 2025-08-04T16:07:18Z

src/gpgpu-sim/gpu-sim.cc

@@ -93,6 +93,7 @@ tr1_hash_map<new_addr_type, unsigned> address_random_interleaving;
 #define L2 0x02
 #define DRAM 0x04
 #define ICNT 0x08
+#define GMMU 0x10


Minor nitpicking: Can you rewrite this in the form of 1 << 1, 1 << 2? Just to keep things a bit cleaner.

If we were to adopt the form "1 << n", we better wrap them in a pair of parentheses at macro definition. The C++ shift operators have low precedence.

William-An · 2025-08-04T16:55:15Z

src/gpgpu-sim/gpu-sim.h

+  class memory_stats_t *m_memory_stats;
+};
+
+struct lp_tree_node {


Is this struct ever used?

William-An · 2025-08-04T16:56:08Z

src/gpgpu-sim/gpu-sim.h

+  // page table walk delay queue
+  std::list<page_table_walk_latency_t> page_table_walk_queue;
+
+  enum class latency_type {


Same for this enum, is it ever used?

William-An · 2025-08-05T18:28:31Z

src/gpgpu-sim/shader.h

+  std::list<mem_fetch *> m_cu_gmmu_queue;
+
+  // set of virtual addresses present in TLB
+  std::list<mem_addr_t> tlb;


I guess we can document somewhere that the TLB is a fully-associative cache.

William-An · 2025-08-05T18:40:08Z

src/gpgpu-sim/shader.h

@@ -1366,6 +1367,9 @@ class ldst_unit : public pipelined_simd_unit {
  virtual void cycle();

  void fill(mem_fetch *mf);
+  // function to fill the gmmu to cu queue
+  // from the cluster to load/store unit
+  void fill_mem_access(mem_fetch *mf);


If this method is only used for TLB-related actions, we should give a specific name rather than a generic "mem_access".

William-An · 2025-08-05T18:44:05Z

src/gpgpu-sim/shader.h


  // for debugging
  unsigned long long m_last_inst_gpu_sim_cycle;
  unsigned long long m_last_inst_gpu_tot_sim_cycle;

+  // two queues that interface with texture processor cluster
+  std::list<mem_fetch *> m_gmmu_cu_queue;


These two queues are the core-specific queues. It will be nice if they have different names from the cluster queues.

William-An · 2025-08-05T18:46:21Z

src/gpgpu-sim/shader.h

+
+  // queues that pass memory accesses between core and GMMU
+  // as cluster interfaces between CU and GMMU
+  std::list<mem_fetch *> m_gmmu_cu_queue;


We should use different names to differentiate the cluster queue and core queue.

Something like m_mmu_gmmu2cluster_queue and m_mmu_cluster2core_queue

yechen3 added 7 commits April 22, 2025 13:05

Update the AccelSim test script to target the repo specified by the u…

0c81128

…ser.

Initial modifications to add TLB support

bed64bb

Simplify TLB-related funcitons

18dc15a

Further simplify the TLB implementation by merging TLB-related classe…

10caaec

…s into existing ones.

Merge branch 'dev' of https://github.com/purdue-aalp/gpgpu-sim_distri…

e23c475

…bution-public into dev

Merge branch 'dev' into dev-tlb

698d0f4

Update accelsim.yml

a42b23b

Temporary change to aim at dev-tlb branch

yechen3 requested a review from tgrogers May 15, 2025 05:37

yechen3 force-pushed the dev-tlb branch from f805e22 to a42b23b Compare May 17, 2025 01:58

yechen3 added 4 commits May 16, 2025 22:03

Update short-tests-accelsim.sh

f1ab0bb

Fixed the deadlock when running hotspot using TITANV config

6c655bb

Try to fix SST-test errors

3f493cd

Attempt accel-sim#2 Try to fix SST-test errors

241e180

JRPan mentioned this pull request Jun 25, 2025

TLB implementation accel-sim/accel-sim-framework#452

Open

tgrogers requested review from William-An and FJShen July 4, 2025 19:12

tgrogers requested changes Jul 5, 2025

View reviewed changes

FJShen requested changes Jul 10, 2025

View reviewed changes

JRPan requested a review from Copilot July 14, 2025 03:18

Copilot AI reviewed Jul 14, 2025

View reviewed changes

yechen3 added 2 commits July 14, 2025 01:37

Remove comments and add VMEM_SYS TRACE

aea50c2

Fix some typos

d6f67bf

William-An reviewed Aug 4, 2025

View reviewed changes

src/gpgpu-sim/gpu-sim.h

class memory_stats_t *m_memory_stats;

};

struct lp_tree_node {

Copy link

William-An Aug 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this struct ever used?

William-An reviewed Aug 4, 2025

View reviewed changes

William-An reviewed Aug 5, 2025

View reviewed changes

	inst.accessq_pop_front();
	inst.accessq_pop_back(); // Restore stack-like (LIFO) behavior for memory access ordering

	std::map<mem_addr_t, unsigned> tlb_thrash[num_cluster];
	std::vector<std::map<mem_addr_t, unsigned>> tlb_thrash(num_cluster);

Add only TLB support #118

Are you sure you want to change the base?

Add only TLB support #118

Uh oh!

Conversation

yechen3 commented May 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tgrogers commented Jul 4, 2025

Uh oh!

tgrogers left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

beneslami commented Jul 7, 2025

Uh oh!

yechen3 commented Jul 8, 2025

Uh oh!

yechen3 commented Jul 8, 2025

Uh oh!

FJShen left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yechen3 Jul 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

JRPan commented Jul 10, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Copilot AI Jul 14, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jul 14, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jul 14, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

JRPan commented Jul 14, 2025

Uh oh!

yechen3 commented Jul 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

beneslami commented Jul 17, 2025

Uh oh!

yechen3 commented May 15, 2025 •

edited

Loading

yechen3 Jul 14, 2025 •

edited

Loading

yechen3 commented Jul 14, 2025 •

edited

Loading

FJShen Aug 4, 2025 •

edited

Loading