Skip to content

Commit 4a4fc87

Browse files
authored
Merge pull request #5 from accel-sim/dev
GPGPU-Sim Latest Dev Integration
2 parents 90ec339 + 84c4f46 commit 4a4fc87

22 files changed

+1273
-453
lines changed

CHANGES

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,16 @@
11
LOG:
2+
Version 4.1.0 versus 4.0.0
3+
-Features:
4+
1- Supporting L1 write-allocate with sub-sector writing policy as in Volta+ hardware, and changing the Volta+ cards config to make L1 write-allocate with write-through
5+
2- Making the L1 adaptive cache policy to be configurable
6+
3- Adding Ampere RTX 3060 config files
7+
-Bugs:
8+
1- Fixing L1 bank hash function bug
9+
2- Fixing L1 read hit counters in gpgpu-sim to match nvprof, to achieve more accurate L1 correlation with the HW
10+
3- Fixing bugs in lazy write handling, thanks to Gwendolyn Voskuilen from Sandia labs for this fix
11+
4- Fixing the backend pipeline for sub_core model
12+
5- Fixing Memory stomp bug at the shader_config
13+
6- Some code refactoring:
214
Version 4.0.0 (development branch) versus 3.2.3
315
-Front-End:
416
1- Support .nc cache modifier and __ldg function to access the read-only L1D cache

README.md

Lines changed: 7 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -11,22 +11,26 @@ This version of GPGPU-Sim has been tested with a subset of CUDA version 4.2,
1111
Please see the copyright notice in the file COPYRIGHT distributed with this
1212
release in the same directory as this file.
1313

14+
GPGPU-Sim 4.0 is compatible with Accel-Sim simulation framework. With the support
15+
of Accel-Sim, GPGPU-Sim 4.0 can run NVIDIA SASS traces (trace-based simulation)
16+
generated by NVIDIA's dynamic binary instrumentation tool (NVBit). For more information
17+
about Accel-Sim, see [https://accel-sim.github.io/](https://accel-sim.github.io/)
18+
1419
If you use GPGPU-Sim 4.0 in your research, please cite:
1520

1621
Mahmoud Khairy, Zhesheng Shen, Tor M. Aamodt, Timothy G Rogers.
1722
Accel-Sim: An Extensible Simulation Framework for Validated GPU Modeling.
1823
In proceedings of the 47th IEEE/ACM International Symposium on Computer Architecture (ISCA),
1924
May 29 - June 3, 2020.
2025

21-
If you use CuDNN or PyTorch support, checkpointing or our new debugging tool for functional
26+
If you use CuDNN or PyTorch support (execution-driven simulation), checkpointing or our new debugging tool for functional
2227
simulation errors in GPGPU-Sim for your research, please cite:
2328

2429
Jonathan Lew, Deval Shah, Suchita Pati, Shaylin Cattell, Mengchi Zhang, Amruth Sandhupatla,
2530
Christopher Ng, Negar Goli, Matthew D. Sinclair, Timothy G. Rogers, Tor M. Aamodt
2631
Analyzing Machine Learning Workloads Using a Detailed GPU Simulator, arXiv:1811.08933,
2732
https://arxiv.org/abs/1811.08933
2833

29-
3034
If you use the Tensor Core model in GPGPU-Sim or GPGPU-Sim's CUTLASS Library
3135
for your research please cite:
3236

@@ -261,6 +265,7 @@ To clean the docs run
261265
The documentation resides at doc/doxygen/html.
262266

263267
To run Pytorch applications with the simulator, install the modified Pytorch library as well by following instructions [here](https://github.com/gpgpu-sim/pytorch-gpgpu-sim).
268+
264269
## Step 3: Run
265270

266271
Before we run, we need to make sure the application's executable file is dynamically linked to CUDA runtime library. This can be done during compilation of your program by introducing the nvcc flag "--cudart shared" in makefile (quotes should be excluded).

configs/tested-cfgs/SM75_RTX2060/gpgpusim.config

Lines changed: 57 additions & 62 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,3 @@
1-
# This config models the Turing RTX 2060
2-
# For more info about turing architecture:
3-
# 1- https://www.nvidia.com/content/dam/en-zz/Solutions/design-visualization/technologies/turing-architecture/NVIDIA-Turing-Architecture-Whitepaper.pdf
4-
# 2- "RTX on—The NVIDIA Turing GPU", IEEE MICRO 2020
5-
61
# functional simulator specification
72
-gpgpu_ptx_instruction_classification 0
83
-gpgpu_ptx_sim_mode 0
@@ -14,6 +9,7 @@
149
-gpgpu_runtime_sync_depth_limit 2
1510
-gpgpu_runtime_pending_launch_count_limit 2048
1611
-gpgpu_kernel_launch_latency 5000
12+
-gpgpu_TB_launch_latency 0
1713

1814
# Compute Capability
1915
-gpgpu_compute_capability_major 7
@@ -27,91 +23,93 @@
2723
-gpgpu_n_clusters 30
2824
-gpgpu_n_cores_per_cluster 1
2925
-gpgpu_n_mem 12
30-
-gpgpu_n_sub_partition_per_mchannel 2
26+
-gpgpu_n_sub_partition_per_mchannel 2
3127

32-
# volta clock domains
28+
# clock domains
3329
#-gpgpu_clock_domains <Core Clock>:<Interconnect Clock>:<L2 Clock>:<DRAM Clock>
34-
-gpgpu_clock_domains 1365.0:1365.0:1365.0:3500.0
35-
# boost mode
36-
# -gpgpu_clock_domains 1680.0:1680.0:1680.0:3500.0
30+
-gpgpu_clock_domains 1365:1365:1365:3500.5
3731

3832
# shader core pipeline config
3933
-gpgpu_shader_registers 65536
4034
-gpgpu_registers_per_block 65536
4135
-gpgpu_occupancy_sm_number 75
4236

43-
# This implies a maximum of 32 warps/SM
44-
-gpgpu_shader_core_pipeline 1024:32
45-
-gpgpu_shader_cta 32
37+
-gpgpu_shader_core_pipeline 1024:32
38+
-gpgpu_shader_cta 16
4639
-gpgpu_simd_model 1
4740

4841
# Pipeline widths and number of FUs
4942
# ID_OC_SP,ID_OC_DP,ID_OC_INT,ID_OC_SFU,ID_OC_MEM,OC_EX_SP,OC_EX_DP,OC_EX_INT,OC_EX_SFU,OC_EX_MEM,EX_WB,ID_OC_TENSOR_CORE,OC_EX_TENSOR_CORE
50-
## Turing has 4 SP SIMD units, 4 INT units, 4 SFU units, 8 Tensor core units
51-
## We need to scale the number of pipeline registers to be equal to the number of SP units
52-
-gpgpu_pipeline_widths 4,0,4,4,4,4,0,4,4,4,8,4,4
43+
-gpgpu_pipeline_widths 4,4,4,4,4,4,4,4,4,4,8,4,4
5344
-gpgpu_num_sp_units 4
5445
-gpgpu_num_sfu_units 4
46+
-gpgpu_num_dp_units 4
5547
-gpgpu_num_int_units 4
5648
-gpgpu_tensor_core_avail 1
5749
-gpgpu_num_tensor_core_units 4
5850

5951
# Instruction latencies and initiation intervals
6052
# "ADD,MAX,MUL,MAD,DIV"
6153
# All Div operations are executed on SFU unit
62-
-ptx_opcode_latency_int 4,13,4,5,145,32
63-
-ptx_opcode_initiation_int 2,2,2,2,8,4
64-
-ptx_opcode_latency_fp 4,13,4,5,39
54+
-ptx_opcode_latency_int 4,4,4,4,21
55+
-ptx_opcode_initiation_int 2,2,2,2,2
56+
-ptx_opcode_latency_fp 4,4,4,4,39
6557
-ptx_opcode_initiation_fp 2,2,2,2,4
66-
-ptx_opcode_latency_dp 8,19,8,8,330
67-
-ptx_opcode_initiation_dp 4,4,4,4,130
68-
-ptx_opcode_latency_sfu 100
58+
-ptx_opcode_latency_dp 64,64,64,64,330
59+
-ptx_opcode_initiation_dp 64,64,64,64,130
60+
-ptx_opcode_latency_sfu 21
6961
-ptx_opcode_initiation_sfu 8
7062
-ptx_opcode_latency_tesnor 64
7163
-ptx_opcode_initiation_tensor 64
7264

73-
# Turing has four schedulers per core
74-
-gpgpu_num_sched_per_core 4
75-
# Greedy then oldest scheduler
76-
-gpgpu_scheduler gto
77-
## In Turing, a warp scheduler can issue 1 inst per cycle
78-
-gpgpu_max_insn_issue_per_warp 1
79-
-gpgpu_dual_issue_diff_exec_units 1
80-
81-
# shared memory bankconflict detection
82-
-gpgpu_shmem_num_banks 32
83-
-gpgpu_shmem_limited_broadcast 0
84-
-gpgpu_shmem_warp_parts 1
85-
-gpgpu_coalesce_arch 75
86-
87-
# Trung has sub core model, in which each scheduler has its own register file and EUs
65+
# sub core model: in which each scheduler has its own register file and EUs
8866
# i.e. schedulers are isolated
8967
-gpgpu_sub_core_model 1
9068
# disable specialized operand collectors and use generic operand collectors instead
9169
-gpgpu_enable_specialized_operand_collector 0
9270
-gpgpu_operand_collector_num_units_gen 8
9371
-gpgpu_operand_collector_num_in_ports_gen 8
9472
-gpgpu_operand_collector_num_out_ports_gen 8
95-
# turing has 8 banks dual-port, 4 schedulers, two banks per scheduler
96-
# we increase #banks to 16 to mitigate the effect of Regisrer File Cache (RFC) which we do not implement in the current version
97-
-gpgpu_num_reg_banks 16
73+
# register banks
74+
-gpgpu_num_reg_banks 8
9875
-gpgpu_reg_file_port_throughput 2
9976

77+
# warp scheduling
78+
-gpgpu_num_sched_per_core 4
79+
-gpgpu_scheduler lrr
80+
# a warp scheduler issue mode
81+
-gpgpu_max_insn_issue_per_warp 1
82+
-gpgpu_dual_issue_diff_exec_units 1
83+
84+
## L1/shared memory configuration
10085
# <nsets>:<bsize>:<assoc>,<rep>:<wr>:<alloc>:<wr_alloc>:<set_index_fn>,<mshr>:<N>:<merge>,<mq>:**<fifo_entry>
10186
# ** Optional parameter - Required when mshr_type==Texture Fifo
102-
-gpgpu_adaptive_cache_config 0
87+
# In adaptive cache, we adaptively assign the remaining shared memory to L1 cache
88+
# For more info, see https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#shared-memory-7-x
89+
-gpgpu_adaptive_cache_config 1
90+
-gpgpu_shmem_option 32,64
91+
-gpgpu_unified_l1d_size 96
92+
# L1 cache configuration
10393
-gpgpu_l1_banks 4
104-
-gpgpu_cache:dl1 S:1:128:512,L:L:s:N:L,A:256:8,16:0,32
105-
-gpgpu_shmem_size 65536
106-
-gpgpu_shmem_sizeDefault 65536
107-
-gpgpu_shmem_per_block 65536
94+
-gpgpu_cache:dl1 S:4:128:64,L:T:m:L:L,A:256:32,16:0,32
95+
-gpgpu_l1_latency 32
10896
-gpgpu_gmem_skip_L1D 0
109-
-gpgpu_n_cluster_ejection_buffer_size 32
110-
-gpgpu_l1_latency 20
111-
-gpgpu_smem_latency 20
11297
-gpgpu_flush_l1_cache 1
98+
-gpgpu_n_cluster_ejection_buffer_size 32
99+
-gpgpu_l1_cache_write_ratio 25
113100

114-
# 64 sets, each 128 bytes 16-way for each memory sub partition (128 KB per memory sub partition). This gives us 3MB L2 cache
101+
# shared memory configuration
102+
-gpgpu_shmem_size 65536
103+
-gpgpu_shmem_sizeDefault 65536
104+
-gpgpu_shmem_per_block 49152
105+
-gpgpu_smem_latency 30
106+
# shared memory bankconflict detection
107+
-gpgpu_shmem_num_banks 32
108+
-gpgpu_shmem_limited_broadcast 0
109+
-gpgpu_shmem_warp_parts 1
110+
-gpgpu_coalesce_arch 75
111+
112+
# L2 cache
115113
-gpgpu_cache:dl2 S:64:128:16,L:B:m:L:P,A:192:4,32:0,32
116114
-gpgpu_cache:dl2_texture_only 0
117115
-gpgpu_dram_partition_queues 64:64:64:64
@@ -122,44 +120,41 @@
122120
-gpgpu_cache:il1 N:64:128:16,L:R:f:N:L,S:2:48,4
123121
-gpgpu_inst_fetch_throughput 4
124122
# 128 KB Tex
125-
# Note, TEX is deprected in Volta, It is used for legacy apps only. Use L1D cache instead with .nc modifier or __ldg mehtod
123+
# Note, TEX is deprected since Volta, It is used for legacy apps only. Use L1D cache instead with .nc modifier or __ldg mehtod
126124
-gpgpu_tex_cache:l1 N:4:128:256,L:R:m:N:L,T:512:8,128:2
127125
# 64 KB Const
128126
-gpgpu_const_cache:l1 N:128:64:8,L:R:f:N:L,S:2:64,4
129127
-gpgpu_perfect_inst_const_cache 1
130128

131129
# interconnection
132-
#-network_mode 1
133-
#-inter_config_file config_turing_islip.icnt
134130
# use built-in local xbar
135131
-network_mode 2
136132
-icnt_in_buffer_limit 512
137133
-icnt_out_buffer_limit 512
138134
-icnt_subnets 2
139-
-icnt_arbiter_algo 1
140135
-icnt_flit_size 40
136+
-icnt_arbiter_algo 1
141137

142138
# memory partition latency config
143-
-gpgpu_l2_rop_latency 160
144-
-dram_latency 100
139+
-gpgpu_l2_rop_latency 194
140+
-dram_latency 96
145141

146-
# dram model config
142+
# dram sched config
147143
-gpgpu_dram_scheduler 1
148144
-gpgpu_frfcfs_dram_sched_queue_size 64
149145
-gpgpu_dram_return_queue_size 192
150146

151-
# Turing has GDDR6
152-
# http://monitorinsider.com/GDDR6.html
147+
# dram model config
153148
-gpgpu_n_mem_per_ctrlr 1
154149
-gpgpu_dram_buswidth 2
155150
-gpgpu_dram_burst_length 16
156151
-dram_data_command_freq_ratio 4
157152
-gpgpu_mem_address_mask 1
158153
-gpgpu_mem_addr_mapping dramid@8;00000000.00000000.00000000.00000000.0000RRRR.RRRRRRRR.RBBBCCCC.BCCSSSSS
159154

160-
# Use the same GDDR5 timing, scaled to 3500MHZ
161-
-gpgpu_dram_timing_opt "nbk=16:CCD=4:RRD=10:RCD=20:RAS=50:RP=20:RC=62:
162-
CL=20:WL=8:CDLR=9:WR=20:nbkgrp=4:CCDL=4:RTPL=4"
155+
# Mem timing
156+
-gpgpu_dram_timing_opt nbk=16:CCD=4:RRD=12:RCD=24:RAS=55:RP=24:RC=78:CL=24:WL=8:CDLR=10:WR=24:nbkgrp=4:CCDL=6:RTPL=4
157+
-dram_dual_bus_interface 0
163158

164159
# select lower bits for bnkgrp to increase bnkgrp parallelism
165160
-dram_bnk_indexing_policy 0
@@ -174,7 +169,7 @@
174169
-enable_ptx_file_line_stats 1
175170
-visualizer_enabled 0
176171

177-
# power model configs, disable it untill we create a real energy model for Volta
172+
# power model configs, disable it untill we create a real energy model
178173
-power_simulation_enabled 0
179174

180175
# tracing functionality

configs/tested-cfgs/SM7_QV100/gpgpusim.config

Lines changed: 13 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -94,12 +94,12 @@
9494
-gpgpu_shmem_num_banks 32
9595
-gpgpu_shmem_limited_broadcast 0
9696
-gpgpu_shmem_warp_parts 1
97-
-gpgpu_coalesce_arch 60
97+
-gpgpu_coalesce_arch 70
9898

9999
# Volta has four schedulers per core
100100
-gpgpu_num_sched_per_core 4
101101
# Greedy then oldest scheduler
102-
-gpgpu_scheduler gto
102+
-gpgpu_scheduler lrr
103103
## In Volta, a warp scheduler can issue 1 inst per cycle
104104
-gpgpu_max_insn_issue_per_warp 1
105105
-gpgpu_dual_issue_diff_exec_units 1
@@ -113,17 +113,21 @@
113113
# For more info, see https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#shared-memory-7-x
114114
# disable this mode in case of multi kernels/apps execution
115115
-gpgpu_adaptive_cache_config 1
116-
# Volta unified cache has four banks
116+
-gpgpu_shmem_option 0,8,16,32,64,96
117+
-gpgpu_unified_l1d_size 128
118+
# L1 cache configuration
117119
-gpgpu_l1_banks 4
118-
-gpgpu_cache:dl1 S:1:128:256,L:L:s:N:L,A:256:8,16:0,32
120+
-gpgpu_cache:dl1 S:4:128:64,L:T:m:L:L,A:512:8,16:0,32
121+
-gpgpu_l1_cache_write_ratio 25
122+
-gpgpu_l1_latency 20
123+
-gpgpu_gmem_skip_L1D 0
124+
-gpgpu_flush_l1_cache 1
125+
-gpgpu_n_cluster_ejection_buffer_size 32
126+
# shared memory configuration
119127
-gpgpu_shmem_size 98304
120128
-gpgpu_shmem_sizeDefault 98304
121129
-gpgpu_shmem_per_block 65536
122-
-gpgpu_gmem_skip_L1D 0
123-
-gpgpu_n_cluster_ejection_buffer_size 32
124-
-gpgpu_l1_latency 20
125130
-gpgpu_smem_latency 20
126-
-gpgpu_flush_l1_cache 1
127131

128132
# 32 sets, each 128 bytes 24-way for each memory sub partition (96 KB per memory sub partition). This gives us 6MB L2 cache
129133
-gpgpu_cache:dl2 S:32:128:24,L:B:m:L:P,A:192:4,32:0,32
@@ -201,5 +205,4 @@
201205
# tracing functionality
202206
#-trace_enabled 1
203207
#-trace_components WARP_SCHEDULER,SCOREBOARD
204-
#-trace_sampling_core 0
205-
208+
#-trace_sampling_core 0

configs/tested-cfgs/SM7_TITANV/gpgpusim.config

Lines changed: 11 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -100,7 +100,7 @@
100100
# Volta has four schedulers per core
101101
-gpgpu_num_sched_per_core 4
102102
# Greedy then oldest scheduler
103-
-gpgpu_scheduler gto
103+
-gpgpu_scheduler lrr
104104
## In Volta, a warp scheduler can issue 1 inst per cycle
105105
-gpgpu_max_insn_issue_per_warp 1
106106
-gpgpu_dual_issue_diff_exec_units 1
@@ -114,17 +114,21 @@
114114
# For more info, see https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#shared-memory-7-x
115115
# disable this mode in case of multi kernels/apps execution
116116
-gpgpu_adaptive_cache_config 1
117-
# Volta unified cache has four banks
117+
-gpgpu_shmem_option 0,8,16,32,64,96
118+
-gpgpu_unified_l1d_size 128
119+
# L1 cache configuration
118120
-gpgpu_l1_banks 4
119-
-gpgpu_cache:dl1 S:1:128:256,L:L:s:N:L,A:256:8,16:0,32
121+
-gpgpu_cache:dl1 S:4:128:64,L:T:m:L:L,A:512:8,16:0,32
122+
-gpgpu_l1_cache_write_ratio 25
123+
-gpgpu_gmem_skip_L1D 0
124+
-gpgpu_l1_latency 20
125+
-gpgpu_flush_l1_cache 1
126+
-gpgpu_n_cluster_ejection_buffer_size 32
127+
# shared memory configuration
120128
-gpgpu_shmem_size 98304
121129
-gpgpu_shmem_sizeDefault 98304
122130
-gpgpu_shmem_per_block 65536
123-
-gpgpu_gmem_skip_L1D 0
124-
-gpgpu_n_cluster_ejection_buffer_size 32
125-
-gpgpu_l1_latency 20
126131
-gpgpu_smem_latency 20
127-
-gpgpu_flush_l1_cache 1
128132

129133
# 32 sets, each 128 bytes 24-way for each memory sub partition (96 KB per memory sub partition). This gives us 4.5MB L2 cache
130134
-gpgpu_cache:dl2 S:32:128:24,L:B:m:L:P,A:192:4,32:0,32

0 commit comments

Comments
 (0)