Skip to content

Commit 51dd200

Browse files
Dmytro Dzhulgakovfacebook-github-bot
Dmytro Dzhulgakov
authored andcommitted
unify c2 and TH allocator (pytorch#16892)
Summary: Pull Request resolved: pytorch#16892 Replaces pytorch#14517 Merged caffe2 and TH CPU Allocators. Mostly using the code from caffe2 allocators. `memset` of caffe2 allocator is gone now. These two allocators should be almost the same. Baseline: ``` Running ./tensor_allocation Run on (48 X 2501 MHz CPU s) CPU Caches: L1 Data 32K (x24) L1 Instruction 32K (x24) L2 Unified 256K (x24) L3 Unified 30720K (x2) ------------------------------------------------------------------------- Benchmark Time CPU Iterations ------------------------------------------------------------------------- BM_MakeStorageImpl 148 ns 148 ns 4676594 BM_StorageImplCtor 54 ns 54 ns 12957810 BM_MallocStorageImpl 62 ns 62 ns 11254745 BM_TensorImplCtor 22 ns 22 ns 31939472 BM_MallocTensorImpl 105 ns 105 ns 6505661 BM_Malloc_1 43 ns 43 ns 16464905 BM_MakeTensorFromStorage 126 ns 126 ns 5586116 BM_MakeVariableFromTensor 236 ns 236 ns 2995528 BM_ATenCPUTensorAllocationSmall1 319 ns 319 ns 2268884 BM_ATenCPUTensorAllocationSmall2 318 ns 318 ns 2163332 BM_ATenCPUTensorAllocationMedium1 403 ns 403 ns 1663228 BM_ATenCPUTensorAllocationMedium2 448 ns 448 ns 1595004 BM_ATenCPUTensorAllocationBig1 532 ns 532 ns 1352634 BM_ATenCPUTensorAllocationBig2 4486 ns 4486 ns 160978 ``` Changed: ``` Running ./tensor_allocation Run on (48 X 2501 MHz CPU s) CPU Caches: L1 Data 32K (x24) L1 Instruction 32K (x24) L2 Unified 256K (x24) L3 Unified 30720K (x2) ------------------------------------------------------------------------- Benchmark Time CPU Iterations ------------------------------------------------------------------------- BM_MakeStorageImpl 141 ns 141 ns 4803576 BM_StorageImplCtor 55 ns 55 ns 13129391 BM_MallocStorageImpl 64 ns 64 ns 11088143 BM_TensorImplCtor 23 ns 23 ns 31616273 BM_MallocTensorImpl 101 ns 101 ns 7017585 BM_Malloc_1 39 ns 39 ns 18523954 BM_MakeTensorFromStorage 118 ns 118 ns 5877919 BM_MakeVariableFromTensor 452 ns 452 ns 1565722 BM_ATenCPUTensorAllocationSmall1 384 ns 384 ns 1819763 BM_ATenCPUTensorAllocationSmall2 389 ns 389 ns 1857483 BM_ATenCPUTensorAllocationMedium1 425 ns 425 ns 1646284 BM_ATenCPUTensorAllocationMedium2 430 ns 430 ns 1561319 BM_ATenCPUTensorAllocationBig1 508 ns 508 ns 1309969 BM_ATenCPUTensorAllocationBig2 3799 ns 3799 ns 173674 ``` lstm benchmark: Before: ``` INFO:lstm_bench:Iter: 1 / 390. Entries Per Second: 0.7k. INFO:lstm_bench:Iter: 21 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 41 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 61 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 81 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 101 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 121 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 141 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 161 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 181 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 201 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 221 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 241 / 390. Entries Per Second: 0.7k. INFO:lstm_bench:Iter: 261 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 281 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 301 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 321 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 341 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 361 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 381 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Done. Total EPS excluding 1st iteration: 0.8k ``` After: ``` INFO:lstm_bench:Iter: 1 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 21 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 41 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 61 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 81 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 101 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 121 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 141 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 161 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 181 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 201 / 390. Entries Per Second: 0.8k. INFO:lstm_bench:Iter: 221 / 390. Entries Per Second: 0.7k. INFO:lstm_bench:Iter: 241 / 390. Entries Per Second: 0.7k. INFO:lstm_bench:Iter: 261 / 390. Entries Per Second: 0.7k. INFO:lstm_bench:Iter: 281 / 390. Entries Per Second: 0.7k. INFO:lstm_bench:Iter: 301 / 390. Entries Per Second: 0.7k. INFO:lstm_bench:Iter: 321 / 390. Entries Per Second: 0.7k. INFO:lstm_bench:Iter: 341 / 390. Entries Per Second: 0.7k. INFO:lstm_bench:Iter: 361 / 390. Entries Per Second: 0.7k. INFO:lstm_bench:Iter: 381 / 390. Entries Per Second: 0.7k. INFO:lstm_bench:Done. Total EPS excluding 1st iteration: 0.8k ``` Reviewed By: ezyang Differential Revision: D13202632 fbshipit-source-id: db6d2ec756ed15b0732b15396c82ad42302bb79d
1 parent f87022b commit 51dd200

14 files changed

+303
-333
lines changed

aten/src/TH/THAllocator.cpp

+3-12
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,8 @@
1010
#define TH_ATOMIC_IPC_REFCOUNT 1
1111
#endif
1212

13+
#include <c10/core/CPUAllocator.h>
14+
1315
#if HAVE_MMAP
1416
#include <sys/types.h>
1517
#include <sys/mman.h>
@@ -19,19 +21,8 @@
1921
#endif
2022
/* end of stuff for mapped files */
2123

22-
struct THDefaultAllocator final : public at::Allocator {
23-
at::DataPtr allocate(size_t size) const override {
24-
auto* ptr = THAlloc(size);
25-
return {ptr, ptr, &THFree, at::DeviceType::CPU};
26-
}
27-
at::DeleterFnPtr raw_deleter() const override {
28-
return &THFree;
29-
}
30-
};
31-
32-
static THDefaultAllocator th_default_allocator;
3324
at::Allocator* getTHDefaultAllocator() {
34-
return &th_default_allocator;
25+
return c10::GetCPUAllocator();
3526
}
3627

3728
#if defined(_WIN32) || defined(HAVE_MMAP)

aten/src/TH/THGeneral.cpp

+5-41
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,9 @@
11
#include <TH/THGeneral.h>
22

3+
#ifdef __cplusplus
4+
#include <c10/core/CPUAllocator.h>
5+
#endif
6+
37
#ifdef _OPENMP
48
#include <omp.h>
59
#endif
@@ -155,52 +159,12 @@ void THSetGCHandler( void (*torchGCFunction_)(void *data), void *data )
155159
torchGCData = data;
156160
}
157161

158-
static void* THAllocInternal(ptrdiff_t size)
159-
{
160-
void *ptr;
161-
162-
if (size > 5120)
163-
{
164-
#if (defined(__unix) || defined(__APPLE__)) && (!defined(DISABLE_POSIX_MEMALIGN))
165-
if (posix_memalign(&ptr, 64, size) != 0)
166-
ptr = NULL;
167-
/*
168-
#elif defined(_WIN32)
169-
ptr = _aligned_malloc(size, 64);
170-
*/
171-
#else
172-
ptr = malloc(size);
173-
#endif
174-
}
175-
else
176-
{
177-
ptr = malloc(size);
178-
}
179-
180-
return ptr;
181-
}
182-
183162
void* THAlloc(ptrdiff_t size)
184163
{
185-
void *ptr;
186-
187164
if(size < 0)
188165
THError("$ Torch: invalid memory size -- maybe an overflow?");
189166

190-
if(size == 0)
191-
return NULL;
192-
193-
ptr = THAllocInternal(size);
194-
195-
if(!ptr && torchGCFunction) {
196-
torchGCFunction(torchGCData);
197-
ptr = THAllocInternal(size);
198-
}
199-
200-
if(!ptr)
201-
THError("$ Torch: not enough memory: you tried to allocate %dGB. Buy new RAM!", size/1073741824);
202-
203-
return ptr;
167+
return c10::alloc_cpu(size);
204168
}
205169

206170
void* THRealloc(void *ptr, ptrdiff_t size)

c10/core/Allocator.cpp

+2-7
Original file line numberDiff line numberDiff line change
@@ -16,12 +16,7 @@ at::DataPtr InefficientStdFunctionContext::makeDataPtr(
1616
device};
1717
}
1818

19-
} // namespace c10
20-
21-
namespace caffe2 {
22-
23-
C10_API at::Allocator* allocator_array[static_cast<int>(
24-
at::DeviceType::COMPILE_TIME_MAX_DEVICE_TYPES)];
19+
C10_API at::Allocator* allocator_array[at::COMPILE_TIME_MAX_DEVICE_TYPES];
2520

2621
void SetAllocator(at::DeviceType t, at::Allocator* alloc) {
2722
allocator_array[static_cast<int>(t)] = alloc;
@@ -33,4 +28,4 @@ at::Allocator* GetAllocator(const at::DeviceType& t) {
3328
return alloc;
3429
}
3530

36-
} // namespace caffe2
31+
} // namespace c10

c10/core/Allocator.h

+2-7
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ namespace c10 {
1616
// nullptr DataPtrs can still have a nontrivial device; this allows
1717
// us to treat zero-size allocations uniformly with non-zero allocations.
1818
//
19-
class DataPtr {
19+
class C10_API DataPtr {
2020
private:
2121
c10::detail::UniqueVoidPtr ptr_;
2222
Device device_;
@@ -181,11 +181,6 @@ struct C10_API InefficientStdFunctionContext {
181181
Device device);
182182
};
183183

184-
} // namespace c10
185-
186-
// TODO: move to c10
187-
namespace caffe2 {
188-
189184
/** Set the allocator for DeviceType `t`. The passed in allocator pointer is
190185
* expected to have static lifetime; this function does NOT take ownership
191186
* of the raw pointer. (The reason for this is to prevent existing pointers
@@ -210,4 +205,4 @@ struct AllocatorRegisterer {
210205
static AllocatorRegisterer<t> g_allocator_##d(f); \
211206
}
212207

213-
} // namespace caffe2
208+
} // namespace c10

c10/core/CPUAllocator.cpp

+170
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,170 @@
1+
#include <c10/core/CPUAllocator.h>
2+
#include <c10/util/typeid.h>
3+
#include <c10/core/DeviceType.h>
4+
5+
// TODO: rename flags to C10
6+
C10_DEFINE_bool(
7+
caffe2_report_cpu_memory_usage,
8+
false,
9+
"If set, print out detailed memory usage");
10+
11+
C10_DEFINE_bool(
12+
caffe2_cpu_allocator_do_zero_fill,
13+
false,
14+
"If set, do memory zerofilling when allocating on CPU");
15+
16+
C10_DEFINE_bool(
17+
caffe2_cpu_allocator_do_junk_fill,
18+
false,
19+
"If set, fill memory with deterministic junk when allocating on CPU");
20+
21+
namespace c10 {
22+
23+
void memset_junk(void* data, size_t num) {
24+
// This garbage pattern is NaN when interpreted as floating point values,
25+
// or as very large integer values.
26+
static constexpr int32_t kJunkPattern = 0x7fedbeef;
27+
static constexpr int64_t kJunkPattern64 =
28+
static_cast<int64_t>(kJunkPattern) << 32 | kJunkPattern;
29+
int32_t int64_count = num / sizeof(kJunkPattern64);
30+
int32_t remaining_bytes = num % sizeof(kJunkPattern64);
31+
int64_t* data_i64 = reinterpret_cast<int64_t*>(data);
32+
for (int i = 0; i < int64_count; i++) {
33+
data_i64[i] = kJunkPattern64;
34+
}
35+
if (remaining_bytes > 0) {
36+
memcpy(data_i64 + int64_count, &kJunkPattern64, remaining_bytes);
37+
}
38+
}
39+
40+
void* alloc_cpu(size_t nbytes) {
41+
if (nbytes == 0) {
42+
return nullptr;
43+
}
44+
45+
void* data;
46+
#ifdef __ANDROID__
47+
data = memalign(gAlignment, nbytes);
48+
#elif defined(_MSC_VER)
49+
data = _aligned_malloc(nbytes, gAlignment);
50+
#else
51+
CAFFE_ENFORCE_EQ(posix_memalign(&data, gAlignment, nbytes), 0);
52+
#endif
53+
54+
CAFFE_ENFORCE(
55+
data,
56+
"DefaultCPUAllocator: not enough memory: you tried to allocate %dGB. Buy new RAM!",
57+
nbytes / 1073741824);
58+
59+
// move data to a thread's NUMA node
60+
NUMAMove(data, nbytes, GetCurrentNUMANode());
61+
CHECK(
62+
!FLAGS_caffe2_cpu_allocator_do_zero_fill ||
63+
!FLAGS_caffe2_cpu_allocator_do_junk_fill)
64+
<< "Cannot request both zero-fill and junk-fill at the same time";
65+
if (FLAGS_caffe2_cpu_allocator_do_zero_fill) {
66+
memset(data, 0, nbytes);
67+
} else if (FLAGS_caffe2_cpu_allocator_do_junk_fill) {
68+
memset_junk(data, nbytes);
69+
}
70+
71+
return data;
72+
}
73+
74+
// A virtual struct that is used to report C10's memory allocation and
75+
// deallocation status
76+
class C10_API MemoryAllocationReporter {
77+
public:
78+
MemoryAllocationReporter() : allocated_(0) {}
79+
void New(void* ptr, size_t nbytes);
80+
void Delete(void* ptr);
81+
82+
private:
83+
std::mutex mutex_;
84+
std::unordered_map<void*, size_t> size_table_;
85+
size_t allocated_;
86+
};
87+
88+
struct C10_API DefaultCPUAllocator final : at::Allocator {
89+
DefaultCPUAllocator() {}
90+
~DefaultCPUAllocator() override {}
91+
at::DataPtr allocate(size_t nbytes) const override {
92+
void* data = alloc_cpu(nbytes);
93+
if (FLAGS_caffe2_report_cpu_memory_usage && nbytes > 0) {
94+
getMemoryAllocationReporter().New(data, nbytes);
95+
return {data, data, &ReportAndDelete, at::Device(at::DeviceType::CPU)};
96+
}
97+
return {data, data, &Delete, at::Device(at::DeviceType::CPU)};
98+
}
99+
100+
#ifdef _MSC_VER
101+
static void Delete(void* data) {
102+
_aligned_free(data);
103+
}
104+
#else
105+
static void Delete(void* data) {
106+
free(data);
107+
}
108+
#endif
109+
110+
static void ReportAndDelete(void* ptr) {
111+
if (!ptr) {
112+
return;
113+
}
114+
getMemoryAllocationReporter().Delete(ptr);
115+
Delete(ptr);
116+
}
117+
118+
at::DeleterFnPtr raw_deleter() const override {
119+
if (FLAGS_caffe2_report_cpu_memory_usage) {
120+
return &ReportAndDelete;
121+
}
122+
return &Delete;
123+
}
124+
125+
protected:
126+
static MemoryAllocationReporter& getMemoryAllocationReporter() {
127+
static MemoryAllocationReporter reporter_;
128+
return reporter_;
129+
}
130+
131+
};
132+
133+
void NoDelete(void*) {}
134+
135+
at::Allocator* GetCPUAllocator() {
136+
return GetAllocator(DeviceType::CPU);
137+
}
138+
139+
void SetCPUAllocator(at::Allocator* alloc) {
140+
SetAllocator(DeviceType::CPU, alloc);
141+
}
142+
143+
// Global default CPU Allocator
144+
static DefaultCPUAllocator g_cpu_alloc;
145+
146+
at::Allocator* GetDefaultCPUAllocator() {
147+
return &g_cpu_alloc;
148+
}
149+
150+
REGISTER_ALLOCATOR(DeviceType::CPU, &g_cpu_alloc);
151+
152+
void MemoryAllocationReporter::New(void* ptr, size_t nbytes) {
153+
std::lock_guard<std::mutex> guard(mutex_);
154+
size_table_[ptr] = nbytes;
155+
allocated_ += nbytes;
156+
LOG(INFO) << "C10 alloc " << nbytes << " bytes, total alloc " << allocated_
157+
<< " bytes.";
158+
}
159+
160+
void MemoryAllocationReporter::Delete(void* ptr) {
161+
std::lock_guard<std::mutex> guard(mutex_);
162+
auto it = size_table_.find(ptr);
163+
CHECK(it != size_table_.end());
164+
allocated_ -= it->second;
165+
LOG(INFO) << "C10 deleted " << it->second << " bytes, total alloc "
166+
<< allocated_ << " bytes.";
167+
size_table_.erase(it);
168+
}
169+
170+
} // namespace c10

c10/core/CPUAllocator.h

+41
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
#pragma once
2+
3+
#include <cstring>
4+
#include <unordered_map>
5+
6+
#include <c10/core/Allocator.h>
7+
#include <c10/util/Logging.h>
8+
#include <c10/util/numa.h>
9+
10+
// TODO: rename to c10
11+
C10_DECLARE_bool(caffe2_report_cpu_memory_usage);
12+
C10_DECLARE_bool(caffe2_cpu_allocator_do_zero_fill);
13+
C10_DECLARE_bool(caffe2_cpu_allocator_do_junk_fill);
14+
15+
namespace c10 {
16+
17+
// Use 64-byte alignment should be enough for computation up to AVX512.
18+
constexpr size_t gAlignment = 64;
19+
20+
using MemoryDeleter = void (*)(void*);
21+
22+
// A helper function that is basically doing nothing.
23+
C10_API void NoDelete(void*);
24+
25+
// Fill the data memory region of num bytes with a particular garbage pattern.
26+
// The garbage value is chosen to be NaN if interpreted as floating point value,
27+
// or a very large integer.
28+
C10_API void memset_junk(void* data, size_t num);
29+
30+
C10_API void* alloc_cpu(size_t nbytes);
31+
32+
// Get the CPU Alloctor.
33+
C10_API at::Allocator* GetCPUAllocator();
34+
// Sets the CPU allocator to the given allocator: the caller gives away the
35+
// ownership of the pointer.
36+
C10_API void SetCPUAllocator(at::Allocator* alloc);
37+
38+
// Get the Default CPU Allocator
39+
C10_API at::Allocator* GetDefaultCPUAllocator();
40+
41+
} // namespace c10

c10/core/StorageImpl.h

+1-1
Original file line numberDiff line numberDiff line change
@@ -53,7 +53,7 @@ struct C10_API StorageImpl final : public c10::intrusive_ptr_target {
5353
data_type,
5454
0,
5555
at::DataPtr(nullptr, device),
56-
caffe2::GetAllocator(device.type()),
56+
GetAllocator(device.type()),
5757
true) {}
5858

5959
StorageImpl& operator=(StorageImpl&& other) = default;

c10/core/TensorImpl.h

+1-1
Original file line numberDiff line numberDiff line change
@@ -1188,7 +1188,7 @@ struct C10_API TensorImpl : public c10::intrusive_ptr_target {
11881188
// know how to reallocate it. However, in order to preserve legacy C2
11891189
// behavior, we allow reallocating the memory using default allocator.
11901190
if (allocator == nullptr) {
1191-
allocator = caffe2::GetAllocator(storage_.device_type());
1191+
allocator = GetAllocator(storage_.device_type());
11921192
}
11931193
if (meta.placementNew()) {
11941194
// For types that need placement new, we will call it, as well as

0 commit comments

Comments
 (0)