-
Notifications
You must be signed in to change notification settings - Fork 161
Open
Description
Performance Issue: Unnecessary Matrix Copy in MMCS Sampling
📋 Summary
The MMCS (Multiphase Monte Carlo Sampling) implementation performs an expensive matrix transpose operation that creates an unnecessary full matrix copy. This impacts performance, especially for high-dimensional sampling scenarios with large sample sizes.
🔍 Problem Description
Current Implementation
In include/sampling/mmcs.hpp (line 234) and related files, the code performs:
MT Samples = TotalRandPoints.transpose(); //do not copy TODO!
for (int i = 0; i < total_samples; i++)
{
Samples.col(i) = T * Samples.col(i) + T_shift;
}
S.conservativeResize(P.dimension(), total_number_of_samples_in_P0 + total_samples);
S.block(0, total_number_of_samples_in_P0, P.dimension(), total_samples) =
Samples.block(0, 0, P.dimension(), total_samples);The Problem
Current Flow (Inefficient):
TotalRandPoints [EXPENSIVE COPY] Samples S (Final)
(n × d matrix) transpose() creates (d × n matrix) (d × n matrix)
┌─────────────┐ full copy ┌─────────────┐ ┌─────────────┐
│ Row 0: x₁ │ ──────────────────────────────>│ Col 0: x₁ │ │ Col 0: Tx₁ │
│ Row 1: x₂ │ ❌ Memory Copy │ Col 1: x₂ │ │ Col 1: Tx₂ │
│ Row 2: x₃ │ ❌ O(n×d) operation │ Col 2: x₃ │ │ Col 2: Tx₃ │
│ ... │ │ ... │ │ ... │
│ Row n: xₙ │ │ Col n: xₙ │ │ Col n: Txₙ │
└─────────────┘ └─────────────┘ └─────────────┘
│ │ │
│ │ │
│ Transform: T * col + shift │ │
│ └────────────────────┘
│ │
└──────────────────────────────────────────────────────────┘
Total: 2× memory operations + transpose overhead
Memory Flow:
- Read
TotalRandPoints(n×d elements) - Allocate new matrix
Samples(d×n elements) ← WASTED MEMORY - Copy and transpose all elements ← EXPENSIVE OPERATION
- Transform each column:
T * Samples.col(i) + T_shift - Copy transformed data to
S
Issues:
-
Memory Overhead: Creates a full copy of the transposed matrix
- For typical case: 50 dimensions × 5000 samples = 250,000 elements
- Memory: ~2MB (double precision) wasted per iteration
-
Computation Overhead: O(n×d) transpose operation
- Unnecessary for the actual computation needed
-
Memory Bandwidth: Doubles memory traffic
- Read from
TotalRandPoints→ Write toSamples→ Read fromSamples→ Write toS
- Read from
Matrix Dimensions
TotalRandPoints: (num_samples × dimension)
- Each row represents one sample point
- Shape: [total_samples, P.dimension()]
Samples (transposed copy): (dimension × num_samples)
- Each column represents one sample point (transposed)
- Shape: [P.dimension(), total_samples]
S (final result): (dimension × total_samples)
- Each column is a transformed sample
- Shape: [P.dimension(), total_number_of_samples_in_P0 + total_samples]
💡 Proposed Solution
Optimized Implementation
Instead of creating a transposed copy, work directly with rows of TotalRandPoints:
// Optimized: avoid transpose copy by working directly with rows
S.conservativeResize(P.dimension(), total_number_of_samples_in_P0 + total_samples);
for (int i = 0; i < total_samples; i++)
{
// Transform each sample: T * sample + T_shift, then store as column in S
S.col(total_number_of_samples_in_P0 + i).noalias() =
T * TotalRandPoints.row(i).transpose() + T_shift;
}Optimized Flow
Optimized Flow (Efficient):
TotalRandPoints S (Final)
(n × d matrix) (d × n matrix)
┌─────────────┐ ┌─────────────┐
│ Row 0: x₁ │ ──Direct: T * row(i).T + shift──────>│ Col 0: Tx₁ │
│ Row 1: x₂ │ ✅ No copy │ Col 1: Tx₂ │
│ Row 2: x₃ │ ✅ Single pass │ Col 2: Tx₃ │
│ ... │ ✅ Efficient │ ... │
│ Row n: xₙ │ │ Col n: Txₙ │
└─────────────┘ └─────────────┘
│ │
└───────────────────────────────────────────────────────┘
Direct transformation, no intermediate copy
Memory Flow:
- Read
TotalRandPoints.row(i)(single row, d elements) - Transform:
T * row(i).transpose() + T_shift(in-place computation) - Assign directly to
S.col(i)withnoalias()← NO COPY - Repeat for all rows
Key Improvement:
- ❌ Before: Read → Copy → Transpose → Transform → Copy = 5 operations
- ✅ After: Read → Transform → Assign = 3 operations
Benefits:
- ✅ No Memory Copy: Eliminates intermediate
Samplesmatrix - ✅ Direct Computation: Transform and assign in one step
- ✅ Better Cache Usage: Single pass through data
- ✅ Uses
noalias(): Prevents Eigen from creating temporaries
References
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels