Skip to content

Performance Issue: Unnecessary Matrix Copy in MMCS Sampling #413

@Jitmisra

Description

@Jitmisra

Performance Issue: Unnecessary Matrix Copy in MMCS Sampling

📋 Summary

The MMCS (Multiphase Monte Carlo Sampling) implementation performs an expensive matrix transpose operation that creates an unnecessary full matrix copy. This impacts performance, especially for high-dimensional sampling scenarios with large sample sizes.

🔍 Problem Description

Current Implementation

In include/sampling/mmcs.hpp (line 234) and related files, the code performs:

MT Samples = TotalRandPoints.transpose(); //do not copy TODO!
for (int i = 0; i < total_samples; i++)
{
    Samples.col(i) = T * Samples.col(i) + T_shift;
}
S.conservativeResize(P.dimension(), total_number_of_samples_in_P0 + total_samples);
S.block(0, total_number_of_samples_in_P0, P.dimension(), total_samples) = 
    Samples.block(0, 0, P.dimension(), total_samples);

The Problem

Current Flow (Inefficient):

TotalRandPoints          [EXPENSIVE COPY]        Samples              S (Final)
(n × d matrix)          transpose() creates    (d × n matrix)       (d × n matrix)
┌─────────────┐         full copy              ┌─────────────┐     ┌─────────────┐
│ Row 0: x₁   │ ──────────────────────────────>│ Col 0: x₁   │     │ Col 0: Tx₁  │
│ Row 1: x₂   │         ❌ Memory Copy          │ Col 1: x₂   │     │ Col 1: Tx₂  │
│ Row 2: x₃   │         ❌ O(n×d) operation    │ Col 2: x₃   │     │ Col 2: Tx₃  │
│    ...      │                                  │    ...      │     │    ...      │
│ Row n: xₙ   │                                  │ Col n: xₙ   │     │ Col n: Txₙ  │
└─────────────┘                                  └─────────────┘     └─────────────┘
     │                                                  │                    │
     │                                                  │                    │
     │                    Transform: T * col + shift   │                    │
     │                                                  └────────────────────┘
     │                                                          │
     └──────────────────────────────────────────────────────────┘
                    Total: 2× memory operations + transpose overhead

Memory Flow:

  1. Read TotalRandPoints (n×d elements)
  2. Allocate new matrix Samples (d×n elements) ← WASTED MEMORY
  3. Copy and transpose all elements ← EXPENSIVE OPERATION
  4. Transform each column: T * Samples.col(i) + T_shift
  5. Copy transformed data to S

Issues:

  1. Memory Overhead: Creates a full copy of the transposed matrix

    • For typical case: 50 dimensions × 5000 samples = 250,000 elements
    • Memory: ~2MB (double precision) wasted per iteration
  2. Computation Overhead: O(n×d) transpose operation

    • Unnecessary for the actual computation needed
  3. Memory Bandwidth: Doubles memory traffic

    • Read from TotalRandPoints → Write to Samples → Read from Samples → Write to S

Matrix Dimensions

TotalRandPoints: (num_samples × dimension)
  - Each row represents one sample point
  - Shape: [total_samples, P.dimension()]

Samples (transposed copy): (dimension × num_samples)
  - Each column represents one sample point (transposed)
  - Shape: [P.dimension(), total_samples]

S (final result): (dimension × total_samples)
  - Each column is a transformed sample
  - Shape: [P.dimension(), total_number_of_samples_in_P0 + total_samples]

💡 Proposed Solution

Optimized Implementation

Instead of creating a transposed copy, work directly with rows of TotalRandPoints:

// Optimized: avoid transpose copy by working directly with rows
S.conservativeResize(P.dimension(), total_number_of_samples_in_P0 + total_samples);
for (int i = 0; i < total_samples; i++)
{
    // Transform each sample: T * sample + T_shift, then store as column in S
    S.col(total_number_of_samples_in_P0 + i).noalias() = 
        T * TotalRandPoints.row(i).transpose() + T_shift;
}

Optimized Flow

Optimized Flow (Efficient):

TotalRandPoints                                        S (Final)
(n × d matrix)                                         (d × n matrix)
┌─────────────┐                                       ┌─────────────┐
│ Row 0: x₁   │ ──Direct: T * row(i).T + shift──────>│ Col 0: Tx₁  │
│ Row 1: x₂   │    ✅ No copy                        │ Col 1: Tx₂  │
│ Row 2: x₃   │    ✅ Single pass                     │ Col 2: Tx₃  │
│    ...      │    ✅ Efficient                       │    ...      │
│ Row n: xₙ   │                                       │ Col n: Txₙ  │
└─────────────┘                                       └─────────────┘
     │                                                       │
     └───────────────────────────────────────────────────────┘
                    Direct transformation, no intermediate copy

Memory Flow:

  1. Read TotalRandPoints.row(i) (single row, d elements)
  2. Transform: T * row(i).transpose() + T_shift (in-place computation)
  3. Assign directly to S.col(i) with noalias()NO COPY
  4. Repeat for all rows

Key Improvement:

  • Before: Read → Copy → Transpose → Transform → Copy = 5 operations
  • After: Read → Transform → Assign = 3 operations

Benefits:

  1. No Memory Copy: Eliminates intermediate Samples matrix
  2. Direct Computation: Transform and assign in one step
  3. Better Cache Usage: Single pass through data
  4. Uses noalias(): Prevents Eigen from creating temporaries

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions