|
| 1 | +.. title: Optimizing chunks for matrix multiplication. A new approach to matrix processing |
| 2 | +.. author: Ricardo Sales Piquer |
| 3 | +.. slug: optimizing-chunks-blosc2 |
| 4 | +.. date: 2025-04-12 9:00:00 UTC |
| 5 | +.. tags: blosc2 optimization matrix multiplication matmul compression |
| 6 | +.. category: |
| 7 | +.. link: |
| 8 | +.. description: Exploring how to optimize chunk sizes in Blosc2 to improve performance in matrix multiplication. |
| 9 | +.. type: text |
| 10 | +
|
| 11 | +
|
| 12 | +As data volumes continue to grow in fields like machine learning and scientific computing, |
| 13 | +optimizing fundamental operations like matrix multiplication becomes increasingly critical. |
| 14 | +Blosc2's chunk-based approach offers a new path to efficiency in these scenarios. |
| 15 | + |
| 16 | +Matrix Multiplication |
| 17 | +--------------------- |
| 18 | +Matrix multiplication is a fundamental operation in many scientific and |
| 19 | +engineering applications. With the introduction of matrix multiplication into |
| 20 | +Blosc2, users can now perform this operation on compressed arrays efficiently. |
| 21 | +The key advantages of having matrix multiplication in Blosc2 include: |
| 22 | + |
| 23 | +- **Compressed matrices in memory:** |
| 24 | + Blosc2 enables matrices to be stored in a compressed format without sacrificing |
| 25 | + the ability to perform operations directly on them. |
| 26 | + |
| 27 | +- **Efficiency with chunks**: |
| 28 | + In computation-intensive applications, matrix multiplication can be executed |
| 29 | + without fully decompressing the data, operating on small blocks of data independently, |
| 30 | + saving both time and memory. |
| 31 | + |
| 32 | +- **Out-of-core computation:** |
| 33 | + When matrices are too large to fit in main memory, Blosc2 facilitates out-of-core |
| 34 | + processing. Data stored on disk is read and processed in optimized chunks, |
| 35 | + allowing matrix multiplication operations without loading the entire dataset into |
| 36 | + memory. |
| 37 | + |
| 38 | +These features are especially valuable in big data environments and in scientific |
| 39 | +or engineering applications where matrix sizes can be overwhelming, enabling |
| 40 | +complex calculations efficiently. |
| 41 | + |
| 42 | + |
| 43 | +Implementation |
| 44 | +-------------- |
| 45 | +The matrix multiplication functionality is implemented in the ``matmul`` |
| 46 | +function. It supports Blosc2 ``NDArray`` objects and leverages chunked |
| 47 | +operations to perform the multiplication efficiently. |
| 48 | + |
| 49 | +.. image:: /images/blosc2-matmul/blocked-gemm.png |
| 50 | + :align: center |
| 51 | + :alt: How blocked matrix multiplication works |
| 52 | + |
| 53 | +The image illustrates a **blocked matrix multiplication** approach. The key idea |
| 54 | +is to divide matrices into smaller blocks (or chunks) to optimize memory |
| 55 | +access and computational efficiency. |
| 56 | + |
| 57 | +In the image, matrix :math:`A (M \times K`) and matrix :math:`B (K \times N`) |
| 58 | +are partitioned into chunks, and these are partitioned into blocks. The resulting |
| 59 | +matrix :math:`C (M \times N`) is computed as a sum of block-wise multiplication. |
| 60 | + |
| 61 | +This method significantly improves cache utilization by ensuring that only the |
| 62 | +necessary parts of the matrices are loaded into memory at any given time. In |
| 63 | +Blosc2, storing matrix blocks as compressed chunks reduces memory footprint and |
| 64 | +enhances performance by enabling on-the-fly decompression. |
| 65 | + |
| 66 | +Also, Blosc2 supports a wide range of data types. In addition to standard Python |
| 67 | +types such as `int`, `float`, and `complex`, it also fully supports various NumPy |
| 68 | +types. The currently supported types include: |
| 69 | + |
| 70 | + - `np.int8` |
| 71 | + - `np.int16` |
| 72 | + - `np.int32` |
| 73 | + - `np.int64` |
| 74 | + - `np.float32` |
| 75 | + - `np.float64` |
| 76 | + - `np.complex64` |
| 77 | + - `np.complex128` |
| 78 | + |
| 79 | +This versatility allows compression and subsequent processing to be |
| 80 | +applied across diverse scenarios, tailored to the specific needs of each |
| 81 | +application. |
| 82 | + |
| 83 | +Together, these features make Blosc2 a flexible and adaptable tool for various |
| 84 | +scenarios, but especially suited for the handling of large datasets. |
| 85 | + |
| 86 | +Benchmarks |
| 87 | +---------- |
| 88 | +The benchmarks have been designed to evaluate the performance of the ``matmul`` |
| 89 | +function under various conditions. Here are the key aspects of our |
| 90 | +experimental setup and findings: |
| 91 | + |
| 92 | +Different matrix sizes were tested using both ``float32`` and ``float64`` |
| 93 | +data types. All the matrices used for multiplication are square. |
| 94 | +The variation in matrix sizes helps observe how the function scales and |
| 95 | +how the overhead of chunk management impacts performance. |
| 96 | + |
| 97 | +The x-axis represents the size of the resulting matrix in megabytes (MB). |
| 98 | +We used GFLOPS (Giga Floating-Point Operations per Second) to gauge the |
| 99 | +computational throughput, allowing us to compare the efficiency of the |
| 100 | +``matmul`` function relative to highly optimized libraries like NumPy. |
| 101 | + |
| 102 | +Blosc2 also incorporates a functionality to automatically select chunks, and |
| 103 | +it is represented in the benchmark by "Auto". |
| 104 | + |
| 105 | +.. image:: /images/blosc2-matmul/float32.png |
| 106 | + :align: center |
| 107 | + :alt: Benchmark float32 |
| 108 | + |
| 109 | +.. image:: /images/blosc2-matmul/float64.png |
| 110 | + :align: center |
| 111 | + :alt: Benchmark float64 |
| 112 | + |
| 113 | +For smaller matrices, the overhead of managing chunks in Blosc2 can result in |
| 114 | +lower GFLOPS compared to NumPy. As the matrix size increases, Blosc2 scales |
| 115 | +well, approaching its performance to NumPy. |
| 116 | + |
| 117 | +Each chunk shape exhibits a peak performance when the matrix size matches the |
| 118 | +chunk size, or is a multiple of the chunk shape. |
| 119 | + |
| 120 | +Conclusion |
| 121 | +---------- |
| 122 | +The new matrix multiplication feature in Blosc2 introduces efficient, chunked |
| 123 | +computation for compressed arrays. This allows users to handle large datasets |
| 124 | +both in memory and on disk without sacrificing performance. The implementation |
| 125 | +supports a wide range of data types, making it versatile for various numerical |
| 126 | +applications. |
| 127 | + |
| 128 | +Real-world applications, such as neural network training, demonstrate the |
| 129 | +potential benefits in scenarios where memory constraints and large data sizes |
| 130 | +are common. While there are some limitations —such as support only for 2D arrays |
| 131 | +and the overhead of blocking— the applicability looks promising, like |
| 132 | +potential integration with deep learning frameworks. |
| 133 | + |
| 134 | +Overall, Blosc2 offers a compelling alternative for applications where the |
| 135 | +advantages of compression and out-of-core computation are critical, paving |
| 136 | +the way for more efficient processing of massive datasets. |
| 137 | + |
| 138 | +Getting my feet wet with Blosc2 |
| 139 | +------------------------------- |
| 140 | +In the initial phase of the project, my biggest challenge was understanding how |
| 141 | +Blosc2 manages data internally. For matrix multiplication, it was critical to |
| 142 | +grasp how to choose the right chunks, since the operation requires that the |
| 143 | +ranges of both matrices coincide. After some considerations and a few insightful |
| 144 | +conversations with Francesc, I finally understood the underlying mechanics. |
| 145 | +This breakthrough allowed me to begin implementing the first versions of my |
| 146 | +solution, adjusting the data fragmentation so that each block was properly |
| 147 | +aligned for precise computation. |
| 148 | + |
| 149 | +Another important aspect was adapting to the professional workflow of using Git |
| 150 | +for version control. Embracing Git —with its branch creation, regular commits, |
| 151 | +and conflict resolution— represented a significant shift in my development |
| 152 | +approach. This experience not only improved the organization of my code and |
| 153 | +facilitated collaboration but also instilled a structured and disciplined |
| 154 | +mindset in managing my projects. This tool has shown to be both valuable and |
| 155 | +extremely helpful. |
| 156 | + |
| 157 | +Finally, the moment when the function finally returned the correct result was |
| 158 | +really exciting. After multiple iterations, the rigorous debugging process paid |
| 159 | +off as everything fell into place. This breakthrough validated the robustness |
| 160 | +of the implementation and boosted my confidence to further optimize and tackle |
| 161 | +new challenges in data processing. |
0 commit comments