Skip to content

Commit d82b9f9

Browse files
committed
Add new blog from Ricardo on blosc2.matmul
1 parent bd63532 commit d82b9f9

File tree

4 files changed

+161
-0
lines changed

4 files changed

+161
-0
lines changed
12.4 KB
Loading
45.4 KB
Loading
43.9 KB
Loading

posts/blosc2-matmul.rst

Lines changed: 161 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,161 @@
1+
.. title: Optimizing chunks for matrix multiplication. A new approach to matrix processing
2+
.. author: Ricardo Sales Piquer
3+
.. slug: optimizing-chunks-blosc2
4+
.. date: 2025-04-12 9:00:00 UTC
5+
.. tags: blosc2 optimization matrix multiplication matmul compression
6+
.. category:
7+
.. link:
8+
.. description: Exploring how to optimize chunk sizes in Blosc2 to improve performance in matrix multiplication.
9+
.. type: text
10+
11+
12+
As data volumes continue to grow in fields like machine learning and scientific computing,
13+
optimizing fundamental operations like matrix multiplication becomes increasingly critical.
14+
Blosc2's chunk-based approach offers a new path to efficiency in these scenarios.
15+
16+
Matrix Multiplication
17+
---------------------
18+
Matrix multiplication is a fundamental operation in many scientific and
19+
engineering applications. With the introduction of matrix multiplication into
20+
Blosc2, users can now perform this operation on compressed arrays efficiently.
21+
The key advantages of having matrix multiplication in Blosc2 include:
22+
23+
- **Compressed matrices in memory:**
24+
Blosc2 enables matrices to be stored in a compressed format without sacrificing
25+
the ability to perform operations directly on them.
26+
27+
- **Efficiency with chunks**:
28+
In computation-intensive applications, matrix multiplication can be executed
29+
without fully decompressing the data, operating on small blocks of data independently,
30+
saving both time and memory.
31+
32+
- **Out-of-core computation:**
33+
When matrices are too large to fit in main memory, Blosc2 facilitates out-of-core
34+
processing. Data stored on disk is read and processed in optimized chunks,
35+
allowing matrix multiplication operations without loading the entire dataset into
36+
memory.
37+
38+
These features are especially valuable in big data environments and in scientific
39+
or engineering applications where matrix sizes can be overwhelming, enabling
40+
complex calculations efficiently.
41+
42+
43+
Implementation
44+
--------------
45+
The matrix multiplication functionality is implemented in the ``matmul``
46+
function. It supports Blosc2 ``NDArray`` objects and leverages chunked
47+
operations to perform the multiplication efficiently.
48+
49+
.. image:: /images/blosc2-matmul/blocked-gemm.png
50+
:align: center
51+
:alt: How blocked matrix multiplication works
52+
53+
The image illustrates a **blocked matrix multiplication** approach. The key idea
54+
is to divide matrices into smaller blocks (or chunks) to optimize memory
55+
access and computational efficiency.
56+
57+
In the image, matrix :math:`A (M \times K`) and matrix :math:`B (K \times N`)
58+
are partitioned into chunks, and these are partitioned into blocks. The resulting
59+
matrix :math:`C (M \times N`) is computed as a sum of block-wise multiplication.
60+
61+
This method significantly improves cache utilization by ensuring that only the
62+
necessary parts of the matrices are loaded into memory at any given time. In
63+
Blosc2, storing matrix blocks as compressed chunks reduces memory footprint and
64+
enhances performance by enabling on-the-fly decompression.
65+
66+
Also, Blosc2 supports a wide range of data types. In addition to standard Python
67+
types such as `int`, `float`, and `complex`, it also fully supports various NumPy
68+
types. The currently supported types include:
69+
70+
- `np.int8`
71+
- `np.int16`
72+
- `np.int32`
73+
- `np.int64`
74+
- `np.float32`
75+
- `np.float64`
76+
- `np.complex64`
77+
- `np.complex128`
78+
79+
This versatility allows compression and subsequent processing to be
80+
applied across diverse scenarios, tailored to the specific needs of each
81+
application.
82+
83+
Together, these features make Blosc2 a flexible and adaptable tool for various
84+
scenarios, but especially suited for the handling of large datasets.
85+
86+
Benchmarks
87+
----------
88+
The benchmarks have been designed to evaluate the performance of the ``matmul``
89+
function under various conditions. Here are the key aspects of our
90+
experimental setup and findings:
91+
92+
Different matrix sizes were tested using both ``float32`` and ``float64``
93+
data types. All the matrices used for multiplication are square.
94+
The variation in matrix sizes helps observe how the function scales and
95+
how the overhead of chunk management impacts performance.
96+
97+
The x-axis represents the size of the resulting matrix in megabytes (MB).
98+
We used GFLOPS (Giga Floating-Point Operations per Second) to gauge the
99+
computational throughput, allowing us to compare the efficiency of the
100+
``matmul`` function relative to highly optimized libraries like NumPy.
101+
102+
Blosc2 also incorporates a functionality to automatically select chunks, and
103+
it is represented in the benchmark by "Auto".
104+
105+
.. image:: /images/blosc2-matmul/float32.png
106+
:align: center
107+
:alt: Benchmark float32
108+
109+
.. image:: /images/blosc2-matmul/float64.png
110+
:align: center
111+
:alt: Benchmark float64
112+
113+
For smaller matrices, the overhead of managing chunks in Blosc2 can result in
114+
lower GFLOPS compared to NumPy. As the matrix size increases, Blosc2 scales
115+
well, approaching its performance to NumPy.
116+
117+
Each chunk shape exhibits a peak performance when the matrix size matches the
118+
chunk size, or is a multiple of the chunk shape.
119+
120+
Conclusion
121+
----------
122+
The new matrix multiplication feature in Blosc2 introduces efficient, chunked
123+
computation for compressed arrays. This allows users to handle large datasets
124+
both in memory and on disk without sacrificing performance. The implementation
125+
supports a wide range of data types, making it versatile for various numerical
126+
applications.
127+
128+
Real-world applications, such as neural network training, demonstrate the
129+
potential benefits in scenarios where memory constraints and large data sizes
130+
are common. While there are some limitations —such as support only for 2D arrays
131+
and the overhead of blocking— the applicability looks promising, like
132+
potential integration with deep learning frameworks.
133+
134+
Overall, Blosc2 offers a compelling alternative for applications where the
135+
advantages of compression and out-of-core computation are critical, paving
136+
the way for more efficient processing of massive datasets.
137+
138+
Getting my feet wet with Blosc2
139+
-------------------------------
140+
In the initial phase of the project, my biggest challenge was understanding how
141+
Blosc2 manages data internally. For matrix multiplication, it was critical to
142+
grasp how to choose the right chunks, since the operation requires that the
143+
ranges of both matrices coincide. After some considerations and a few insightful
144+
conversations with Francesc, I finally understood the underlying mechanics.
145+
This breakthrough allowed me to begin implementing the first versions of my
146+
solution, adjusting the data fragmentation so that each block was properly
147+
aligned for precise computation.
148+
149+
Another important aspect was adapting to the professional workflow of using Git
150+
for version control. Embracing Git —with its branch creation, regular commits,
151+
and conflict resolution— represented a significant shift in my development
152+
approach. This experience not only improved the organization of my code and
153+
facilitated collaboration but also instilled a structured and disciplined
154+
mindset in managing my projects. This tool has shown to be both valuable and
155+
extremely helpful.
156+
157+
Finally, the moment when the function finally returned the correct result was
158+
really exciting. After multiple iterations, the rigorous debugging process paid
159+
off as everything fell into place. This breakthrough validated the robustness
160+
of the implementation and boosted my confidence to further optimize and tackle
161+
new challenges in data processing.

0 commit comments

Comments
 (0)