Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About Efficiency Sparse Dense Multiplication #550

Open
khoinpd0411 opened this issue Nov 24, 2024 · 4 comments
Open

About Efficiency Sparse Dense Multiplication #550

khoinpd0411 opened this issue Nov 24, 2024 · 4 comments

Comments

@khoinpd0411
Copy link

Hi, thank you so much for a great project. May I ask a small question related to sparse and dense matrix operation? The case is that I want to conduct a multiplication between sparse and dense matrix. In detail, the sparsity of the 2 project is given as below:

X shape: (200000, 50000) 
C shape: (50000, 100)
#nnz X: 99500704
#nnz C: 3160328
Sparsity of X:  0.0099500704
Sparsity of C:  0.6320656

Where X is a sparse matrix and C is a dense one. In theory, C should be stored in dense format in order to make the operation to be efficient instead of sparse one. However, using bitmap/dense matrix provided by GraphBLAS is not as efficient as compared to MKL code for sparse * dense operation

##################### python-graphblas (Default Setting - 32 Threads - CSR Format * BitmapC Format) #####################
Convert Scipy Sparse Format -> python-graphblas Format (s) 0.6137485504150391
Sparse * Sparse computing time (s) 2.929335117340088
Sparse * Sparse computing time (s) 2.9536073207855225
Sparse * Sparse computing time (s) 2.9327011108398438
Sparse * Sparse computing time (s) 2.946722984313965
Sparse * Sparse computing time (s) 2.942711114883423
Sparse * Sparse computing time (s) 2.959540843963623
Sparse * Sparse computing time (s) 2.9557604789733887
Sparse * Sparse computing time (s) 2.946617364883423
Sparse * Sparse computing time (s) 2.9554543495178223
Sparse * Sparse computing time (s) 2.9580225944519043
Sparse * Sparse computing time (s) 2.9371848106384277
Sparse * Sparse computing time (s) 2.9504570960998535
Sparse * Sparse computing time (s) 2.956547498703003
Sparse * Sparse computing time (s) 2.9532806873321533
Sparse * Sparse computing time (s) 2.947199821472168
Sparse * Sparse computing time (s) 2.952831745147705
Sparse * Sparse computing time (s) 2.950216293334961
Sparse * Sparse computing time (s) 2.950721025466919
Sparse * Sparse computing time (s) 2.958610773086548
Sparse * Sparse computing time (s) 2.952730894088745
Mean Runtime:  2.9495126962661744
Std Runtime:  0.008198427827772714

##################### python-graphblas (Default Setting - 32 Threads - CSR Format * FullC Format) #####################
Convert Scipy Sparse Format -> python-graphblas Format (s) 0.5988132953643799
Sparse * Sparse computing time (s) 1.73736572265625
Sparse * Sparse computing time (s) 1.744694709777832
Sparse * Sparse computing time (s) 1.7468664646148682
Sparse * Sparse computing time (s) 1.7360756397247314
Sparse * Sparse computing time (s) 1.7450006008148193
Sparse * Sparse computing time (s) 1.7598528861999512
Sparse * Sparse computing time (s) 1.7438948154449463
Sparse * Sparse computing time (s) 1.7407243251800537
Sparse * Sparse computing time (s) 1.7503554821014404
Sparse * Sparse computing time (s) 1.740696907043457
Sparse * Sparse computing time (s) 1.757706642150879
Sparse * Sparse computing time (s) 1.7401001453399658
Sparse * Sparse computing time (s) 1.7460541725158691
Sparse * Sparse computing time (s) 1.7468533515930176
Sparse * Sparse computing time (s) 1.751969337463379
Sparse * Sparse computing time (s) 1.7436254024505615
Sparse * Sparse computing time (s) 1.7455754280090332
Sparse * Sparse computing time (s) 1.7475886344909668
Sparse * Sparse computing time (s) 1.7609961032867432
Sparse * Sparse computing time (s) 1.7456836700439453
Mean Runtime:  1.7465840220451354
Std Runtime:  0.006642586857441714
##################### MKL (Default Settings - 16 Threads - Sparse CSR Format * Dense Numpy Format) #####################
Sparse * Sparse computing time (s) 0.6445832252502441
Sparse * Sparse computing time (s) 0.6223111152648926
Sparse * Sparse computing time (s) 0.6190366744995117
Sparse * Sparse computing time (s) 0.6191632747650146
Sparse * Sparse computing time (s) 0.617084264755249
Sparse * Sparse computing time (s) 0.6194536685943604
Sparse * Sparse computing time (s) 0.6174299716949463
Sparse * Sparse computing time (s) 0.6181790828704834
Sparse * Sparse computing time (s) 0.6196620464324951
Sparse * Sparse computing time (s) 0.6216928958892822
Sparse * Sparse computing time (s) 0.6174044609069824
Sparse * Sparse computing time (s) 0.6212775707244873
Sparse * Sparse computing time (s) 0.6180646419525146
Sparse * Sparse computing time (s) 0.6177294254302979
Sparse * Sparse computing time (s) 0.6187028884887695
Sparse * Sparse computing time (s) 0.6197123527526855
Sparse * Sparse computing time (s) 0.6219933032989502
Sparse * Sparse computing time (s) 0.6392252445220947
Sparse * Sparse computing time (s) 0.6407394409179688
Sparse * Sparse computing time (s) 0.6248526573181152

Mean Runtime:  0.6229149103164673
Std Runtime:  0.008091237575214866

Could you please kindly suggest any format from GraphBLAS to solve this problem? Thank you so much in advance!

@eriknw
Copy link
Member

eriknw commented Jan 8, 2025

How interesting, thanks for sharing!

I don't believe SuiteSparse:GraphBLAS currently uses MKL or any dense acceleration libraries for dense or bitmap operations, but perhaps someday it could. CC @DrTimothyAldenDavis, who may have more insight.

In Python, you can enable the SuiteSparse:GraphBLAS "burble" diagnostics output with

import graphblas as gb
gb.ss.burble.enable()
# or as a context manager, such as
# with gb.ss.burble:
#     ...

This will print a lot of details including dtypes, which internal methods are used, and timings. For trying to optimize performance, one thing to watch out for is "Generic" operations (i.e., if the output has "Generic") that can occur for kernels that aren't pre-compiled or JIT-compiled (btw, the SuiteSparse JIT is currently disabled by default until we get GraphBLAS/python-suitesparse-graphblas#118 merged).

Hi, thank you so much for a great project.

Thanks, I'm glad you find it useful ❤️!

May I ask a small question related to sparse and dense matrix operation?

Of course! We're a little busy at times and may not always reply promptly, but we try :)

Possibly related: #552

@DrTimothyAldenDavis
Copy link

I don't use the Intel sparse MKL, for sparse-times-dense. I tried to, with Intel's help, but the sparse MKL was slower than any of my methods. I've benchmarked the two libraries.

the only time I could get speedup from an MKL library is the dense-times-dense methods, in the dense BLAS (dgemm, sgemm, cgemm, and zgemm), for the PLUS_TIMES semirings on those 4 data types.

@DrTimothyAldenDavis
Copy link

The timings you post above are not enough for me to know what you are doing. Are you computing just y = A*x where A is a sparse matrix and x is a dense vector or matrix?

If so, then you are not getting the fastest GraphBLAS kernel, because I must assume that y could be SPARSE not DENSE. The intel MKL would assume Y must be dense. I cannot do that, per the GraphBLAS spec. I cannot fill in y with zeros. Y can be sparse if A has any empty rows.

If you want to mimic what the MKL does, you want to compute as follows:

y = 0 ; // a dense vector of all zeros
y = y + A*x ; // using GrB_mxv or mxm with a PLUS accum operator

THEN I can know that y must be dense on output, even if A has 1 or more empty rows with no entries at all. And I use a different algorithm inside. The burble should say "saxpy4", "saxpy5", "dot4" or "dot5". If it says "saxpy3" or "dot2" then you are getting a slower kernel; those 2 kernels compute the sparsity of y.

Without the ACCUM operator, I must compute the sparsity pattern of y, which is costly. If I then find it to be dense, I drop the pattern and you get a dense vector y. But it just took me extra time to do so.

Show me the burble output. Also I would need to know if the matrices are held by row or by column (that would be apparent in the burble output).

@DrTimothyAldenDavis
Copy link

Where X is a sparse matrix and C is a dense one. In theory, C should be stored in dense format in order to make the operation to be efficient instead of sparse one. However, using bitmap/dense matrix provided by GraphBLAS is not as efficient as compared to MKL code for sparse * dense operation

If the matrix C is missing even just a single entry, then I cannot store it in a dense format. Per the spec, I am not permitted to put in zeros. What is a zero? In a graph algorithm it means "add an edge here with weight zero". But the existence of an edge with weight zero is not the same thing as "there is no edge here".

The Intel MKL can make this assumption because it only has one semiring to deal with. I cannot do it. The sparsity pattern of the result must be preserved.

If on the other hand you want to give me a matrix C that has been filled in with explicit zeros so now it is a dense matrix, that that is fine. I just cannot do that myself inside GraphBLAS. I would break the contract with user; in some cases I would have to fill in with + infinity, -infinity etc. And the fillin value would change the moment you use another semiring (which happens all the time in graph algorithms).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants