parallelism causes problems with very large datasets #2

scrollsaw · 2019-06-06T15:21:32Z

An issue I've noticed when using this on large data sets is that when the size of the input data set gets very large (> 1.5 million rows in my case) SQL Server will set up a plan to run the query using parallelism. This then causes the cluster connections to not be calculated correctly. You only get a fraction of the clusters back. I'm guessing it's because when running in parallel each chunk of the query doesn't know about the others. You can test for this by running the query with larger and larger data sets until SQL makes a parallel plan.

A solution is to just add OPTION (MAXDOP 1) to the query like so:

select dbo.TCC(id1, id2) from dbo.TestData OPTION (MAXDOP 1)

This restricts parallelism and the clusters are then returned correctly.

The text was updated successfully, but these errors were encountered:

yorek · 2019-06-06T20:34:49Z

Thanks a lot for reporting this. Parallelism will use the merge method to merge two different result into one. I'll try to run some test as soon as possible to figure out what's not working. In the meantime, thanks for the workaround!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

parallelism causes problems with very large datasets #2

parallelism causes problems with very large datasets #2

scrollsaw commented Jun 6, 2019

yorek commented Jun 6, 2019

parallelism causes problems with very large datasets #2

parallelism causes problems with very large datasets #2

Comments

scrollsaw commented Jun 6, 2019

yorek commented Jun 6, 2019