You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I don't have the hardware on hand to test these algorithms, but here's the first distributed sorting algorithm that comes to mind. I'm writing it here in case anyone wants to implement it. (because this is the first algorithm that comes to mind for me it has probably been come up with countless times before, I do not claim to invent it)
Let n be the number of elements being sorted, and k be the number of cores/nodes/machines, and l be the amount of memory each node has. Assume that k is a power of two and let K = log2(k)
This algorithm is designed for the case when 3*k^2 < n < k*l/3. For k = 32, l = 256MB, sorting 64-bit records, that means 3,000 < n < 700,000,000
Begin with the data partitioned evenly among nodes
Perform a single K-bit pass of MSB radix sort at each node
Redistribute data among nodes
Sort the data locally at each node
The concatenation of the data stored at each node is now sorted
In step 2, In the event of an unradixable record or highly nonuniform distribution of leading bits, compute a random sample of about k^2 and use the k-quantiles of the sample as keys in a binary search tree to create a balanced bucket assignment.
Naively, the redistribution requires all-pairs connectedness and O(k) time. Each node sends the appropriate chunk of data to each other node. This can be accomplished with message passing or with shared memory and a previous sharing of counts of how many elements are moved from each node to each other node.
I prefer radix sort for the within-node sorting whenever possible, even for highly nonuniform bit distributions. Another algorithm may be necessary for unradixable types, or unconventional hardware. For example, if each node is a computer with a GPU, then the local sorting may use this very algorithm!
This algorithm also serves as an external single or multithreaded sorting algorithm. In this case, each core is a region of disk space and they are loaded into memory sequentially, but provisionally I prefer an external-sort specific version for that case. As a TODO, it would be nice to unify these two (or even unify both with standard single threaded internal sorting)
functionmsb!(v, bits=1, n=2^bits, key=x -> x >> (64- bits) +1)
# indices point to the next thing to go.# for a source that is the next element to take out# for a sink it is the next place to put in.
chunk =length(v) ÷ n^2
source =similar(v, chunk * n)
sink =similar(v, chunk * n)
counts =zeros(UInt, n +1)
counts[1] =firstindex(v)
for x in v
counts[key(x)+1] +=1end#println(Int.(counts))cumsum!(counts, counts)
original_counts =copy(counts)
sink_limits = chunk:chunk:n*chunk
sink_indices =collect(1:chunk:n*chunk)
@assertlength(sink_indices) ==length(sink_limits)
source_index =lastindex(source) +1for k in1:n
amount =min(chunk, original_counts[k+1] - counts[k])
#println((amount, original_counts[k+1]-counts[k]))copyto!(source, source_index - amount, v, counts[k], amount)
source_index -= amount
endwhiletrue#println(Int.(counts))
x = source[source_index]
source_index +=1
k =key(x)
#println(k)
sink_index = sink_indices[k]
sink[sink_index] = x
sink_indices[k] +=1
source_index <=lastindex(source) ||breakif sink_index >= sink_limits[k]
# sink bucket is full, copy to v and copy back some more elements from vcopyto!(v, counts[k], sink, sink_index - chunk +1, chunk)
sink_indices[k] -= chunk
counts[k] += chunk
# there may not be a full chunk to copy back
amount =min(chunk, original_counts[k+1] - counts[k])
#println((amount, original_counts[k+1]-counts[k]))copyto!(source, source_index - amount, v, counts[k], amount)
source_index -= amount
endend# copy the rest of the sink into the original vectorfor i in1:n
start = sink_limits[i] - chunk +1copyto!(v, counts[i], sink, start, sink_indices[i] - start)
end
v, original_counts
endexport msb_sort!
functionmsb_sort!(v, args...; kw...)
_, c =msb!(v, args...; kw...)
buffer =similar(v, 0)
for i infirstindex(c):lastindex(c)-1sort!(view(v, c[i]:c[i+1]-1); buffer)
end
v
end
Asymptotic analysis (assuming radixable and reasonably balanced most significant bits) is O(n/k) time which is pretty good (the same as the distributed runtime of sum).
The text was updated successfully, but these errors were encountered:
I don't have the hardware on hand to test these algorithms, but here's the first distributed sorting algorithm that comes to mind. I'm writing it here in case anyone wants to implement it. (because this is the first algorithm that comes to mind for me it has probably been come up with countless times before, I do not claim to invent it)
Let
n
be the number of elements being sorted, andk
be the number of cores/nodes/machines, andl
be the amount of memory each node has. Assume thatk
is a power of two and letK = log2(k)
This algorithm is designed for the case when
3*k^2 < n < k*l/3
. Fork = 32
,l = 256MB
, sorting 64-bit records, that means3,000 < n < 700,000,000
In step 2, In the event of an unradixable record or highly nonuniform distribution of leading bits, compute a random sample of about k^2 and use the k-quantiles of the sample as keys in a binary search tree to create a balanced bucket assignment.
Naively, the redistribution requires all-pairs connectedness and O(k) time. Each node sends the appropriate chunk of data to each other node. This can be accomplished with message passing or with shared memory and a previous sharing of counts of how many elements are moved from each node to each other node.
I prefer radix sort for the within-node sorting whenever possible, even for highly nonuniform bit distributions. Another algorithm may be necessary for unradixable types, or unconventional hardware. For example, if each node is a computer with a GPU, then the local sorting may use this very algorithm!
This algorithm also serves as an external single or multithreaded sorting algorithm. In this case, each core is a region of disk space and they are loaded into memory sequentially, but provisionally I prefer an external-sort specific version for that case. As a TODO, it would be nice to unify these two (or even unify both with standard single threaded internal sorting)
Asymptotic analysis (assuming radixable and reasonably balanced most significant bits) is
O(n/k)
time which is pretty good (the same as the distributed runtime ofsum
).The text was updated successfully, but these errors were encountered: