radixsort much slower than buit-in sort (regarding string sorting) #2

rapus95 · 2018-03-05T23:40:51Z

lines = open("f.txt") do file
    readlines(file) #has approx 4.3m lines
end
@time sort(lines)
@time radixsort(lines)

delivers the following output:

2.585242 seconds (9 allocations: 49.539 MiB, 0.42% gc time)
18.793147 seconds (14.83 k allocations: 2.916 GiB, 36.62% gc time)

The text was updated successfully, but these errors were encountered:

xiaodaigh · 2018-03-05T23:44:48Z

Interesting can you let me know the length of each line? If very long then it will be slow I think.

…

On 6 Mar. 2018 10:40 am, "Aaron" ***@***.***> wrote: lines = open("f.txt") do file readlines(file) #has approx 4.3m ***@***.*** ***@***.*** radixsort(lines) delivers the following output: 2.585242 seconds (9 allocations: 49.539 MiB, 0.42% gc time) 18.793147 seconds (14.83 k allocations: 2.916 GiB, 36.62% gc time) — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#2>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AESfJXwa63RlcjBqZzrXLXd0LBiwRQjJks5tbc0DgaJpZM4Sd27u> .

xiaodaigh · 2018-03-05T23:45:19Z

Please also run it twice.

…

On 6 Mar. 2018 10:40 am, "Aaron" ***@***.***> wrote: lines = open("f.txt") do file readlines(file) #has approx 4.3m ***@***.*** ***@***.*** radixsort(lines) delivers the following output: 2.585242 seconds (9 allocations: 49.539 MiB, 0.42% gc time) 18.793147 seconds (14.83 k allocations: 2.916 GiB, 36.62% gc time) — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#2>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AESfJXwa63RlcjBqZzrXLXd0LBiwRQjJks5tbc0DgaJpZM4Sd27u> .

rapus95 · 2018-03-05T23:50:22Z

longest entry is 212 chars long
but only ~8k out of 4.3m entries are longer than 50 chars...

xiaodaigh · 2018-03-05T23:53:45Z

Maybe I should just add in some comments and give a warning radixsort in its current implemebtation will be slow for strings longer than 16bytes. Apologies, no way around it at the moment. Should just use quicksort which should be the faster than r and python. I assume your strings are mostly unique? That is also not gonna be that fast with radixsort

…

On 6 Mar. 2018 10:50 am, "Aaron" ***@***.***> wrote: longest entry is 212 chars long but only ~8k out of 4.3m entries are longer than 50 chars... — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#2 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AESfJaJRFv8K9T9w5vgRUf3aZ-Cpu8q1ks5tbc8-gaJpZM4Sd27u> .

rapus95 · 2018-03-05T23:56:40Z

unique approx after the first 10 characters. Is there an easy way to sort only after the first 10 characters? alias you might want to extend the radixsort function by something similar to the "by" keyword :D

xiaodaigh · 2018-03-06T00:02:35Z

I need to think. You can use the by parameter and use quicksort

…

On 6 Mar. 2018 10:56 am, "Aaron" ***@***.***> wrote: unique approx after the first 10 characters. Is there an easy way to sort only after the first 10 characters? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#2 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AESfJarnJy85mbfrOTEpAMQ-BO13yIk3ks5tbdC5gaJpZM4Sd27u> .

rapus95 · 2018-03-06T00:07:06Z

for the Base.sort it results in that pretty bad performance while i guess your radix sort would outperform this easily...

@time sort(lines, by=x->x[1:min(end,10)])
11.629258 seconds (188.29 M allocations: 5.660 GiB, 12.06% gc time)
@time sort(lines)
 2.552390 seconds (9 allocations: 49.539 MiB, 0.22% gc time)

xiaodaigh · 2018-03-06T00:17:36Z

You sorted upto first 10 characters. So you only want to sort by first 10 characters? Or after? If after radixsort will also be slow. It's slow at sorting long strings. Unfortunately there is no by functionality in radixsort yet. I will see if I can whip something later. Also 2 second ain't bad. Unless you need to sort by thing that are 100 times larger

…

On 6 Mar. 2018 11:07 am, "Aaron" ***@***.***> wrote: for the Base.sort it results in that pretty bad performance while i guess your radix sort would outperform this easily... @time sort(lines, by=x->x[1:min(end,10)]) 11.629258 seconds (188.29 M allocations: 5.660 GiB, 12.06% gc time) @time sort(lines) 2.552390 seconds (9 allocations: 49.539 MiB, 0.22% gc time) — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#2 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AESfJbrhd_H_702aYnHGBzh1zqmwtdxxks5tbdMqgaJpZM4Sd27u> .

rapus95 · 2018-03-06T00:23:43Z

well i got more than 100 of these files -.- you can guess that my RAM isn't big enough -.-
what i was thinking about is sorting all of these files only after the first 2 chars and then saving each beginning into its own file again. and after that sorting the smaller files by quicksort.
thus it'd be cool if there was a way to append each "bucket" to its own file. :D

xiaodaigh · 2018-03-06T05:50:24Z

honestly, i am a bit confused by the description. for each you file you want sort by the first 2 characters or sort by every character after the first 2?

Then you want to break it up into smaller files? IO is a huge bottleneck, so I don't see why that might be faster.

rapus95 · 2018-03-06T12:05:15Z

as i was saying before, i still only want to sort by the first 2 caracters, not those after the first 2!
bcs that way i want to set up an external sort:
sort every file (only for 2 chars instead of whole length should be faster)
regroup parts of the files (some sort of multiway merge)
finally sort the contents of the resulting files while maintaining a good file structure based on the first 2 chars

xiaodaigh · 2018-03-06T12:24:22Z

In that case you can just use SortingAlgorithms.jl and not use SortingLab.jl. I can get a 2x speed up.

a = randstring()
rand(2:128)
nn = abs.(randn(4_300_000))
x = randstring.(Int.(round.(nn .* 106)).+2)
@time sort(x, by = x-> x |> pointer |> Ptr{UInt16} |> unsafe_load |> hton); # 2
@time sort(x); # 4
using SortingAlgorithms
@time sort(x, by = x-> x |> pointer |> Ptr{UInt16} |> unsafe_load |> hton, alg = RadixSort); # 1.1

xiaodaigh added the enhancement New feature or request label Mar 6, 2018

rapus95 mentioned this issue Mar 6, 2018

Implement LSD radix sort JuliaCollections/SortingAlgorithms.jl#29

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

radixsort much slower than buit-in sort (regarding string sorting) #2

radixsort much slower than buit-in sort (regarding string sorting) #2

rapus95 commented Mar 5, 2018

xiaodaigh commented Mar 5, 2018 via email

xiaodaigh commented Mar 5, 2018 via email

rapus95 commented Mar 5, 2018

xiaodaigh commented Mar 5, 2018 via email

rapus95 commented Mar 5, 2018 •

edited

Loading

xiaodaigh commented Mar 6, 2018 via email

rapus95 commented Mar 6, 2018

xiaodaigh commented Mar 6, 2018 via email

rapus95 commented Mar 6, 2018 •

edited

Loading

xiaodaigh commented Mar 6, 2018

rapus95 commented Mar 6, 2018

xiaodaigh commented Mar 6, 2018 •

edited

Loading

radixsort much slower than buit-in sort (regarding string sorting) #2

radixsort much slower than buit-in sort (regarding string sorting) #2

Comments

rapus95 commented Mar 5, 2018

xiaodaigh commented Mar 5, 2018 via email

xiaodaigh commented Mar 5, 2018 via email

rapus95 commented Mar 5, 2018

xiaodaigh commented Mar 5, 2018 via email

rapus95 commented Mar 5, 2018 • edited Loading

xiaodaigh commented Mar 6, 2018 via email

rapus95 commented Mar 6, 2018

xiaodaigh commented Mar 6, 2018 via email

rapus95 commented Mar 6, 2018 • edited Loading

xiaodaigh commented Mar 6, 2018

rapus95 commented Mar 6, 2018

xiaodaigh commented Mar 6, 2018 • edited Loading

rapus95 commented Mar 5, 2018 •

edited

Loading

rapus95 commented Mar 6, 2018 •

edited

Loading

xiaodaigh commented Mar 6, 2018 •

edited

Loading