-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
radixsort much slower than buit-in sort (regarding string sorting) #2
Comments
Interesting can you let me know the length of each line? If very long then
it will be slow I think.
…On 6 Mar. 2018 10:40 am, "Aaron" ***@***.***> wrote:
lines = open("f.txt") do file
readlines(file) #has approx 4.3m ***@***.*** ***@***.*** radixsort(lines)
delivers the following output:
2.585242 seconds (9 allocations: 49.539 MiB, 0.42% gc time)
18.793147 seconds (14.83 k allocations: 2.916 GiB, 36.62% gc time)
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#2>, or mute the thread
<https://github.com/notifications/unsubscribe-auth/AESfJXwa63RlcjBqZzrXLXd0LBiwRQjJks5tbc0DgaJpZM4Sd27u>
.
|
Please also run it twice.
…On 6 Mar. 2018 10:40 am, "Aaron" ***@***.***> wrote:
lines = open("f.txt") do file
readlines(file) #has approx 4.3m ***@***.*** ***@***.*** radixsort(lines)
delivers the following output:
2.585242 seconds (9 allocations: 49.539 MiB, 0.42% gc time)
18.793147 seconds (14.83 k allocations: 2.916 GiB, 36.62% gc time)
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#2>, or mute the thread
<https://github.com/notifications/unsubscribe-auth/AESfJXwa63RlcjBqZzrXLXd0LBiwRQjJks5tbc0DgaJpZM4Sd27u>
.
|
longest entry is 212 chars long |
Maybe I should just add in some comments and give a warning radixsort in
its current implemebtation will be slow for strings longer than 16bytes.
Apologies, no way around it at the moment. Should just use quicksort which
should be the faster than r and python. I assume your strings are mostly
unique? That is also not gonna be that fast with radixsort
…On 6 Mar. 2018 10:50 am, "Aaron" ***@***.***> wrote:
longest entry is 212 chars long
but only ~8k out of 4.3m entries are longer than 50 chars...
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#2 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AESfJaJRFv8K9T9w5vgRUf3aZ-Cpu8q1ks5tbc8-gaJpZM4Sd27u>
.
|
unique approx after the first 10 characters. Is there an easy way to sort only after the first 10 characters? alias you might want to extend the radixsort function by something similar to the "by" keyword :D |
I need to think. You can use the by parameter and use quicksort
…On 6 Mar. 2018 10:56 am, "Aaron" ***@***.***> wrote:
unique approx after the first 10 characters. Is there an easy way to sort
only after the first 10 characters?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#2 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AESfJarnJy85mbfrOTEpAMQ-BO13yIk3ks5tbdC5gaJpZM4Sd27u>
.
|
for the Base.sort it results in that pretty bad performance while i guess your radix sort would outperform this easily...
|
You sorted upto first 10 characters. So you only want to sort by first 10
characters? Or after? If after radixsort will also be slow. It's slow at
sorting long strings.
Unfortunately there is no by functionality in radixsort yet. I will see if
I can whip something later.
Also 2 second ain't bad. Unless you need to sort by thing that are 100
times larger
…On 6 Mar. 2018 11:07 am, "Aaron" ***@***.***> wrote:
for the Base.sort it results in that pretty bad performance while i guess
your radix sort would outperform this easily...
@time sort(lines, by=x->x[1:min(end,10)])
11.629258 seconds (188.29 M allocations: 5.660 GiB, 12.06% gc time)
@time sort(lines)
2.552390 seconds (9 allocations: 49.539 MiB, 0.22% gc time)
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#2 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AESfJbrhd_H_702aYnHGBzh1zqmwtdxxks5tbdMqgaJpZM4Sd27u>
.
|
well i got more than 100 of these files -.- you can guess that my RAM isn't big enough -.- |
honestly, i am a bit confused by the description. for each you file you want sort by the first 2 characters or sort by every character after the first 2? Then you want to break it up into smaller files? IO is a huge bottleneck, so I don't see why that might be faster. |
as i was saying before, i still only want to sort by the first 2 caracters, not those after the first 2! |
In that case you can just use a = randstring()
rand(2:128)
nn = abs.(randn(4_300_000))
x = randstring.(Int.(round.(nn .* 106)).+2)
@time sort(x, by = x-> x |> pointer |> Ptr{UInt16} |> unsafe_load |> hton); # 2
@time sort(x); # 4
using SortingAlgorithms
@time sort(x, by = x-> x |> pointer |> Ptr{UInt16} |> unsafe_load |> hton, alg = RadixSort); # 1.1 |
delivers the following output:
The text was updated successfully, but these errors were encountered: