Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

radixsort much slower than buit-in sort (regarding string sorting) #2

Open
rapus95 opened this issue Mar 5, 2018 · 12 comments
Open
Labels
enhancement New feature or request

Comments

@rapus95
Copy link

rapus95 commented Mar 5, 2018

lines = open("f.txt") do file
    readlines(file) #has approx 4.3m lines
end
@time sort(lines)
@time radixsort(lines)

delivers the following output:

2.585242 seconds (9 allocations: 49.539 MiB, 0.42% gc time)
18.793147 seconds (14.83 k allocations: 2.916 GiB, 36.62% gc time)
@xiaodaigh
Copy link
Owner

xiaodaigh commented Mar 5, 2018 via email

@xiaodaigh
Copy link
Owner

xiaodaigh commented Mar 5, 2018 via email

@rapus95
Copy link
Author

rapus95 commented Mar 5, 2018

longest entry is 212 chars long
but only ~8k out of 4.3m entries are longer than 50 chars...

@xiaodaigh
Copy link
Owner

xiaodaigh commented Mar 5, 2018 via email

@rapus95
Copy link
Author

rapus95 commented Mar 5, 2018

unique approx after the first 10 characters. Is there an easy way to sort only after the first 10 characters? alias you might want to extend the radixsort function by something similar to the "by" keyword :D

@xiaodaigh
Copy link
Owner

xiaodaigh commented Mar 6, 2018 via email

@rapus95
Copy link
Author

rapus95 commented Mar 6, 2018

for the Base.sort it results in that pretty bad performance while i guess your radix sort would outperform this easily...

@time sort(lines, by=x->x[1:min(end,10)])
11.629258 seconds (188.29 M allocations: 5.660 GiB, 12.06% gc time)
@time sort(lines)
 2.552390 seconds (9 allocations: 49.539 MiB, 0.22% gc time)

@xiaodaigh
Copy link
Owner

xiaodaigh commented Mar 6, 2018 via email

@rapus95
Copy link
Author

rapus95 commented Mar 6, 2018

well i got more than 100 of these files -.- you can guess that my RAM isn't big enough -.-
what i was thinking about is sorting all of these files only after the first 2 chars and then saving each beginning into its own file again. and after that sorting the smaller files by quicksort.
thus it'd be cool if there was a way to append each "bucket" to its own file. :D

@xiaodaigh
Copy link
Owner

honestly, i am a bit confused by the description. for each you file you want sort by the first 2 characters or sort by every character after the first 2?

Then you want to break it up into smaller files? IO is a huge bottleneck, so I don't see why that might be faster.

@rapus95
Copy link
Author

rapus95 commented Mar 6, 2018

as i was saying before, i still only want to sort by the first 2 caracters, not those after the first 2!
bcs that way i want to set up an external sort:
sort every file (only for 2 chars instead of whole length should be faster)
regroup parts of the files (some sort of multiway merge)
finally sort the contents of the resulting files while maintaining a good file structure based on the first 2 chars

@xiaodaigh
Copy link
Owner

xiaodaigh commented Mar 6, 2018

In that case you can just use SortingAlgorithms.jl and not use SortingLab.jl. I can get a 2x speed up.

a = randstring()
rand(2:128)
nn = abs.(randn(4_300_000))
x = randstring.(Int.(round.(nn .* 106)).+2)
@time sort(x, by = x-> x |> pointer |> Ptr{UInt16} |> unsafe_load |> hton); # 2
@time sort(x); # 4
using SortingAlgorithms
@time sort(x, by = x-> x |> pointer |> Ptr{UInt16} |> unsafe_load |> hton, alg = RadixSort); # 1.1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants