-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add serialization, non-tableized automaton and optimisations #9
base: master
Are you sure you want to change the base?
Conversation
…round half the speed for search) but requires much less memory. I have observed memory usage of 1% the tableized version.
Add methods to create non-tableized automaton. These are slower but require much less memory. The default method still creates a tableized automaton so there is no impact on legacy code. |
…his creates the same number of threads as there are processors available. For small numbers of patterns this makes little difference, however testing on my Mac Pro with 8 processors, with large numbers of patterns (1,000 - 5,000) the multithreaded make uses 4-6 threads and is around 3 to 4 times faster.
remove assert from while loop call notify rather than notifyAll as we only need to wake a single thread don't check for whether process has finished on each thread wakeup
replace hashmap contains and get with a single get. Note that the contains call will more often than not return false resulting in the need for a get
remove pointless assignment
…s optimising the "next()"" method. Also provide methods to return all the matching patterns (and their starts and end), still default is to return the first pattern.
Switch allocation of start/end arrays out of next() method to start(), end() methods.
Initialise maps and lists with known size
… with ".*" and "^" do not have prefix attached.
A bunch o' changes, mainly:
|
I have about 400 regexes that I am trying to do a mutlimatch for; I began to utilize this PR/branch as I hoped it would speed up my initialization process (the multithreaded init). But there appears to be some sort of infinite loop going on? Maybe it's a rogue RegEx? It's not apparent to me how I can determine the offending regex by looking at the multistate/multipatternautomation classes.. Here's the stack trace:
Here's the regex's:
|
Split your 400 regexp in packs of 40 or so and it should work. It will be 10 times slower, but still between 50 and 100 times faster than looping over Java's regexp. |
@neilireson sorry for not coming back to you earlier. The pull request is not addressing a single issue, so I am afraid I will not merge it. Handling "^" is a very important addition. All 4 should be separate issue and PR. If you want to split your PR accordingly, I'd be happy to review it and merge it in. |
@fulmicoton I would have liked to created different PRs but every time attempted to create a new PR it just appended it to my original one. I will attempt to overcome my GIT ignorance and work out how to split the PR into their separate issues. |
@aantix Unfortunately there is an exponential growth in the time taken to create the multiregexp automaton. Most of my “patterns” are actually just simple strings so it was only taking me a minute to create a ~3,000 pattern automaton, however for 20k patterns I reckon it would take around 8 days. I haven’t tried the approach with more complex regex patterns. To overcome the time limitations I divided my 20k+ patterns into sets and generated multiple multiregexp automaton. By the way almost all the memory requirements are for generating the lookup table, which speeds up matching (I think it about halves the time). However if you set tablized parameter to false the memory requirements are about two orders of magnitude less, so your 1GB object will be around 10MB. In the end I have taken @fulmicoton advice and I've expanded all my patterns into (~50k) strings and I'm using an Aho-Corasick implementation (https://github.com/robert-bor/aho-corasick). Although I'm looking to extend it to include very simple patterns to cover my use case, i.e. ignoring multiple spaces, punctuation, folding the characters into ASCII, etc. |
@neilreson I think it is reasonable to put all of your strings aside, and treat them with aho-corasick, and handle the remaining regular expression with multiregexp. Ideally the library would do that for you. In the case of @aantix , he use cases requires regular expressions, so I believe using multiregexp makes sense. The automaton explodes in memory unfortunately. Packing the queries should work. |
The automaton I create can be large (1GB+) and these can take a long time to create on slower systems so it's handy to be able to serialize them so that the create process only happens once.