Add serialization, non-tableized automaton and optimisations #9

neilireson · 2016-04-13T13:06:35Z

The automaton I create can be large (1GB+) and these can take a long time to create on slower systems so it's handy to be able to serialize them so that the create process only happens once.

…round half the speed for search) but requires much less memory. I have observed memory usage of 1% the tableized version.

neilireson · 2016-04-13T22:57:43Z

Add methods to create non-tableized automaton. These are slower but require much less memory. The default method still creates a tableized automaton so there is no impact on legacy code.

…his creates the same number of threads as there are processors available. For small numbers of patterns this makes little difference, however testing on my Mac Pro with 8 processors, with large numbers of patterns (1,000 - 5,000) the multithreaded make uses 4-6 threads and is around 3 to 4 times faster.

remove assert from while loop call notify rather than notifyAll as we only need to wake a single thread don't check for whether process has finished on each thread wakeup

replace hashmap contains and get with a single get. Note that the contains call will more often than not return false resulting in the need for a get

remove pointless assignment

…s optimising the "next()"" method. Also provide methods to return all the matching patterns (and their starts and end), still default is to return the first pattern.

Switch allocation of start/end arrays out of next() method to start(), end() methods.

Initialise maps and lists with known size

… with ".*" and "^" do not have prefix attached.

neilireson · 2016-04-19T14:39:10Z

A bunch o' changes, mainly:

Add multithreaded make to MultiPatternAutomaton (this is set as the default)
MultiPatternSearcher next() only finds matches, moved all finding of pattern start and end processes to start() and end().
Bug fix for patterns starting with "^"
A bunch of small optimisations

aantix · 2016-05-07T18:31:45Z

I have about 400 regexes that I am trying to do a mutlimatch for; I began to utilize this PR/branch as I hoped it would speed up my initialization process (the multithreaded init).

But there appears to be some sort of infinite loop going on? Maybe it's a rogue RegEx? It's not apparent to me how I can determine the offending regex by looking at the multistate/multipatternautomation classes..

Here's the stack trace:

Exception in thread "Thread-8" java.lang.OutOfMemoryError: GC overhead limit exceeded
    at com.fulmicoton.multiregexp.MultiState.step(MultiState.java:40)
    at com.fulmicoton.multiregexp.MultiPatternAutomaton$1.run(MultiPatternAutomaton.java:105)
    at java.lang.Thread.run(Thread.java:745)
Exception in thread "Thread-4" java.lang.OutOfMemoryError: GC overhead limit exceeded
    at com.fulmicoton.multiregexp.MultiState.step(MultiState.java:40)
    at com.fulmicoton.multiregexp.MultiPatternAutomaton$1.run(MultiPatternAutomaton.java:105)
    at java.lang.Thread.run(Thread.java:745)
Exception in thread "Thread-7" java.lang.OutOfMemoryError: GC overhead limit exceeded
    at com.fulmicoton.multiregexp.MultiState.step(MultiState.java:40)
    at com.fulmicoton.multiregexp.MultiPatternAutomaton$1.run(MultiPatternAutomaton.java:105)
    at java.lang.Thread.run(Thread.java:745)
Exception in thread "Thread-2" java.lang.OutOfMemoryError: GC overhead limit exceeded
    at com.fulmicoton.multiregexp.MultiState.step(MultiState.java:40)
    at com.fulmicoton.multiregexp.MultiPatternAutomaton$1.run(MultiPatternAutomaton.java:105)
    at java.lang.Thread.run(Thread.java:745)
Exception in thread "Thread-5" java.lang.OutOfMemoryError: GC overhead limit exceeded
^C^C^C^C^C^C^C^CException in thread "Thread-6" java.lang.OutOfMemoryError: GC overhead limit exceeded
Exception in thread "Thread-3" java.lang.OutOfMemoryError: GC overhead limit exceeded
Exception in thread "Thread-9" java.lang.OutOfMemoryError: GC overhead limit exceeded
16/05/07 13:17:49 WARN util.ShutdownHookManager: ShutdownHook '' failed, java.lang.OutOfMemoryError: GC overhead limit exceeded
java.lang.OutOfMemoryError: GC overhead limit exceeded

Here's the regex's:

(?:<link[^>]+components/bitrix|(?:src|href)=\"/bitrix/(?:js|templates))
1c-bitrix
(?:twlh(?:track)?\\.asp|3d_upsell\\.js)
<div class=\"[^\"]*parbase
<div[^>]+data-component-path=\"[^\"+]jcr:
/etc/designs/
ametys\\.js
(?:Powered by <a href=\"[^>]+BIGACE|<!--\\s+Site is running BIGACE)
Built upon the <a href=\"[^>]+banshee-php\\.org/\">[a-z]+</a>(?:v([\\d.]+))?\\;
<!-- BC_OBNW -->
CatalystScripts
<link [^>]+Cargo feed
/cargo\\.
concrete/js/
<!--[^>]+powered by (?:TYPOlight|Contao)[^>]*-->
<link[^>]+(?:typolight|contao)\\.css
<(?:link [^>]*href|img [^>]*src)=\"/polopoly_fs/
<!-- by DotNetNuke Corporation
<!-- DNN Platform
/js/dnncore\\.js
/js/dnn\\.js
<a[^>]+Site Powered by DTG
dedeajax
<(?:link|style)[^>]+sites/(?:default|all)/(?:themes|modules)/
drupal\\.js
<!--[^>]+FlexCMP[^>v]+v\\. ([\\d.]+)\\;
<!--\\s+Powered by GX
/graffiti\\.js
<img[^>]+/dsresource\\?objectid=
 <[^>]+/binaries/(?:[^/]+/)*content/gallery/
include/linkexternal\\.js
<!-- CSS InProces Portaal default -->
brein/inproces/website/websitefuncties\\.js
<(?:link|a href) [^>]+ndxz-studio
Powered by\\s+(?:CERN )?<a href=\"http://(?:cdsware\\.cern\\.ch/indico/|indico-software\\.org|cern\\.ch/indico)\">(?:CDS )?Indico( [\\d\\.]+)?\\;
(?:<div[^>]+id=\"wrapper_r\"|<[^>]+(?:feed|components)/com_|<table[^>]+class=\"pill)\\;confidence:50
<!--[^>]+This website is powered by Koala Web Framework CMS
<html lang=\"en\" class=\"k-source-essays k-lens-essays\">
<!--\\s+KOKEN DEBUGGING
koken(?:\\.js\\?([\\d.]+)|/storage)\\;
<!--[^K>-]+Koobi ([a-z\\d.]+)\\;
/Kooboo
kotisivukone(?:\\.min)?\\.js
<!-- Lightmon Engine Copyright Lightmon
 <a [^>]+Powered by Lithium
<link[^>]*/sites/[a-z\\d]{24}/theme/stylesheets
<a[^>]+>Powered by MODx</a>
<(?:link|script)[^>]+assets/snippets/\\;confidence:20
<!-- Methode uuid: \"[a-f\\d]+\" ?-->
(?:<script|link)[^>]*mg-(?:core|plugins|templates)
monotracker(?:\\.min)?\\.js
<link[^>]* href=[^>]+/web/css/(?:web\\.assets_common/|website\\.assets_frontend/)\\;confidence:25
/web/js/(?:web\\.assets_common/|website\\.assets_frontend/)\\;confidence:25
<link href=\"/opencms/
opencms
<!--[^>]+published by Open Text Web Solutions
ophal\\.js
Powered by <a href=\"[^>]+php-fusion
<[^>]+class=\"perc-region\"
<span[^>]+id=\"xvotes-0
<div class=\"posterous
<a href=\"[^>]+opensolution\\.org/\">CMS by
<html[^>]+xmlns:change=
<img[^>]+_tcm\\d{2,3}-\\d{6}\\.
/sim(?:site|core)/js
Powered by <a href=\"[^>]+SilverStripe
<img[^>]+src=\"[^>]*/~/media/[^>]+\\.ashx
<[^>]+/smartsite\\.(?:dws|shtml)\\?id=
<div class='dynamicDiv' id='dd\\.\\d\\.\\d'>
<!--\\s+Running (?:MySource|Squiz) Matrix
<(?:script[^>]+ src|link[^>]+ href)=[^>]+typo3temp/
<html[^>]+xmlns:typo3=\"[^\"]+Flow/Packages/Neos/
<(?:link|style|script)[^>]+/assets/frontOffice/
<[^>]*type=[^>]text\\/vnd\\.tiddlywiki
(?:/|_)tiki
powered by <a href=[^>]+umbraco
/js/ushahidi\\.js$
<[^>]+=\"vgn-?ext
cdn\\d+\\.editmysite\\.com
static\\.wixstatic\\.com
(?:<a href=\"[^>]+wolfcms\\.org[^>]+>Wolf CMS(?:</a>)? inside|Thank you for using <a[^>]+>Wolf CMS)
<link rel=[\"']stylesheet[\"'] [^>]+wp-(?:content|includes)
<link[^>]+s\\d+\\.wp\\.com
/wp-includes/
actionheroClient\\.js
[^a-z\\d]e107\\.js
<link[^>]*/papaya-themes/
Powered by <a href=\"[^\"]+phpwind\\.net
<a[^>]+>Powered by uKnowva</a>
/media/conv/js/jquery.js
powered by <a href=\"[^>]+viennacms

ping\\.src = node\\.href;\\s+[^>]+\\s+}\\s+</script>
<a href=\"[^>]+woltlab\\.com[^<]+<strong>Burning Board
Powered by (?:<strong>)?<a href=\"[^>]+fluxbb
<link[^>]+ipb_[^>]+\\.css
jscripts/ips_
<a href=\"[^\"]+minibb[^<]+</a>[^<]+\n<!--End of copyright link
(?:<script [^>]+\\s+<!--\\s+lang\\.no_new_posts|<a[^>]* title=\"Powered By MyBB)
<[^>]+Powered by PHP-Nuke
(?:<a[^>]+Powered by Reddit|powered by <a[^>]+>reddit<)
<body id=\"(?:DiscussionsPage|vanilla)
<!-- Powered by XMB
(?:jQuery\\.extend\\(true, XenForo|Forum software by XenForo&trade;|<!--XF:branding|<html[^>]+id=\"XenForo\")
Powered by <a href=\"[^>]+yabbforum
(?:Powered by <a[^>]+phpbb|<a[^>]+phpbb[^>]+class=\\.copyright|\tphpBB style name|<[^>]+styles/(?:sub|pro)silver/theme|<img[^>]+i_icon_mini|<table class=\"forumline)
Powered by <a href=\"[^>]+punbb

<!-- <h1>BigDump: Staggered MySQL Dump Importer ver\\. ([\\d.b]+)\\;
(?:<title>SQL Buddy</title>|<[^>]+onclick=\"sideMainClick\\(\"home\\.php)
(?: \\| phpMyAdmin ([\\d.]+)<\\/title>|PMA_sendHeaderLocation\\(|<link [^>]*href=\"[^\"]*phpmyadmin\\.css\\.php)\\;
(?:<title>phpPgAdmin</title>|<span class=\"appname\">phpPgAdmin)

(?:wh(?:utils|ver|proxy|lang|topic|msg)|ehlpdhtm)\\.js
(?:<!-- Generated by Doxygen ([\\d.]+)|<link[^>]+doxygen\\.css)\\;
<link[^>]+href=\"[^\"]*rdoc-style\\.css
Generated by <a[^>]+href=\"https?://rdoc\\.rubyforge\\.org[^>]+>RDoc</a> ([\\d.]*\\d)\\;
(?:<html[^>]* yuilibrary\\.com/rdf/[\\d.]+/yui\\.rdf|<body[^>]+class=\"yui3-skin-sam)
<!-- Generated by phpDocumentor

cdn\\.shop\\.pe/widget/
addthis\\.com/js/
hellobar\\.js
addtoany\\.com/menu/page\\.js
\\/assets\\/js\\/manycontacts\\.min\\.js
(?:<iframe id=\"meebo-iframe\"|Meebo\\('domReady'\\))
pub\\.mybloglog\\.com
<link [^>]*href=\"[^\"]+owl.carousel(?:\\.min)?\\.css
owl.carousel.*\\.js
widgets\\.outbrain\\.com/outbrain\\.js
w\\.sharethis\\.com/
assetscdn\\.stackla\\.com\\/media\\/js\\/widget\\/(?:[a-zA-Z0-9.]+)?\\.js
load\\.sumome\\.com

<a href=\"http://www.strato.de/\" target=\"_blank\">
<a href=\"https://ssl.mietshop.d
<div class=\"BoxContainer\">
<dd>This OnlineStore is brought to you by ViA-Online GmbH Afterbuy. Information and contribution at https://www.afterbuy.de</dd>
shop-static\\.afterbuy\\.de
Powered by <a href=\"http://www.xonic-solutions.de/index.php\" target=\"_blank\">xonic-solutions Shopsoftware</a>
core/jslib/jquery\\.xonic\\.js\\.php
Powered by <a [^>]*href=\"https?://(?:www\\.)?arastta\\.org[^>]+>Arastta
arastta\\.js
<link[^>]* href=\"^https?://edge\\.avangate\\.net/
^https?://edge\\.avangate\\.net/
<link href=[^>]+cdn\\d+\\.bigcommerce\\.com/v
cdn\\d+\\.bigcommerce\\.com/v
(?:Diese <a href=[^>]+bigware\\.de|<a href=[^>]+/main_bigware_\\d+\\.php)
&nbsp;Powered by (?:<a href=[^>]+cs-cart\\.com|CS-Cart)
.cm-noscript[^>]+</style>
clientexec\\.[^>]*\\s?=\\s?[^>]*;
cosmoshop_functions\\.js
(?:Powered by <a href=[^>]+cubecart\\.com|<p[^>]+>Powered by CubeCart)
<[^>]+demandware\\.edgesuite
<[^>]+(?:id=\"block[_-]commerce[_-]cart[_-]cart|class=\"commerce[_-]product[_-]field)
cdn\\.e-merchant\\.com
<!--\\s+FwP Systems
(?:<link [^>]*href=\"[^\\/]*\\/\\/www\\.fortune3\\.com\\/[^\"]*siterate\\/rate\\.css|Powered by <a [^>]*href=\"[^\"]+fortune3\\.com)
cartjs\\.php\\?(?:.*&)?s=[^&]*myfortune3cart\\.com
(?:<link[^>]* href=\"templates/gambio/|<a[^>]content\\.php\\?coID=\\d|<!-- gambio eof -->|<!--[\\s=]+Shopsoftware by Gambio GmbH \\(c\\))
gm_javascript\\.js\\.php
<[^>]+(?:/sys_master/|/hybr/|/_ui/desktop/)
(?:is-bin|INTERSHOP)
(?:<input[^>]+name=\"JTLSHOP|<a href=\"jtl\\.php)
js/mage
//skin/frontend/(?:default|(enterprise))\\;?Enterprise:Community
<!--[^-]*OXID eShop
(?:index\\.php\\?route=[a-z]+/|Powered By <a href=\"[^>]+OpenCart)
<[^>]+_dyncharset
<[^>]+id=\"oracle-cc\"
<a[^>]+title=\"POWERGAP
<input type=\"hidden\" name=\"shopid\"
Powered by <a\\s+[^>]+>PrestaShop
<a href=\"[^>]+opensolution\\.org/\">(?:Shopping cart by|Sklep internetowy)
<a[^>]+title=\"SEOshop
<body class=\"shopatron
<img[^>]+mediacdn\\.shopatron\\.com\\;confidence:50
mediacdn\\.shopatron\\.com
<link[^>]+=['\"]//cdn\\.shopify\\.com
<link [^>]*href=\"https?://cdn\\.myshoptet\\.com/
^https?://cdn\\.myshoptet\\.com/
<title>Shopware ([\\d\\.]+) [^<]+\\;\\;confidence:90
(?:(shopware)|/web/cache/[0-9]{10}_.+)\\.js\\;?4:5
smjslib\\.js
(?:<link[^>]*/assets/store/all-[a-z\\d]{32}\\.css[^>]+>|<script>\\s*Spree\\.(?:routes|translations|api_key))
Shopsystem von <a href=[^>]+store-systems\\.de\"|\\.mws_boxTop
uc_cart/uc_cart_block\\.js
<form [^>]*action=\"[^\"]*\\/cgi-bin\\/UCEditor\\?(?:[^\"]*&)?merchantId=[^\"]
cgi-bin\\/UCJavaScript\\?(?:[^\"]*&)?merchantid=.
<a[^>]+>Powered By VP-ASP Shopping Cart</a>
vs350\\.js
<div id=\"vmMainPage
<link [^>]*href=\"[^\"]*/vspfiles/
/volusion\\.js(?:\\?([\\d.]*))?\\;
<!-- WooCommerce
woocommerce
Powered by X-Cart(?: (\\d+))? <a[^>]+href=\"http://www\\.x-cart\\.com/\"[^>]*>\\;
<a[^>]+href=\"[^\"]*(?:\\?|&)xcart_form_id=[a-z\\d]{32}(?:&|$)
/skin/common_files/modules/Product_Options/func\\.js
<link[^>]+store\\.yahoo\\.net
(?:<!--Powered by nopCommerce|Powered by: <a[^>]+nopcommerce)
<body onload=\"window\\.defaultStatus='oscss templates';\"
(?:<a[^>]*(?:\\?|&)osCsid|Powered by (?:<[^>]+>)?osCommerce</a>|<[^>]+class=\"[^>]*infoBoxHeading)
<div class=\"copyright\">[^<]+<a[^>]+>xt:Commerce

<!--Coppermine Photo Gallery ([\\d.]+)\\;
<div id=\"gsNavBar\" class=\"gcBorder1\">
<link [^>]*href=\"[^\"]+lightbox(?:\\.min)?\\.css
lightbox.*\\.js
<link [^>]*href=\"[^/]*slimbox(?:-rtl)?\\.css
slimbox\\.js
<link [^>]*href=\"[^/]*slimbox2(?:-rtl)?\\.css
slimbox2\\.js
supersized(?:\\.([\\d.]*[\\d]))?.*\\.js\\;
<!--phpalbum ([.\\d\\s]+)-->\\;
(?:<link [^>]*href=\"[^\"]*prettyPhoto(?:\\.min)?\\.css|<a [^>]*rel=\"prettyPhoto)
jquery\\.prettyPhoto\\.js

<html[^>]* xmlns:jspwiki=
jspwiki
Powered by <a href=[^>]+atlassian\\.com/software/confluence(?:[^>]+>Atlassian Confluence</a> ([\\d.]+))?\\;
(?:<a[^>]+>Powered by MediaWiki</a>|<[^>]+id=\"t-specialpages)
moin(?:_static(\\d)(\\d)(\\d)|.+)/common/js/common\\.js\\;
<img [^>]*(?:title|alt)=\"This site is powered by the TWiki collaboration platform
(?:TWikiJavascripts|twikilib(?:\\.min)?\\.js)
<script[^>]*>[^<]*session_url:\\s*'https://session\\.wikispaces\\.com/
<\\w+[^>]*\\s+class=\"[^\"]*WikispacesContent\\s+WikispacesBs3[^\"]*\"
Powered by <a href=\"[^>]+WikkaWiki

<a[^>]+>DirectAdmin</a> Web Control Panel
common\\.js\\?plesk
<!-- cPanel

xiti\\.com/hit\\.xiti
aws\\.src = [^<]+caphyon-analytics
bugsense\\.js
bugsnag.*\\.js
src=[^>]+co2stats\\.com/propres\\.php
chartbeat\\.js
clickheat.*\\.js
static\\.getclicky\\.com
conversionlab\\.trackset\\.com/track/tsend\\.js
cetrk\\.com/pages/scripts/\\d+/\\d+\\.js
tag\\.crsspxl\\.com/s1\\.js
dtagent.*\\.js
^https?://[^\\/]+\\.google-analytics\\.com\\/(?:ga|urchin|(analytics))\\.js\\;?UA:
heap-\\d+.js
cmdatatagutils\\.js
^https?://(?:[^/]+\\.)?i(?:oam|v)wbox\\.de/
(?:api\\.intercom\\.io/api|static\\.intercomcdn\\.com/intercom\\.v1)
/jirafe\\.js
cf\\.kampyle\\.com/k_button\\.js
tracking\\.koego\\.com/end/ego\\.js
<script[^<>]*>[^]{0,128}?src\\s*=\\s*['\"]//counter\\.yadro\\.ru/hit(?:;\\S+)?\\?(?:t\\d+\\.\\d+;)?r
<!--LiveInternet counter-->
<!--/LiveInternet-->
<a href=\"http://www.liveinternet.ru/click\"
/js/al/common.js\\?[0-9_]+
mint/\\?js
api\\.mixpanel\\.com/track
netmonitor\\.fi/nmtracker\\.js
<!-- (?:Start|End) Open Web Analytics Tracker -->
optimizely\\.com.*\\.js
atgsvcs.+atgsvcs\\.js
piwik\\.js|piwik\\.php
edge\\.quantserve\\.com/quant\\.js
ruxitagentjs
<img[^>]*\\s+src=['\"]?https?://www\\.shinystat\\.com/cgi-bin/shinystat\\.cgi\\?[^'\"\\s>]*['\"\\s/>]
^https?://codice(?:business|ssl|pro|isp)?\\.shinystat\\.com/cgi-bin/getcod\\.cgi
sitemeter\\.com/js/counter\\.js\\?site=
/s[_-]code.*\\.js
snoobi\\.com/snoop\\.php
statcounter\\.com/counter/counter
tracker.js
visualpath[^/]*\\.trackset\\.it/[^/]+/track/include\\.js
w3counter\\.com/tracker\\.js
<title [^>]*lang=\"wo\">
<img[^>]+id=\"DCSIMG\"[^>]+webtrends
static\\.woopra\\.com
d\\.yimg\\.com/mi/ywa\\.js
mc\\.yandex\\.ru\\/metrika\\/watch\\.js
<iframe[^>]* (?:id=\"comscore\"|scr=[^>]+comscore)|\\.scorecardresearch\\.com/beacon\\.js|COMSCORE\\.beacon
\\.scorecardresearch\\.com/beacon\\.js|COMSCORE\\.beacon
<script[\\s\\S]*cdn\\.segment\\.com/analytics.js[\\s\\S]*script>
cdn\\.segment\\.com/analytics\\.js

<iframe src=\"[^>]+tumblr\\.com

^https?://cdn\\.alloyui\\.com/
angular(?:\\-|\\.)([\\d.]*\\d)[^/]*\\.js\\;
/([\\d.]+(?:\\-?rc[.\\d]*)*)/angular(?:\\.min)?\\.js\\;
angular.*\\.js
backbone.*\\.js
boba(?:\\.min)?\\.js
dhtmlxcommon\\.js
dataTables.*\\.js
dpd\\.js
([\\d.]+)/dojo/dojo(?:\\.xd)?\\.js\\;
\benyo\\.js
ext-base\\.js
hammer(?:\\.min)?\\.js
<[^>]*type=[^>]text\\/x-handlebars-template
handlebars(?:\\.runtime)?(?:-v([\\d.]+?))?(?:\\.min)?\\.js\\;
<[^>]*data-headjs-load
head\\.(?:core|load)(?:\\.min)?\\.js
hogan-(?:-|\\.)([\\d.]*\\d)[^/]*\\.js\\;
([\\d.]+)/hogan(?:\\.min)?\\.js\\;
^immutable\\.(?:min\\.)?js$
lazy(?:\\.browser)?(?:\\.min)?\\.js
lodash.*\\.js
backbone\\.marionette.*\\.js
<link[^>]+__meteor-css__
MochiKit(?:\\.min)?\\.js
modernizr(?:-([\\d.]*[\\d]))?.*\\.js\\;
moment-timezone(?:\\-data)?(?:\\.min)?\\.js
moment(?:\\.min)?\\.js
mootools.*\\.js
mustache(?:\\.min)?\\.js
petrojs(?:\\-|\\.)([\\d.]*\\d)[^/]*\\.js\\;
(?:/([\\d.]+)/)?petrojs(?:\\.min)?\\.js\\;
(?:<polymer-[^>]+|<link[^>]+rel=\"import\"[^>]+/polymer\\.html\")
polymer\\.js
(?:prototype|protoaculous)(?:-([\\d.]*[\\d]))?.*\\.js\\;
ramda.*\\.js
<[^>]+data-react
react(?:\\-with\\-addons)?(?:\\-|\\.)([\\d.]*\\d)[^/]*\\.js\\;
/([\\d.]+)/react(?:\\.min)?\\.js\\;
react.*\\.js
require.*\\.js
reveal(?:\\.min)?\\.js
right\\.js
riot(?:\\+compiler)?(?:\\.min)?\\.js
rx(?:\\.\\w+)?(?:\\.compat)?(?:\\.min)?\\.js
select2.*\\.js
sencha-touch.*\\.js
snap\\.svg(?:-min)?\\.js
socket.io.*\\.js
<link[^>]+?href=\"[^\"]+sweet-alert(?:\\.min)?\\.css
sweet-alert(?:\\.min)?\\.js
TweenMax(?:\\.min)?\\.js
(?:typeahead|bloodhound)\\.(?:jquery|bundle)?(?:\\.min)?\\.js
underscore.*\\.js
vue(?:\\-|\\.)([\\d.]*\\d)[^/]*\\.js\\;
/([\\d.]+)/vue(?:\\.min)?\\.js\\;
vue.*\\.js\\;confidence:20
\bwebix\\.js
xregexp(?:\\-|\\.)([\\d.]*\\d)[^/]*\\.js\\;
/([\\d.]+)/xregexp(?:\\.min)?\\.js\\;
xregexp.*\\.js
xajax_core.*\\.js
(?:/yui/|yui\\.yahooapis\\.com)
zepto.*\\.js
basket.*\\.js\\;confidence:10
jquery(?:\\-|\\.)([\\d.]*\\d)[^/]*\\.js\\;
/([\\d.]+)/jquery(?:\\.min)?\\.js\\;
jquery.*\\.js
jquery-ui(?:-|\\.)([\\d.]*\\d)[^/]*\\.js\\;
([\\d.]+)/jquery-ui(?:\\.min)?\\.js\\;
jquery-ui.*\\.js
math(?:\\.min)?\\.js
(?:scriptaculous|protoaculous)\\.js
spin(?:\\.min)?\\.js
yepnope-(?:-|\\.)([\\d.]*\\d)[^/]*\\.js\\;
([\\d.]+)/yepnope(?:\\.min)?\\.js\\;
yepnope.*\\.js
^list\\.(?:min\\.)?js$

Powered by\\s+<a href=[^>]+atlassian\\.com/(?:software/jira|jira-bug-tracking/)[^>]+>Atlassian\\s+JIRA(?:[^v]*v(?:ersion: )?(\\d+\\.\\d+(?:\\.\\d+)?))?\\;
jira-issue-collector-plugin
atlassian\\.jira\\.collector\\.plugin
href=\"enter_bug\\.cgi\">
(?:<a[^>]+>Powered by Flyspray|<map id=\"projectsearchform)
<img[^>]+ alt=\"Powered by Mantis Bugtracker
Powered by <a href=\"[^>]+Redmine
<a id=\"tracpowered
Powered by <a href=\"[^\"]*\"><strong>Trac(?:[ /]([\\d.]+))?\\;

<(?:param|embed|iframe)[^>]+blip\\.tv/play
cdn\\.sublimevideo\\.net/js/[a-z\\d]+\\.js
<div[^>]+class=\"video-js+\">
zencdn\\.net/c/video\\.js
(?:<(?:param|embed)[^>]+vimeo\\.com/moogaloop|<iframe[^>]player\\.vimeo\\.com)
<(?:param|embed|iframe)[^>]+youtube(?:-nocookie)?\\.com/(?:v|embed)

<div[^>]+id=\"disqus_thread\"
disqus_url
intensedebate\\.com
<[^>]+(?:id|class)=\"livefyre
livefyre_init\\.js

^https?://api\\.captchme\\.net/
<img[^>]+\\.mollom\\.com
mollom(?:\\.min)?\\.js
^https?://api\\.solvemedia\\.com/
(?:<div[^>]+id=\"recaptcha_image|<link[^>]+recaptcha|document\\.getElementById\\('recaptcha')
(?:api-secure\\.recaptcha\\.net|recaptcha_ajax\\.js)

cufon-yui\\.js
<link[^>]* href=[^>]+font-awesome(?:\\.min)?\\.css
(?:<link[^>]* href=[^>]+glyphicons(?:\\.min)?\\.css|<img[^>]* src=[^>]+glyphicons)
<link[^>]* href=[^>]+fonts\\.(?:googleapis|google)\\.com
googleapis\\.com/.+webfont
<link[^>]* href=[^>]+ionicons(?:\\.min)?\\.css
use\\.typekit\\.com
sifr\\.js

<!-- START headerTags\\.cfm
/cfajax/
/([\\d.]+(?:\\-?rc[.\\d]*)*)/angular-material(?:\\.min)?\\.js\\;
angular-material.*\\.js
Powered by <a[^>]+href=\"https?://(?:www\\.)?cibonfire\\.com[^>]*>Bonfire v([^<]+)\\;
<input[^>]+name=\"ci_csrf_token\"\\;version:2+
(?:powered by <a[^>]+>Django ?([\\d.]+)?|<input[^>]*name=[\"']csrfmiddlewaretoken[\"'][^>]*>)\\;
<link [^>]*href=\"[^\"]+ink(?:\\.min)?\\.css
ink.*\\.js
<link[^>]*\\s+href=[^>]*styles/kendo\\.common(?:\\.min)?\\.css[^>]*/>
<link[^>]* href=\"[^\"]*materialize(?:\\.min)?\\.css
materialize(?:\\.min)?\\.js
<input[^>]+name=\"__VIEWSTATE
<link[^>]+?href=\"[^\"]+milligram(?:\\.min)?\\.css
<link[^>]+?href=\"[^\"]+penguin(?:\\.min)?\\.css
penguin(?:\\.min)?\\.js
<link[^>]+(?:([\\d.])+/)?pure(?:-min)?\\.css\\;
/assets/application-[a-z\\d]{32}/\\.js\\;confidence:50
(?:<div class=\"ui\\s[^>]+\">)\\;confidence:30
(?:<link[^>]+semantic(?:\\.css|\\.min\\.css)\">)
(?:semantic(?:\\.js|\\.min\\.js))
Powered by <a href=\"[^>]+Swiftlet
<style>/\\*!\\* Bootstrap v(\\d\\.\\d\\.\\d)\\;
<link[^>]+?href=\"[^\"]+bootstrap(?:\\.min)?\\.css
<div [^>]*class=\"[^\"]*col-(?:xs|sm|md|lg)-\\d{1,2}
(?:twitter\\.github\\.com/bootstrap|bootstrap(?:\\.js|\\.min\\.js))
uikit.*\\.js
\\.js\\.wgx$
web2py\\.js
Powered by <a href=\"http://www.yiiframework.com/\" rel=\"external\">Yii Framework</a>
<!-- ZK [\\.\\d\\s]+-->
zkau/
<div [^>]*class=\"[^\"]*(?:small|medium|large)-\\d{1,2} columns
var WCF_PATH[^>]+
WCF\\..*\\.js

fulmicoton · 2016-05-08T01:52:56Z

Split your 400 regexp in packs of 40 or so and it should work. It will be 10 times slower, but still between 50 and 100 times faster than looping over Java's regexp.

fulmicoton · 2016-05-08T03:20:51Z

@neilireson sorry for not coming back to you earlier.
Thank you for your pull request.

The pull request is not addressing a single issue, so I am afraid I will not merge it.

Handling "^" is a very important addition.
I think serialization, and non tableized are a nice-to-have.
Multithreading I am not too sure about it.

All 4 should be separate issue and PR.

If you want to split your PR accordingly, I'd be happy to review it and merge it in.
If not, I'm afraid it is going to stay in that state.

neilireson · 2016-05-10T05:36:39Z

@fulmicoton I would have liked to created different PRs but every time attempted to create a new PR it just appended it to my original one. I will attempt to overcome my GIT ignorance and work out how to split the PR into their separate issues.

neilireson · 2016-05-10T05:46:51Z

@aantix Unfortunately there is an exponential growth in the time taken to create the multiregexp automaton. Most of my “patterns” are actually just simple strings so it was only taking me a minute to create a ~3,000 pattern automaton, however for 20k patterns I reckon it would take around 8 days. I haven’t tried the approach with more complex regex patterns. To overcome the time limitations I divided my 20k+ patterns into sets and generated multiple multiregexp automaton.

By the way almost all the memory requirements are for generating the lookup table, which speeds up matching (I think it about halves the time). However if you set tablized parameter to false the memory requirements are about two orders of magnitude less, so your 1GB object will be around 10MB.

In the end I have taken @fulmicoton advice and I've expanded all my patterns into (~50k) strings and I'm using an Aho-Corasick implementation (https://github.com/robert-bor/aho-corasick). Although I'm looking to extend it to include very simple patterns to cover my use case, i.e. ignoring multiple spaces, punctuation, folding the characters into ASCII, etc.

fulmicoton · 2016-05-10T06:20:31Z

@neilreson I think it is reasonable to put all of your strings aside, and treat them with aho-corasick, and handle the remaining regular expression with multiregexp. Ideally the library would do that for you.

In the case of @aantix , he use cases requires regular expressions, so I believe using multiregexp makes sense. The automaton explodes in memory unfortunately. Packing the queries should work.

neilireson added 2 commits April 13, 2016 14:01

Add serialization

37b2c95

Add ability to use non-tableized automaton. This is slower (I think a…

e23c406

…round half the speed for search) but requires much less memory. I have observed memory usage of 1% the tableized version.

neilireson closed this Apr 13, 2016

neilireson reopened this Apr 13, 2016

Spelling correction

0997a30

neilireson changed the title ~~Add serialization~~ Add serialization and non-tableized automaton Apr 14, 2016

neilireson added 11 commits April 14, 2016 19:28

Optimisations -

d627b74

remove assert from while loop call notify rather than notifyAll as we only need to wake a single thread don't check for whether process has finished on each thread wakeup

Optimisations -

26c90e4

replace hashmap contains and get with a single get. Note that the contains call will more often than not return false resulting in the need for a get

Optimisations -

5ab713e

remove pointless assignment

trivial - add final to variable

2a9b3f6

Only calculate pattern start and end if those methods are called, thu…

0b2a813

…s optimising the "next()"" method. Also provide methods to return all the matching patterns (and their starts and end), still default is to return the first pattern.

Optimisation

24af33a

Switch allocation of start/end arrays out of next() method to start(), end() methods.

Optimisation

bf6e7e4

Initialise maps and lists with known size

Add exceptions to makeAutomatonWithPrefix() so that patterns starting…

1f2f207

… with ".*" and "^" do not have prefix attached.

Trivial - comment change version to 0.5.1

f612e5c

Optimisation - initialise maps and lists with known size

b9a11f6

neilireson changed the title ~~Add serialization and non-tableized automaton~~ Add serialization, non-tableized automaton and optimisations Apr 25, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add serialization, non-tableized automaton and optimisations #9

Add serialization, non-tableized automaton and optimisations #9

neilireson commented Apr 13, 2016

neilireson commented Apr 13, 2016

neilireson commented Apr 19, 2016

aantix commented May 7, 2016

fulmicoton commented May 8, 2016

fulmicoton commented May 8, 2016

neilireson commented May 10, 2016

neilireson commented May 10, 2016

fulmicoton commented May 10, 2016

Add serialization, non-tableized automaton and optimisations #9

Are you sure you want to change the base?

Add serialization, non-tableized automaton and optimisations #9

Conversation

neilireson commented Apr 13, 2016

neilireson commented Apr 13, 2016

neilireson commented Apr 19, 2016

aantix commented May 7, 2016

fulmicoton commented May 8, 2016

fulmicoton commented May 8, 2016

neilireson commented May 10, 2016

neilireson commented May 10, 2016

fulmicoton commented May 10, 2016