Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add serialization, non-tableized automaton and optimisations #9

Open
wants to merge 14 commits into
base: master
Choose a base branch
from

Conversation

neilireson
Copy link

The automaton I create can be large (1GB+) and these can take a long time to create on slower systems so it's handy to be able to serialize them so that the create process only happens once.

…round half the speed for search) but requires much less memory. I have observed memory usage of 1% the tableized version.
@neilireson
Copy link
Author

Add methods to create non-tableized automaton. These are slower but require much less memory. The default method still creates a tableized automaton so there is no impact on legacy code.

@neilireson neilireson closed this Apr 13, 2016
@neilireson neilireson reopened this Apr 13, 2016
@neilireson neilireson changed the title Add serialization Add serialization and non-tableized automaton Apr 14, 2016
…his creates the same number of threads as there are processors available.

For small numbers of patterns this makes little difference, however testing on my Mac Pro with 8 processors, with large numbers of patterns (1,000 - 5,000) the multithreaded make uses 4-6 threads and is around 3 to 4 times faster.
   remove assert from while loop
   call notify rather than notifyAll as we only need to wake a single thread
   don't check for whether process has finished on each thread wakeup
   replace hashmap contains and get with a single get. Note that the contains call will more often than not return false resulting in the need for a get
  remove pointless assignment
…s optimising the "next()"" method.

Also provide methods to return all the matching patterns (and their starts and end), still default is to return the first pattern.
Switch allocation of start/end arrays out of next() method to start(), end() methods.
Initialise maps and lists with known size
… with ".*" and "^" do not have prefix attached.
@neilireson
Copy link
Author

A bunch o' changes, mainly:

  1. Add multithreaded make to MultiPatternAutomaton (this is set as the default)
  2. MultiPatternSearcher next() only finds matches, moved all finding of pattern start and end processes to start() and end().
  3. Bug fix for patterns starting with "^"
  4. A bunch of small optimisations

@neilireson neilireson changed the title Add serialization and non-tableized automaton Add serialization, non-tableized automaton and optimisations Apr 25, 2016
@aantix
Copy link

aantix commented May 7, 2016

I have about 400 regexes that I am trying to do a mutlimatch for; I began to utilize this PR/branch as I hoped it would speed up my initialization process (the multithreaded init).

But there appears to be some sort of infinite loop going on? Maybe it's a rogue RegEx? It's not apparent to me how I can determine the offending regex by looking at the multistate/multipatternautomation classes..

Here's the stack trace:

Exception in thread "Thread-8" java.lang.OutOfMemoryError: GC overhead limit exceeded
    at com.fulmicoton.multiregexp.MultiState.step(MultiState.java:40)
    at com.fulmicoton.multiregexp.MultiPatternAutomaton$1.run(MultiPatternAutomaton.java:105)
    at java.lang.Thread.run(Thread.java:745)
Exception in thread "Thread-4" java.lang.OutOfMemoryError: GC overhead limit exceeded
    at com.fulmicoton.multiregexp.MultiState.step(MultiState.java:40)
    at com.fulmicoton.multiregexp.MultiPatternAutomaton$1.run(MultiPatternAutomaton.java:105)
    at java.lang.Thread.run(Thread.java:745)
Exception in thread "Thread-7" java.lang.OutOfMemoryError: GC overhead limit exceeded
    at com.fulmicoton.multiregexp.MultiState.step(MultiState.java:40)
    at com.fulmicoton.multiregexp.MultiPatternAutomaton$1.run(MultiPatternAutomaton.java:105)
    at java.lang.Thread.run(Thread.java:745)
Exception in thread "Thread-2" java.lang.OutOfMemoryError: GC overhead limit exceeded
    at com.fulmicoton.multiregexp.MultiState.step(MultiState.java:40)
    at com.fulmicoton.multiregexp.MultiPatternAutomaton$1.run(MultiPatternAutomaton.java:105)
    at java.lang.Thread.run(Thread.java:745)
Exception in thread "Thread-5" java.lang.OutOfMemoryError: GC overhead limit exceeded
^C^C^C^C^C^C^C^CException in thread "Thread-6" java.lang.OutOfMemoryError: GC overhead limit exceeded
Exception in thread "Thread-3" java.lang.OutOfMemoryError: GC overhead limit exceeded
Exception in thread "Thread-9" java.lang.OutOfMemoryError: GC overhead limit exceeded
16/05/07 13:17:49 WARN util.ShutdownHookManager: ShutdownHook '' failed, java.lang.OutOfMemoryError: GC overhead limit exceeded
java.lang.OutOfMemoryError: GC overhead limit exceeded

Here's the regex's:

(?:<link[^>]+components/bitrix|(?:src|href)=\"/bitrix/(?:js|templates))
1c-bitrix
(?:twlh(?:track)?\\.asp|3d_upsell\\.js)
<div class=\"[^\"]*parbase
<div[^>]+data-component-path=\"[^\"+]jcr:
/etc/designs/
ametys\\.js
(?:Powered by <a href=\"[^>]+BIGACE|<!--\\s+Site is running BIGACE)
Built upon the <a href=\"[^>]+banshee-php\\.org/\">[a-z]+</a>(?:v([\\d.]+))?\\;
<!-- BC_OBNW -->
CatalystScripts
<link [^>]+Cargo feed
/cargo\\.
concrete/js/
<!--[^>]+powered by (?:TYPOlight|Contao)[^>]*-->
<link[^>]+(?:typolight|contao)\\.css
<(?:link [^>]*href|img [^>]*src)=\"/polopoly_fs/
<!-- by DotNetNuke Corporation
<!-- DNN Platform
/js/dnncore\\.js
/js/dnn\\.js
<a[^>]+Site Powered by DTG
dedeajax
<(?:link|style)[^>]+sites/(?:default|all)/(?:themes|modules)/
drupal\\.js
<!--[^>]+FlexCMP[^>v]+v\\. ([\\d.]+)\\;
<!--\\s+Powered by GX
/graffiti\\.js
<img[^>]+/dsresource\\?objectid=
 <[^>]+/binaries/(?:[^/]+/)*content/gallery/
include/linkexternal\\.js
<!-- CSS InProces Portaal default -->
brein/inproces/website/websitefuncties\\.js
<(?:link|a href) [^>]+ndxz-studio
Powered by\\s+(?:CERN )?<a href=\"http://(?:cdsware\\.cern\\.ch/indico/|indico-software\\.org|cern\\.ch/indico)\">(?:CDS )?Indico( [\\d\\.]+)?\\;
(?:<div[^>]+id=\"wrapper_r\"|<[^>]+(?:feed|components)/com_|<table[^>]+class=\"pill)\\;confidence:50
<!--[^>]+This website is powered by Koala Web Framework CMS
<html lang=\"en\" class=\"k-source-essays k-lens-essays\">
<!--\\s+KOKEN DEBUGGING
koken(?:\\.js\\?([\\d.]+)|/storage)\\;
<!--[^K>-]+Koobi ([a-z\\d.]+)\\;
/Kooboo
kotisivukone(?:\\.min)?\\.js
<!-- Lightmon Engine Copyright Lightmon
 <a [^>]+Powered by Lithium
<link[^>]*/sites/[a-z\\d]{24}/theme/stylesheets
<a[^>]+>Powered by MODx</a>
<(?:link|script)[^>]+assets/snippets/\\;confidence:20
<!-- Methode uuid: \"[a-f\\d]+\" ?-->
(?:<script|link)[^>]*mg-(?:core|plugins|templates)
monotracker(?:\\.min)?\\.js
<link[^>]* href=[^>]+/web/css/(?:web\\.assets_common/|website\\.assets_frontend/)\\;confidence:25
/web/js/(?:web\\.assets_common/|website\\.assets_frontend/)\\;confidence:25
<link href=\"/opencms/
opencms
<!--[^>]+published by Open Text Web Solutions
ophal\\.js
Powered by <a href=\"[^>]+php-fusion
<[^>]+class=\"perc-region\"
<span[^>]+id=\"xvotes-0
<div class=\"posterous
<a href=\"[^>]+opensolution\\.org/\">CMS by
<html[^>]+xmlns:change=
<img[^>]+_tcm\\d{2,3}-\\d{6}\\.
/sim(?:site|core)/js
Powered by <a href=\"[^>]+SilverStripe
<img[^>]+src=\"[^>]*/~/media/[^>]+\\.ashx
<[^>]+/smartsite\\.(?:dws|shtml)\\?id=
<div class='dynamicDiv' id='dd\\.\\d\\.\\d'>
<!--\\s+Running (?:MySource|Squiz) Matrix
<(?:script[^>]+ src|link[^>]+ href)=[^>]+typo3temp/
<html[^>]+xmlns:typo3=\"[^\"]+Flow/Packages/Neos/
<(?:link|style|script)[^>]+/assets/frontOffice/
<[^>]*type=[^>]text\\/vnd\\.tiddlywiki
(?:/|_)tiki
powered by <a href=[^>]+umbraco
/js/ushahidi\\.js$
<[^>]+=\"vgn-?ext
cdn\\d+\\.editmysite\\.com
static\\.wixstatic\\.com
(?:<a href=\"[^>]+wolfcms\\.org[^>]+>Wolf CMS(?:</a>)? inside|Thank you for using <a[^>]+>Wolf CMS)
<link rel=[\"']stylesheet[\"'] [^>]+wp-(?:content|includes)
<link[^>]+s\\d+\\.wp\\.com
/wp-includes/
actionheroClient\\.js
[^a-z\\d]e107\\.js
<link[^>]*/papaya-themes/
Powered by <a href=\"[^\"]+phpwind\\.net
<a[^>]+>Powered by uKnowva</a>
/media/conv/js/jquery.js
powered by <a href=\"[^>]+viennacms

ping\\.src = node\\.href;\\s+[^>]+\\s+}\\s+</script>
<a href=\"[^>]+woltlab\\.com[^<]+<strong>Burning Board
Powered by (?:<strong>)?<a href=\"[^>]+fluxbb
<link[^>]+ipb_[^>]+\\.css
jscripts/ips_
<a href=\"[^\"]+minibb[^<]+</a>[^<]+\n<!--End of copyright link
(?:<script [^>]+\\s+<!--\\s+lang\\.no_new_posts|<a[^>]* title=\"Powered By MyBB)
<[^>]+Powered by PHP-Nuke
(?:<a[^>]+Powered by Reddit|powered by <a[^>]+>reddit<)
<body id=\"(?:DiscussionsPage|vanilla)
<!-- Powered by XMB
(?:jQuery\\.extend\\(true, XenForo|Forum software by XenForo&trade;|<!--XF:branding|<html[^>]+id=\"XenForo\")
Powered by <a href=\"[^>]+yabbforum
(?:Powered by <a[^>]+phpbb|<a[^>]+phpbb[^>]+class=\\.copyright|\tphpBB style name|<[^>]+styles/(?:sub|pro)silver/theme|<img[^>]+i_icon_mini|<table class=\"forumline)
Powered by <a href=\"[^>]+punbb

<!-- <h1>BigDump: Staggered MySQL Dump Importer ver\\. ([\\d.b]+)\\;
(?:<title>SQL Buddy</title>|<[^>]+onclick=\"sideMainClick\\(\"home\\.php)
(?: \\| phpMyAdmin ([\\d.]+)<\\/title>|PMA_sendHeaderLocation\\(|<link [^>]*href=\"[^\"]*phpmyadmin\\.css\\.php)\\;
(?:<title>phpPgAdmin</title>|<span class=\"appname\">phpPgAdmin)

(?:wh(?:utils|ver|proxy|lang|topic|msg)|ehlpdhtm)\\.js
(?:<!-- Generated by Doxygen ([\\d.]+)|<link[^>]+doxygen\\.css)\\;
<link[^>]+href=\"[^\"]*rdoc-style\\.css
Generated by <a[^>]+href=\"https?://rdoc\\.rubyforge\\.org[^>]+>RDoc</a> ([\\d.]*\\d)\\;
(?:<html[^>]* yuilibrary\\.com/rdf/[\\d.]+/yui\\.rdf|<body[^>]+class=\"yui3-skin-sam)
<!-- Generated by phpDocumentor

cdn\\.shop\\.pe/widget/
addthis\\.com/js/
hellobar\\.js
addtoany\\.com/menu/page\\.js
\\/assets\\/js\\/manycontacts\\.min\\.js
(?:<iframe id=\"meebo-iframe\"|Meebo\\('domReady'\\))
pub\\.mybloglog\\.com
<link [^>]*href=\"[^\"]+owl.carousel(?:\\.min)?\\.css
owl.carousel.*\\.js
widgets\\.outbrain\\.com/outbrain\\.js
w\\.sharethis\\.com/
assetscdn\\.stackla\\.com\\/media\\/js\\/widget\\/(?:[a-zA-Z0-9.]+)?\\.js
load\\.sumome\\.com

<a href=\"http://www.strato.de/\" target=\"_blank\">
<a href=\"https://ssl.mietshop.d
<div class=\"BoxContainer\">
<dd>This OnlineStore is brought to you by ViA-Online GmbH Afterbuy. Information and contribution at https://www.afterbuy.de</dd>
shop-static\\.afterbuy\\.de
Powered by <a href=\"http://www.xonic-solutions.de/index.php\" target=\"_blank\">xonic-solutions Shopsoftware</a>
core/jslib/jquery\\.xonic\\.js\\.php
Powered by <a [^>]*href=\"https?://(?:www\\.)?arastta\\.org[^>]+>Arastta
arastta\\.js
<link[^>]* href=\"^https?://edge\\.avangate\\.net/
^https?://edge\\.avangate\\.net/
<link href=[^>]+cdn\\d+\\.bigcommerce\\.com/v
cdn\\d+\\.bigcommerce\\.com/v
(?:Diese <a href=[^>]+bigware\\.de|<a href=[^>]+/main_bigware_\\d+\\.php)
&nbsp;Powered by (?:<a href=[^>]+cs-cart\\.com|CS-Cart)
.cm-noscript[^>]+</style>
clientexec\\.[^>]*\\s?=\\s?[^>]*;
cosmoshop_functions\\.js
(?:Powered by <a href=[^>]+cubecart\\.com|<p[^>]+>Powered by CubeCart)
<[^>]+demandware\\.edgesuite
<[^>]+(?:id=\"block[_-]commerce[_-]cart[_-]cart|class=\"commerce[_-]product[_-]field)
cdn\\.e-merchant\\.com
<!--\\s+FwP Systems
(?:<link [^>]*href=\"[^\\/]*\\/\\/www\\.fortune3\\.com\\/[^\"]*siterate\\/rate\\.css|Powered by <a [^>]*href=\"[^\"]+fortune3\\.com)
cartjs\\.php\\?(?:.*&)?s=[^&]*myfortune3cart\\.com
(?:<link[^>]* href=\"templates/gambio/|<a[^>]content\\.php\\?coID=\\d|<!-- gambio eof -->|<!--[\\s=]+Shopsoftware by Gambio GmbH \\(c\\))
gm_javascript\\.js\\.php
<[^>]+(?:/sys_master/|/hybr/|/_ui/desktop/)
(?:is-bin|INTERSHOP)
(?:<input[^>]+name=\"JTLSHOP|<a href=\"jtl\\.php)
js/mage
//skin/frontend/(?:default|(enterprise))\\;?Enterprise:Community
<!--[^-]*OXID eShop
(?:index\\.php\\?route=[a-z]+/|Powered By <a href=\"[^>]+OpenCart)
<[^>]+_dyncharset
<[^>]+id=\"oracle-cc\"
<a[^>]+title=\"POWERGAP
<input type=\"hidden\" name=\"shopid\"
Powered by <a\\s+[^>]+>PrestaShop
<a href=\"[^>]+opensolution\\.org/\">(?:Shopping cart by|Sklep internetowy)
<a[^>]+title=\"SEOshop
<body class=\"shopatron
<img[^>]+mediacdn\\.shopatron\\.com\\;confidence:50
mediacdn\\.shopatron\\.com
<link[^>]+=['\"]//cdn\\.shopify\\.com
<link [^>]*href=\"https?://cdn\\.myshoptet\\.com/
^https?://cdn\\.myshoptet\\.com/
<title>Shopware ([\\d\\.]+) [^<]+\\;\\;confidence:90
(?:(shopware)|/web/cache/[0-9]{10}_.+)\\.js\\;?4:5
smjslib\\.js
(?:<link[^>]*/assets/store/all-[a-z\\d]{32}\\.css[^>]+>|<script>\\s*Spree\\.(?:routes|translations|api_key))
Shopsystem von <a href=[^>]+store-systems\\.de\"|\\.mws_boxTop
uc_cart/uc_cart_block\\.js
<form [^>]*action=\"[^\"]*\\/cgi-bin\\/UCEditor\\?(?:[^\"]*&)?merchantId=[^\"]
cgi-bin\\/UCJavaScript\\?(?:[^\"]*&)?merchantid=.
<a[^>]+>Powered By VP-ASP Shopping Cart</a>
vs350\\.js
<div id=\"vmMainPage
<link [^>]*href=\"[^\"]*/vspfiles/
/volusion\\.js(?:\\?([\\d.]*))?\\;
<!-- WooCommerce
woocommerce
Powered by X-Cart(?: (\\d+))? <a[^>]+href=\"http://www\\.x-cart\\.com/\"[^>]*>\\;
<a[^>]+href=\"[^\"]*(?:\\?|&)xcart_form_id=[a-z\\d]{32}(?:&|$)
/skin/common_files/modules/Product_Options/func\\.js
<link[^>]+store\\.yahoo\\.net
(?:<!--Powered by nopCommerce|Powered by: <a[^>]+nopcommerce)
<body onload=\"window\\.defaultStatus='oscss templates';\"
(?:<a[^>]*(?:\\?|&)osCsid|Powered by (?:<[^>]+>)?osCommerce</a>|<[^>]+class=\"[^>]*infoBoxHeading)
<div class=\"copyright\">[^<]+<a[^>]+>xt:Commerce

<!--Coppermine Photo Gallery ([\\d.]+)\\;
<div id=\"gsNavBar\" class=\"gcBorder1\">
<link [^>]*href=\"[^\"]+lightbox(?:\\.min)?\\.css
lightbox.*\\.js
<link [^>]*href=\"[^/]*slimbox(?:-rtl)?\\.css
slimbox\\.js
<link [^>]*href=\"[^/]*slimbox2(?:-rtl)?\\.css
slimbox2\\.js
supersized(?:\\.([\\d.]*[\\d]))?.*\\.js\\;
<!--phpalbum ([.\\d\\s]+)-->\\;
(?:<link [^>]*href=\"[^\"]*prettyPhoto(?:\\.min)?\\.css|<a [^>]*rel=\"prettyPhoto)
jquery\\.prettyPhoto\\.js

<html[^>]* xmlns:jspwiki=
jspwiki
Powered by <a href=[^>]+atlassian\\.com/software/confluence(?:[^>]+>Atlassian Confluence</a> ([\\d.]+))?\\;
(?:<a[^>]+>Powered by MediaWiki</a>|<[^>]+id=\"t-specialpages)
moin(?:_static(\\d)(\\d)(\\d)|.+)/common/js/common\\.js\\;
<img [^>]*(?:title|alt)=\"This site is powered by the TWiki collaboration platform
(?:TWikiJavascripts|twikilib(?:\\.min)?\\.js)
<script[^>]*>[^<]*session_url:\\s*'https://session\\.wikispaces\\.com/
<\\w+[^>]*\\s+class=\"[^\"]*WikispacesContent\\s+WikispacesBs3[^\"]*\"
Powered by <a href=\"[^>]+WikkaWiki

<a[^>]+>DirectAdmin</a> Web Control Panel
common\\.js\\?plesk
<!-- cPanel

xiti\\.com/hit\\.xiti
aws\\.src = [^<]+caphyon-analytics
bugsense\\.js
bugsnag.*\\.js
src=[^>]+co2stats\\.com/propres\\.php
chartbeat\\.js
clickheat.*\\.js
static\\.getclicky\\.com
conversionlab\\.trackset\\.com/track/tsend\\.js
cetrk\\.com/pages/scripts/\\d+/\\d+\\.js
tag\\.crsspxl\\.com/s1\\.js
dtagent.*\\.js
^https?://[^\\/]+\\.google-analytics\\.com\\/(?:ga|urchin|(analytics))\\.js\\;?UA:
heap-\\d+.js
cmdatatagutils\\.js
^https?://(?:[^/]+\\.)?i(?:oam|v)wbox\\.de/
(?:api\\.intercom\\.io/api|static\\.intercomcdn\\.com/intercom\\.v1)
/jirafe\\.js
cf\\.kampyle\\.com/k_button\\.js
tracking\\.koego\\.com/end/ego\\.js
<script[^<>]*>[^]{0,128}?src\\s*=\\s*['\"]//counter\\.yadro\\.ru/hit(?:;\\S+)?\\?(?:t\\d+\\.\\d+;)?r
<!--LiveInternet counter-->
<!--/LiveInternet-->
<a href=\"http://www.liveinternet.ru/click\"
/js/al/common.js\\?[0-9_]+
mint/\\?js
api\\.mixpanel\\.com/track
netmonitor\\.fi/nmtracker\\.js
<!-- (?:Start|End) Open Web Analytics Tracker -->
optimizely\\.com.*\\.js
atgsvcs.+atgsvcs\\.js
piwik\\.js|piwik\\.php
edge\\.quantserve\\.com/quant\\.js
ruxitagentjs
<img[^>]*\\s+src=['\"]?https?://www\\.shinystat\\.com/cgi-bin/shinystat\\.cgi\\?[^'\"\\s>]*['\"\\s/>]
^https?://codice(?:business|ssl|pro|isp)?\\.shinystat\\.com/cgi-bin/getcod\\.cgi
sitemeter\\.com/js/counter\\.js\\?site=
/s[_-]code.*\\.js
snoobi\\.com/snoop\\.php
statcounter\\.com/counter/counter
tracker.js
visualpath[^/]*\\.trackset\\.it/[^/]+/track/include\\.js
w3counter\\.com/tracker\\.js
<title [^>]*lang=\"wo\">
<img[^>]+id=\"DCSIMG\"[^>]+webtrends
static\\.woopra\\.com
d\\.yimg\\.com/mi/ywa\\.js
mc\\.yandex\\.ru\\/metrika\\/watch\\.js
<iframe[^>]* (?:id=\"comscore\"|scr=[^>]+comscore)|\\.scorecardresearch\\.com/beacon\\.js|COMSCORE\\.beacon
\\.scorecardresearch\\.com/beacon\\.js|COMSCORE\\.beacon
<script[\\s\\S]*cdn\\.segment\\.com/analytics.js[\\s\\S]*script>
cdn\\.segment\\.com/analytics\\.js

<iframe src=\"[^>]+tumblr\\.com

^https?://cdn\\.alloyui\\.com/
angular(?:\\-|\\.)([\\d.]*\\d)[^/]*\\.js\\;
/([\\d.]+(?:\\-?rc[.\\d]*)*)/angular(?:\\.min)?\\.js\\;
angular.*\\.js
backbone.*\\.js
boba(?:\\.min)?\\.js
dhtmlxcommon\\.js
dataTables.*\\.js
dpd\\.js
([\\d.]+)/dojo/dojo(?:\\.xd)?\\.js\\;
\benyo\\.js
ext-base\\.js
hammer(?:\\.min)?\\.js
<[^>]*type=[^>]text\\/x-handlebars-template
handlebars(?:\\.runtime)?(?:-v([\\d.]+?))?(?:\\.min)?\\.js\\;
<[^>]*data-headjs-load
head\\.(?:core|load)(?:\\.min)?\\.js
hogan-(?:-|\\.)([\\d.]*\\d)[^/]*\\.js\\;
([\\d.]+)/hogan(?:\\.min)?\\.js\\;
^immutable\\.(?:min\\.)?js$
lazy(?:\\.browser)?(?:\\.min)?\\.js
lodash.*\\.js
backbone\\.marionette.*\\.js
<link[^>]+__meteor-css__
MochiKit(?:\\.min)?\\.js
modernizr(?:-([\\d.]*[\\d]))?.*\\.js\\;
moment-timezone(?:\\-data)?(?:\\.min)?\\.js
moment(?:\\.min)?\\.js
mootools.*\\.js
mustache(?:\\.min)?\\.js
petrojs(?:\\-|\\.)([\\d.]*\\d)[^/]*\\.js\\;
(?:/([\\d.]+)/)?petrojs(?:\\.min)?\\.js\\;
(?:<polymer-[^>]+|<link[^>]+rel=\"import\"[^>]+/polymer\\.html\")
polymer\\.js
(?:prototype|protoaculous)(?:-([\\d.]*[\\d]))?.*\\.js\\;
ramda.*\\.js
<[^>]+data-react
react(?:\\-with\\-addons)?(?:\\-|\\.)([\\d.]*\\d)[^/]*\\.js\\;
/([\\d.]+)/react(?:\\.min)?\\.js\\;
react.*\\.js
require.*\\.js
reveal(?:\\.min)?\\.js
right\\.js
riot(?:\\+compiler)?(?:\\.min)?\\.js
rx(?:\\.\\w+)?(?:\\.compat)?(?:\\.min)?\\.js
select2.*\\.js
sencha-touch.*\\.js
snap\\.svg(?:-min)?\\.js
socket.io.*\\.js
<link[^>]+?href=\"[^\"]+sweet-alert(?:\\.min)?\\.css
sweet-alert(?:\\.min)?\\.js
TweenMax(?:\\.min)?\\.js
(?:typeahead|bloodhound)\\.(?:jquery|bundle)?(?:\\.min)?\\.js
underscore.*\\.js
vue(?:\\-|\\.)([\\d.]*\\d)[^/]*\\.js\\;
/([\\d.]+)/vue(?:\\.min)?\\.js\\;
vue.*\\.js\\;confidence:20
\bwebix\\.js
xregexp(?:\\-|\\.)([\\d.]*\\d)[^/]*\\.js\\;
/([\\d.]+)/xregexp(?:\\.min)?\\.js\\;
xregexp.*\\.js
xajax_core.*\\.js
(?:/yui/|yui\\.yahooapis\\.com)
zepto.*\\.js
basket.*\\.js\\;confidence:10
jquery(?:\\-|\\.)([\\d.]*\\d)[^/]*\\.js\\;
/([\\d.]+)/jquery(?:\\.min)?\\.js\\;
jquery.*\\.js
jquery-ui(?:-|\\.)([\\d.]*\\d)[^/]*\\.js\\;
([\\d.]+)/jquery-ui(?:\\.min)?\\.js\\;
jquery-ui.*\\.js
math(?:\\.min)?\\.js
(?:scriptaculous|protoaculous)\\.js
spin(?:\\.min)?\\.js
yepnope-(?:-|\\.)([\\d.]*\\d)[^/]*\\.js\\;
([\\d.]+)/yepnope(?:\\.min)?\\.js\\;
yepnope.*\\.js
^list\\.(?:min\\.)?js$

Powered by\\s+<a href=[^>]+atlassian\\.com/(?:software/jira|jira-bug-tracking/)[^>]+>Atlassian\\s+JIRA(?:[^v]*v(?:ersion: )?(\\d+\\.\\d+(?:\\.\\d+)?))?\\;
jira-issue-collector-plugin
atlassian\\.jira\\.collector\\.plugin
href=\"enter_bug\\.cgi\">
(?:<a[^>]+>Powered by Flyspray|<map id=\"projectsearchform)
<img[^>]+ alt=\"Powered by Mantis Bugtracker
Powered by <a href=\"[^>]+Redmine
<a id=\"tracpowered
Powered by <a href=\"[^\"]*\"><strong>Trac(?:[ /]([\\d.]+))?\\;

<(?:param|embed|iframe)[^>]+blip\\.tv/play
cdn\\.sublimevideo\\.net/js/[a-z\\d]+\\.js
<div[^>]+class=\"video-js+\">
zencdn\\.net/c/video\\.js
(?:<(?:param|embed)[^>]+vimeo\\.com/moogaloop|<iframe[^>]player\\.vimeo\\.com)
<(?:param|embed|iframe)[^>]+youtube(?:-nocookie)?\\.com/(?:v|embed)

<div[^>]+id=\"disqus_thread\"
disqus_url
intensedebate\\.com
<[^>]+(?:id|class)=\"livefyre
livefyre_init\\.js

^https?://api\\.captchme\\.net/
<img[^>]+\\.mollom\\.com
mollom(?:\\.min)?\\.js
^https?://api\\.solvemedia\\.com/
(?:<div[^>]+id=\"recaptcha_image|<link[^>]+recaptcha|document\\.getElementById\\('recaptcha')
(?:api-secure\\.recaptcha\\.net|recaptcha_ajax\\.js)

cufon-yui\\.js
<link[^>]* href=[^>]+font-awesome(?:\\.min)?\\.css
(?:<link[^>]* href=[^>]+glyphicons(?:\\.min)?\\.css|<img[^>]* src=[^>]+glyphicons)
<link[^>]* href=[^>]+fonts\\.(?:googleapis|google)\\.com
googleapis\\.com/.+webfont
<link[^>]* href=[^>]+ionicons(?:\\.min)?\\.css
use\\.typekit\\.com
sifr\\.js

<!-- START headerTags\\.cfm
/cfajax/
/([\\d.]+(?:\\-?rc[.\\d]*)*)/angular-material(?:\\.min)?\\.js\\;
angular-material.*\\.js
Powered by <a[^>]+href=\"https?://(?:www\\.)?cibonfire\\.com[^>]*>Bonfire v([^<]+)\\;
<input[^>]+name=\"ci_csrf_token\"\\;version:2+
(?:powered by <a[^>]+>Django ?([\\d.]+)?|<input[^>]*name=[\"']csrfmiddlewaretoken[\"'][^>]*>)\\;
<link [^>]*href=\"[^\"]+ink(?:\\.min)?\\.css
ink.*\\.js
<link[^>]*\\s+href=[^>]*styles/kendo\\.common(?:\\.min)?\\.css[^>]*/>
<link[^>]* href=\"[^\"]*materialize(?:\\.min)?\\.css
materialize(?:\\.min)?\\.js
<input[^>]+name=\"__VIEWSTATE
<link[^>]+?href=\"[^\"]+milligram(?:\\.min)?\\.css
<link[^>]+?href=\"[^\"]+penguin(?:\\.min)?\\.css
penguin(?:\\.min)?\\.js
<link[^>]+(?:([\\d.])+/)?pure(?:-min)?\\.css\\;
/assets/application-[a-z\\d]{32}/\\.js\\;confidence:50
(?:<div class=\"ui\\s[^>]+\">)\\;confidence:30
(?:<link[^>]+semantic(?:\\.css|\\.min\\.css)\">)
(?:semantic(?:\\.js|\\.min\\.js))
Powered by <a href=\"[^>]+Swiftlet
<style>/\\*!\\* Bootstrap v(\\d\\.\\d\\.\\d)\\;
<link[^>]+?href=\"[^\"]+bootstrap(?:\\.min)?\\.css
<div [^>]*class=\"[^\"]*col-(?:xs|sm|md|lg)-\\d{1,2}
(?:twitter\\.github\\.com/bootstrap|bootstrap(?:\\.js|\\.min\\.js))
uikit.*\\.js
\\.js\\.wgx$
web2py\\.js
Powered by <a href=\"http://www.yiiframework.com/\" rel=\"external\">Yii Framework</a>
<!-- ZK [\\.\\d\\s]+-->
zkau/
<div [^>]*class=\"[^\"]*(?:small|medium|large)-\\d{1,2} columns
var WCF_PATH[^>]+
WCF\\..*\\.js

@fulmicoton
Copy link
Owner

Split your 400 regexp in packs of 40 or so and it should work. It will be 10 times slower, but still between 50 and 100 times faster than looping over Java's regexp.

@fulmicoton
Copy link
Owner

@neilireson sorry for not coming back to you earlier.
Thank you for your pull request.

The pull request is not addressing a single issue, so I am afraid I will not merge it.

Handling "^" is a very important addition.
I think serialization, and non tableized are a nice-to-have.
Multithreading I am not too sure about it.

All 4 should be separate issue and PR.

If you want to split your PR accordingly, I'd be happy to review it and merge it in.
If not, I'm afraid it is going to stay in that state.

@neilireson
Copy link
Author

@fulmicoton I would have liked to created different PRs but every time attempted to create a new PR it just appended it to my original one. I will attempt to overcome my GIT ignorance and work out how to split the PR into their separate issues.

@neilireson
Copy link
Author

@aantix Unfortunately there is an exponential growth in the time taken to create the multiregexp automaton. Most of my “patterns” are actually just simple strings so it was only taking me a minute to create a ~3,000 pattern automaton, however for 20k patterns I reckon it would take around 8 days. I haven’t tried the approach with more complex regex patterns. To overcome the time limitations I divided my 20k+ patterns into sets and generated multiple multiregexp automaton.

By the way almost all the memory requirements are for generating the lookup table, which speeds up matching (I think it about halves the time). However if you set tablized parameter to false the memory requirements are about two orders of magnitude less, so your 1GB object will be around 10MB.

In the end I have taken @fulmicoton advice and I've expanded all my patterns into (~50k) strings and I'm using an Aho-Corasick implementation (https://github.com/robert-bor/aho-corasick). Although I'm looking to extend it to include very simple patterns to cover my use case, i.e. ignoring multiple spaces, punctuation, folding the characters into ASCII, etc.

@fulmicoton
Copy link
Owner

@neilreson I think it is reasonable to put all of your strings aside, and treat them with aho-corasick, and handle the remaining regular expression with multiregexp. Ideally the library would do that for you.

In the case of @aantix , he use cases requires regular expressions, so I believe using multiregexp makes sense. The automaton explodes in memory unfortunately. Packing the queries should work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants