Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding Aho-Corasick to Boost.Algorithm #24

Open
wants to merge 27 commits into
base: develop
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
4275bcc
Added TODO list
zamazan4ik Jul 23, 2016
697e816
Initialize Aho-Corasik
zamazan4ik Jul 26, 2016
f5681d8
Added return-reference find
zamazan4ik Jul 29, 2016
47778dd
New version of Aho-Corasick
zamazan4ik Aug 13, 2016
559ad90
Added guards and copyright
zamazan4ik Aug 13, 2016
1c64e19
Added example for Aho-Corasick
zamazan4ik Aug 13, 2016
19bc400
Added doxygen and documentation
zamazan4ik Aug 14, 2016
df8321a
Removed range interface for operator()
zamazan4ik Aug 14, 2016
14392f8
Added in doc 'insert' method
zamazan4ik Aug 15, 2016
4f325ae
Added tests.
zamazan4ik Aug 16, 2016
a40fe59
New version 'Aho-Corasick'
zamazan4ik Aug 24, 2016
a4dcb83
[micro] Fix doxygen comment
zamazan4ik Aug 24, 2016
98ec38a
Delete useless README
zamazan4ik Aug 24, 2016
ea9f65a
[micro] Fixed doxygen doc
zamazan4ik Aug 27, 2016
95ce8ab
Merged branch feature_branch/aho_corasik into feature_branch/aho_corasik
zamazan4ik Aug 27, 2016
a638394
[micro] Fixed C++11 compatibility and include guard name
zamazan4ik Aug 28, 2016
63c7077
[micro] Fixed comment
zamazan4ik Aug 28, 2016
517b6d5
Fix in matching patterns
zamazan4ik Aug 28, 2016
dbd435b
Fixed multiple init in 'find', renamed to aho_corasick
zamazan4ik Aug 31, 2016
3212f73
Added range-based 'find' .
zamazan4ik Sep 3, 2016
86e4514
Changed type of root std::unique_ptr<node_type> -> node_type
zamazan4ik Sep 3, 2016
3b37858
Deleted memory allocations.
zamazan4ik Sep 6, 2016
b6382e6
[micro] Refactoring
zamazan4ik Sep 13, 2016
ae8c3c3
Fix serious bug in searching.
zamazan4ik Sep 13, 2016
87117ce
Deleted std::map and std::unordered_map versions.
zamazan4ik Sep 15, 2016
48b2bc1
Fixed compile error
zamazan4ik Sep 17, 2016
47e6b97
Added source of implementation, refactoring
zamazan4ik Oct 7, 2016
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
143 changes: 143 additions & 0 deletions doc/aho_corasick.qbk
Original file line number Diff line number Diff line change
@@ -0,0 +1,143 @@
[/ QuickBook Document version 1.5 ]

[section:AhoCorasick Aho-Corasick Search]

[/license

Copyright (c) 2016 Alexander Zaitsev

Distributed under the Boost Software License, Version 1.0.
(See accompanying file LICENSE_1_0.txt or copy at
http://www.boost.org/LICENSE_1_0.txt)
]


[heading Overview]

The header file 'aho_corasick.hpp' contains an implementation of the Aho-Corasick algorithm for searching sequences of values. It is primarily used to search for multiple patterns within a corpus.

The Aho-Corasick algorithm works by building a trie (a tree with each node corresponding to an object) of the patterns sequences and traversing the trie to search for the pattern in a given corpus sequence. Additionally, the Aho-Corasick introduced the concept of "failure pointer/failure node" which is the node to be traversed when there is a mismatch.

The algorithm was conceived in 1975 by Alfred V. Aho and Margaret J. Corasick. Their paper "Efficient string matching: An aid to bibliographic search" was published in the Communications of the ACM.

Nomenclature: The nomenclature is similar to that of the Knuth Morris Pratt implementation in Boost.Algorithm. The sequence being searched for is referred to as the "pattern", and the sequence being searched in is referred to as the "corpus".

See more in "Set Matching and Aho–Corasick Algorithm", lecture slides by Pekka Kilpeläinen(http://www.cs.uku.fi/~kilpelai/BSA05/lectures/slides04.pdf).

[heading Interface]

For flexibility, the Aho-Corasick algorithm has two interfaces; an object-based interface and a procedural one. The object-based interface builds the trie in the constructor, and uses 'find()' to make suffix links and perform the search. The procedural interface builds the trie(with building suffix links) and does the search all in one step. If you are going to be searching for the same pattern in multiple corpora, then you should use the object interface, and only build the tries once.

The header file 'aho_corasick.hpp' contains two versions of Aho-Corasick: based on std::map and std::unordered_map. Also there is class AhoCorasick, which you can customize. For every version this header file provide functional and object-based interfaces.

Procedural interfaces:

Procedural interfaces provide interfaces based on iterators.

For Aho-Corasick based on std::map:

``
template <typename T, typename Predicate = std::less<T>, typename RAIterator,
typename ForwardIterator, typename Callback>
bool aho_corasick_map ( RAIterator corpus_first, RAIterator corpus_last,
ForwardIterator pat_first, ForwardIterator pat_last,
Callback cb);
``

For Aho-Corasick based on std::unordered_map:
``
template <typename T, typename Hash = std::hash<T>, typename Comp = std::equal_to<T>, typename RAIterator,
typename ForwardIterator, typename Callback>
bool aho_corasick_hashmap ( RAIterator corpus_first, RAIterator corpus_last,
ForwardIterator pat_first, ForwardIterator pat_last,
Callback cb);
``



Object interface (typedefs):
``
template <typename T, typename Pred = std::less<T>>
using Aho_Corasick_Map = AhoCorasick<T, std::map, Pred>;

template <typename T, typename Hash = std::hash<T>, typename Comp = std::equal_to<T>>
using Aho_Corasick_HashMap = AhoCorasick<T, std::unordered_map, Hash, Comp>;
``

Interface (constructors, etc.) are equal for Aho_Corasick_Map, Aho_Corasick_HashMap and basical AhoCorasick:
``
AhoCorasick();

template<typename ForwardIterator>
explicit AhoCorasick(ForwardIterator patBegin, ForwardIterator patEnd);

template<typename Range>
explicit AhoCorasick(const Range& range);

template<typename ForwardIterator>
void insert(ForwardIterator begin, ForwardIterator end);

template<typename Range>
void insert(const Range& range);

template <typename RAIterator, typename Callback>
bool find(RAIterator begin, RAIterator end, Callback cb);
``

[heading Return value]

The 'find' method returns true, if all Callback callings return true, otherwise returns false.

[heading Requirements]

C++11-compatible compiler required.

For Aho_Corasick_HashMap and aho_corasick_hashmap: by default use std::hash<ValueType> for Hash and std::equal_to<ValueType> as Comparator. If you type doesn't support it, you must use your own functions for this. Without Hash and Comparator algorithm doesn't work.

For Aho_Corasick_Map and aho_corasick_map: by default use std::less<ValueType> as Predicate. If you type doesn't support it, you must use your own functions for this. Without Predicate algorithm doesn't work.

[heading Performance]

Performance of Aho_Corasick_Map and Aho_Corasick_HashMap is similar on small alphabets. On large alphabets Aho_Corasick_HashMap is faster than Aho_Corasick_Map. Remember, that getting hash of element is slow operation. Also if you use Aho_Corasick_HashMap, std::unordered_map can sometimes do rehash with O(Alphabet).

[heading Memory Use]

Every node of trie consist of container of std::shared_ptr to trie nodes, which you choose(std::map, std::unordered_map or maybe something else), two std::shared_ptr to trie nodes and std:vector<size_t> of length of patterns, which that ends in this node. Count of nodes is linear in the sum of the length of the patterns.

[heading Complexity]

Nomenclature: M - sum of the patterns length, N - length of the corpus, K - alphabet size, T - number of coincidences

std::unordered_map-based version:
Time: O(M + N + T), Memory: O(M)
std::map-based version:
Time: O((M + N)log(K) + T), Memory: O(M).

[heading Exception Safety]

Both the object-oriented and procedural versions of the Aho-Corasick algorithm take all their parameters by value(exclude output container, taked by non-const reference). Therefore, both interfaces provide the strong exception guarantee.

[heading Notes]

* When using the object-based interface, the pattern must remain unchanged for during the inserting.

* The Aho-Corasick algorithm requires forward iterators for patterns and random-access iterators for the corpus.

[heading Customization points]

For using Aho-Corasick algorithms you must use your own Callback(RAIterator, RAIterator) -> bool. This Callback must returns true if all is fine, otherwise false.

In Aho_Corasick_HashMap and aho_corasick_hashmap() you can customize: value type, hash and compare functions.

In Aho_Corasick_Map and aho_corasick_map() you can customize: value type and predicate.

In AhoCorasick you can customize: value type, type of container and any other template parameters. It container will be used in nodes of the trie. Defining of the container: Container<Value_type, std::shared_ptr<Node>, Args...>. So your other template parameters will be used as Args... . Also your container must support 'find' method.

[endsect]

[/ File aho_corasick.qbk
Copyright 2016 Alexander Zaitsev
Distributed under the Boost Software License, Version 1.0.
(See accompanying file LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt).
]

1 change: 1 addition & 0 deletions doc/algorithm.qbk
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,7 @@ Thanks to all the people who have reviewed this library and made suggestions for


[section:Searching Searching Algorithms]
[include aho_corasick.qbk]
[include boyer_moore.qbk]
[include boyer_moore_horspool.qbk]
[include knuth_morris_pratt.qbk]
Expand Down
3 changes: 2 additions & 1 deletion example/Jamfile.v2
Original file line number Diff line number Diff line change
Expand Up @@ -20,5 +20,6 @@ project /boost/algorithm/example

exe clamp_example : clamp_example.cpp ;
exe search_example : search_example.cpp ;
exe is_palindrome_example : is_palindrome_example.cpp;
exe is_palindrome_example : is_palindrome_example.cpp ;
exe aho_corasick_example : aho_corasick_example.cpp ;

41 changes: 41 additions & 0 deletions example/aho_corasick_example.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
/*
Copyright (c) Alexander Zaitsev <[email protected]>, 2016

Distributed under the Boost Software License, Version 1.0. (See accompanying
file LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt)

For more information, see http://www.boost.org
*/

#include <vector>
#include <string>
#include <iostream>

#include <boost/algorithm/searching/aho_corasick.hpp>


int main()
{
std::vector<std::string> pat({"228", "he", "is", "1488", "she", "his", "322", "her",
"h", "hishera", "azaza"});
std::string corp = "hisher";
std::vector<std::pair<std::string::const_iterator, std::string::const_iterator>> out;

bool result = boost::algorithm::aho_corasick_map<char>(corp.begin(), corp.end(), pat.begin(), pat.end(),
[&out](std::string::const_iterator begin, std::string::const_iterator end) -> bool
{ out.push_back({begin, end}); return true; });

std::cout << result << std::endl;
for(const auto& val: out)
{
auto begin = val.first;
auto end = val.second;
while (begin != end)
{
std::cout << *begin;
++begin;
}
std::cout << std::endl;
}
return 0;
}
2 changes: 1 addition & 1 deletion include/boost/algorithm/is_palindrome.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -61,7 +61,7 @@ bool is_palindrome(BidirectionalIterator begin, BidirectionalIterator end, Predi
/// \return true if the entire sequence is palindrome
///
/// \param begin The start of the input sequence
/// \param end One past the end of the input sequence
/// \param end One past the end of the input sequence
///
/// \note This function will return true for empty sequences and for palindromes.
/// For other sequences function will return false.
Expand Down
Loading