Updating an Aho-Corasick trie in the face of inserts and deletes
All the literature and implementations I've found of Aho-Corasick are about building the entire trie beforehand from a set of phrases. However, I'm interested in ways to work with it as a mutable data structure, where it can handle occasional adds and removes without needing to rebuild the whole trie (imagine there are 1 million entries in it). It's OK if the worst case is awful, as long as the average case is close to logarithmic.
From how I figure it, the fail state for each node is another node that uses the same symbol. So if we have a hash multimap from each symbol to the list of nodes which use that symbol, we have our candidates whose fail states need updating.
Remove is straightforward. You find all other nodes which use the deleted node as a fail state and recompute their fail state. Go from the end of the string backwards, and the tree should still be in good shape.
Add is a bit trickier. Any node that failed to that symbol could have the new node as a better candidate. However, again it seems to be enough to traverse every other node with that symbol and fully recompute its fail state.
In other words, if we're adding or removing a node with symbol "A", we need to visit every other "A" node anywhere in the trie, and recompute the fail state (look for its closest ancestor with "A" as a child, or the root). This would require visiting every node for symbol "A", but in my case, that would be several orders of magnitude less than visiting every node in the trie.
Would that algorithm work, or am I missing something obvious?
database algorithm search data-structures aho-corasick
add a comment |
All the literature and implementations I've found of Aho-Corasick are about building the entire trie beforehand from a set of phrases. However, I'm interested in ways to work with it as a mutable data structure, where it can handle occasional adds and removes without needing to rebuild the whole trie (imagine there are 1 million entries in it). It's OK if the worst case is awful, as long as the average case is close to logarithmic.
From how I figure it, the fail state for each node is another node that uses the same symbol. So if we have a hash multimap from each symbol to the list of nodes which use that symbol, we have our candidates whose fail states need updating.
Remove is straightforward. You find all other nodes which use the deleted node as a fail state and recompute their fail state. Go from the end of the string backwards, and the tree should still be in good shape.
Add is a bit trickier. Any node that failed to that symbol could have the new node as a better candidate. However, again it seems to be enough to traverse every other node with that symbol and fully recompute its fail state.
In other words, if we're adding or removing a node with symbol "A", we need to visit every other "A" node anywhere in the trie, and recompute the fail state (look for its closest ancestor with "A" as a child, or the root). This would require visiting every node for symbol "A", but in my case, that would be several orders of magnitude less than visiting every node in the trie.
Would that algorithm work, or am I missing something obvious?
database algorithm search data-structures aho-corasick
Visiting every node with a symbol "A" to update its fail state wouldn't be so difficult, I guess. Finding every such node, though, will require a linear scan of the tree, unless you keep an auxiliary data structure that maps every node by symbol. I think to do this update efficiently will require O(n) extra space, where n is the number of nodes in your trie.
– Jim Mischel
Nov 14 '18 at 12:35
I think this is going to be more than just updating the fail states. Consider removing the string "the" from the list of matched terms. But you still have to match "their," "they," "them," "other," etc. And, yeah ... adding gets pretty tricky. I've never seen an article on updating an Aho-Corasick trie in-place, which leads me to believe that this is a pretty hard problem.
– Jim Mischel
Nov 14 '18 at 12:42
@JimMischel Yes, I expect it to take a bit more memory, my main concern is correctness and time complexity. The question is whether there is a sequence of operations that would leave the trie in a bad state. In that particular case, you could remove the "the" output from the "e" node, but leave the node intact as a non-output node, as long as you're not compressing nodes. My intuition says that would work fine, but I'm not certain; there could be cases where a node would need to move, which would mean other seemingly-unrelated fail states could be wrong.
– Robert Fraser
Nov 14 '18 at 17:02
add a comment |
All the literature and implementations I've found of Aho-Corasick are about building the entire trie beforehand from a set of phrases. However, I'm interested in ways to work with it as a mutable data structure, where it can handle occasional adds and removes without needing to rebuild the whole trie (imagine there are 1 million entries in it). It's OK if the worst case is awful, as long as the average case is close to logarithmic.
From how I figure it, the fail state for each node is another node that uses the same symbol. So if we have a hash multimap from each symbol to the list of nodes which use that symbol, we have our candidates whose fail states need updating.
Remove is straightforward. You find all other nodes which use the deleted node as a fail state and recompute their fail state. Go from the end of the string backwards, and the tree should still be in good shape.
Add is a bit trickier. Any node that failed to that symbol could have the new node as a better candidate. However, again it seems to be enough to traverse every other node with that symbol and fully recompute its fail state.
In other words, if we're adding or removing a node with symbol "A", we need to visit every other "A" node anywhere in the trie, and recompute the fail state (look for its closest ancestor with "A" as a child, or the root). This would require visiting every node for symbol "A", but in my case, that would be several orders of magnitude less than visiting every node in the trie.
Would that algorithm work, or am I missing something obvious?
database algorithm search data-structures aho-corasick
All the literature and implementations I've found of Aho-Corasick are about building the entire trie beforehand from a set of phrases. However, I'm interested in ways to work with it as a mutable data structure, where it can handle occasional adds and removes without needing to rebuild the whole trie (imagine there are 1 million entries in it). It's OK if the worst case is awful, as long as the average case is close to logarithmic.
From how I figure it, the fail state for each node is another node that uses the same symbol. So if we have a hash multimap from each symbol to the list of nodes which use that symbol, we have our candidates whose fail states need updating.
Remove is straightforward. You find all other nodes which use the deleted node as a fail state and recompute their fail state. Go from the end of the string backwards, and the tree should still be in good shape.
Add is a bit trickier. Any node that failed to that symbol could have the new node as a better candidate. However, again it seems to be enough to traverse every other node with that symbol and fully recompute its fail state.
In other words, if we're adding or removing a node with symbol "A", we need to visit every other "A" node anywhere in the trie, and recompute the fail state (look for its closest ancestor with "A" as a child, or the root). This would require visiting every node for symbol "A", but in my case, that would be several orders of magnitude less than visiting every node in the trie.
Would that algorithm work, or am I missing something obvious?
database algorithm search data-structures aho-corasick
database algorithm search data-structures aho-corasick
edited Nov 14 '18 at 17:05
Robert Fraser
asked Nov 13 '18 at 20:03
Robert FraserRobert Fraser
6,89764778
6,89764778
Visiting every node with a symbol "A" to update its fail state wouldn't be so difficult, I guess. Finding every such node, though, will require a linear scan of the tree, unless you keep an auxiliary data structure that maps every node by symbol. I think to do this update efficiently will require O(n) extra space, where n is the number of nodes in your trie.
– Jim Mischel
Nov 14 '18 at 12:35
I think this is going to be more than just updating the fail states. Consider removing the string "the" from the list of matched terms. But you still have to match "their," "they," "them," "other," etc. And, yeah ... adding gets pretty tricky. I've never seen an article on updating an Aho-Corasick trie in-place, which leads me to believe that this is a pretty hard problem.
– Jim Mischel
Nov 14 '18 at 12:42
@JimMischel Yes, I expect it to take a bit more memory, my main concern is correctness and time complexity. The question is whether there is a sequence of operations that would leave the trie in a bad state. In that particular case, you could remove the "the" output from the "e" node, but leave the node intact as a non-output node, as long as you're not compressing nodes. My intuition says that would work fine, but I'm not certain; there could be cases where a node would need to move, which would mean other seemingly-unrelated fail states could be wrong.
– Robert Fraser
Nov 14 '18 at 17:02
add a comment |
Visiting every node with a symbol "A" to update its fail state wouldn't be so difficult, I guess. Finding every such node, though, will require a linear scan of the tree, unless you keep an auxiliary data structure that maps every node by symbol. I think to do this update efficiently will require O(n) extra space, where n is the number of nodes in your trie.
– Jim Mischel
Nov 14 '18 at 12:35
I think this is going to be more than just updating the fail states. Consider removing the string "the" from the list of matched terms. But you still have to match "their," "they," "them," "other," etc. And, yeah ... adding gets pretty tricky. I've never seen an article on updating an Aho-Corasick trie in-place, which leads me to believe that this is a pretty hard problem.
– Jim Mischel
Nov 14 '18 at 12:42
@JimMischel Yes, I expect it to take a bit more memory, my main concern is correctness and time complexity. The question is whether there is a sequence of operations that would leave the trie in a bad state. In that particular case, you could remove the "the" output from the "e" node, but leave the node intact as a non-output node, as long as you're not compressing nodes. My intuition says that would work fine, but I'm not certain; there could be cases where a node would need to move, which would mean other seemingly-unrelated fail states could be wrong.
– Robert Fraser
Nov 14 '18 at 17:02
Visiting every node with a symbol "A" to update its fail state wouldn't be so difficult, I guess. Finding every such node, though, will require a linear scan of the tree, unless you keep an auxiliary data structure that maps every node by symbol. I think to do this update efficiently will require O(n) extra space, where n is the number of nodes in your trie.
– Jim Mischel
Nov 14 '18 at 12:35
Visiting every node with a symbol "A" to update its fail state wouldn't be so difficult, I guess. Finding every such node, though, will require a linear scan of the tree, unless you keep an auxiliary data structure that maps every node by symbol. I think to do this update efficiently will require O(n) extra space, where n is the number of nodes in your trie.
– Jim Mischel
Nov 14 '18 at 12:35
I think this is going to be more than just updating the fail states. Consider removing the string "the" from the list of matched terms. But you still have to match "their," "they," "them," "other," etc. And, yeah ... adding gets pretty tricky. I've never seen an article on updating an Aho-Corasick trie in-place, which leads me to believe that this is a pretty hard problem.
– Jim Mischel
Nov 14 '18 at 12:42
I think this is going to be more than just updating the fail states. Consider removing the string "the" from the list of matched terms. But you still have to match "their," "they," "them," "other," etc. And, yeah ... adding gets pretty tricky. I've never seen an article on updating an Aho-Corasick trie in-place, which leads me to believe that this is a pretty hard problem.
– Jim Mischel
Nov 14 '18 at 12:42
@JimMischel Yes, I expect it to take a bit more memory, my main concern is correctness and time complexity. The question is whether there is a sequence of operations that would leave the trie in a bad state. In that particular case, you could remove the "the" output from the "e" node, but leave the node intact as a non-output node, as long as you're not compressing nodes. My intuition says that would work fine, but I'm not certain; there could be cases where a node would need to move, which would mean other seemingly-unrelated fail states could be wrong.
– Robert Fraser
Nov 14 '18 at 17:02
@JimMischel Yes, I expect it to take a bit more memory, my main concern is correctness and time complexity. The question is whether there is a sequence of operations that would leave the trie in a bad state. In that particular case, you could remove the "the" output from the "e" node, but leave the node intact as a non-output node, as long as you're not compressing nodes. My intuition says that would work fine, but I'm not certain; there could be cases where a node would need to move, which would mean other seemingly-unrelated fail states could be wrong.
– Robert Fraser
Nov 14 '18 at 17:02
add a comment |
1 Answer
1
active
oldest
votes
I went ahead and implemented it* and it seems to be working. The outlined algorithm would also be slow on tries that have many of the same symbol (for example, genetic sequences or antivirus databases) and/or are deep, because it requires visiting every node that uses the symbol. It would also require some tweaking for tries that use compression (which seems like most of them). But it's practical for matching n-grams in text documents.
* Incredibly messy and mostly uncommented code, but if you're curious, here's the header, the body, and some tests.
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53288664%2fupdating-an-aho-corasick-trie-in-the-face-of-inserts-and-deletes%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
I went ahead and implemented it* and it seems to be working. The outlined algorithm would also be slow on tries that have many of the same symbol (for example, genetic sequences or antivirus databases) and/or are deep, because it requires visiting every node that uses the symbol. It would also require some tweaking for tries that use compression (which seems like most of them). But it's practical for matching n-grams in text documents.
* Incredibly messy and mostly uncommented code, but if you're curious, here's the header, the body, and some tests.
add a comment |
I went ahead and implemented it* and it seems to be working. The outlined algorithm would also be slow on tries that have many of the same symbol (for example, genetic sequences or antivirus databases) and/or are deep, because it requires visiting every node that uses the symbol. It would also require some tweaking for tries that use compression (which seems like most of them). But it's practical for matching n-grams in text documents.
* Incredibly messy and mostly uncommented code, but if you're curious, here's the header, the body, and some tests.
add a comment |
I went ahead and implemented it* and it seems to be working. The outlined algorithm would also be slow on tries that have many of the same symbol (for example, genetic sequences or antivirus databases) and/or are deep, because it requires visiting every node that uses the symbol. It would also require some tweaking for tries that use compression (which seems like most of them). But it's practical for matching n-grams in text documents.
* Incredibly messy and mostly uncommented code, but if you're curious, here's the header, the body, and some tests.
I went ahead and implemented it* and it seems to be working. The outlined algorithm would also be slow on tries that have many of the same symbol (for example, genetic sequences or antivirus databases) and/or are deep, because it requires visiting every node that uses the symbol. It would also require some tweaking for tries that use compression (which seems like most of them). But it's practical for matching n-grams in text documents.
* Incredibly messy and mostly uncommented code, but if you're curious, here's the header, the body, and some tests.
answered Nov 27 '18 at 6:51
Robert FraserRobert Fraser
6,89764778
6,89764778
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53288664%2fupdating-an-aho-corasick-trie-in-the-face-of-inserts-and-deletes%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Visiting every node with a symbol "A" to update its fail state wouldn't be so difficult, I guess. Finding every such node, though, will require a linear scan of the tree, unless you keep an auxiliary data structure that maps every node by symbol. I think to do this update efficiently will require O(n) extra space, where n is the number of nodes in your trie.
– Jim Mischel
Nov 14 '18 at 12:35
I think this is going to be more than just updating the fail states. Consider removing the string "the" from the list of matched terms. But you still have to match "their," "they," "them," "other," etc. And, yeah ... adding gets pretty tricky. I've never seen an article on updating an Aho-Corasick trie in-place, which leads me to believe that this is a pretty hard problem.
– Jim Mischel
Nov 14 '18 at 12:42
@JimMischel Yes, I expect it to take a bit more memory, my main concern is correctness and time complexity. The question is whether there is a sequence of operations that would leave the trie in a bad state. In that particular case, you could remove the "the" output from the "e" node, but leave the node intact as a non-output node, as long as you're not compressing nodes. My intuition says that would work fine, but I'm not certain; there could be cases where a node would need to move, which would mean other seemingly-unrelated fail states could be wrong.
– Robert Fraser
Nov 14 '18 at 17:02