why the performance different if change the order of compile and findall in python
up vote
2
down vote
favorite
I notice that preprocessing by compiling pattern will speed up the match operation, just like the following example.
python3 -m timeit -s "import re; t = re.compile(r'[w+][d]+')" "t.findall('abc eft123&aaa123')"
1000000 loops, best of 3: 1.42 usec per loop
python3 -m timeit -s "import re;" "re.findall(r'[w+][d]+', 'abc eft123&aaa123')"
100000 loops, best of 3: 2.45 usec per loop
But if I change the order of compiled pattern and re module, the result different, it seems that much slower now, why this happened?
python3 -m timeit -s "import re; t = re.compile(r'[w+][d]+')" "re.findall(t, 'abc eft123&aaa123')"
100000 loops, best of 3: 3.66 usec per loop
python regex compilation findall
add a comment |
up vote
2
down vote
favorite
I notice that preprocessing by compiling pattern will speed up the match operation, just like the following example.
python3 -m timeit -s "import re; t = re.compile(r'[w+][d]+')" "t.findall('abc eft123&aaa123')"
1000000 loops, best of 3: 1.42 usec per loop
python3 -m timeit -s "import re;" "re.findall(r'[w+][d]+', 'abc eft123&aaa123')"
100000 loops, best of 3: 2.45 usec per loop
But if I change the order of compiled pattern and re module, the result different, it seems that much slower now, why this happened?
python3 -m timeit -s "import re; t = re.compile(r'[w+][d]+')" "re.findall(t, 'abc eft123&aaa123')"
100000 loops, best of 3: 3.66 usec per loop
python regex compilation findall
add a comment |
up vote
2
down vote
favorite
up vote
2
down vote
favorite
I notice that preprocessing by compiling pattern will speed up the match operation, just like the following example.
python3 -m timeit -s "import re; t = re.compile(r'[w+][d]+')" "t.findall('abc eft123&aaa123')"
1000000 loops, best of 3: 1.42 usec per loop
python3 -m timeit -s "import re;" "re.findall(r'[w+][d]+', 'abc eft123&aaa123')"
100000 loops, best of 3: 2.45 usec per loop
But if I change the order of compiled pattern and re module, the result different, it seems that much slower now, why this happened?
python3 -m timeit -s "import re; t = re.compile(r'[w+][d]+')" "re.findall(t, 'abc eft123&aaa123')"
100000 loops, best of 3: 3.66 usec per loop
python regex compilation findall
I notice that preprocessing by compiling pattern will speed up the match operation, just like the following example.
python3 -m timeit -s "import re; t = re.compile(r'[w+][d]+')" "t.findall('abc eft123&aaa123')"
1000000 loops, best of 3: 1.42 usec per loop
python3 -m timeit -s "import re;" "re.findall(r'[w+][d]+', 'abc eft123&aaa123')"
100000 loops, best of 3: 2.45 usec per loop
But if I change the order of compiled pattern and re module, the result different, it seems that much slower now, why this happened?
python3 -m timeit -s "import re; t = re.compile(r'[w+][d]+')" "re.findall(t, 'abc eft123&aaa123')"
100000 loops, best of 3: 3.66 usec per loop
python regex compilation findall
python regex compilation findall
edited Nov 12 at 8:36
asked Nov 12 at 8:00
colin-zhou
1115
1115
add a comment |
add a comment |
3 Answers
3
active
oldest
votes
up vote
1
down vote
By "changing the order" you are actually using findall
in its "static" form, pretty much the equivallent of calling str.lower('ABC')
instead of 'ABC'.lower()
.
Depending on the exact implementation of the Python interpreter you are using, this is probably causing some overhead (for method lookups for example).
In other words, this is more related to the way Python works and not specifically to regex or the re
module in particular.
from timeit import Timer
def a():
str.lower('ABC')
def b():
'ABC'.lower()
print(min(Timer(a).repeat(5000, 5000)))
print(min(Timer(b).repeat(5000, 5000)))
Outputs
0.001060427000000086 # str.lower('ABC')
0.0008686820000001205 # 'ABC'.lower()
Thanks for your reply.
– colin-zhou
Nov 12 at 8:12
I did a test as you described, the result shows it may not root cause. python3 -m timeit -s "'AAAAAAAAAAAAABBBBBBBBBBBBBBBA'.lower()" 100000000 loops, best of 3: 0.00995 usec per loop python3 -m timeit -s "str.lower('AAAAAAAAAAAAABBBBBBBBBBBBBBBA')" 100000000 loops, best of 3: 0.00979 usec per loop
– colin-zhou
Nov 12 at 8:18
add a comment |
up vote
0
down vote
Let's say that word1, word2 ... are regexes:
let's rewrite those parts:
allWords = [re.compile(m) for m in ["word1", "word2", "word3"]]
I would create one single regex for all patterns:
allWords = re.compile("|".join(["word1", "word2", "word3"])
To support regexes with | in them, you would have to parenthesize the expressions:
allWords = re.compile("|".join("({})".format(x) for x in ["word1", "word2", "word3"])
(that also works with standard words of course, and it's still worth using regexes because of the | part)
now this is a disguised loop with each term hardcoded:
def bar(data, allWords):
if allWords[0].search(data) != None:
temp = data.split("word1", 1)[1] # that works only on non-regexes BTW
return(temp)
elif allWords[1].search(data) != None:
temp = data.split("word2", 1)[1]
return(temp)
can be rewritten simply as
def bar(data, allWords):
return allWords.split(data,maxsplit=1)[1]
in terms of performance:
Regular expression is compiled at start, so it's as fast as it can be
there's no loop or pasted expressions, the "or" part is done by the regex engine, which is most of the time some compiled code: can't beat that in pure python.
The match & the split are done in one operation
The last hiccup is that internally the regex engine searches for all expressions in a loop, which makes that a O(n) algorithm. To make it faster, you would have to predict which pattern is the most frequent, and put it first (my hypothesis is that regexes are "disjoint", which means that a text cannot be matched by several ones, else the longest would have to come before the shorter one)
add a comment |
up vote
0
down vote
I took some time to investigate the realization of re.findall
and re.match
, and I copied the standard library source code here.
def findall(pattern, string, flags=0):
"""Return a list of all non-overlapping matches in the string.
If one or more capturing groups are present in the pattern, return
a list of groups; this will be a list of tuples if the pattern
has more than one group.
Empty matches are included in the result."""
return _compile(pattern, flags).findall(string)
def match(pattern, string, flags=0):
"""Try to apply the pattern at the start of the string, returning
a match object, or None if no match was found."""
return _compile(pattern, flags).match(string)
def _compile(pattern, flags):
# internal: compile pattern
try:
p, loc = _cache[type(pattern), pattern, flags]
if loc is None or loc == _locale.setlocale(_locale.LC_CTYPE):
return p
except KeyError:
pass
if isinstance(pattern, _pattern_type):
if flags:
raise ValueError(
"cannot process flags argument with a compiled pattern")
return pattern
if not sre_compile.isstring(pattern):
raise TypeError("first argument must be string or compiled pattern")
p = sre_compile.compile(pattern, flags)
if not (flags & DEBUG):
if len(_cache) >= _MAXCACHE:
_cache.clear()
if p.flags & LOCALE:
if not _locale:
return p
loc = _locale.setlocale(_locale.LC_CTYPE)
else:
loc = None
_cache[type(pattern), pattern, flags] = p, loc
return p
This shows that if we execute re.findall(compiled_pattern, string) directly, it will trigger an additional calling of _compile(pattern, flags), in which function it will do some check and search the pattern in cache dictionary. However, if we call compile_pattern.findall(string)
instead, that 'additional operation' wouldn't exist. So compile_pattern.findall(string)
will faster than re.findall(compile_pattern, string)
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53257939%2fwhy-the-performance-different-if-change-the-order-of-compile-and-findall-in-pyth%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
3 Answers
3
active
oldest
votes
3 Answers
3
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
1
down vote
By "changing the order" you are actually using findall
in its "static" form, pretty much the equivallent of calling str.lower('ABC')
instead of 'ABC'.lower()
.
Depending on the exact implementation of the Python interpreter you are using, this is probably causing some overhead (for method lookups for example).
In other words, this is more related to the way Python works and not specifically to regex or the re
module in particular.
from timeit import Timer
def a():
str.lower('ABC')
def b():
'ABC'.lower()
print(min(Timer(a).repeat(5000, 5000)))
print(min(Timer(b).repeat(5000, 5000)))
Outputs
0.001060427000000086 # str.lower('ABC')
0.0008686820000001205 # 'ABC'.lower()
Thanks for your reply.
– colin-zhou
Nov 12 at 8:12
I did a test as you described, the result shows it may not root cause. python3 -m timeit -s "'AAAAAAAAAAAAABBBBBBBBBBBBBBBA'.lower()" 100000000 loops, best of 3: 0.00995 usec per loop python3 -m timeit -s "str.lower('AAAAAAAAAAAAABBBBBBBBBBBBBBBA')" 100000000 loops, best of 3: 0.00979 usec per loop
– colin-zhou
Nov 12 at 8:18
add a comment |
up vote
1
down vote
By "changing the order" you are actually using findall
in its "static" form, pretty much the equivallent of calling str.lower('ABC')
instead of 'ABC'.lower()
.
Depending on the exact implementation of the Python interpreter you are using, this is probably causing some overhead (for method lookups for example).
In other words, this is more related to the way Python works and not specifically to regex or the re
module in particular.
from timeit import Timer
def a():
str.lower('ABC')
def b():
'ABC'.lower()
print(min(Timer(a).repeat(5000, 5000)))
print(min(Timer(b).repeat(5000, 5000)))
Outputs
0.001060427000000086 # str.lower('ABC')
0.0008686820000001205 # 'ABC'.lower()
Thanks for your reply.
– colin-zhou
Nov 12 at 8:12
I did a test as you described, the result shows it may not root cause. python3 -m timeit -s "'AAAAAAAAAAAAABBBBBBBBBBBBBBBA'.lower()" 100000000 loops, best of 3: 0.00995 usec per loop python3 -m timeit -s "str.lower('AAAAAAAAAAAAABBBBBBBBBBBBBBBA')" 100000000 loops, best of 3: 0.00979 usec per loop
– colin-zhou
Nov 12 at 8:18
add a comment |
up vote
1
down vote
up vote
1
down vote
By "changing the order" you are actually using findall
in its "static" form, pretty much the equivallent of calling str.lower('ABC')
instead of 'ABC'.lower()
.
Depending on the exact implementation of the Python interpreter you are using, this is probably causing some overhead (for method lookups for example).
In other words, this is more related to the way Python works and not specifically to regex or the re
module in particular.
from timeit import Timer
def a():
str.lower('ABC')
def b():
'ABC'.lower()
print(min(Timer(a).repeat(5000, 5000)))
print(min(Timer(b).repeat(5000, 5000)))
Outputs
0.001060427000000086 # str.lower('ABC')
0.0008686820000001205 # 'ABC'.lower()
By "changing the order" you are actually using findall
in its "static" form, pretty much the equivallent of calling str.lower('ABC')
instead of 'ABC'.lower()
.
Depending on the exact implementation of the Python interpreter you are using, this is probably causing some overhead (for method lookups for example).
In other words, this is more related to the way Python works and not specifically to regex or the re
module in particular.
from timeit import Timer
def a():
str.lower('ABC')
def b():
'ABC'.lower()
print(min(Timer(a).repeat(5000, 5000)))
print(min(Timer(b).repeat(5000, 5000)))
Outputs
0.001060427000000086 # str.lower('ABC')
0.0008686820000001205 # 'ABC'.lower()
edited Nov 12 at 8:13
answered Nov 12 at 8:07
DeepSpace
35.7k44067
35.7k44067
Thanks for your reply.
– colin-zhou
Nov 12 at 8:12
I did a test as you described, the result shows it may not root cause. python3 -m timeit -s "'AAAAAAAAAAAAABBBBBBBBBBBBBBBA'.lower()" 100000000 loops, best of 3: 0.00995 usec per loop python3 -m timeit -s "str.lower('AAAAAAAAAAAAABBBBBBBBBBBBBBBA')" 100000000 loops, best of 3: 0.00979 usec per loop
– colin-zhou
Nov 12 at 8:18
add a comment |
Thanks for your reply.
– colin-zhou
Nov 12 at 8:12
I did a test as you described, the result shows it may not root cause. python3 -m timeit -s "'AAAAAAAAAAAAABBBBBBBBBBBBBBBA'.lower()" 100000000 loops, best of 3: 0.00995 usec per loop python3 -m timeit -s "str.lower('AAAAAAAAAAAAABBBBBBBBBBBBBBBA')" 100000000 loops, best of 3: 0.00979 usec per loop
– colin-zhou
Nov 12 at 8:18
Thanks for your reply.
– colin-zhou
Nov 12 at 8:12
Thanks for your reply.
– colin-zhou
Nov 12 at 8:12
I did a test as you described, the result shows it may not root cause. python3 -m timeit -s "'AAAAAAAAAAAAABBBBBBBBBBBBBBBA'.lower()" 100000000 loops, best of 3: 0.00995 usec per loop python3 -m timeit -s "str.lower('AAAAAAAAAAAAABBBBBBBBBBBBBBBA')" 100000000 loops, best of 3: 0.00979 usec per loop
– colin-zhou
Nov 12 at 8:18
I did a test as you described, the result shows it may not root cause. python3 -m timeit -s "'AAAAAAAAAAAAABBBBBBBBBBBBBBBA'.lower()" 100000000 loops, best of 3: 0.00995 usec per loop python3 -m timeit -s "str.lower('AAAAAAAAAAAAABBBBBBBBBBBBBBBA')" 100000000 loops, best of 3: 0.00979 usec per loop
– colin-zhou
Nov 12 at 8:18
add a comment |
up vote
0
down vote
Let's say that word1, word2 ... are regexes:
let's rewrite those parts:
allWords = [re.compile(m) for m in ["word1", "word2", "word3"]]
I would create one single regex for all patterns:
allWords = re.compile("|".join(["word1", "word2", "word3"])
To support regexes with | in them, you would have to parenthesize the expressions:
allWords = re.compile("|".join("({})".format(x) for x in ["word1", "word2", "word3"])
(that also works with standard words of course, and it's still worth using regexes because of the | part)
now this is a disguised loop with each term hardcoded:
def bar(data, allWords):
if allWords[0].search(data) != None:
temp = data.split("word1", 1)[1] # that works only on non-regexes BTW
return(temp)
elif allWords[1].search(data) != None:
temp = data.split("word2", 1)[1]
return(temp)
can be rewritten simply as
def bar(data, allWords):
return allWords.split(data,maxsplit=1)[1]
in terms of performance:
Regular expression is compiled at start, so it's as fast as it can be
there's no loop or pasted expressions, the "or" part is done by the regex engine, which is most of the time some compiled code: can't beat that in pure python.
The match & the split are done in one operation
The last hiccup is that internally the regex engine searches for all expressions in a loop, which makes that a O(n) algorithm. To make it faster, you would have to predict which pattern is the most frequent, and put it first (my hypothesis is that regexes are "disjoint", which means that a text cannot be matched by several ones, else the longest would have to come before the shorter one)
add a comment |
up vote
0
down vote
Let's say that word1, word2 ... are regexes:
let's rewrite those parts:
allWords = [re.compile(m) for m in ["word1", "word2", "word3"]]
I would create one single regex for all patterns:
allWords = re.compile("|".join(["word1", "word2", "word3"])
To support regexes with | in them, you would have to parenthesize the expressions:
allWords = re.compile("|".join("({})".format(x) for x in ["word1", "word2", "word3"])
(that also works with standard words of course, and it's still worth using regexes because of the | part)
now this is a disguised loop with each term hardcoded:
def bar(data, allWords):
if allWords[0].search(data) != None:
temp = data.split("word1", 1)[1] # that works only on non-regexes BTW
return(temp)
elif allWords[1].search(data) != None:
temp = data.split("word2", 1)[1]
return(temp)
can be rewritten simply as
def bar(data, allWords):
return allWords.split(data,maxsplit=1)[1]
in terms of performance:
Regular expression is compiled at start, so it's as fast as it can be
there's no loop or pasted expressions, the "or" part is done by the regex engine, which is most of the time some compiled code: can't beat that in pure python.
The match & the split are done in one operation
The last hiccup is that internally the regex engine searches for all expressions in a loop, which makes that a O(n) algorithm. To make it faster, you would have to predict which pattern is the most frequent, and put it first (my hypothesis is that regexes are "disjoint", which means that a text cannot be matched by several ones, else the longest would have to come before the shorter one)
add a comment |
up vote
0
down vote
up vote
0
down vote
Let's say that word1, word2 ... are regexes:
let's rewrite those parts:
allWords = [re.compile(m) for m in ["word1", "word2", "word3"]]
I would create one single regex for all patterns:
allWords = re.compile("|".join(["word1", "word2", "word3"])
To support regexes with | in them, you would have to parenthesize the expressions:
allWords = re.compile("|".join("({})".format(x) for x in ["word1", "word2", "word3"])
(that also works with standard words of course, and it's still worth using regexes because of the | part)
now this is a disguised loop with each term hardcoded:
def bar(data, allWords):
if allWords[0].search(data) != None:
temp = data.split("word1", 1)[1] # that works only on non-regexes BTW
return(temp)
elif allWords[1].search(data) != None:
temp = data.split("word2", 1)[1]
return(temp)
can be rewritten simply as
def bar(data, allWords):
return allWords.split(data,maxsplit=1)[1]
in terms of performance:
Regular expression is compiled at start, so it's as fast as it can be
there's no loop or pasted expressions, the "or" part is done by the regex engine, which is most of the time some compiled code: can't beat that in pure python.
The match & the split are done in one operation
The last hiccup is that internally the regex engine searches for all expressions in a loop, which makes that a O(n) algorithm. To make it faster, you would have to predict which pattern is the most frequent, and put it first (my hypothesis is that regexes are "disjoint", which means that a text cannot be matched by several ones, else the longest would have to come before the shorter one)
Let's say that word1, word2 ... are regexes:
let's rewrite those parts:
allWords = [re.compile(m) for m in ["word1", "word2", "word3"]]
I would create one single regex for all patterns:
allWords = re.compile("|".join(["word1", "word2", "word3"])
To support regexes with | in them, you would have to parenthesize the expressions:
allWords = re.compile("|".join("({})".format(x) for x in ["word1", "word2", "word3"])
(that also works with standard words of course, and it's still worth using regexes because of the | part)
now this is a disguised loop with each term hardcoded:
def bar(data, allWords):
if allWords[0].search(data) != None:
temp = data.split("word1", 1)[1] # that works only on non-regexes BTW
return(temp)
elif allWords[1].search(data) != None:
temp = data.split("word2", 1)[1]
return(temp)
can be rewritten simply as
def bar(data, allWords):
return allWords.split(data,maxsplit=1)[1]
in terms of performance:
Regular expression is compiled at start, so it's as fast as it can be
there's no loop or pasted expressions, the "or" part is done by the regex engine, which is most of the time some compiled code: can't beat that in pure python.
The match & the split are done in one operation
The last hiccup is that internally the regex engine searches for all expressions in a loop, which makes that a O(n) algorithm. To make it faster, you would have to predict which pattern is the most frequent, and put it first (my hypothesis is that regexes are "disjoint", which means that a text cannot be matched by several ones, else the longest would have to come before the shorter one)
edited Nov 12 at 9:10
Adrian W
1,75831320
1,75831320
answered Nov 12 at 8:09
prasanth ashok
1
1
add a comment |
add a comment |
up vote
0
down vote
I took some time to investigate the realization of re.findall
and re.match
, and I copied the standard library source code here.
def findall(pattern, string, flags=0):
"""Return a list of all non-overlapping matches in the string.
If one or more capturing groups are present in the pattern, return
a list of groups; this will be a list of tuples if the pattern
has more than one group.
Empty matches are included in the result."""
return _compile(pattern, flags).findall(string)
def match(pattern, string, flags=0):
"""Try to apply the pattern at the start of the string, returning
a match object, or None if no match was found."""
return _compile(pattern, flags).match(string)
def _compile(pattern, flags):
# internal: compile pattern
try:
p, loc = _cache[type(pattern), pattern, flags]
if loc is None or loc == _locale.setlocale(_locale.LC_CTYPE):
return p
except KeyError:
pass
if isinstance(pattern, _pattern_type):
if flags:
raise ValueError(
"cannot process flags argument with a compiled pattern")
return pattern
if not sre_compile.isstring(pattern):
raise TypeError("first argument must be string or compiled pattern")
p = sre_compile.compile(pattern, flags)
if not (flags & DEBUG):
if len(_cache) >= _MAXCACHE:
_cache.clear()
if p.flags & LOCALE:
if not _locale:
return p
loc = _locale.setlocale(_locale.LC_CTYPE)
else:
loc = None
_cache[type(pattern), pattern, flags] = p, loc
return p
This shows that if we execute re.findall(compiled_pattern, string) directly, it will trigger an additional calling of _compile(pattern, flags), in which function it will do some check and search the pattern in cache dictionary. However, if we call compile_pattern.findall(string)
instead, that 'additional operation' wouldn't exist. So compile_pattern.findall(string)
will faster than re.findall(compile_pattern, string)
add a comment |
up vote
0
down vote
I took some time to investigate the realization of re.findall
and re.match
, and I copied the standard library source code here.
def findall(pattern, string, flags=0):
"""Return a list of all non-overlapping matches in the string.
If one or more capturing groups are present in the pattern, return
a list of groups; this will be a list of tuples if the pattern
has more than one group.
Empty matches are included in the result."""
return _compile(pattern, flags).findall(string)
def match(pattern, string, flags=0):
"""Try to apply the pattern at the start of the string, returning
a match object, or None if no match was found."""
return _compile(pattern, flags).match(string)
def _compile(pattern, flags):
# internal: compile pattern
try:
p, loc = _cache[type(pattern), pattern, flags]
if loc is None or loc == _locale.setlocale(_locale.LC_CTYPE):
return p
except KeyError:
pass
if isinstance(pattern, _pattern_type):
if flags:
raise ValueError(
"cannot process flags argument with a compiled pattern")
return pattern
if not sre_compile.isstring(pattern):
raise TypeError("first argument must be string or compiled pattern")
p = sre_compile.compile(pattern, flags)
if not (flags & DEBUG):
if len(_cache) >= _MAXCACHE:
_cache.clear()
if p.flags & LOCALE:
if not _locale:
return p
loc = _locale.setlocale(_locale.LC_CTYPE)
else:
loc = None
_cache[type(pattern), pattern, flags] = p, loc
return p
This shows that if we execute re.findall(compiled_pattern, string) directly, it will trigger an additional calling of _compile(pattern, flags), in which function it will do some check and search the pattern in cache dictionary. However, if we call compile_pattern.findall(string)
instead, that 'additional operation' wouldn't exist. So compile_pattern.findall(string)
will faster than re.findall(compile_pattern, string)
add a comment |
up vote
0
down vote
up vote
0
down vote
I took some time to investigate the realization of re.findall
and re.match
, and I copied the standard library source code here.
def findall(pattern, string, flags=0):
"""Return a list of all non-overlapping matches in the string.
If one or more capturing groups are present in the pattern, return
a list of groups; this will be a list of tuples if the pattern
has more than one group.
Empty matches are included in the result."""
return _compile(pattern, flags).findall(string)
def match(pattern, string, flags=0):
"""Try to apply the pattern at the start of the string, returning
a match object, or None if no match was found."""
return _compile(pattern, flags).match(string)
def _compile(pattern, flags):
# internal: compile pattern
try:
p, loc = _cache[type(pattern), pattern, flags]
if loc is None or loc == _locale.setlocale(_locale.LC_CTYPE):
return p
except KeyError:
pass
if isinstance(pattern, _pattern_type):
if flags:
raise ValueError(
"cannot process flags argument with a compiled pattern")
return pattern
if not sre_compile.isstring(pattern):
raise TypeError("first argument must be string or compiled pattern")
p = sre_compile.compile(pattern, flags)
if not (flags & DEBUG):
if len(_cache) >= _MAXCACHE:
_cache.clear()
if p.flags & LOCALE:
if not _locale:
return p
loc = _locale.setlocale(_locale.LC_CTYPE)
else:
loc = None
_cache[type(pattern), pattern, flags] = p, loc
return p
This shows that if we execute re.findall(compiled_pattern, string) directly, it will trigger an additional calling of _compile(pattern, flags), in which function it will do some check and search the pattern in cache dictionary. However, if we call compile_pattern.findall(string)
instead, that 'additional operation' wouldn't exist. So compile_pattern.findall(string)
will faster than re.findall(compile_pattern, string)
I took some time to investigate the realization of re.findall
and re.match
, and I copied the standard library source code here.
def findall(pattern, string, flags=0):
"""Return a list of all non-overlapping matches in the string.
If one or more capturing groups are present in the pattern, return
a list of groups; this will be a list of tuples if the pattern
has more than one group.
Empty matches are included in the result."""
return _compile(pattern, flags).findall(string)
def match(pattern, string, flags=0):
"""Try to apply the pattern at the start of the string, returning
a match object, or None if no match was found."""
return _compile(pattern, flags).match(string)
def _compile(pattern, flags):
# internal: compile pattern
try:
p, loc = _cache[type(pattern), pattern, flags]
if loc is None or loc == _locale.setlocale(_locale.LC_CTYPE):
return p
except KeyError:
pass
if isinstance(pattern, _pattern_type):
if flags:
raise ValueError(
"cannot process flags argument with a compiled pattern")
return pattern
if not sre_compile.isstring(pattern):
raise TypeError("first argument must be string or compiled pattern")
p = sre_compile.compile(pattern, flags)
if not (flags & DEBUG):
if len(_cache) >= _MAXCACHE:
_cache.clear()
if p.flags & LOCALE:
if not _locale:
return p
loc = _locale.setlocale(_locale.LC_CTYPE)
else:
loc = None
_cache[type(pattern), pattern, flags] = p, loc
return p
This shows that if we execute re.findall(compiled_pattern, string) directly, it will trigger an additional calling of _compile(pattern, flags), in which function it will do some check and search the pattern in cache dictionary. However, if we call compile_pattern.findall(string)
instead, that 'additional operation' wouldn't exist. So compile_pattern.findall(string)
will faster than re.findall(compile_pattern, string)
answered Nov 12 at 18:26
colin-zhou
1115
1115
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53257939%2fwhy-the-performance-different-if-change-the-order-of-compile-and-findall-in-pyth%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown