why the performance different if change the order of compile and findall in python

up vote
2
down vote

favorite

I notice that preprocessing by compiling pattern will speed up the match operation, just like the following example.

python3 -m timeit -s "import re; t = re.compile(r'[w+][d]+')" "t.findall('abc eft123&aaa123')"

1000000 loops, best of 3: 1.42 usec per loop

python3 -m timeit -s "import re;" "re.findall(r'[w+][d]+', 'abc eft123&aaa123')"

100000 loops, best of 3: 2.45 usec per loop

But if I change the order of compiled pattern and re module, the result different, it seems that much slower now, why this happened?

python3 -m timeit -s "import re; t = re.compile(r'[w+][d]+')" "re.findall(t, 'abc eft123&aaa123')"

100000 loops, best of 3: 3.66 usec per loop

edited Nov 12 at 8:36

asked Nov 12 at 8:00

colin-zhou

1115

add a comment |

up vote
2
down vote

favorite

I notice that preprocessing by compiling pattern will speed up the match operation, just like the following example.

python3 -m timeit -s "import re; t = re.compile(r'[w+][d]+')" "t.findall('abc eft123&aaa123')"

1000000 loops, best of 3: 1.42 usec per loop

python3 -m timeit -s "import re;" "re.findall(r'[w+][d]+', 'abc eft123&aaa123')"

100000 loops, best of 3: 2.45 usec per loop

But if I change the order of compiled pattern and re module, the result different, it seems that much slower now, why this happened?

python3 -m timeit -s "import re; t = re.compile(r'[w+][d]+')" "re.findall(t, 'abc eft123&aaa123')"

100000 loops, best of 3: 3.66 usec per loop

edited Nov 12 at 8:36

asked Nov 12 at 8:00

colin-zhou

1115

add a comment |

up vote
2
down vote

favorite

I notice that preprocessing by compiling pattern will speed up the match operation, just like the following example.

python3 -m timeit -s "import re; t = re.compile(r'[w+][d]+')" "t.findall('abc eft123&aaa123')"

1000000 loops, best of 3: 1.42 usec per loop

python3 -m timeit -s "import re;" "re.findall(r'[w+][d]+', 'abc eft123&aaa123')"

100000 loops, best of 3: 2.45 usec per loop

But if I change the order of compiled pattern and re module, the result different, it seems that much slower now, why this happened?

python3 -m timeit -s "import re; t = re.compile(r'[w+][d]+')" "re.findall(t, 'abc eft123&aaa123')"

100000 loops, best of 3: 3.66 usec per loop

edited Nov 12 at 8:36

asked Nov 12 at 8:00

colin-zhou

1115

I notice that preprocessing by compiling pattern will speed up the match operation, just like the following example.

python3 -m timeit -s "import re; t = re.compile(r'[w+][d]+')" "t.findall('abc eft123&aaa123')"

1000000 loops, best of 3: 1.42 usec per loop

python3 -m timeit -s "import re;" "re.findall(r'[w+][d]+', 'abc eft123&aaa123')"

100000 loops, best of 3: 2.45 usec per loop

But if I change the order of compiled pattern and re module, the result different, it seems that much slower now, why this happened?

python3 -m timeit -s "import re; t = re.compile(r'[w+][d]+')" "re.findall(t, 'abc eft123&aaa123')"

100000 loops, best of 3: 3.66 usec per loop

python regex compilation findall

edited Nov 12 at 8:36

asked Nov 12 at 8:00

colin-zhou

1115

edited Nov 12 at 8:36

asked Nov 12 at 8:00

colin-zhou

1115

edited Nov 12 at 8:36

asked Nov 12 at 8:00

colin-zhou

1115

asked Nov 12 at 8:00

colin-zhou

1115

asked Nov 12 at 8:00

colin-zhou

1115

add a comment |

3 Answers
3

active

oldest

votes

up vote
1
down vote

By "changing the order" you are actually using findall in its "static" form, pretty much the equivallent of calling str.lower('ABC') instead of 'ABC'.lower().

Depending on the exact implementation of the Python interpreter you are using, this is probably causing some overhead (for method lookups for example).

In other words, this is more related to the way Python works and not specifically to regex or the re module in particular.

from timeit import Timer



def a():

    str.lower('ABC')



def b():

    'ABC'.lower()



print(min(Timer(a).repeat(5000, 5000)))

print(min(Timer(b).repeat(5000, 5000)))

Outputs

0.001060427000000086    # str.lower('ABC')

0.0008686820000001205   # 'ABC'.lower()

edited Nov 12 at 8:13

answered Nov 12 at 8:07

DeepSpace

35.7k44067

Thanks for your reply.
– colin-zhou
Nov 12 at 8:12

I did a test as you described, the result shows it may not root cause. python3 -m timeit -s "'AAAAAAAAAAAAABBBBBBBBBBBBBBBA'.lower()" 100000000 loops, best of 3: 0.00995 usec per loop python3 -m timeit -s "str.lower('AAAAAAAAAAAAABBBBBBBBBBBBBBBA')" 100000000 loops, best of 3: 0.00979 usec per loop
– colin-zhou
Nov 12 at 8:18

add a comment |

up vote
0
down vote

Let's say that word1, word2 ... are regexes:

let's rewrite those parts:

allWords = [re.compile(m) for m in ["word1", "word2", "word3"]]

I would create one single regex for all patterns:

allWords = re.compile("|".join(["word1", "word2", "word3"])

To support regexes with | in them, you would have to parenthesize the expressions:

allWords = re.compile("|".join("({})".format(x) for x in ["word1", "word2", "word3"])

(that also works with standard words of course, and it's still worth using regexes because of the | part)

now this is a disguised loop with each term hardcoded:

def bar(data, allWords):

   if allWords[0].search(data) != None:

      temp = data.split("word1", 1)[1]  # that works only on non-regexes BTW

      return(temp)



   elif allWords[1].search(data) != None:

      temp = data.split("word2", 1)[1]

      return(temp)

can be rewritten simply as

def bar(data, allWords):

   return allWords.split(data,maxsplit=1)[1]

in terms of performance:

Regular expression is compiled at start, so it's as fast as it can be
there's no loop or pasted expressions, the "or" part is done by the regex engine, which is most of the time some compiled code: can't beat that in pure python.
The match & the split are done in one operation
The last hiccup is that internally the regex engine searches for all expressions in a loop, which makes that a O(n) algorithm. To make it faster, you would have to predict which pattern is the most frequent, and put it first (my hypothesis is that regexes are "disjoint", which means that a text cannot be matched by several ones, else the longest would have to come before the shorter one)

edited Nov 12 at 9:10

Adrian W

1,75831320

answered Nov 12 at 8:09

prasanth ashok

add a comment |

up vote
0
down vote

I took some time to investigate the realization of re.findall and re.match, and I copied the standard library source code here.

def findall(pattern, string, flags=0):

    """Return a list of all non-overlapping matches in the string.



    If one or more capturing groups are present in the pattern, return

    a list of groups; this will be a list of tuples if the pattern

    has more than one group.



    Empty matches are included in the result."""

    return _compile(pattern, flags).findall(string)





def match(pattern, string, flags=0):

    """Try to apply the pattern at the start of the string, returning

    a match object, or None if no match was found."""

    return _compile(pattern, flags).match(string)





def _compile(pattern, flags):

    # internal: compile pattern

    try:

        p, loc = _cache[type(pattern), pattern, flags]

        if loc is None or loc == _locale.setlocale(_locale.LC_CTYPE):

            return p

    except KeyError:

        pass

    if isinstance(pattern, _pattern_type):

        if flags:

            raise ValueError(

                "cannot process flags argument with a compiled pattern")

        return pattern

    if not sre_compile.isstring(pattern):

        raise TypeError("first argument must be string or compiled pattern")

    p = sre_compile.compile(pattern, flags)

    if not (flags & DEBUG):

        if len(_cache) >= _MAXCACHE:

            _cache.clear()

        if p.flags & LOCALE:

            if not _locale:

                return p

            loc = _locale.setlocale(_locale.LC_CTYPE)

        else:

            loc = None

        _cache[type(pattern), pattern, flags] = p, loc

    return p

This shows that if we execute re.findall(compiled_pattern, string) directly, it will trigger an additional calling of _compile(pattern, flags), in which function it will do some check and search the pattern in cache dictionary. However, if we call compile_pattern.findall(string) instead, that 'additional operation' wouldn't exist. So compile_pattern.findall(string) will faster than re.findall(compile_pattern, string)

answered Nov 12 at 18:26

colin-zhou

1115

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53257939%2fwhy-the-performance-different-if-change-the-order-of-compile-and-findall-in-pyth%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

3 Answers
3

active

oldest

votes

3 Answers
3

active

oldest

votes

up vote
1
down vote

By "changing the order" you are actually using findall in its "static" form, pretty much the equivallent of calling str.lower('ABC') instead of 'ABC'.lower().

Depending on the exact implementation of the Python interpreter you are using, this is probably causing some overhead (for method lookups for example).

In other words, this is more related to the way Python works and not specifically to regex or the re module in particular.

from timeit import Timer



def a():

    str.lower('ABC')



def b():

    'ABC'.lower()



print(min(Timer(a).repeat(5000, 5000)))

print(min(Timer(b).repeat(5000, 5000)))

Outputs

0.001060427000000086    # str.lower('ABC')

0.0008686820000001205   # 'ABC'.lower()

edited Nov 12 at 8:13

answered Nov 12 at 8:07

DeepSpace

35.7k44067

Thanks for your reply.
– colin-zhou
Nov 12 at 8:12

I did a test as you described, the result shows it may not root cause. python3 -m timeit -s "'AAAAAAAAAAAAABBBBBBBBBBBBBBBA'.lower()" 100000000 loops, best of 3: 0.00995 usec per loop python3 -m timeit -s "str.lower('AAAAAAAAAAAAABBBBBBBBBBBBBBBA')" 100000000 loops, best of 3: 0.00979 usec per loop
– colin-zhou
Nov 12 at 8:18

add a comment |

up vote
1
down vote

By "changing the order" you are actually using findall in its "static" form, pretty much the equivallent of calling str.lower('ABC') instead of 'ABC'.lower().

Depending on the exact implementation of the Python interpreter you are using, this is probably causing some overhead (for method lookups for example).

In other words, this is more related to the way Python works and not specifically to regex or the re module in particular.

from timeit import Timer



def a():

    str.lower('ABC')



def b():

    'ABC'.lower()



print(min(Timer(a).repeat(5000, 5000)))

print(min(Timer(b).repeat(5000, 5000)))

Outputs

0.001060427000000086    # str.lower('ABC')

0.0008686820000001205   # 'ABC'.lower()

edited Nov 12 at 8:13

answered Nov 12 at 8:07

DeepSpace

35.7k44067

Thanks for your reply.
– colin-zhou
Nov 12 at 8:12

I did a test as you described, the result shows it may not root cause. python3 -m timeit -s "'AAAAAAAAAAAAABBBBBBBBBBBBBBBA'.lower()" 100000000 loops, best of 3: 0.00995 usec per loop python3 -m timeit -s "str.lower('AAAAAAAAAAAAABBBBBBBBBBBBBBBA')" 100000000 loops, best of 3: 0.00979 usec per loop
– colin-zhou
Nov 12 at 8:18

add a comment |

up vote
1
down vote

By "changing the order" you are actually using findall in its "static" form, pretty much the equivallent of calling str.lower('ABC') instead of 'ABC'.lower().

Depending on the exact implementation of the Python interpreter you are using, this is probably causing some overhead (for method lookups for example).

In other words, this is more related to the way Python works and not specifically to regex or the re module in particular.

from timeit import Timer



def a():

    str.lower('ABC')



def b():

    'ABC'.lower()



print(min(Timer(a).repeat(5000, 5000)))

print(min(Timer(b).repeat(5000, 5000)))

Outputs

0.001060427000000086    # str.lower('ABC')

0.0008686820000001205   # 'ABC'.lower()

edited Nov 12 at 8:13

answered Nov 12 at 8:07

DeepSpace

35.7k44067

By "changing the order" you are actually using findall in its "static" form, pretty much the equivallent of calling str.lower('ABC') instead of 'ABC'.lower().

Depending on the exact implementation of the Python interpreter you are using, this is probably causing some overhead (for method lookups for example).

In other words, this is more related to the way Python works and not specifically to regex or the re module in particular.

from timeit import Timer



def a():

    str.lower('ABC')



def b():

    'ABC'.lower()



print(min(Timer(a).repeat(5000, 5000)))

print(min(Timer(b).repeat(5000, 5000)))

Outputs

0.001060427000000086    # str.lower('ABC')

0.0008686820000001205   # 'ABC'.lower()

edited Nov 12 at 8:13

answered Nov 12 at 8:07

DeepSpace

35.7k44067

edited Nov 12 at 8:13

answered Nov 12 at 8:07

DeepSpace

35.7k44067

answered Nov 12 at 8:07

DeepSpace

35.7k44067

answered Nov 12 at 8:07

DeepSpace

35.7k44067

Thanks for your reply.
– colin-zhou
Nov 12 at 8:12

I did a test as you described, the result shows it may not root cause. python3 -m timeit -s "'AAAAAAAAAAAAABBBBBBBBBBBBBBBA'.lower()" 100000000 loops, best of 3: 0.00995 usec per loop python3 -m timeit -s "str.lower('AAAAAAAAAAAAABBBBBBBBBBBBBBBA')" 100000000 loops, best of 3: 0.00979 usec per loop
– colin-zhou
Nov 12 at 8:18

add a comment |

Thanks for your reply.
– colin-zhou
Nov 12 at 8:12

I did a test as you described, the result shows it may not root cause. python3 -m timeit -s "'AAAAAAAAAAAAABBBBBBBBBBBBBBBA'.lower()" 100000000 loops, best of 3: 0.00995 usec per loop python3 -m timeit -s "str.lower('AAAAAAAAAAAAABBBBBBBBBBBBBBBA')" 100000000 loops, best of 3: 0.00979 usec per loop
– colin-zhou
Nov 12 at 8:18

Thanks for your reply.
– colin-zhou
Nov 12 at 8:12

I did a test as you described, the result shows it may not root cause. python3 -m timeit -s "'AAAAAAAAAAAAABBBBBBBBBBBBBBBA'.lower()" 100000000 loops, best of 3: 0.00995 usec per loop python3 -m timeit -s "str.lower('AAAAAAAAAAAAABBBBBBBBBBBBBBBA')" 100000000 loops, best of 3: 0.00979 usec per loop
– colin-zhou
Nov 12 at 8:18

add a comment |

up vote
0
down vote

Let's say that word1, word2 ... are regexes:

let's rewrite those parts:

allWords = [re.compile(m) for m in ["word1", "word2", "word3"]]

I would create one single regex for all patterns:

allWords = re.compile("|".join(["word1", "word2", "word3"])

To support regexes with | in them, you would have to parenthesize the expressions:

allWords = re.compile("|".join("({})".format(x) for x in ["word1", "word2", "word3"])

(that also works with standard words of course, and it's still worth using regexes because of the | part)

now this is a disguised loop with each term hardcoded:

def bar(data, allWords):

   if allWords[0].search(data) != None:

      temp = data.split("word1", 1)[1]  # that works only on non-regexes BTW

      return(temp)



   elif allWords[1].search(data) != None:

      temp = data.split("word2", 1)[1]

      return(temp)

can be rewritten simply as

def bar(data, allWords):

   return allWords.split(data,maxsplit=1)[1]

in terms of performance:

edited Nov 12 at 9:10

Adrian W

1,75831320

answered Nov 12 at 8:09

prasanth ashok

add a comment |

up vote
0
down vote

Let's say that word1, word2 ... are regexes:

let's rewrite those parts:

allWords = [re.compile(m) for m in ["word1", "word2", "word3"]]

I would create one single regex for all patterns:

allWords = re.compile("|".join(["word1", "word2", "word3"])

To support regexes with | in them, you would have to parenthesize the expressions:

allWords = re.compile("|".join("({})".format(x) for x in ["word1", "word2", "word3"])

(that also works with standard words of course, and it's still worth using regexes because of the | part)

now this is a disguised loop with each term hardcoded:

def bar(data, allWords):

   if allWords[0].search(data) != None:

      temp = data.split("word1", 1)[1]  # that works only on non-regexes BTW

      return(temp)



   elif allWords[1].search(data) != None:

      temp = data.split("word2", 1)[1]

      return(temp)

can be rewritten simply as

def bar(data, allWords):

   return allWords.split(data,maxsplit=1)[1]

in terms of performance:

edited Nov 12 at 9:10

Adrian W

1,75831320

answered Nov 12 at 8:09

prasanth ashok

add a comment |

up vote
0
down vote

Let's say that word1, word2 ... are regexes:

let's rewrite those parts:

allWords = [re.compile(m) for m in ["word1", "word2", "word3"]]

I would create one single regex for all patterns:

allWords = re.compile("|".join(["word1", "word2", "word3"])

To support regexes with | in them, you would have to parenthesize the expressions:

allWords = re.compile("|".join("({})".format(x) for x in ["word1", "word2", "word3"])

(that also works with standard words of course, and it's still worth using regexes because of the | part)

now this is a disguised loop with each term hardcoded:

def bar(data, allWords):

   if allWords[0].search(data) != None:

      temp = data.split("word1", 1)[1]  # that works only on non-regexes BTW

      return(temp)



   elif allWords[1].search(data) != None:

      temp = data.split("word2", 1)[1]

      return(temp)

can be rewritten simply as

def bar(data, allWords):

   return allWords.split(data,maxsplit=1)[1]

in terms of performance:

edited Nov 12 at 9:10

Adrian W

1,75831320

answered Nov 12 at 8:09

prasanth ashok

Let's say that word1, word2 ... are regexes:

let's rewrite those parts:

allWords = [re.compile(m) for m in ["word1", "word2", "word3"]]

I would create one single regex for all patterns:

allWords = re.compile("|".join(["word1", "word2", "word3"])

To support regexes with | in them, you would have to parenthesize the expressions:

allWords = re.compile("|".join("({})".format(x) for x in ["word1", "word2", "word3"])

(that also works with standard words of course, and it's still worth using regexes because of the | part)

now this is a disguised loop with each term hardcoded:

def bar(data, allWords):

   if allWords[0].search(data) != None:

      temp = data.split("word1", 1)[1]  # that works only on non-regexes BTW

      return(temp)



   elif allWords[1].search(data) != None:

      temp = data.split("word2", 1)[1]

      return(temp)

can be rewritten simply as

def bar(data, allWords):

   return allWords.split(data,maxsplit=1)[1]

in terms of performance:

edited Nov 12 at 9:10

Adrian W

1,75831320

answered Nov 12 at 8:09

prasanth ashok

edited Nov 12 at 9:10

Adrian W

1,75831320

edited Nov 12 at 9:10

Adrian W

1,75831320

edited Nov 12 at 9:10

Adrian W

1,75831320

answered Nov 12 at 8:09

prasanth ashok

answered Nov 12 at 8:09

prasanth ashok

answered Nov 12 at 8:09

prasanth ashok

add a comment |

up vote
0
down vote

I took some time to investigate the realization of re.findall and re.match, and I copied the standard library source code here.

def findall(pattern, string, flags=0):

    """Return a list of all non-overlapping matches in the string.



    If one or more capturing groups are present in the pattern, return

    a list of groups; this will be a list of tuples if the pattern

    has more than one group.



    Empty matches are included in the result."""

    return _compile(pattern, flags).findall(string)





def match(pattern, string, flags=0):

    """Try to apply the pattern at the start of the string, returning

    a match object, or None if no match was found."""

    return _compile(pattern, flags).match(string)





def _compile(pattern, flags):

    # internal: compile pattern

    try:

        p, loc = _cache[type(pattern), pattern, flags]

        if loc is None or loc == _locale.setlocale(_locale.LC_CTYPE):

            return p

    except KeyError:

        pass

    if isinstance(pattern, _pattern_type):

        if flags:

            raise ValueError(

                "cannot process flags argument with a compiled pattern")

        return pattern

    if not sre_compile.isstring(pattern):

        raise TypeError("first argument must be string or compiled pattern")

    p = sre_compile.compile(pattern, flags)

    if not (flags & DEBUG):

        if len(_cache) >= _MAXCACHE:

            _cache.clear()

        if p.flags & LOCALE:

            if not _locale:

                return p

            loc = _locale.setlocale(_locale.LC_CTYPE)

        else:

            loc = None

        _cache[type(pattern), pattern, flags] = p, loc

    return p

answered Nov 12 at 18:26

colin-zhou

1115

add a comment |

up vote
0
down vote

I took some time to investigate the realization of re.findall and re.match, and I copied the standard library source code here.

def findall(pattern, string, flags=0):

    """Return a list of all non-overlapping matches in the string.



    If one or more capturing groups are present in the pattern, return

    a list of groups; this will be a list of tuples if the pattern

    has more than one group.



    Empty matches are included in the result."""

    return _compile(pattern, flags).findall(string)





def match(pattern, string, flags=0):

    """Try to apply the pattern at the start of the string, returning

    a match object, or None if no match was found."""

    return _compile(pattern, flags).match(string)





def _compile(pattern, flags):

    # internal: compile pattern

    try:

        p, loc = _cache[type(pattern), pattern, flags]

        if loc is None or loc == _locale.setlocale(_locale.LC_CTYPE):

            return p

    except KeyError:

        pass

    if isinstance(pattern, _pattern_type):

        if flags:

            raise ValueError(

                "cannot process flags argument with a compiled pattern")

        return pattern

    if not sre_compile.isstring(pattern):

        raise TypeError("first argument must be string or compiled pattern")

    p = sre_compile.compile(pattern, flags)

    if not (flags & DEBUG):

        if len(_cache) >= _MAXCACHE:

            _cache.clear()

        if p.flags & LOCALE:

            if not _locale:

                return p

            loc = _locale.setlocale(_locale.LC_CTYPE)

        else:

            loc = None

        _cache[type(pattern), pattern, flags] = p, loc

    return p

answered Nov 12 at 18:26

colin-zhou

1115

add a comment |

up vote
0
down vote

I took some time to investigate the realization of re.findall and re.match, and I copied the standard library source code here.

def findall(pattern, string, flags=0):

    """Return a list of all non-overlapping matches in the string.



    If one or more capturing groups are present in the pattern, return

    a list of groups; this will be a list of tuples if the pattern

    has more than one group.



    Empty matches are included in the result."""

    return _compile(pattern, flags).findall(string)





def match(pattern, string, flags=0):

    """Try to apply the pattern at the start of the string, returning

    a match object, or None if no match was found."""

    return _compile(pattern, flags).match(string)





def _compile(pattern, flags):

    # internal: compile pattern

    try:

        p, loc = _cache[type(pattern), pattern, flags]

        if loc is None or loc == _locale.setlocale(_locale.LC_CTYPE):

            return p

    except KeyError:

        pass

    if isinstance(pattern, _pattern_type):

        if flags:

            raise ValueError(

                "cannot process flags argument with a compiled pattern")

        return pattern

    if not sre_compile.isstring(pattern):

        raise TypeError("first argument must be string or compiled pattern")

    p = sre_compile.compile(pattern, flags)

    if not (flags & DEBUG):

        if len(_cache) >= _MAXCACHE:

            _cache.clear()

        if p.flags & LOCALE:

            if not _locale:

                return p

            loc = _locale.setlocale(_locale.LC_CTYPE)

        else:

            loc = None

        _cache[type(pattern), pattern, flags] = p, loc

    return p

answered Nov 12 at 18:26

colin-zhou

1115

I took some time to investigate the realization of re.findall and re.match, and I copied the standard library source code here.

def findall(pattern, string, flags=0):

    """Return a list of all non-overlapping matches in the string.



    If one or more capturing groups are present in the pattern, return

    a list of groups; this will be a list of tuples if the pattern

    has more than one group.



    Empty matches are included in the result."""

    return _compile(pattern, flags).findall(string)





def match(pattern, string, flags=0):

    """Try to apply the pattern at the start of the string, returning

    a match object, or None if no match was found."""

    return _compile(pattern, flags).match(string)





def _compile(pattern, flags):

    # internal: compile pattern

    try:

        p, loc = _cache[type(pattern), pattern, flags]

        if loc is None or loc == _locale.setlocale(_locale.LC_CTYPE):

            return p

    except KeyError:

        pass

    if isinstance(pattern, _pattern_type):

        if flags:

            raise ValueError(

                "cannot process flags argument with a compiled pattern")

        return pattern

    if not sre_compile.isstring(pattern):

        raise TypeError("first argument must be string or compiled pattern")

    p = sre_compile.compile(pattern, flags)

    if not (flags & DEBUG):

        if len(_cache) >= _MAXCACHE:

            _cache.clear()

        if p.flags & LOCALE:

            if not _locale:

                return p

            loc = _locale.setlocale(_locale.LC_CTYPE)

        else:

            loc = None

        _cache[type(pattern), pattern, flags] = p, loc

    return p

answered Nov 12 at 18:26

colin-zhou

1115

answered Nov 12 at 18:26

colin-zhou

1115

answered Nov 12 at 18:26

colin-zhou

1115

answered Nov 12 at 18:26

colin-zhou

1115

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

NYH,xm2jyAhncMMllw6j44H e hTt zoP

搜尋此網誌

Vfrdtyky