Regex doesn't match all foreign characters












2















Here's my regex ^([\p{L}-|a-zA-Z0-9-_]+)$ and it is supposed to allow all foreign letters as well as numeric letter, number. But for some reason, hindi characters cannot match.



I wrote a Xunit test to prove.



[Fact]
public void test()
{
var hindiChar = "इम्तहान";
var input = "12345ABCDPrüfungテスト中文테스트إسرائيل" + hindiChar;
var regex = "^([\p{L}-|a-zA-Z0-9-_]+)$";
Assert.True(new Regex(regex).IsMatch(input));
}


If you remove the hindiChar, the test will return true; but if you add the hindiChar, the test will return false.



I thought part of the regex is to fit all foreign characters, but not sure why it doesn't match hindi characters.










share|improve this question


















  • 1





    It is a known fact that p{L} only matches letters from the BMP plane. Do you want to match diacritics too? Add p{M}. Use @"^[p{M}p{L}-|a-zA-Z0-9-_]+$". What is the | there for? Note the | inside a character class matches a literal | char. It seems to me you wan to use @"^[p{L}p{M}0-9_-]+$"

    – Wiktor Stribiżew
    Nov 15 '18 at 18:47













  • @WiktorStribiżew thanks it worked. The | is or as in this regex allows foreign characters or numeric, numbers.

    – WW pana
    Nov 15 '18 at 18:51






  • 1





    Ok, [|] is not an or operator, and here | must be removed.

    – Wiktor Stribiżew
    Nov 15 '18 at 18:51
















2















Here's my regex ^([\p{L}-|a-zA-Z0-9-_]+)$ and it is supposed to allow all foreign letters as well as numeric letter, number. But for some reason, hindi characters cannot match.



I wrote a Xunit test to prove.



[Fact]
public void test()
{
var hindiChar = "इम्तहान";
var input = "12345ABCDPrüfungテスト中文테스트إسرائيل" + hindiChar;
var regex = "^([\p{L}-|a-zA-Z0-9-_]+)$";
Assert.True(new Regex(regex).IsMatch(input));
}


If you remove the hindiChar, the test will return true; but if you add the hindiChar, the test will return false.



I thought part of the regex is to fit all foreign characters, but not sure why it doesn't match hindi characters.










share|improve this question


















  • 1





    It is a known fact that p{L} only matches letters from the BMP plane. Do you want to match diacritics too? Add p{M}. Use @"^[p{M}p{L}-|a-zA-Z0-9-_]+$". What is the | there for? Note the | inside a character class matches a literal | char. It seems to me you wan to use @"^[p{L}p{M}0-9_-]+$"

    – Wiktor Stribiżew
    Nov 15 '18 at 18:47













  • @WiktorStribiżew thanks it worked. The | is or as in this regex allows foreign characters or numeric, numbers.

    – WW pana
    Nov 15 '18 at 18:51






  • 1





    Ok, [|] is not an or operator, and here | must be removed.

    – Wiktor Stribiżew
    Nov 15 '18 at 18:51














2












2








2








Here's my regex ^([\p{L}-|a-zA-Z0-9-_]+)$ and it is supposed to allow all foreign letters as well as numeric letter, number. But for some reason, hindi characters cannot match.



I wrote a Xunit test to prove.



[Fact]
public void test()
{
var hindiChar = "इम्तहान";
var input = "12345ABCDPrüfungテスト中文테스트إسرائيل" + hindiChar;
var regex = "^([\p{L}-|a-zA-Z0-9-_]+)$";
Assert.True(new Regex(regex).IsMatch(input));
}


If you remove the hindiChar, the test will return true; but if you add the hindiChar, the test will return false.



I thought part of the regex is to fit all foreign characters, but not sure why it doesn't match hindi characters.










share|improve this question














Here's my regex ^([\p{L}-|a-zA-Z0-9-_]+)$ and it is supposed to allow all foreign letters as well as numeric letter, number. But for some reason, hindi characters cannot match.



I wrote a Xunit test to prove.



[Fact]
public void test()
{
var hindiChar = "इम्तहान";
var input = "12345ABCDPrüfungテスト中文테스트إسرائيل" + hindiChar;
var regex = "^([\p{L}-|a-zA-Z0-9-_]+)$";
Assert.True(new Regex(regex).IsMatch(input));
}


If you remove the hindiChar, the test will return true; but if you add the hindiChar, the test will return false.



I thought part of the regex is to fit all foreign characters, but not sure why it doesn't match hindi characters.







c# regex xunit






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Nov 15 '18 at 18:45









WW panaWW pana

474417




474417








  • 1





    It is a known fact that p{L} only matches letters from the BMP plane. Do you want to match diacritics too? Add p{M}. Use @"^[p{M}p{L}-|a-zA-Z0-9-_]+$". What is the | there for? Note the | inside a character class matches a literal | char. It seems to me you wan to use @"^[p{L}p{M}0-9_-]+$"

    – Wiktor Stribiżew
    Nov 15 '18 at 18:47













  • @WiktorStribiżew thanks it worked. The | is or as in this regex allows foreign characters or numeric, numbers.

    – WW pana
    Nov 15 '18 at 18:51






  • 1





    Ok, [|] is not an or operator, and here | must be removed.

    – Wiktor Stribiżew
    Nov 15 '18 at 18:51














  • 1





    It is a known fact that p{L} only matches letters from the BMP plane. Do you want to match diacritics too? Add p{M}. Use @"^[p{M}p{L}-|a-zA-Z0-9-_]+$". What is the | there for? Note the | inside a character class matches a literal | char. It seems to me you wan to use @"^[p{L}p{M}0-9_-]+$"

    – Wiktor Stribiżew
    Nov 15 '18 at 18:47













  • @WiktorStribiżew thanks it worked. The | is or as in this regex allows foreign characters or numeric, numbers.

    – WW pana
    Nov 15 '18 at 18:51






  • 1





    Ok, [|] is not an or operator, and here | must be removed.

    – Wiktor Stribiżew
    Nov 15 '18 at 18:51








1




1





It is a known fact that p{L} only matches letters from the BMP plane. Do you want to match diacritics too? Add p{M}. Use @"^[p{M}p{L}-|a-zA-Z0-9-_]+$". What is the | there for? Note the | inside a character class matches a literal | char. It seems to me you wan to use @"^[p{L}p{M}0-9_-]+$"

– Wiktor Stribiżew
Nov 15 '18 at 18:47







It is a known fact that p{L} only matches letters from the BMP plane. Do you want to match diacritics too? Add p{M}. Use @"^[p{M}p{L}-|a-zA-Z0-9-_]+$". What is the | there for? Note the | inside a character class matches a literal | char. It seems to me you wan to use @"^[p{L}p{M}0-9_-]+$"

– Wiktor Stribiżew
Nov 15 '18 at 18:47















@WiktorStribiżew thanks it worked. The | is or as in this regex allows foreign characters or numeric, numbers.

– WW pana
Nov 15 '18 at 18:51





@WiktorStribiżew thanks it worked. The | is or as in this regex allows foreign characters or numeric, numbers.

– WW pana
Nov 15 '18 at 18:51




1




1





Ok, [|] is not an or operator, and here | must be removed.

– Wiktor Stribiżew
Nov 15 '18 at 18:51





Ok, [|] is not an or operator, and here | must be removed.

– Wiktor Stribiżew
Nov 15 '18 at 18:51












1 Answer
1






active

oldest

votes


















2














It is not enough to use p{L} to match words, you also need to match diacritics. That can be done by adding p{M} to your regex. Note that even the w shorthand "word" character class in .NET regex by default also matches a set of diacritics, p{Mn} (Mark, Nonspacing Unicode char category), see this .NET regex reference. However, here you need p{M} to allow any diacritics.



Note that | inside a character class matches a literal | char, so you need to remove the | from your pattern.



It seems to me you use



@"^[p{L}p{M}0-9_-]+$"


It will match any string of one or more letters, diacritics, ASCII digits, _ or - chars.



See the regex demo.



Note that in case you want to allow any Unicode digit chars, you may even use



@"^[wp{M}-]+$"


See another demo






share|improve this answer

























    Your Answer






    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "1"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53326044%2fregex-doesnt-match-all-foreign-characters%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    2














    It is not enough to use p{L} to match words, you also need to match diacritics. That can be done by adding p{M} to your regex. Note that even the w shorthand "word" character class in .NET regex by default also matches a set of diacritics, p{Mn} (Mark, Nonspacing Unicode char category), see this .NET regex reference. However, here you need p{M} to allow any diacritics.



    Note that | inside a character class matches a literal | char, so you need to remove the | from your pattern.



    It seems to me you use



    @"^[p{L}p{M}0-9_-]+$"


    It will match any string of one or more letters, diacritics, ASCII digits, _ or - chars.



    See the regex demo.



    Note that in case you want to allow any Unicode digit chars, you may even use



    @"^[wp{M}-]+$"


    See another demo






    share|improve this answer






























      2














      It is not enough to use p{L} to match words, you also need to match diacritics. That can be done by adding p{M} to your regex. Note that even the w shorthand "word" character class in .NET regex by default also matches a set of diacritics, p{Mn} (Mark, Nonspacing Unicode char category), see this .NET regex reference. However, here you need p{M} to allow any diacritics.



      Note that | inside a character class matches a literal | char, so you need to remove the | from your pattern.



      It seems to me you use



      @"^[p{L}p{M}0-9_-]+$"


      It will match any string of one or more letters, diacritics, ASCII digits, _ or - chars.



      See the regex demo.



      Note that in case you want to allow any Unicode digit chars, you may even use



      @"^[wp{M}-]+$"


      See another demo






      share|improve this answer




























        2












        2








        2







        It is not enough to use p{L} to match words, you also need to match diacritics. That can be done by adding p{M} to your regex. Note that even the w shorthand "word" character class in .NET regex by default also matches a set of diacritics, p{Mn} (Mark, Nonspacing Unicode char category), see this .NET regex reference. However, here you need p{M} to allow any diacritics.



        Note that | inside a character class matches a literal | char, so you need to remove the | from your pattern.



        It seems to me you use



        @"^[p{L}p{M}0-9_-]+$"


        It will match any string of one or more letters, diacritics, ASCII digits, _ or - chars.



        See the regex demo.



        Note that in case you want to allow any Unicode digit chars, you may even use



        @"^[wp{M}-]+$"


        See another demo






        share|improve this answer















        It is not enough to use p{L} to match words, you also need to match diacritics. That can be done by adding p{M} to your regex. Note that even the w shorthand "word" character class in .NET regex by default also matches a set of diacritics, p{Mn} (Mark, Nonspacing Unicode char category), see this .NET regex reference. However, here you need p{M} to allow any diacritics.



        Note that | inside a character class matches a literal | char, so you need to remove the | from your pattern.



        It seems to me you use



        @"^[p{L}p{M}0-9_-]+$"


        It will match any string of one or more letters, diacritics, ASCII digits, _ or - chars.



        See the regex demo.



        Note that in case you want to allow any Unicode digit chars, you may even use



        @"^[wp{M}-]+$"


        See another demo







        share|improve this answer














        share|improve this answer



        share|improve this answer








        edited Nov 15 '18 at 18:59

























        answered Nov 15 '18 at 18:54









        Wiktor StribiżewWiktor Stribiżew

        323k16146226




        323k16146226
































            draft saved

            draft discarded




















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53326044%2fregex-doesnt-match-all-foreign-characters%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Bressuire

            Vorschmack

            Quarantine