Regex doesn't match all foreign characters
Here's my regex ^([\p{L}-|a-zA-Z0-9-_]+)$
and it is supposed to allow all foreign letters as well as numeric letter, number. But for some reason, hindi characters cannot match.
I wrote a Xunit test to prove.
[Fact]
public void test()
{
var hindiChar = "इम्तहान";
var input = "12345ABCDPrüfungテスト中文테스트إسرائيل" + hindiChar;
var regex = "^([\p{L}-|a-zA-Z0-9-_]+)$";
Assert.True(new Regex(regex).IsMatch(input));
}
If you remove the hindiChar
, the test will return true; but if you add the hindiChar
, the test will return false.
I thought part of the regex is to fit all foreign characters, but not sure why it doesn't match hindi characters.
c# regex xunit
add a comment |
Here's my regex ^([\p{L}-|a-zA-Z0-9-_]+)$
and it is supposed to allow all foreign letters as well as numeric letter, number. But for some reason, hindi characters cannot match.
I wrote a Xunit test to prove.
[Fact]
public void test()
{
var hindiChar = "इम्तहान";
var input = "12345ABCDPrüfungテスト中文테스트إسرائيل" + hindiChar;
var regex = "^([\p{L}-|a-zA-Z0-9-_]+)$";
Assert.True(new Regex(regex).IsMatch(input));
}
If you remove the hindiChar
, the test will return true; but if you add the hindiChar
, the test will return false.
I thought part of the regex is to fit all foreign characters, but not sure why it doesn't match hindi characters.
c# regex xunit
1
It is a known fact thatp{L}
only matches letters from the BMP plane. Do you want to match diacritics too? Addp{M}
. Use@"^[p{M}p{L}-|a-zA-Z0-9-_]+$"
. What is the|
there for? Note the|
inside a character class matches a literal|
char. It seems to me you wan to use@"^[p{L}p{M}0-9_-]+$"
– Wiktor Stribiżew
Nov 15 '18 at 18:47
@WiktorStribiżew thanks it worked. The|
isor
as in this regex allows foreign charactersor
numeric, numbers.
– WW pana
Nov 15 '18 at 18:51
1
Ok,[|]
is not an or operator, and here|
must be removed.
– Wiktor Stribiżew
Nov 15 '18 at 18:51
add a comment |
Here's my regex ^([\p{L}-|a-zA-Z0-9-_]+)$
and it is supposed to allow all foreign letters as well as numeric letter, number. But for some reason, hindi characters cannot match.
I wrote a Xunit test to prove.
[Fact]
public void test()
{
var hindiChar = "इम्तहान";
var input = "12345ABCDPrüfungテスト中文테스트إسرائيل" + hindiChar;
var regex = "^([\p{L}-|a-zA-Z0-9-_]+)$";
Assert.True(new Regex(regex).IsMatch(input));
}
If you remove the hindiChar
, the test will return true; but if you add the hindiChar
, the test will return false.
I thought part of the regex is to fit all foreign characters, but not sure why it doesn't match hindi characters.
c# regex xunit
Here's my regex ^([\p{L}-|a-zA-Z0-9-_]+)$
and it is supposed to allow all foreign letters as well as numeric letter, number. But for some reason, hindi characters cannot match.
I wrote a Xunit test to prove.
[Fact]
public void test()
{
var hindiChar = "इम्तहान";
var input = "12345ABCDPrüfungテスト中文테스트إسرائيل" + hindiChar;
var regex = "^([\p{L}-|a-zA-Z0-9-_]+)$";
Assert.True(new Regex(regex).IsMatch(input));
}
If you remove the hindiChar
, the test will return true; but if you add the hindiChar
, the test will return false.
I thought part of the regex is to fit all foreign characters, but not sure why it doesn't match hindi characters.
c# regex xunit
c# regex xunit
asked Nov 15 '18 at 18:45
WW panaWW pana
474417
474417
1
It is a known fact thatp{L}
only matches letters from the BMP plane. Do you want to match diacritics too? Addp{M}
. Use@"^[p{M}p{L}-|a-zA-Z0-9-_]+$"
. What is the|
there for? Note the|
inside a character class matches a literal|
char. It seems to me you wan to use@"^[p{L}p{M}0-9_-]+$"
– Wiktor Stribiżew
Nov 15 '18 at 18:47
@WiktorStribiżew thanks it worked. The|
isor
as in this regex allows foreign charactersor
numeric, numbers.
– WW pana
Nov 15 '18 at 18:51
1
Ok,[|]
is not an or operator, and here|
must be removed.
– Wiktor Stribiżew
Nov 15 '18 at 18:51
add a comment |
1
It is a known fact thatp{L}
only matches letters from the BMP plane. Do you want to match diacritics too? Addp{M}
. Use@"^[p{M}p{L}-|a-zA-Z0-9-_]+$"
. What is the|
there for? Note the|
inside a character class matches a literal|
char. It seems to me you wan to use@"^[p{L}p{M}0-9_-]+$"
– Wiktor Stribiżew
Nov 15 '18 at 18:47
@WiktorStribiżew thanks it worked. The|
isor
as in this regex allows foreign charactersor
numeric, numbers.
– WW pana
Nov 15 '18 at 18:51
1
Ok,[|]
is not an or operator, and here|
must be removed.
– Wiktor Stribiżew
Nov 15 '18 at 18:51
1
1
It is a known fact that
p{L}
only matches letters from the BMP plane. Do you want to match diacritics too? Add p{M}
. Use @"^[p{M}p{L}-|a-zA-Z0-9-_]+$"
. What is the |
there for? Note the |
inside a character class matches a literal |
char. It seems to me you wan to use @"^[p{L}p{M}0-9_-]+$"
– Wiktor Stribiżew
Nov 15 '18 at 18:47
It is a known fact that
p{L}
only matches letters from the BMP plane. Do you want to match diacritics too? Add p{M}
. Use @"^[p{M}p{L}-|a-zA-Z0-9-_]+$"
. What is the |
there for? Note the |
inside a character class matches a literal |
char. It seems to me you wan to use @"^[p{L}p{M}0-9_-]+$"
– Wiktor Stribiżew
Nov 15 '18 at 18:47
@WiktorStribiżew thanks it worked. The
|
is or
as in this regex allows foreign characters or
numeric, numbers.– WW pana
Nov 15 '18 at 18:51
@WiktorStribiżew thanks it worked. The
|
is or
as in this regex allows foreign characters or
numeric, numbers.– WW pana
Nov 15 '18 at 18:51
1
1
Ok,
[|]
is not an or operator, and here |
must be removed.– Wiktor Stribiżew
Nov 15 '18 at 18:51
Ok,
[|]
is not an or operator, and here |
must be removed.– Wiktor Stribiżew
Nov 15 '18 at 18:51
add a comment |
1 Answer
1
active
oldest
votes
It is not enough to use p{L}
to match words, you also need to match diacritics. That can be done by adding p{M}
to your regex. Note that even the w
shorthand "word" character class in .NET regex by default also matches a set of diacritics, p{Mn}
(Mark, Nonspacing Unicode char category), see this .NET regex reference. However, here you need p{M}
to allow any diacritics.
Note that |
inside a character class matches a literal |
char, so you need to remove the |
from your pattern.
It seems to me you use
@"^[p{L}p{M}0-9_-]+$"
It will match any string of one or more letters, diacritics, ASCII digits, _
or -
chars.
See the regex demo.
Note that in case you want to allow any Unicode digit chars, you may even use
@"^[wp{M}-]+$"
See another demo
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53326044%2fregex-doesnt-match-all-foreign-characters%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
It is not enough to use p{L}
to match words, you also need to match diacritics. That can be done by adding p{M}
to your regex. Note that even the w
shorthand "word" character class in .NET regex by default also matches a set of diacritics, p{Mn}
(Mark, Nonspacing Unicode char category), see this .NET regex reference. However, here you need p{M}
to allow any diacritics.
Note that |
inside a character class matches a literal |
char, so you need to remove the |
from your pattern.
It seems to me you use
@"^[p{L}p{M}0-9_-]+$"
It will match any string of one or more letters, diacritics, ASCII digits, _
or -
chars.
See the regex demo.
Note that in case you want to allow any Unicode digit chars, you may even use
@"^[wp{M}-]+$"
See another demo
add a comment |
It is not enough to use p{L}
to match words, you also need to match diacritics. That can be done by adding p{M}
to your regex. Note that even the w
shorthand "word" character class in .NET regex by default also matches a set of diacritics, p{Mn}
(Mark, Nonspacing Unicode char category), see this .NET regex reference. However, here you need p{M}
to allow any diacritics.
Note that |
inside a character class matches a literal |
char, so you need to remove the |
from your pattern.
It seems to me you use
@"^[p{L}p{M}0-9_-]+$"
It will match any string of one or more letters, diacritics, ASCII digits, _
or -
chars.
See the regex demo.
Note that in case you want to allow any Unicode digit chars, you may even use
@"^[wp{M}-]+$"
See another demo
add a comment |
It is not enough to use p{L}
to match words, you also need to match diacritics. That can be done by adding p{M}
to your regex. Note that even the w
shorthand "word" character class in .NET regex by default also matches a set of diacritics, p{Mn}
(Mark, Nonspacing Unicode char category), see this .NET regex reference. However, here you need p{M}
to allow any diacritics.
Note that |
inside a character class matches a literal |
char, so you need to remove the |
from your pattern.
It seems to me you use
@"^[p{L}p{M}0-9_-]+$"
It will match any string of one or more letters, diacritics, ASCII digits, _
or -
chars.
See the regex demo.
Note that in case you want to allow any Unicode digit chars, you may even use
@"^[wp{M}-]+$"
See another demo
It is not enough to use p{L}
to match words, you also need to match diacritics. That can be done by adding p{M}
to your regex. Note that even the w
shorthand "word" character class in .NET regex by default also matches a set of diacritics, p{Mn}
(Mark, Nonspacing Unicode char category), see this .NET regex reference. However, here you need p{M}
to allow any diacritics.
Note that |
inside a character class matches a literal |
char, so you need to remove the |
from your pattern.
It seems to me you use
@"^[p{L}p{M}0-9_-]+$"
It will match any string of one or more letters, diacritics, ASCII digits, _
or -
chars.
See the regex demo.
Note that in case you want to allow any Unicode digit chars, you may even use
@"^[wp{M}-]+$"
See another demo
edited Nov 15 '18 at 18:59
answered Nov 15 '18 at 18:54
Wiktor StribiżewWiktor Stribiżew
323k16146226
323k16146226
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53326044%2fregex-doesnt-match-all-foreign-characters%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
1
It is a known fact that
p{L}
only matches letters from the BMP plane. Do you want to match diacritics too? Addp{M}
. Use@"^[p{M}p{L}-|a-zA-Z0-9-_]+$"
. What is the|
there for? Note the|
inside a character class matches a literal|
char. It seems to me you wan to use@"^[p{L}p{M}0-9_-]+$"
– Wiktor Stribiżew
Nov 15 '18 at 18:47
@WiktorStribiżew thanks it worked. The
|
isor
as in this regex allows foreign charactersor
numeric, numbers.– WW pana
Nov 15 '18 at 18:51
1
Ok,
[|]
is not an or operator, and here|
must be removed.– Wiktor Stribiżew
Nov 15 '18 at 18:51