Secure unicode regex in PHP for text field submissions
I am currently using a PHP form processing code that accepts text field submissions. My code is this:
function checkInput($f) {
$f = strtr($f, array('Š' => 'S','Ž' => 'Z','š' => 's','ž' => 'z','Ÿ' => 'Y','À' => 'A','Á' => 'A','Â' => 'A','Ã' => 'A','Ä' => 'A','Å' => 'A','Ç' => 'C','È' => 'E'));
$f = strtr($f, array('Þ' => 'TH', 'þ' => 'th', 'Ð' => 'DH', 'ð' => 'dh', 'ß' => 'ss', 'Œ' => 'OE', 'œ' => 'oe', 'Æ' => 'AE', 'æ' => 'ae', 'µ' => 'u'));
$f = preg_replace(array('#( ){2,}#', '#(.){4,}#', '/[^w-_.:, ]+/'), array(' ', '...', '_'), $f);
return $f;
}
This code checks characters with accents and replaced those with the 'regular' characters without accents. And the preg_replace line checks:
1. if there are 2 or more consecutive spaces, if yes: replace with 1 space;
2. if there are 4 or more consecutive dots, if yes: replace with 3 dots;
3. if there are any non-matching characters, if yes: replace those with an underscode (_);
I want to support unicode characters from other language, for example Cyrillic. Is it enough to just add a u in the preg_replace line? Example:
$f = preg_replace(array('#( ){2,}#', '#(.){4,}#', '/[^w-_.:, ]+/u'), array(' ', '...', '_'), $f);
I am not sure if that is the way to go in terms of security. Please advise.
EDIT:
This regex seems to be working, it restricts allowed chars to the specified chars in the regex, but it does not allow non-Latin chars..
/^[a-z0-9.,:!?-_ ]+/iu
I want to allow chars: a through z (case insensitive), 0 through 9, . , : ! ? - _ white space and non-Latin chars.
EDIT2:
Ok now this seems to be working correctly in code:
$rgx = '/[^a-z0-9-_.:,!?w ]+/iu';
$f = preg_replace($rgx, "", $f);
$f = preg_replace(array('#( ){2,}#', '#(.){4,}#'), array(' ', '...'), $f);
return $f;
It allows chars a - z, digits, - _ . : , ! ? and non-Latin chars. And replaces any restricted characters like quotes " ' and semi-colons ; to prevent SQL injections.
php regex
|
show 4 more comments
I am currently using a PHP form processing code that accepts text field submissions. My code is this:
function checkInput($f) {
$f = strtr($f, array('Š' => 'S','Ž' => 'Z','š' => 's','ž' => 'z','Ÿ' => 'Y','À' => 'A','Á' => 'A','Â' => 'A','Ã' => 'A','Ä' => 'A','Å' => 'A','Ç' => 'C','È' => 'E'));
$f = strtr($f, array('Þ' => 'TH', 'þ' => 'th', 'Ð' => 'DH', 'ð' => 'dh', 'ß' => 'ss', 'Œ' => 'OE', 'œ' => 'oe', 'Æ' => 'AE', 'æ' => 'ae', 'µ' => 'u'));
$f = preg_replace(array('#( ){2,}#', '#(.){4,}#', '/[^w-_.:, ]+/'), array(' ', '...', '_'), $f);
return $f;
}
This code checks characters with accents and replaced those with the 'regular' characters without accents. And the preg_replace line checks:
1. if there are 2 or more consecutive spaces, if yes: replace with 1 space;
2. if there are 4 or more consecutive dots, if yes: replace with 3 dots;
3. if there are any non-matching characters, if yes: replace those with an underscode (_);
I want to support unicode characters from other language, for example Cyrillic. Is it enough to just add a u in the preg_replace line? Example:
$f = preg_replace(array('#( ){2,}#', '#(.){4,}#', '/[^w-_.:, ]+/u'), array(' ', '...', '_'), $f);
I am not sure if that is the way to go in terms of security. Please advise.
EDIT:
This regex seems to be working, it restricts allowed chars to the specified chars in the regex, but it does not allow non-Latin chars..
/^[a-z0-9.,:!?-_ ]+/iu
I want to allow chars: a through z (case insensitive), 0 through 9, . , : ! ? - _ white space and non-Latin chars.
EDIT2:
Ok now this seems to be working correctly in code:
$rgx = '/[^a-z0-9-_.:,!?w ]+/iu';
$f = preg_replace($rgx, "", $f);
$f = preg_replace(array('#( ){2,}#', '#(.){4,}#'), array(' ', '...'), $f);
return $f;
It allows chars a - z, digits, - _ . : , ! ? and non-Latin chars. And replaces any restricted characters like quotes " ' and semi-colons ; to prevent SQL injections.
php regex
uwill makewUnicode aware and/[^w]/uwill match any char that is not a Unicode letter, digit or_(and some other chars). You probably want to replace'/[^w-_.:, ]+/u'with'/[^a-zA-Z0-9-_.:,s]+/'
– Wiktor Stribiżew
Nov 16 '18 at 11:04
Thanks, i have tested it: it allows cyrillic characters, but not latin chars. I have tested it here: rubular.com/r/yp6dhfehrl It is a ruby site, but PHP should work the same i think..
– Nemdr
Nov 16 '18 at 11:17
In your code, you are replacing with regex, and at rubular you are matching. PCRE is used in PHP regexps, not Onigmo (used in Rubular). Use regex101.com to test PHP regexps, it is the most user-friendly - IMHO - regex testing Web site for PCRE, JS, Pythonreand Go regexps.
– Wiktor Stribiżew
Nov 16 '18 at 11:20
Thanks, i have tested this regex on regex101, and this seems to be working: ^[a-zA-Z0-9-_.:,!?w ]+/u
– Nemdr
Nov 16 '18 at 11:29
1
I want allow chars: a-zA-Z0-9.,:!?_ - (white space) both Latin and non-Latin chars. Any char in string that is not allowed (not matched), must be replaced with "" (removed from string). This must be done in the entire string. Also I want to replace 2 or more consecutive white spaces with 1 space, etc. like in the preg_match example above. Example: "te-st in!p.ut" will not be changed. But: "test i@n';put" must be changed to: "test input". And: "test [a lot of space] input" changed to: "test input".
– Nemdr
Nov 16 '18 at 12:11
|
show 4 more comments
I am currently using a PHP form processing code that accepts text field submissions. My code is this:
function checkInput($f) {
$f = strtr($f, array('Š' => 'S','Ž' => 'Z','š' => 's','ž' => 'z','Ÿ' => 'Y','À' => 'A','Á' => 'A','Â' => 'A','Ã' => 'A','Ä' => 'A','Å' => 'A','Ç' => 'C','È' => 'E'));
$f = strtr($f, array('Þ' => 'TH', 'þ' => 'th', 'Ð' => 'DH', 'ð' => 'dh', 'ß' => 'ss', 'Œ' => 'OE', 'œ' => 'oe', 'Æ' => 'AE', 'æ' => 'ae', 'µ' => 'u'));
$f = preg_replace(array('#( ){2,}#', '#(.){4,}#', '/[^w-_.:, ]+/'), array(' ', '...', '_'), $f);
return $f;
}
This code checks characters with accents and replaced those with the 'regular' characters without accents. And the preg_replace line checks:
1. if there are 2 or more consecutive spaces, if yes: replace with 1 space;
2. if there are 4 or more consecutive dots, if yes: replace with 3 dots;
3. if there are any non-matching characters, if yes: replace those with an underscode (_);
I want to support unicode characters from other language, for example Cyrillic. Is it enough to just add a u in the preg_replace line? Example:
$f = preg_replace(array('#( ){2,}#', '#(.){4,}#', '/[^w-_.:, ]+/u'), array(' ', '...', '_'), $f);
I am not sure if that is the way to go in terms of security. Please advise.
EDIT:
This regex seems to be working, it restricts allowed chars to the specified chars in the regex, but it does not allow non-Latin chars..
/^[a-z0-9.,:!?-_ ]+/iu
I want to allow chars: a through z (case insensitive), 0 through 9, . , : ! ? - _ white space and non-Latin chars.
EDIT2:
Ok now this seems to be working correctly in code:
$rgx = '/[^a-z0-9-_.:,!?w ]+/iu';
$f = preg_replace($rgx, "", $f);
$f = preg_replace(array('#( ){2,}#', '#(.){4,}#'), array(' ', '...'), $f);
return $f;
It allows chars a - z, digits, - _ . : , ! ? and non-Latin chars. And replaces any restricted characters like quotes " ' and semi-colons ; to prevent SQL injections.
php regex
I am currently using a PHP form processing code that accepts text field submissions. My code is this:
function checkInput($f) {
$f = strtr($f, array('Š' => 'S','Ž' => 'Z','š' => 's','ž' => 'z','Ÿ' => 'Y','À' => 'A','Á' => 'A','Â' => 'A','Ã' => 'A','Ä' => 'A','Å' => 'A','Ç' => 'C','È' => 'E'));
$f = strtr($f, array('Þ' => 'TH', 'þ' => 'th', 'Ð' => 'DH', 'ð' => 'dh', 'ß' => 'ss', 'Œ' => 'OE', 'œ' => 'oe', 'Æ' => 'AE', 'æ' => 'ae', 'µ' => 'u'));
$f = preg_replace(array('#( ){2,}#', '#(.){4,}#', '/[^w-_.:, ]+/'), array(' ', '...', '_'), $f);
return $f;
}
This code checks characters with accents and replaced those with the 'regular' characters without accents. And the preg_replace line checks:
1. if there are 2 or more consecutive spaces, if yes: replace with 1 space;
2. if there are 4 or more consecutive dots, if yes: replace with 3 dots;
3. if there are any non-matching characters, if yes: replace those with an underscode (_);
I want to support unicode characters from other language, for example Cyrillic. Is it enough to just add a u in the preg_replace line? Example:
$f = preg_replace(array('#( ){2,}#', '#(.){4,}#', '/[^w-_.:, ]+/u'), array(' ', '...', '_'), $f);
I am not sure if that is the way to go in terms of security. Please advise.
EDIT:
This regex seems to be working, it restricts allowed chars to the specified chars in the regex, but it does not allow non-Latin chars..
/^[a-z0-9.,:!?-_ ]+/iu
I want to allow chars: a through z (case insensitive), 0 through 9, . , : ! ? - _ white space and non-Latin chars.
EDIT2:
Ok now this seems to be working correctly in code:
$rgx = '/[^a-z0-9-_.:,!?w ]+/iu';
$f = preg_replace($rgx, "", $f);
$f = preg_replace(array('#( ){2,}#', '#(.){4,}#'), array(' ', '...'), $f);
return $f;
It allows chars a - z, digits, - _ . : , ! ? and non-Latin chars. And replaces any restricted characters like quotes " ' and semi-colons ; to prevent SQL injections.
php regex
php regex
edited Nov 16 '18 at 12:54
Nemdr
asked Nov 16 '18 at 10:46
NemdrNemdr
134
134
uwill makewUnicode aware and/[^w]/uwill match any char that is not a Unicode letter, digit or_(and some other chars). You probably want to replace'/[^w-_.:, ]+/u'with'/[^a-zA-Z0-9-_.:,s]+/'
– Wiktor Stribiżew
Nov 16 '18 at 11:04
Thanks, i have tested it: it allows cyrillic characters, but not latin chars. I have tested it here: rubular.com/r/yp6dhfehrl It is a ruby site, but PHP should work the same i think..
– Nemdr
Nov 16 '18 at 11:17
In your code, you are replacing with regex, and at rubular you are matching. PCRE is used in PHP regexps, not Onigmo (used in Rubular). Use regex101.com to test PHP regexps, it is the most user-friendly - IMHO - regex testing Web site for PCRE, JS, Pythonreand Go regexps.
– Wiktor Stribiżew
Nov 16 '18 at 11:20
Thanks, i have tested this regex on regex101, and this seems to be working: ^[a-zA-Z0-9-_.:,!?w ]+/u
– Nemdr
Nov 16 '18 at 11:29
1
I want allow chars: a-zA-Z0-9.,:!?_ - (white space) both Latin and non-Latin chars. Any char in string that is not allowed (not matched), must be replaced with "" (removed from string). This must be done in the entire string. Also I want to replace 2 or more consecutive white spaces with 1 space, etc. like in the preg_match example above. Example: "te-st in!p.ut" will not be changed. But: "test i@n';put" must be changed to: "test input". And: "test [a lot of space] input" changed to: "test input".
– Nemdr
Nov 16 '18 at 12:11
|
show 4 more comments
uwill makewUnicode aware and/[^w]/uwill match any char that is not a Unicode letter, digit or_(and some other chars). You probably want to replace'/[^w-_.:, ]+/u'with'/[^a-zA-Z0-9-_.:,s]+/'
– Wiktor Stribiżew
Nov 16 '18 at 11:04
Thanks, i have tested it: it allows cyrillic characters, but not latin chars. I have tested it here: rubular.com/r/yp6dhfehrl It is a ruby site, but PHP should work the same i think..
– Nemdr
Nov 16 '18 at 11:17
In your code, you are replacing with regex, and at rubular you are matching. PCRE is used in PHP regexps, not Onigmo (used in Rubular). Use regex101.com to test PHP regexps, it is the most user-friendly - IMHO - regex testing Web site for PCRE, JS, Pythonreand Go regexps.
– Wiktor Stribiżew
Nov 16 '18 at 11:20
Thanks, i have tested this regex on regex101, and this seems to be working: ^[a-zA-Z0-9-_.:,!?w ]+/u
– Nemdr
Nov 16 '18 at 11:29
1
I want allow chars: a-zA-Z0-9.,:!?_ - (white space) both Latin and non-Latin chars. Any char in string that is not allowed (not matched), must be replaced with "" (removed from string). This must be done in the entire string. Also I want to replace 2 or more consecutive white spaces with 1 space, etc. like in the preg_match example above. Example: "te-st in!p.ut" will not be changed. But: "test i@n';put" must be changed to: "test input". And: "test [a lot of space] input" changed to: "test input".
– Nemdr
Nov 16 '18 at 12:11
u will make w Unicode aware and /[^w]/u will match any char that is not a Unicode letter, digit or _ (and some other chars). You probably want to replace '/[^w-_.:, ]+/u' with '/[^a-zA-Z0-9-_.:,s]+/'– Wiktor Stribiżew
Nov 16 '18 at 11:04
u will make w Unicode aware and /[^w]/u will match any char that is not a Unicode letter, digit or _ (and some other chars). You probably want to replace '/[^w-_.:, ]+/u' with '/[^a-zA-Z0-9-_.:,s]+/'– Wiktor Stribiżew
Nov 16 '18 at 11:04
Thanks, i have tested it: it allows cyrillic characters, but not latin chars. I have tested it here: rubular.com/r/yp6dhfehrl It is a ruby site, but PHP should work the same i think..
– Nemdr
Nov 16 '18 at 11:17
Thanks, i have tested it: it allows cyrillic characters, but not latin chars. I have tested it here: rubular.com/r/yp6dhfehrl It is a ruby site, but PHP should work the same i think..
– Nemdr
Nov 16 '18 at 11:17
In your code, you are replacing with regex, and at rubular you are matching. PCRE is used in PHP regexps, not Onigmo (used in Rubular). Use regex101.com to test PHP regexps, it is the most user-friendly - IMHO - regex testing Web site for PCRE, JS, Python
re and Go regexps.– Wiktor Stribiżew
Nov 16 '18 at 11:20
In your code, you are replacing with regex, and at rubular you are matching. PCRE is used in PHP regexps, not Onigmo (used in Rubular). Use regex101.com to test PHP regexps, it is the most user-friendly - IMHO - regex testing Web site for PCRE, JS, Python
re and Go regexps.– Wiktor Stribiżew
Nov 16 '18 at 11:20
Thanks, i have tested this regex on regex101, and this seems to be working: ^[a-zA-Z0-9-_.:,!?w ]+/u
– Nemdr
Nov 16 '18 at 11:29
Thanks, i have tested this regex on regex101, and this seems to be working: ^[a-zA-Z0-9-_.:,!?w ]+/u
– Nemdr
Nov 16 '18 at 11:29
1
1
I want allow chars: a-zA-Z0-9.,:!?_ - (white space) both Latin and non-Latin chars. Any char in string that is not allowed (not matched), must be replaced with "" (removed from string). This must be done in the entire string. Also I want to replace 2 or more consecutive white spaces with 1 space, etc. like in the preg_match example above. Example: "te-st in!p.ut" will not be changed. But: "test i@n';put" must be changed to: "test input". And: "test [a lot of space] input" changed to: "test input".
– Nemdr
Nov 16 '18 at 12:11
I want allow chars: a-zA-Z0-9.,:!?_ - (white space) both Latin and non-Latin chars. Any char in string that is not allowed (not matched), must be replaced with "" (removed from string). This must be done in the entire string. Also I want to replace 2 or more consecutive white spaces with 1 space, etc. like in the preg_match example above. Example: "te-st in!p.ut" will not be changed. But: "test i@n';put" must be changed to: "test input". And: "test [a lot of space] input" changed to: "test input".
– Nemdr
Nov 16 '18 at 12:11
|
show 4 more comments
1 Answer
1
active
oldest
votes
Allow me to clean up your Edit #2 patterns and suggest a single-call implementation.
Code: (Demo)
function sanitizer($string) {
return preg_replace(['~[^p{L}p{N}_.:,!? -]+~u', '~ K +|.{3}K.+~'], '', $string);
}
$strings = [
"1: Доброе утро - Dobraye ootro & Good morning",
"2: Добрый день => Dobriy den'....... (Good afternoon)"
];
foreach ($strings as $string) {
echo sanitizer($string);
echo "n---n";
}
Output:
1: Доброе утро- Dobraye ootro Good morning
---
2: Добрый день Dobriy den... Good afternoon
---
I could have written a single piped pattern for preg_replace() but I wanted to make two passes over the string. 1. to remove any invalid characters then 2. to remove excessively long character sequences that may or may not have been formed by the first pass.
Noteworthy pattern changes:
[a-zA-Z0-9_]is more simply written aswhowever because you are using theuflag AND to prepare for PHP7.3's strict adherence to PCRE2, it is better two write out:p{L}p{N}_Avoid writing unnecessary slashes before characters without special meaning -- it only makes your pattern longer and harder to interpret. Character that normally have special meaning (like
*,?,+, etc) lose there special meaning inside of a character class[ ... ].Move your hyphen to the front or back of your negated character class to avoid the possibility of writing a range of character. (because your
-came after a range of characters0-9, this was a non-issue, but it is good advice to remember as a matter of best practice.Kmeans "forget the previously matched substring" in other words, "start matching from here". This enables you to avoid a capture group and just truncate the unwanted characters by replacing the match with an empty string.
p.s. You should still run your strtr() calls as your original post has done.
add a comment |
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53336286%2fsecure-unicode-regex-in-php-for-text-field-submissions%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
Allow me to clean up your Edit #2 patterns and suggest a single-call implementation.
Code: (Demo)
function sanitizer($string) {
return preg_replace(['~[^p{L}p{N}_.:,!? -]+~u', '~ K +|.{3}K.+~'], '', $string);
}
$strings = [
"1: Доброе утро - Dobraye ootro & Good morning",
"2: Добрый день => Dobriy den'....... (Good afternoon)"
];
foreach ($strings as $string) {
echo sanitizer($string);
echo "n---n";
}
Output:
1: Доброе утро- Dobraye ootro Good morning
---
2: Добрый день Dobriy den... Good afternoon
---
I could have written a single piped pattern for preg_replace() but I wanted to make two passes over the string. 1. to remove any invalid characters then 2. to remove excessively long character sequences that may or may not have been formed by the first pass.
Noteworthy pattern changes:
[a-zA-Z0-9_]is more simply written aswhowever because you are using theuflag AND to prepare for PHP7.3's strict adherence to PCRE2, it is better two write out:p{L}p{N}_Avoid writing unnecessary slashes before characters without special meaning -- it only makes your pattern longer and harder to interpret. Character that normally have special meaning (like
*,?,+, etc) lose there special meaning inside of a character class[ ... ].Move your hyphen to the front or back of your negated character class to avoid the possibility of writing a range of character. (because your
-came after a range of characters0-9, this was a non-issue, but it is good advice to remember as a matter of best practice.Kmeans "forget the previously matched substring" in other words, "start matching from here". This enables you to avoid a capture group and just truncate the unwanted characters by replacing the match with an empty string.
p.s. You should still run your strtr() calls as your original post has done.
add a comment |
Allow me to clean up your Edit #2 patterns and suggest a single-call implementation.
Code: (Demo)
function sanitizer($string) {
return preg_replace(['~[^p{L}p{N}_.:,!? -]+~u', '~ K +|.{3}K.+~'], '', $string);
}
$strings = [
"1: Доброе утро - Dobraye ootro & Good morning",
"2: Добрый день => Dobriy den'....... (Good afternoon)"
];
foreach ($strings as $string) {
echo sanitizer($string);
echo "n---n";
}
Output:
1: Доброе утро- Dobraye ootro Good morning
---
2: Добрый день Dobriy den... Good afternoon
---
I could have written a single piped pattern for preg_replace() but I wanted to make two passes over the string. 1. to remove any invalid characters then 2. to remove excessively long character sequences that may or may not have been formed by the first pass.
Noteworthy pattern changes:
[a-zA-Z0-9_]is more simply written aswhowever because you are using theuflag AND to prepare for PHP7.3's strict adherence to PCRE2, it is better two write out:p{L}p{N}_Avoid writing unnecessary slashes before characters without special meaning -- it only makes your pattern longer and harder to interpret. Character that normally have special meaning (like
*,?,+, etc) lose there special meaning inside of a character class[ ... ].Move your hyphen to the front or back of your negated character class to avoid the possibility of writing a range of character. (because your
-came after a range of characters0-9, this was a non-issue, but it is good advice to remember as a matter of best practice.Kmeans "forget the previously matched substring" in other words, "start matching from here". This enables you to avoid a capture group and just truncate the unwanted characters by replacing the match with an empty string.
p.s. You should still run your strtr() calls as your original post has done.
add a comment |
Allow me to clean up your Edit #2 patterns and suggest a single-call implementation.
Code: (Demo)
function sanitizer($string) {
return preg_replace(['~[^p{L}p{N}_.:,!? -]+~u', '~ K +|.{3}K.+~'], '', $string);
}
$strings = [
"1: Доброе утро - Dobraye ootro & Good morning",
"2: Добрый день => Dobriy den'....... (Good afternoon)"
];
foreach ($strings as $string) {
echo sanitizer($string);
echo "n---n";
}
Output:
1: Доброе утро- Dobraye ootro Good morning
---
2: Добрый день Dobriy den... Good afternoon
---
I could have written a single piped pattern for preg_replace() but I wanted to make two passes over the string. 1. to remove any invalid characters then 2. to remove excessively long character sequences that may or may not have been formed by the first pass.
Noteworthy pattern changes:
[a-zA-Z0-9_]is more simply written aswhowever because you are using theuflag AND to prepare for PHP7.3's strict adherence to PCRE2, it is better two write out:p{L}p{N}_Avoid writing unnecessary slashes before characters without special meaning -- it only makes your pattern longer and harder to interpret. Character that normally have special meaning (like
*,?,+, etc) lose there special meaning inside of a character class[ ... ].Move your hyphen to the front or back of your negated character class to avoid the possibility of writing a range of character. (because your
-came after a range of characters0-9, this was a non-issue, but it is good advice to remember as a matter of best practice.Kmeans "forget the previously matched substring" in other words, "start matching from here". This enables you to avoid a capture group and just truncate the unwanted characters by replacing the match with an empty string.
p.s. You should still run your strtr() calls as your original post has done.
Allow me to clean up your Edit #2 patterns and suggest a single-call implementation.
Code: (Demo)
function sanitizer($string) {
return preg_replace(['~[^p{L}p{N}_.:,!? -]+~u', '~ K +|.{3}K.+~'], '', $string);
}
$strings = [
"1: Доброе утро - Dobraye ootro & Good morning",
"2: Добрый день => Dobriy den'....... (Good afternoon)"
];
foreach ($strings as $string) {
echo sanitizer($string);
echo "n---n";
}
Output:
1: Доброе утро- Dobraye ootro Good morning
---
2: Добрый день Dobriy den... Good afternoon
---
I could have written a single piped pattern for preg_replace() but I wanted to make two passes over the string. 1. to remove any invalid characters then 2. to remove excessively long character sequences that may or may not have been formed by the first pass.
Noteworthy pattern changes:
[a-zA-Z0-9_]is more simply written aswhowever because you are using theuflag AND to prepare for PHP7.3's strict adherence to PCRE2, it is better two write out:p{L}p{N}_Avoid writing unnecessary slashes before characters without special meaning -- it only makes your pattern longer and harder to interpret. Character that normally have special meaning (like
*,?,+, etc) lose there special meaning inside of a character class[ ... ].Move your hyphen to the front or back of your negated character class to avoid the possibility of writing a range of character. (because your
-came after a range of characters0-9, this was a non-issue, but it is good advice to remember as a matter of best practice.Kmeans "forget the previously matched substring" in other words, "start matching from here". This enables you to avoid a capture group and just truncate the unwanted characters by replacing the match with an empty string.
p.s. You should still run your strtr() calls as your original post has done.
answered Nov 16 '18 at 15:21
mickmackusamickmackusa
23.4k103658
23.4k103658
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53336286%2fsecure-unicode-regex-in-php-for-text-field-submissions%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown

uwill makewUnicode aware and/[^w]/uwill match any char that is not a Unicode letter, digit or_(and some other chars). You probably want to replace'/[^w-_.:, ]+/u'with'/[^a-zA-Z0-9-_.:,s]+/'– Wiktor Stribiżew
Nov 16 '18 at 11:04
Thanks, i have tested it: it allows cyrillic characters, but not latin chars. I have tested it here: rubular.com/r/yp6dhfehrl It is a ruby site, but PHP should work the same i think..
– Nemdr
Nov 16 '18 at 11:17
In your code, you are replacing with regex, and at rubular you are matching. PCRE is used in PHP regexps, not Onigmo (used in Rubular). Use regex101.com to test PHP regexps, it is the most user-friendly - IMHO - regex testing Web site for PCRE, JS, Python
reand Go regexps.– Wiktor Stribiżew
Nov 16 '18 at 11:20
Thanks, i have tested this regex on regex101, and this seems to be working: ^[a-zA-Z0-9-_.:,!?w ]+/u
– Nemdr
Nov 16 '18 at 11:29
1
I want allow chars: a-zA-Z0-9.,:!?_ - (white space) both Latin and non-Latin chars. Any char in string that is not allowed (not matched), must be replaced with "" (removed from string). This must be done in the entire string. Also I want to replace 2 or more consecutive white spaces with 1 space, etc. like in the preg_match example above. Example: "te-st in!p.ut" will not be changed. But: "test i@n';put" must be changed to: "test input". And: "test [a lot of space] input" changed to: "test input".
– Nemdr
Nov 16 '18 at 12:11