Secure unicode regex in PHP for text field submissions

I am currently using a PHP form processing code that accepts text field submissions. My code is this:

function checkInput($f) {

    $f = strtr($f, array('Š' => 'S','Ž' => 'Z','š' => 's','ž' => 'z','Ÿ' => 'Y','À' => 'A','Á' => 'A','Â' => 'A','Ã' => 'A','Ä' => 'A','Å' => 'A','Ç' => 'C','È' => 'E'));  

    $f = strtr($f, array('Þ' => 'TH', 'þ' => 'th', 'Ð' => 'DH', 'ð' => 'dh', 'ß' => 'ss', 'Œ' => 'OE', 'œ' => 'oe', 'Æ' => 'AE', 'æ' => 'ae', 'µ' => 'u'));  

    $f = preg_replace(array('#( ){2,}#', '#(.){4,}#', '/[^w-_.:, ]+/'), array(' ', '...', '_'), $f);  

    return $f; 

}

This code checks characters with accents and replaced those with the 'regular' characters without accents. And the preg_replace line checks:

1. if there are 2 or more consecutive spaces, if yes: replace with 1 space;

2. if there are 4 or more consecutive dots, if yes: replace with 3 dots;

3. if there are any non-matching characters, if yes: replace those with an underscode (_);

I want to support unicode characters from other language, for example Cyrillic. Is it enough to just add a u in the preg_replace line? Example:

$f = preg_replace(array('#( ){2,}#', '#(.){4,}#', '/[^w-_.:, ]+/u'), array(' ', '...', '_'), $f);

I am not sure if that is the way to go in terms of security. Please advise.

EDIT:

This regex seems to be working, it restricts allowed chars to the specified chars in the regex, but it does not allow non-Latin chars..

/^[a-z0-9.,:!?-_ ]+/iu

I want to allow chars: a through z (case insensitive), 0 through 9, . , : ! ? - _ white space and non-Latin chars.

EDIT2:

Ok now this seems to be working correctly in code:

$rgx = '/[^a-z0-9-_.:,!?w ]+/iu';  

$f = preg_replace($rgx, "", $f);  

$f = preg_replace(array('#( ){2,}#', '#(.){4,}#'), array(' ', '...'), $f);  

return $f;

It allows chars a - z, digits, - _ . : , ! ? and non-Latin chars. And replaces any restricted characters like quotes " ' and semi-colons ; to prevent SQL injections.

edited Nov 16 '18 at 12:54

asked Nov 16 '18 at 10:46

Nemdr

134

u will make w Unicode aware and /[^w]/u will match any char that is not a Unicode letter, digit or _ (and some other chars). You probably want to replace '/[^w-_.:, ]+/u' with '/[^a-zA-Z0-9-_.:,s]+/'

– Wiktor Stribiżew
Nov 16 '18 at 11:04

Thanks, i have tested it: it allows cyrillic characters, but not latin chars. I have tested it here: rubular.com/r/yp6dhfehrl It is a ruby site, but PHP should work the same i think..

– Nemdr
Nov 16 '18 at 11:17

In your code, you are replacing with regex, and at rubular you are matching. PCRE is used in PHP regexps, not Onigmo (used in Rubular). Use regex101.com to test PHP regexps, it is the most user-friendly - IMHO - regex testing Web site for PCRE, JS, Python re and Go regexps.

– Wiktor Stribiżew
Nov 16 '18 at 11:20

Thanks, i have tested this regex on regex101, and this seems to be working: ^[a-zA-Z0-9-_.:,!?w ]+/u

– Nemdr
Nov 16 '18 at 11:29

1

I want allow chars: a-zA-Z0-9.,:!?_ - (white space) both Latin and non-Latin chars. Any char in string that is not allowed (not matched), must be replaced with "" (removed from string). This must be done in the entire string. Also I want to replace 2 or more consecutive white spaces with 1 space, etc. like in the preg_match example above. Example: "te-st in!p.ut" will not be changed. But: "test i@n';put" must be changed to: "test input". And: "test [a lot of space] input" changed to: "test input".

– Nemdr
Nov 16 '18 at 12:11

|
show 4 more comments

I am currently using a PHP form processing code that accepts text field submissions. My code is this:

function checkInput($f) {

    $f = strtr($f, array('Š' => 'S','Ž' => 'Z','š' => 's','ž' => 'z','Ÿ' => 'Y','À' => 'A','Á' => 'A','Â' => 'A','Ã' => 'A','Ä' => 'A','Å' => 'A','Ç' => 'C','È' => 'E'));  

    $f = strtr($f, array('Þ' => 'TH', 'þ' => 'th', 'Ð' => 'DH', 'ð' => 'dh', 'ß' => 'ss', 'Œ' => 'OE', 'œ' => 'oe', 'Æ' => 'AE', 'æ' => 'ae', 'µ' => 'u'));  

    $f = preg_replace(array('#( ){2,}#', '#(.){4,}#', '/[^w-_.:, ]+/'), array(' ', '...', '_'), $f);  

    return $f; 

}

I want to support unicode characters from other language, for example Cyrillic. Is it enough to just add a u in the preg_replace line? Example:

$f = preg_replace(array('#( ){2,}#', '#(.){4,}#', '/[^w-_.:, ]+/u'), array(' ', '...', '_'), $f);

I am not sure if that is the way to go in terms of security. Please advise.

EDIT:

This regex seems to be working, it restricts allowed chars to the specified chars in the regex, but it does not allow non-Latin chars..

/^[a-z0-9.,:!?-_ ]+/iu

I want to allow chars: a through z (case insensitive), 0 through 9, . , : ! ? - _ white space and non-Latin chars.

EDIT2:

Ok now this seems to be working correctly in code:

$rgx = '/[^a-z0-9-_.:,!?w ]+/iu';  

$f = preg_replace($rgx, "", $f);  

$f = preg_replace(array('#( ){2,}#', '#(.){4,}#'), array(' ', '...'), $f);  

return $f;

It allows chars a - z, digits, - _ . : , ! ? and non-Latin chars. And replaces any restricted characters like quotes " ' and semi-colons ; to prevent SQL injections.

edited Nov 16 '18 at 12:54

asked Nov 16 '18 at 10:46

Nemdr

134

u will make w Unicode aware and /[^w]/u will match any char that is not a Unicode letter, digit or _ (and some other chars). You probably want to replace '/[^w-_.:, ]+/u' with '/[^a-zA-Z0-9-_.:,s]+/'

– Wiktor Stribiżew
Nov 16 '18 at 11:04

Thanks, i have tested it: it allows cyrillic characters, but not latin chars. I have tested it here: rubular.com/r/yp6dhfehrl It is a ruby site, but PHP should work the same i think..

– Nemdr
Nov 16 '18 at 11:17

In your code, you are replacing with regex, and at rubular you are matching. PCRE is used in PHP regexps, not Onigmo (used in Rubular). Use regex101.com to test PHP regexps, it is the most user-friendly - IMHO - regex testing Web site for PCRE, JS, Python re and Go regexps.

– Wiktor Stribiżew
Nov 16 '18 at 11:20

Thanks, i have tested this regex on regex101, and this seems to be working: ^[a-zA-Z0-9-_.:,!?w ]+/u

– Nemdr
Nov 16 '18 at 11:29

1

I want allow chars: a-zA-Z0-9.,:!?_ - (white space) both Latin and non-Latin chars. Any char in string that is not allowed (not matched), must be replaced with "" (removed from string). This must be done in the entire string. Also I want to replace 2 or more consecutive white spaces with 1 space, etc. like in the preg_match example above. Example: "te-st in!p.ut" will not be changed. But: "test i@n';put" must be changed to: "test input". And: "test [a lot of space] input" changed to: "test input".

– Nemdr
Nov 16 '18 at 12:11

|
show 4 more comments

I am currently using a PHP form processing code that accepts text field submissions. My code is this:

function checkInput($f) {

    $f = strtr($f, array('Š' => 'S','Ž' => 'Z','š' => 's','ž' => 'z','Ÿ' => 'Y','À' => 'A','Á' => 'A','Â' => 'A','Ã' => 'A','Ä' => 'A','Å' => 'A','Ç' => 'C','È' => 'E'));  

    $f = strtr($f, array('Þ' => 'TH', 'þ' => 'th', 'Ð' => 'DH', 'ð' => 'dh', 'ß' => 'ss', 'Œ' => 'OE', 'œ' => 'oe', 'Æ' => 'AE', 'æ' => 'ae', 'µ' => 'u'));  

    $f = preg_replace(array('#( ){2,}#', '#(.){4,}#', '/[^w-_.:, ]+/'), array(' ', '...', '_'), $f);  

    return $f; 

}

I want to support unicode characters from other language, for example Cyrillic. Is it enough to just add a u in the preg_replace line? Example:

$f = preg_replace(array('#( ){2,}#', '#(.){4,}#', '/[^w-_.:, ]+/u'), array(' ', '...', '_'), $f);

I am not sure if that is the way to go in terms of security. Please advise.

EDIT:

This regex seems to be working, it restricts allowed chars to the specified chars in the regex, but it does not allow non-Latin chars..

/^[a-z0-9.,:!?-_ ]+/iu

I want to allow chars: a through z (case insensitive), 0 through 9, . , : ! ? - _ white space and non-Latin chars.

EDIT2:

Ok now this seems to be working correctly in code:

$rgx = '/[^a-z0-9-_.:,!?w ]+/iu';  

$f = preg_replace($rgx, "", $f);  

$f = preg_replace(array('#( ){2,}#', '#(.){4,}#'), array(' ', '...'), $f);  

return $f;

It allows chars a - z, digits, - _ . : , ! ? and non-Latin chars. And replaces any restricted characters like quotes " ' and semi-colons ; to prevent SQL injections.

edited Nov 16 '18 at 12:54

asked Nov 16 '18 at 10:46

Nemdr

134

I am currently using a PHP form processing code that accepts text field submissions. My code is this:

function checkInput($f) {

    $f = strtr($f, array('Š' => 'S','Ž' => 'Z','š' => 's','ž' => 'z','Ÿ' => 'Y','À' => 'A','Á' => 'A','Â' => 'A','Ã' => 'A','Ä' => 'A','Å' => 'A','Ç' => 'C','È' => 'E'));  

    $f = strtr($f, array('Þ' => 'TH', 'þ' => 'th', 'Ð' => 'DH', 'ð' => 'dh', 'ß' => 'ss', 'Œ' => 'OE', 'œ' => 'oe', 'Æ' => 'AE', 'æ' => 'ae', 'µ' => 'u'));  

    $f = preg_replace(array('#( ){2,}#', '#(.){4,}#', '/[^w-_.:, ]+/'), array(' ', '...', '_'), $f);  

    return $f; 

}

I want to support unicode characters from other language, for example Cyrillic. Is it enough to just add a u in the preg_replace line? Example:

$f = preg_replace(array('#( ){2,}#', '#(.){4,}#', '/[^w-_.:, ]+/u'), array(' ', '...', '_'), $f);

I am not sure if that is the way to go in terms of security. Please advise.

EDIT:

This regex seems to be working, it restricts allowed chars to the specified chars in the regex, but it does not allow non-Latin chars..

/^[a-z0-9.,:!?-_ ]+/iu

I want to allow chars: a through z (case insensitive), 0 through 9, . , : ! ? - _ white space and non-Latin chars.

EDIT2:

Ok now this seems to be working correctly in code:

$rgx = '/[^a-z0-9-_.:,!?w ]+/iu';  

$f = preg_replace($rgx, "", $f);  

$f = preg_replace(array('#( ){2,}#', '#(.){4,}#'), array(' ', '...'), $f);  

return $f;

It allows chars a - z, digits, - _ . : , ! ? and non-Latin chars. And replaces any restricted characters like quotes " ' and semi-colons ; to prevent SQL injections.

php regex

edited Nov 16 '18 at 12:54

asked Nov 16 '18 at 10:46

Nemdr

134

edited Nov 16 '18 at 12:54

asked Nov 16 '18 at 10:46

Nemdr

134

edited Nov 16 '18 at 12:54

asked Nov 16 '18 at 10:46

Nemdr

134

asked Nov 16 '18 at 10:46

Nemdr

134

asked Nov 16 '18 at 10:46

Nemdr

134

u will make w Unicode aware and /[^w]/u will match any char that is not a Unicode letter, digit or _ (and some other chars). You probably want to replace '/[^w-_.:, ]+/u' with '/[^a-zA-Z0-9-_.:,s]+/'

– Wiktor Stribiżew
Nov 16 '18 at 11:04

Thanks, i have tested it: it allows cyrillic characters, but not latin chars. I have tested it here: rubular.com/r/yp6dhfehrl It is a ruby site, but PHP should work the same i think..

– Nemdr
Nov 16 '18 at 11:17

In your code, you are replacing with regex, and at rubular you are matching. PCRE is used in PHP regexps, not Onigmo (used in Rubular). Use regex101.com to test PHP regexps, it is the most user-friendly - IMHO - regex testing Web site for PCRE, JS, Python re and Go regexps.

– Wiktor Stribiżew
Nov 16 '18 at 11:20

Thanks, i have tested this regex on regex101, and this seems to be working: ^[a-zA-Z0-9-_.:,!?w ]+/u

– Nemdr
Nov 16 '18 at 11:29

1

I want allow chars: a-zA-Z0-9.,:!?_ - (white space) both Latin and non-Latin chars. Any char in string that is not allowed (not matched), must be replaced with "" (removed from string). This must be done in the entire string. Also I want to replace 2 or more consecutive white spaces with 1 space, etc. like in the preg_match example above. Example: "te-st in!p.ut" will not be changed. But: "test i@n';put" must be changed to: "test input". And: "test [a lot of space] input" changed to: "test input".

– Nemdr
Nov 16 '18 at 12:11

|
show 4 more comments

u will make w Unicode aware and /[^w]/u will match any char that is not a Unicode letter, digit or _ (and some other chars). You probably want to replace '/[^w-_.:, ]+/u' with '/[^a-zA-Z0-9-_.:,s]+/'

– Wiktor Stribiżew
Nov 16 '18 at 11:04

Thanks, i have tested it: it allows cyrillic characters, but not latin chars. I have tested it here: rubular.com/r/yp6dhfehrl It is a ruby site, but PHP should work the same i think..

– Nemdr
Nov 16 '18 at 11:17

In your code, you are replacing with regex, and at rubular you are matching. PCRE is used in PHP regexps, not Onigmo (used in Rubular). Use regex101.com to test PHP regexps, it is the most user-friendly - IMHO - regex testing Web site for PCRE, JS, Python re and Go regexps.

– Wiktor Stribiżew
Nov 16 '18 at 11:20

Thanks, i have tested this regex on regex101, and this seems to be working: ^[a-zA-Z0-9-_.:,!?w ]+/u

– Nemdr
Nov 16 '18 at 11:29

1

I want allow chars: a-zA-Z0-9.,:!?_ - (white space) both Latin and non-Latin chars. Any char in string that is not allowed (not matched), must be replaced with "" (removed from string). This must be done in the entire string. Also I want to replace 2 or more consecutive white spaces with 1 space, etc. like in the preg_match example above. Example: "te-st in!p.ut" will not be changed. But: "test i@n';put" must be changed to: "test input". And: "test [a lot of space] input" changed to: "test input".

– Nemdr
Nov 16 '18 at 12:11

u will make w Unicode aware and /[^w]/u will match any char that is not a Unicode letter, digit or _ (and some other chars). You probably want to replace '/[^w-_.:, ]+/u' with '/[^a-zA-Z0-9-_.:,s]+/'

– Wiktor Stribiżew
Nov 16 '18 at 11:04

Thanks, i have tested it: it allows cyrillic characters, but not latin chars. I have tested it here: rubular.com/r/yp6dhfehrl It is a ruby site, but PHP should work the same i think..

– Nemdr
Nov 16 '18 at 11:17

In your code, you are replacing with regex, and at rubular you are matching. PCRE is used in PHP regexps, not Onigmo (used in Rubular). Use regex101.com to test PHP regexps, it is the most user-friendly - IMHO - regex testing Web site for PCRE, JS, Python re and Go regexps.

– Wiktor Stribiżew
Nov 16 '18 at 11:20

Thanks, i have tested this regex on regex101, and this seems to be working: ^[a-zA-Z0-9-_.:,!?w ]+/u

– Nemdr
Nov 16 '18 at 11:29

I want allow chars: a-zA-Z0-9.,:!?_ - (white space) both Latin and non-Latin chars. Any char in string that is not allowed (not matched), must be replaced with "" (removed from string). This must be done in the entire string. Also I want to replace 2 or more consecutive white spaces with 1 space, etc. like in the preg_match example above. Example: "te-st in!p.ut" will not be changed. But: "test i@n';put" must be changed to: "test input". And: "test [a lot of space] input" changed to: "test input".

– Nemdr
Nov 16 '18 at 12:11

|
show 4 more comments

1 Answer
1

active

oldest

votes

Allow me to clean up your Edit #2 patterns and suggest a single-call implementation.

Code: (Demo)

function sanitizer($string) {

    return preg_replace(['~[^p{L}p{N}_.:,!? -]+~u', '~ K +|.{3}K.+~'], '', $string);

}



$strings = [

    "1: Доброе утро - Dobraye ootro &       Good morning",

    "2: Добрый день => Dobriy den'....... (Good afternoon)"

];



foreach ($strings as $string) {

    echo sanitizer($string);

    echo "n---n";

}

Output:

1: Доброе утро- Dobraye ootro Good morning

---

2: Добрый день Dobriy den... Good afternoon

---

I could have written a single piped pattern for preg_replace() but I wanted to make two passes over the string. 1. to remove any invalid characters then 2. to remove excessively long character sequences that may or may not have been formed by the first pass.

Noteworthy pattern changes:

[a-zA-Z0-9_] is more simply written as w however because you are using the u flag AND to prepare for PHP7.3's strict adherence to PCRE2, it is better two write out: p{L}p{N}_

Avoid writing unnecessary slashes before characters without special meaning -- it only makes your pattern longer and harder to interpret. Character that normally have special meaning (like *, ?, +, etc) lose there special meaning inside of a character class [ ... ].

Move your hyphen to the front or back of your negated character class to avoid the possibility of writing a range of character. (because your - came after a range of characters 0-9, this was a non-issue, but it is good advice to remember as a matter of best practice.

K means "forget the previously matched substring" in other words, "start matching from here". This enables you to avoid a capture group and just truncate the unwanted characters by replacing the match with an empty string.

p.s. You should still run your strtr() calls as your original post has done.

answered Nov 16 '18 at 15:21

mickmackusa

23.4k103658

add a comment |

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53336286%2fsecure-unicode-regex-in-php-for-text-field-submissions%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

Allow me to clean up your Edit #2 patterns and suggest a single-call implementation.

Code: (Demo)

function sanitizer($string) {

    return preg_replace(['~[^p{L}p{N}_.:,!? -]+~u', '~ K +|.{3}K.+~'], '', $string);

}



$strings = [

    "1: Доброе утро - Dobraye ootro &       Good morning",

    "2: Добрый день => Dobriy den'....... (Good afternoon)"

];



foreach ($strings as $string) {

    echo sanitizer($string);

    echo "n---n";

}

Output:

1: Доброе утро- Dobraye ootro Good morning

---

2: Добрый день Dobriy den... Good afternoon

---

Noteworthy pattern changes:

[a-zA-Z0-9_] is more simply written as w however because you are using the u flag AND to prepare for PHP7.3's strict adherence to PCRE2, it is better two write out: p{L}p{N}_

Avoid writing unnecessary slashes before characters without special meaning -- it only makes your pattern longer and harder to interpret. Character that normally have special meaning (like *, ?, +, etc) lose there special meaning inside of a character class [ ... ].

Move your hyphen to the front or back of your negated character class to avoid the possibility of writing a range of character. (because your - came after a range of characters 0-9, this was a non-issue, but it is good advice to remember as a matter of best practice.

K means "forget the previously matched substring" in other words, "start matching from here". This enables you to avoid a capture group and just truncate the unwanted characters by replacing the match with an empty string.

p.s. You should still run your strtr() calls as your original post has done.

answered Nov 16 '18 at 15:21

mickmackusa

23.4k103658

add a comment |

Allow me to clean up your Edit #2 patterns and suggest a single-call implementation.

Code: (Demo)

function sanitizer($string) {

    return preg_replace(['~[^p{L}p{N}_.:,!? -]+~u', '~ K +|.{3}K.+~'], '', $string);

}



$strings = [

    "1: Доброе утро - Dobraye ootro &       Good morning",

    "2: Добрый день => Dobriy den'....... (Good afternoon)"

];



foreach ($strings as $string) {

    echo sanitizer($string);

    echo "n---n";

}

Output:

1: Доброе утро- Dobraye ootro Good morning

---

2: Добрый день Dobriy den... Good afternoon

---

Noteworthy pattern changes:

[a-zA-Z0-9_] is more simply written as w however because you are using the u flag AND to prepare for PHP7.3's strict adherence to PCRE2, it is better two write out: p{L}p{N}_

Avoid writing unnecessary slashes before characters without special meaning -- it only makes your pattern longer and harder to interpret. Character that normally have special meaning (like *, ?, +, etc) lose there special meaning inside of a character class [ ... ].

Move your hyphen to the front or back of your negated character class to avoid the possibility of writing a range of character. (because your - came after a range of characters 0-9, this was a non-issue, but it is good advice to remember as a matter of best practice.

K means "forget the previously matched substring" in other words, "start matching from here". This enables you to avoid a capture group and just truncate the unwanted characters by replacing the match with an empty string.

p.s. You should still run your strtr() calls as your original post has done.

answered Nov 16 '18 at 15:21

mickmackusa

23.4k103658

add a comment |

Allow me to clean up your Edit #2 patterns and suggest a single-call implementation.

Code: (Demo)

function sanitizer($string) {

    return preg_replace(['~[^p{L}p{N}_.:,!? -]+~u', '~ K +|.{3}K.+~'], '', $string);

}



$strings = [

    "1: Доброе утро - Dobraye ootro &       Good morning",

    "2: Добрый день => Dobriy den'....... (Good afternoon)"

];



foreach ($strings as $string) {

    echo sanitizer($string);

    echo "n---n";

}

Output:

1: Доброе утро- Dobraye ootro Good morning

---

2: Добрый день Dobriy den... Good afternoon

---

Noteworthy pattern changes:

[a-zA-Z0-9_] is more simply written as w however because you are using the u flag AND to prepare for PHP7.3's strict adherence to PCRE2, it is better two write out: p{L}p{N}_

Avoid writing unnecessary slashes before characters without special meaning -- it only makes your pattern longer and harder to interpret. Character that normally have special meaning (like *, ?, +, etc) lose there special meaning inside of a character class [ ... ].

Move your hyphen to the front or back of your negated character class to avoid the possibility of writing a range of character. (because your - came after a range of characters 0-9, this was a non-issue, but it is good advice to remember as a matter of best practice.

K means "forget the previously matched substring" in other words, "start matching from here". This enables you to avoid a capture group and just truncate the unwanted characters by replacing the match with an empty string.

p.s. You should still run your strtr() calls as your original post has done.

answered Nov 16 '18 at 15:21

mickmackusa

23.4k103658

Allow me to clean up your Edit #2 patterns and suggest a single-call implementation.

Code: (Demo)

function sanitizer($string) {

    return preg_replace(['~[^p{L}p{N}_.:,!? -]+~u', '~ K +|.{3}K.+~'], '', $string);

}



$strings = [

    "1: Доброе утро - Dobraye ootro &       Good morning",

    "2: Добрый день => Dobriy den'....... (Good afternoon)"

];



foreach ($strings as $string) {

    echo sanitizer($string);

    echo "n---n";

}

Output:

1: Доброе утро- Dobraye ootro Good morning

---

2: Добрый день Dobriy den... Good afternoon

---

Noteworthy pattern changes:

[a-zA-Z0-9_] is more simply written as w however because you are using the u flag AND to prepare for PHP7.3's strict adherence to PCRE2, it is better two write out: p{L}p{N}_

Avoid writing unnecessary slashes before characters without special meaning -- it only makes your pattern longer and harder to interpret. Character that normally have special meaning (like *, ?, +, etc) lose there special meaning inside of a character class [ ... ].

Move your hyphen to the front or back of your negated character class to avoid the possibility of writing a range of character. (because your - came after a range of characters 0-9, this was a non-issue, but it is good advice to remember as a matter of best practice.

K means "forget the previously matched substring" in other words, "start matching from here". This enables you to avoid a capture group and just truncate the unwanted characters by replacing the match with an empty string.

p.s. You should still run your strtr() calls as your original post has done.

answered Nov 16 '18 at 15:21

mickmackusa

23.4k103658

answered Nov 16 '18 at 15:21

mickmackusa

23.4k103658

answered Nov 16 '18 at 15:21

mickmackusa

23.4k103658

answered Nov 16 '18 at 15:21

mickmackusa

23.4k103658

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Vfrdtyky