Split Regex, return only characters, digits and underscores. Perl
Attempting to break up the line
#!/usr/bin/perl -w
with the following code
use strict;
use warnings;
my %words;
while (my $line = <>)
{
foreach my $word (split /:|,s*|/|!|#|-/, $line)
{
$words{$word}++;
}
}
foreach my $word (keys %words)
{
print "$word: $words{$word}n";
}
Is there an easier way to have the split command only split at words, numbers and underscores? Rather than setting all of these delimiters.
Attempting to get the output
usr: 1
bin: 1
perl: 1
regex perl
add a comment |
Attempting to break up the line
#!/usr/bin/perl -w
with the following code
use strict;
use warnings;
my %words;
while (my $line = <>)
{
foreach my $word (split /:|,s*|/|!|#|-/, $line)
{
$words{$word}++;
}
}
foreach my $word (keys %words)
{
print "$word: $words{$word}n";
}
Is there an easier way to have the split command only split at words, numbers and underscores? Rather than setting all of these delimiters.
Attempting to get the output
usr: 1
bin: 1
perl: 1
regex perl
(This was closed as a duplicate to a question whose answer issplit ' '
, which is not appropriate here. Re-opened.)
– ikegami
Nov 12 at 17:58
add a comment |
Attempting to break up the line
#!/usr/bin/perl -w
with the following code
use strict;
use warnings;
my %words;
while (my $line = <>)
{
foreach my $word (split /:|,s*|/|!|#|-/, $line)
{
$words{$word}++;
}
}
foreach my $word (keys %words)
{
print "$word: $words{$word}n";
}
Is there an easier way to have the split command only split at words, numbers and underscores? Rather than setting all of these delimiters.
Attempting to get the output
usr: 1
bin: 1
perl: 1
regex perl
Attempting to break up the line
#!/usr/bin/perl -w
with the following code
use strict;
use warnings;
my %words;
while (my $line = <>)
{
foreach my $word (split /:|,s*|/|!|#|-/, $line)
{
$words{$word}++;
}
}
foreach my $word (keys %words)
{
print "$word: $words{$word}n";
}
Is there an easier way to have the split command only split at words, numbers and underscores? Rather than setting all of these delimiters.
Attempting to get the output
usr: 1
bin: 1
perl: 1
regex perl
regex perl
asked Nov 12 at 17:47
Conman
194
194
(This was closed as a duplicate to a question whose answer issplit ' '
, which is not appropriate here. Re-opened.)
– ikegami
Nov 12 at 17:58
add a comment |
(This was closed as a duplicate to a question whose answer issplit ' '
, which is not appropriate here. Re-opened.)
– ikegami
Nov 12 at 17:58
(This was closed as a duplicate to a question whose answer is
split ' '
, which is not appropriate here. Re-opened.)– ikegami
Nov 12 at 17:58
(This was closed as a duplicate to a question whose answer is
split ' '
, which is not appropriate here. Re-opened.)– ikegami
Nov 12 at 17:58
add a comment |
2 Answers
2
active
oldest
votes
Don't split, extract.
++$words{$_} for $line =~ /w+/g;
So much more compact, love it. However, I can't figure a way to only output words greater than 1 character (so we would discard the -w).
– Conman
Nov 12 at 17:53
ww+
or simplyw{2,}
– ikegami
Nov 12 at 17:53
Perfect, thank you so much! Can you use extraction to convert all of the letters to lowercase as well?
– Conman
Nov 12 at 17:59
No, but neither couldsplit
. Just do it outside (++$words{lc($_)}
).
– ikegami
Nov 12 at 18:02
Ahh okay no worries! I was just seeing to what extent I could work this whole "extraction" method. Sorry about that
– Conman
Nov 12 at 18:32
|
show 1 more comment
You can also do this with split and the negated word character class:
foreach my $word (split /W+/, $line) {
$words{$word}++;
}
But note that since your string starts with non-word characters, the first word it will return is the empty string at the beginning of the string.
Another tool for this task (but more suited to prose than code and filenames) is the unicode word boundary, which uses Unicode rules for where words start and end, and takes into account things like apostrophes being part of words (can't
). To utilize this, you'd first need to split your input into a list containing both words and non-words, and then find the words (easiest way is probably to use any elements that contain at least one word character):
foreach my $word (grep { m/w/ } split /b{wb}/, $line) {
$words{$word}++;
}
The b{wb}
regex sequence requires Perl 5.24+.
That doesn't handle strings that don't start with a word (such as the OP's example). /// Also, this approach doesn't lend itself to improving the definition of a word. All else being equal, a solution that can evolve is better.
– ikegami
Nov 12 at 22:53
I agree, but it's an option, and it is still extensible in some ways. For instance, if you change the definition of "word character" to include:
:split /[^w:]+/, $line
– Grinnz
Nov 12 at 23:14
There are of course unicode considerations tow
andW
, but until the code in question is decoding its input from UTF-8, I'm assuming it's not a factor for this input...
– Grinnz
Nov 12 at 23:16
That's hardly the way word parsers evolve, though. First, there's's
,...
– ikegami
Nov 12 at 23:18
@ikegami If you really want to do that right, there's the unicode word boundary in recent Perls...
– Grinnz
Nov 12 at 23:22
|
show 1 more comment
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53267484%2fsplit-regex-return-only-characters-digits-and-underscores-perl%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
Don't split, extract.
++$words{$_} for $line =~ /w+/g;
So much more compact, love it. However, I can't figure a way to only output words greater than 1 character (so we would discard the -w).
– Conman
Nov 12 at 17:53
ww+
or simplyw{2,}
– ikegami
Nov 12 at 17:53
Perfect, thank you so much! Can you use extraction to convert all of the letters to lowercase as well?
– Conman
Nov 12 at 17:59
No, but neither couldsplit
. Just do it outside (++$words{lc($_)}
).
– ikegami
Nov 12 at 18:02
Ahh okay no worries! I was just seeing to what extent I could work this whole "extraction" method. Sorry about that
– Conman
Nov 12 at 18:32
|
show 1 more comment
Don't split, extract.
++$words{$_} for $line =~ /w+/g;
So much more compact, love it. However, I can't figure a way to only output words greater than 1 character (so we would discard the -w).
– Conman
Nov 12 at 17:53
ww+
or simplyw{2,}
– ikegami
Nov 12 at 17:53
Perfect, thank you so much! Can you use extraction to convert all of the letters to lowercase as well?
– Conman
Nov 12 at 17:59
No, but neither couldsplit
. Just do it outside (++$words{lc($_)}
).
– ikegami
Nov 12 at 18:02
Ahh okay no worries! I was just seeing to what extent I could work this whole "extraction" method. Sorry about that
– Conman
Nov 12 at 18:32
|
show 1 more comment
Don't split, extract.
++$words{$_} for $line =~ /w+/g;
Don't split, extract.
++$words{$_} for $line =~ /w+/g;
answered Nov 12 at 17:49
ikegami
261k11176396
261k11176396
So much more compact, love it. However, I can't figure a way to only output words greater than 1 character (so we would discard the -w).
– Conman
Nov 12 at 17:53
ww+
or simplyw{2,}
– ikegami
Nov 12 at 17:53
Perfect, thank you so much! Can you use extraction to convert all of the letters to lowercase as well?
– Conman
Nov 12 at 17:59
No, but neither couldsplit
. Just do it outside (++$words{lc($_)}
).
– ikegami
Nov 12 at 18:02
Ahh okay no worries! I was just seeing to what extent I could work this whole "extraction" method. Sorry about that
– Conman
Nov 12 at 18:32
|
show 1 more comment
So much more compact, love it. However, I can't figure a way to only output words greater than 1 character (so we would discard the -w).
– Conman
Nov 12 at 17:53
ww+
or simplyw{2,}
– ikegami
Nov 12 at 17:53
Perfect, thank you so much! Can you use extraction to convert all of the letters to lowercase as well?
– Conman
Nov 12 at 17:59
No, but neither couldsplit
. Just do it outside (++$words{lc($_)}
).
– ikegami
Nov 12 at 18:02
Ahh okay no worries! I was just seeing to what extent I could work this whole "extraction" method. Sorry about that
– Conman
Nov 12 at 18:32
So much more compact, love it. However, I can't figure a way to only output words greater than 1 character (so we would discard the -w).
– Conman
Nov 12 at 17:53
So much more compact, love it. However, I can't figure a way to only output words greater than 1 character (so we would discard the -w).
– Conman
Nov 12 at 17:53
ww+
or simply w{2,}
– ikegami
Nov 12 at 17:53
ww+
or simply w{2,}
– ikegami
Nov 12 at 17:53
Perfect, thank you so much! Can you use extraction to convert all of the letters to lowercase as well?
– Conman
Nov 12 at 17:59
Perfect, thank you so much! Can you use extraction to convert all of the letters to lowercase as well?
– Conman
Nov 12 at 17:59
No, but neither could
split
. Just do it outside (++$words{lc($_)}
).– ikegami
Nov 12 at 18:02
No, but neither could
split
. Just do it outside (++$words{lc($_)}
).– ikegami
Nov 12 at 18:02
Ahh okay no worries! I was just seeing to what extent I could work this whole "extraction" method. Sorry about that
– Conman
Nov 12 at 18:32
Ahh okay no worries! I was just seeing to what extent I could work this whole "extraction" method. Sorry about that
– Conman
Nov 12 at 18:32
|
show 1 more comment
You can also do this with split and the negated word character class:
foreach my $word (split /W+/, $line) {
$words{$word}++;
}
But note that since your string starts with non-word characters, the first word it will return is the empty string at the beginning of the string.
Another tool for this task (but more suited to prose than code and filenames) is the unicode word boundary, which uses Unicode rules for where words start and end, and takes into account things like apostrophes being part of words (can't
). To utilize this, you'd first need to split your input into a list containing both words and non-words, and then find the words (easiest way is probably to use any elements that contain at least one word character):
foreach my $word (grep { m/w/ } split /b{wb}/, $line) {
$words{$word}++;
}
The b{wb}
regex sequence requires Perl 5.24+.
That doesn't handle strings that don't start with a word (such as the OP's example). /// Also, this approach doesn't lend itself to improving the definition of a word. All else being equal, a solution that can evolve is better.
– ikegami
Nov 12 at 22:53
I agree, but it's an option, and it is still extensible in some ways. For instance, if you change the definition of "word character" to include:
:split /[^w:]+/, $line
– Grinnz
Nov 12 at 23:14
There are of course unicode considerations tow
andW
, but until the code in question is decoding its input from UTF-8, I'm assuming it's not a factor for this input...
– Grinnz
Nov 12 at 23:16
That's hardly the way word parsers evolve, though. First, there's's
,...
– ikegami
Nov 12 at 23:18
@ikegami If you really want to do that right, there's the unicode word boundary in recent Perls...
– Grinnz
Nov 12 at 23:22
|
show 1 more comment
You can also do this with split and the negated word character class:
foreach my $word (split /W+/, $line) {
$words{$word}++;
}
But note that since your string starts with non-word characters, the first word it will return is the empty string at the beginning of the string.
Another tool for this task (but more suited to prose than code and filenames) is the unicode word boundary, which uses Unicode rules for where words start and end, and takes into account things like apostrophes being part of words (can't
). To utilize this, you'd first need to split your input into a list containing both words and non-words, and then find the words (easiest way is probably to use any elements that contain at least one word character):
foreach my $word (grep { m/w/ } split /b{wb}/, $line) {
$words{$word}++;
}
The b{wb}
regex sequence requires Perl 5.24+.
That doesn't handle strings that don't start with a word (such as the OP's example). /// Also, this approach doesn't lend itself to improving the definition of a word. All else being equal, a solution that can evolve is better.
– ikegami
Nov 12 at 22:53
I agree, but it's an option, and it is still extensible in some ways. For instance, if you change the definition of "word character" to include:
:split /[^w:]+/, $line
– Grinnz
Nov 12 at 23:14
There are of course unicode considerations tow
andW
, but until the code in question is decoding its input from UTF-8, I'm assuming it's not a factor for this input...
– Grinnz
Nov 12 at 23:16
That's hardly the way word parsers evolve, though. First, there's's
,...
– ikegami
Nov 12 at 23:18
@ikegami If you really want to do that right, there's the unicode word boundary in recent Perls...
– Grinnz
Nov 12 at 23:22
|
show 1 more comment
You can also do this with split and the negated word character class:
foreach my $word (split /W+/, $line) {
$words{$word}++;
}
But note that since your string starts with non-word characters, the first word it will return is the empty string at the beginning of the string.
Another tool for this task (but more suited to prose than code and filenames) is the unicode word boundary, which uses Unicode rules for where words start and end, and takes into account things like apostrophes being part of words (can't
). To utilize this, you'd first need to split your input into a list containing both words and non-words, and then find the words (easiest way is probably to use any elements that contain at least one word character):
foreach my $word (grep { m/w/ } split /b{wb}/, $line) {
$words{$word}++;
}
The b{wb}
regex sequence requires Perl 5.24+.
You can also do this with split and the negated word character class:
foreach my $word (split /W+/, $line) {
$words{$word}++;
}
But note that since your string starts with non-word characters, the first word it will return is the empty string at the beginning of the string.
Another tool for this task (but more suited to prose than code and filenames) is the unicode word boundary, which uses Unicode rules for where words start and end, and takes into account things like apostrophes being part of words (can't
). To utilize this, you'd first need to split your input into a list containing both words and non-words, and then find the words (easiest way is probably to use any elements that contain at least one word character):
foreach my $word (grep { m/w/ } split /b{wb}/, $line) {
$words{$word}++;
}
The b{wb}
regex sequence requires Perl 5.24+.
edited Nov 12 at 23:30
answered Nov 12 at 22:02
Grinnz
1,877311
1,877311
That doesn't handle strings that don't start with a word (such as the OP's example). /// Also, this approach doesn't lend itself to improving the definition of a word. All else being equal, a solution that can evolve is better.
– ikegami
Nov 12 at 22:53
I agree, but it's an option, and it is still extensible in some ways. For instance, if you change the definition of "word character" to include:
:split /[^w:]+/, $line
– Grinnz
Nov 12 at 23:14
There are of course unicode considerations tow
andW
, but until the code in question is decoding its input from UTF-8, I'm assuming it's not a factor for this input...
– Grinnz
Nov 12 at 23:16
That's hardly the way word parsers evolve, though. First, there's's
,...
– ikegami
Nov 12 at 23:18
@ikegami If you really want to do that right, there's the unicode word boundary in recent Perls...
– Grinnz
Nov 12 at 23:22
|
show 1 more comment
That doesn't handle strings that don't start with a word (such as the OP's example). /// Also, this approach doesn't lend itself to improving the definition of a word. All else being equal, a solution that can evolve is better.
– ikegami
Nov 12 at 22:53
I agree, but it's an option, and it is still extensible in some ways. For instance, if you change the definition of "word character" to include:
:split /[^w:]+/, $line
– Grinnz
Nov 12 at 23:14
There are of course unicode considerations tow
andW
, but until the code in question is decoding its input from UTF-8, I'm assuming it's not a factor for this input...
– Grinnz
Nov 12 at 23:16
That's hardly the way word parsers evolve, though. First, there's's
,...
– ikegami
Nov 12 at 23:18
@ikegami If you really want to do that right, there's the unicode word boundary in recent Perls...
– Grinnz
Nov 12 at 23:22
That doesn't handle strings that don't start with a word (such as the OP's example). /// Also, this approach doesn't lend itself to improving the definition of a word. All else being equal, a solution that can evolve is better.
– ikegami
Nov 12 at 22:53
That doesn't handle strings that don't start with a word (such as the OP's example). /// Also, this approach doesn't lend itself to improving the definition of a word. All else being equal, a solution that can evolve is better.
– ikegami
Nov 12 at 22:53
I agree, but it's an option, and it is still extensible in some ways. For instance, if you change the definition of "word character" to include
:
: split /[^w:]+/, $line
– Grinnz
Nov 12 at 23:14
I agree, but it's an option, and it is still extensible in some ways. For instance, if you change the definition of "word character" to include
:
: split /[^w:]+/, $line
– Grinnz
Nov 12 at 23:14
There are of course unicode considerations to
w
and W
, but until the code in question is decoding its input from UTF-8, I'm assuming it's not a factor for this input...– Grinnz
Nov 12 at 23:16
There are of course unicode considerations to
w
and W
, but until the code in question is decoding its input from UTF-8, I'm assuming it's not a factor for this input...– Grinnz
Nov 12 at 23:16
That's hardly the way word parsers evolve, though. First, there's
's
,...– ikegami
Nov 12 at 23:18
That's hardly the way word parsers evolve, though. First, there's
's
,...– ikegami
Nov 12 at 23:18
@ikegami If you really want to do that right, there's the unicode word boundary in recent Perls...
– Grinnz
Nov 12 at 23:22
@ikegami If you really want to do that right, there's the unicode word boundary in recent Perls...
– Grinnz
Nov 12 at 23:22
|
show 1 more comment
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53267484%2fsplit-regex-return-only-characters-digits-and-underscores-perl%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
(This was closed as a duplicate to a question whose answer is
split ' '
, which is not appropriate here. Re-opened.)– ikegami
Nov 12 at 17:58