Split Regex, return only characters, digits and underscores. Perl

Attempting to break up the line

#!/usr/bin/perl -w

with the following code

use strict;

use warnings;



my %words;



while (my $line = <>)

{

foreach my $word (split /:|,s*|/|!|#|-/, $line)

{

    $words{$word}++;

}

}



foreach my $word (keys %words)

{

print "$word: $words{$word}n";

}

Is there an easier way to have the split command only split at words, numbers and underscores? Rather than setting all of these delimiters.

Attempting to get the output

usr: 1

bin: 1

perl: 1

asked Nov 12 at 17:47

Conman

194

(This was closed as a duplicate to a question whose answer is split ' ', which is not appropriate here. Re-opened.)
– ikegami
Nov 12 at 17:58

add a comment |

Attempting to break up the line

#!/usr/bin/perl -w

with the following code

use strict;

use warnings;



my %words;



while (my $line = <>)

{

foreach my $word (split /:|,s*|/|!|#|-/, $line)

{

    $words{$word}++;

}

}



foreach my $word (keys %words)

{

print "$word: $words{$word}n";

}

Is there an easier way to have the split command only split at words, numbers and underscores? Rather than setting all of these delimiters.

Attempting to get the output

usr: 1

bin: 1

perl: 1

asked Nov 12 at 17:47

Conman

194

(This was closed as a duplicate to a question whose answer is split ' ', which is not appropriate here. Re-opened.)
– ikegami
Nov 12 at 17:58

add a comment |

Attempting to break up the line

#!/usr/bin/perl -w

with the following code

use strict;

use warnings;



my %words;



while (my $line = <>)

{

foreach my $word (split /:|,s*|/|!|#|-/, $line)

{

    $words{$word}++;

}

}



foreach my $word (keys %words)

{

print "$word: $words{$word}n";

}

Is there an easier way to have the split command only split at words, numbers and underscores? Rather than setting all of these delimiters.

Attempting to get the output

usr: 1

bin: 1

perl: 1

asked Nov 12 at 17:47

Conman

194

Attempting to break up the line

#!/usr/bin/perl -w

with the following code

use strict;

use warnings;



my %words;



while (my $line = <>)

{

foreach my $word (split /:|,s*|/|!|#|-/, $line)

{

    $words{$word}++;

}

}



foreach my $word (keys %words)

{

print "$word: $words{$word}n";

}

Is there an easier way to have the split command only split at words, numbers and underscores? Rather than setting all of these delimiters.

Attempting to get the output

usr: 1

bin: 1

perl: 1

regex perl

asked Nov 12 at 17:47

Conman

194

asked Nov 12 at 17:47

Conman

194

asked Nov 12 at 17:47

Conman

194

asked Nov 12 at 17:47

Conman

194

asked Nov 12 at 17:47

Conman

194

(This was closed as a duplicate to a question whose answer is split ' ', which is not appropriate here. Re-opened.)
– ikegami
Nov 12 at 17:58

add a comment |

(This was closed as a duplicate to a question whose answer is split ' ', which is not appropriate here. Re-opened.)
– ikegami
Nov 12 at 17:58

(This was closed as a duplicate to a question whose answer is split ' ', which is not appropriate here. Re-opened.)
– ikegami
Nov 12 at 17:58

add a comment |

2 Answers
2

active

oldest

votes

Don't split, extract.

++$words{$_} for $line =~ /w+/g;

answered Nov 12 at 17:49

ikegami

261k11176396

So much more compact, love it. However, I can't figure a way to only output words greater than 1 character (so we would discard the -w).
– Conman
Nov 12 at 17:53

ww+ or simply w{2,}
– ikegami
Nov 12 at 17:53

Perfect, thank you so much! Can you use extraction to convert all of the letters to lowercase as well?
– Conman
Nov 12 at 17:59

No, but neither could split. Just do it outside (++$words{lc($_)}).
– ikegami
Nov 12 at 18:02

Ahh okay no worries! I was just seeing to what extent I could work this whole "extraction" method. Sorry about that
– Conman
Nov 12 at 18:32

|
show 1 more comment

You can also do this with split and the negated word character class:

foreach my $word (split /W+/, $line) {

  $words{$word}++;

}

But note that since your string starts with non-word characters, the first word it will return is the empty string at the beginning of the string.

Another tool for this task (but more suited to prose than code and filenames) is the unicode word boundary, which uses Unicode rules for where words start and end, and takes into account things like apostrophes being part of words (can't). To utilize this, you'd first need to split your input into a list containing both words and non-words, and then find the words (easiest way is probably to use any elements that contain at least one word character):

foreach my $word (grep { m/w/ } split /b{wb}/, $line) {

  $words{$word}++;

}

The b{wb} regex sequence requires Perl 5.24+.

edited Nov 12 at 23:30

answered Nov 12 at 22:02

Grinnz

1,877311

That doesn't handle strings that don't start with a word (such as the OP's example). /// Also, this approach doesn't lend itself to improving the definition of a word. All else being equal, a solution that can evolve is better.
– ikegami
Nov 12 at 22:53

I agree, but it's an option, and it is still extensible in some ways. For instance, if you change the definition of "word character" to include :: split /[^w:]+/, $line
– Grinnz
Nov 12 at 23:14

There are of course unicode considerations to w and W, but until the code in question is decoding its input from UTF-8, I'm assuming it's not a factor for this input...
– Grinnz
Nov 12 at 23:16

That's hardly the way word parsers evolve, though. First, there's 's,...
– ikegami
Nov 12 at 23:18

@ikegami If you really want to do that right, there's the unicode word boundary in recent Perls...
– Grinnz
Nov 12 at 23:22

|
show 1 more comment

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53267484%2fsplit-regex-return-only-characters-digits-and-underscores-perl%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

Don't split, extract.

++$words{$_} for $line =~ /w+/g;

answered Nov 12 at 17:49

ikegami

261k11176396

So much more compact, love it. However, I can't figure a way to only output words greater than 1 character (so we would discard the -w).
– Conman
Nov 12 at 17:53

ww+ or simply w{2,}
– ikegami
Nov 12 at 17:53

Perfect, thank you so much! Can you use extraction to convert all of the letters to lowercase as well?
– Conman
Nov 12 at 17:59

No, but neither could split. Just do it outside (++$words{lc($_)}).
– ikegami
Nov 12 at 18:02

Ahh okay no worries! I was just seeing to what extent I could work this whole "extraction" method. Sorry about that
– Conman
Nov 12 at 18:32

|
show 1 more comment

Don't split, extract.

++$words{$_} for $line =~ /w+/g;

answered Nov 12 at 17:49

ikegami

261k11176396

So much more compact, love it. However, I can't figure a way to only output words greater than 1 character (so we would discard the -w).
– Conman
Nov 12 at 17:53

ww+ or simply w{2,}
– ikegami
Nov 12 at 17:53

Perfect, thank you so much! Can you use extraction to convert all of the letters to lowercase as well?
– Conman
Nov 12 at 17:59

No, but neither could split. Just do it outside (++$words{lc($_)}).
– ikegami
Nov 12 at 18:02

Ahh okay no worries! I was just seeing to what extent I could work this whole "extraction" method. Sorry about that
– Conman
Nov 12 at 18:32

|
show 1 more comment

Don't split, extract.

++$words{$_} for $line =~ /w+/g;

answered Nov 12 at 17:49

ikegami

261k11176396

Don't split, extract.

++$words{$_} for $line =~ /w+/g;

answered Nov 12 at 17:49

ikegami

261k11176396

answered Nov 12 at 17:49

ikegami

261k11176396

answered Nov 12 at 17:49

ikegami

261k11176396

answered Nov 12 at 17:49

ikegami

261k11176396

So much more compact, love it. However, I can't figure a way to only output words greater than 1 character (so we would discard the -w).
– Conman
Nov 12 at 17:53

ww+ or simply w{2,}
– ikegami
Nov 12 at 17:53

Perfect, thank you so much! Can you use extraction to convert all of the letters to lowercase as well?
– Conman
Nov 12 at 17:59

No, but neither could split. Just do it outside (++$words{lc($_)}).
– ikegami
Nov 12 at 18:02

Ahh okay no worries! I was just seeing to what extent I could work this whole "extraction" method. Sorry about that
– Conman
Nov 12 at 18:32

|
show 1 more comment

So much more compact, love it. However, I can't figure a way to only output words greater than 1 character (so we would discard the -w).
– Conman
Nov 12 at 17:53

ww+ or simply w{2,}
– ikegami
Nov 12 at 17:53

Perfect, thank you so much! Can you use extraction to convert all of the letters to lowercase as well?
– Conman
Nov 12 at 17:59

No, but neither could split. Just do it outside (++$words{lc($_)}).
– ikegami
Nov 12 at 18:02

Ahh okay no worries! I was just seeing to what extent I could work this whole "extraction" method. Sorry about that
– Conman
Nov 12 at 18:32

So much more compact, love it. However, I can't figure a way to only output words greater than 1 character (so we would discard the -w).
– Conman
Nov 12 at 17:53

ww+ or simply w{2,}
– ikegami
Nov 12 at 17:53

Perfect, thank you so much! Can you use extraction to convert all of the letters to lowercase as well?
– Conman
Nov 12 at 17:59

No, but neither could split. Just do it outside (++$words{lc($_)}).
– ikegami
Nov 12 at 18:02

Ahh okay no worries! I was just seeing to what extent I could work this whole "extraction" method. Sorry about that
– Conman
Nov 12 at 18:32

|
show 1 more comment

You can also do this with split and the negated word character class:

foreach my $word (split /W+/, $line) {

  $words{$word}++;

}

But note that since your string starts with non-word characters, the first word it will return is the empty string at the beginning of the string.

foreach my $word (grep { m/w/ } split /b{wb}/, $line) {

  $words{$word}++;

}

The b{wb} regex sequence requires Perl 5.24+.

edited Nov 12 at 23:30

answered Nov 12 at 22:02

Grinnz

1,877311

That doesn't handle strings that don't start with a word (such as the OP's example). /// Also, this approach doesn't lend itself to improving the definition of a word. All else being equal, a solution that can evolve is better.
– ikegami
Nov 12 at 22:53

I agree, but it's an option, and it is still extensible in some ways. For instance, if you change the definition of "word character" to include :: split /[^w:]+/, $line
– Grinnz
Nov 12 at 23:14

There are of course unicode considerations to w and W, but until the code in question is decoding its input from UTF-8, I'm assuming it's not a factor for this input...
– Grinnz
Nov 12 at 23:16

That's hardly the way word parsers evolve, though. First, there's 's,...
– ikegami
Nov 12 at 23:18

@ikegami If you really want to do that right, there's the unicode word boundary in recent Perls...
– Grinnz
Nov 12 at 23:22

|
show 1 more comment

You can also do this with split and the negated word character class:

foreach my $word (split /W+/, $line) {

  $words{$word}++;

}

But note that since your string starts with non-word characters, the first word it will return is the empty string at the beginning of the string.

foreach my $word (grep { m/w/ } split /b{wb}/, $line) {

  $words{$word}++;

}

The b{wb} regex sequence requires Perl 5.24+.

edited Nov 12 at 23:30

answered Nov 12 at 22:02

Grinnz

1,877311

That doesn't handle strings that don't start with a word (such as the OP's example). /// Also, this approach doesn't lend itself to improving the definition of a word. All else being equal, a solution that can evolve is better.
– ikegami
Nov 12 at 22:53

I agree, but it's an option, and it is still extensible in some ways. For instance, if you change the definition of "word character" to include :: split /[^w:]+/, $line
– Grinnz
Nov 12 at 23:14

There are of course unicode considerations to w and W, but until the code in question is decoding its input from UTF-8, I'm assuming it's not a factor for this input...
– Grinnz
Nov 12 at 23:16

That's hardly the way word parsers evolve, though. First, there's 's,...
– ikegami
Nov 12 at 23:18

@ikegami If you really want to do that right, there's the unicode word boundary in recent Perls...
– Grinnz
Nov 12 at 23:22

|
show 1 more comment

You can also do this with split and the negated word character class:

foreach my $word (split /W+/, $line) {

  $words{$word}++;

}

But note that since your string starts with non-word characters, the first word it will return is the empty string at the beginning of the string.

foreach my $word (grep { m/w/ } split /b{wb}/, $line) {

  $words{$word}++;

}

The b{wb} regex sequence requires Perl 5.24+.

edited Nov 12 at 23:30

answered Nov 12 at 22:02

Grinnz

1,877311

You can also do this with split and the negated word character class:

foreach my $word (split /W+/, $line) {

  $words{$word}++;

}

But note that since your string starts with non-word characters, the first word it will return is the empty string at the beginning of the string.

foreach my $word (grep { m/w/ } split /b{wb}/, $line) {

  $words{$word}++;

}

The b{wb} regex sequence requires Perl 5.24+.

edited Nov 12 at 23:30

answered Nov 12 at 22:02

Grinnz

1,877311

edited Nov 12 at 23:30

answered Nov 12 at 22:02

Grinnz

1,877311

answered Nov 12 at 22:02

Grinnz

1,877311

answered Nov 12 at 22:02

Grinnz

1,877311

That doesn't handle strings that don't start with a word (such as the OP's example). /// Also, this approach doesn't lend itself to improving the definition of a word. All else being equal, a solution that can evolve is better.
– ikegami
Nov 12 at 22:53

I agree, but it's an option, and it is still extensible in some ways. For instance, if you change the definition of "word character" to include :: split /[^w:]+/, $line
– Grinnz
Nov 12 at 23:14

There are of course unicode considerations to w and W, but until the code in question is decoding its input from UTF-8, I'm assuming it's not a factor for this input...
– Grinnz
Nov 12 at 23:16

That's hardly the way word parsers evolve, though. First, there's 's,...
– ikegami
Nov 12 at 23:18

@ikegami If you really want to do that right, there's the unicode word boundary in recent Perls...
– Grinnz
Nov 12 at 23:22

|
show 1 more comment

That doesn't handle strings that don't start with a word (such as the OP's example). /// Also, this approach doesn't lend itself to improving the definition of a word. All else being equal, a solution that can evolve is better.
– ikegami
Nov 12 at 22:53

I agree, but it's an option, and it is still extensible in some ways. For instance, if you change the definition of "word character" to include :: split /[^w:]+/, $line
– Grinnz
Nov 12 at 23:14

There are of course unicode considerations to w and W, but until the code in question is decoding its input from UTF-8, I'm assuming it's not a factor for this input...
– Grinnz
Nov 12 at 23:16

That's hardly the way word parsers evolve, though. First, there's 's,...
– ikegami
Nov 12 at 23:18

@ikegami If you really want to do that right, there's the unicode word boundary in recent Perls...
– Grinnz
Nov 12 at 23:22

That doesn't handle strings that don't start with a word (such as the OP's example). /// Also, this approach doesn't lend itself to improving the definition of a word. All else being equal, a solution that can evolve is better.
– ikegami
Nov 12 at 22:53

I agree, but it's an option, and it is still extensible in some ways. For instance, if you change the definition of "word character" to include :: split /[^w:]+/, $line
– Grinnz
Nov 12 at 23:14

There are of course unicode considerations to w and W, but until the code in question is decoding its input from UTF-8, I'm assuming it's not a factor for this input...
– Grinnz
Nov 12 at 23:16

That's hardly the way word parsers evolve, though. First, there's 's,...
– ikegami
Nov 12 at 23:18

@ikegami If you really want to do that right, there's the unicode word boundary in recent Perls...
– Grinnz
Nov 12 at 23:22

|
show 1 more comment

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Vfrdtyky