Split Regex, return only characters, digits and underscores. Perl












2














Attempting to break up the line



#!/usr/bin/perl -w



with the following code



use strict;
use warnings;

my %words;

while (my $line = <>)
{
foreach my $word (split /:|,s*|/|!|#|-/, $line)
{
$words{$word}++;
}
}

foreach my $word (keys %words)
{
print "$word: $words{$word}n";
}


Is there an easier way to have the split command only split at words, numbers and underscores? Rather than setting all of these delimiters.



Attempting to get the output



usr: 1
bin: 1
perl: 1









share|improve this question






















  • (This was closed as a duplicate to a question whose answer is split ' ', which is not appropriate here. Re-opened.)
    – ikegami
    Nov 12 at 17:58
















2














Attempting to break up the line



#!/usr/bin/perl -w



with the following code



use strict;
use warnings;

my %words;

while (my $line = <>)
{
foreach my $word (split /:|,s*|/|!|#|-/, $line)
{
$words{$word}++;
}
}

foreach my $word (keys %words)
{
print "$word: $words{$word}n";
}


Is there an easier way to have the split command only split at words, numbers and underscores? Rather than setting all of these delimiters.



Attempting to get the output



usr: 1
bin: 1
perl: 1









share|improve this question






















  • (This was closed as a duplicate to a question whose answer is split ' ', which is not appropriate here. Re-opened.)
    – ikegami
    Nov 12 at 17:58














2












2








2


0





Attempting to break up the line



#!/usr/bin/perl -w



with the following code



use strict;
use warnings;

my %words;

while (my $line = <>)
{
foreach my $word (split /:|,s*|/|!|#|-/, $line)
{
$words{$word}++;
}
}

foreach my $word (keys %words)
{
print "$word: $words{$word}n";
}


Is there an easier way to have the split command only split at words, numbers and underscores? Rather than setting all of these delimiters.



Attempting to get the output



usr: 1
bin: 1
perl: 1









share|improve this question













Attempting to break up the line



#!/usr/bin/perl -w



with the following code



use strict;
use warnings;

my %words;

while (my $line = <>)
{
foreach my $word (split /:|,s*|/|!|#|-/, $line)
{
$words{$word}++;
}
}

foreach my $word (keys %words)
{
print "$word: $words{$word}n";
}


Is there an easier way to have the split command only split at words, numbers and underscores? Rather than setting all of these delimiters.



Attempting to get the output



usr: 1
bin: 1
perl: 1






regex perl






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Nov 12 at 17:47









Conman

194




194












  • (This was closed as a duplicate to a question whose answer is split ' ', which is not appropriate here. Re-opened.)
    – ikegami
    Nov 12 at 17:58


















  • (This was closed as a duplicate to a question whose answer is split ' ', which is not appropriate here. Re-opened.)
    – ikegami
    Nov 12 at 17:58
















(This was closed as a duplicate to a question whose answer is split ' ', which is not appropriate here. Re-opened.)
– ikegami
Nov 12 at 17:58




(This was closed as a duplicate to a question whose answer is split ' ', which is not appropriate here. Re-opened.)
– ikegami
Nov 12 at 17:58












2 Answers
2






active

oldest

votes


















5














Don't split, extract.



++$words{$_} for $line =~ /w+/g;





share|improve this answer





















  • So much more compact, love it. However, I can't figure a way to only output words greater than 1 character (so we would discard the -w).
    – Conman
    Nov 12 at 17:53










  • ww+ or simply w{2,}
    – ikegami
    Nov 12 at 17:53












  • Perfect, thank you so much! Can you use extraction to convert all of the letters to lowercase as well?
    – Conman
    Nov 12 at 17:59










  • No, but neither could split. Just do it outside (++$words{lc($_)}).
    – ikegami
    Nov 12 at 18:02












  • Ahh okay no worries! I was just seeing to what extent I could work this whole "extraction" method. Sorry about that
    – Conman
    Nov 12 at 18:32



















1














You can also do this with split and the negated word character class:



foreach my $word (split /W+/, $line) {
$words{$word}++;
}


But note that since your string starts with non-word characters, the first word it will return is the empty string at the beginning of the string.



Another tool for this task (but more suited to prose than code and filenames) is the unicode word boundary, which uses Unicode rules for where words start and end, and takes into account things like apostrophes being part of words (can't). To utilize this, you'd first need to split your input into a list containing both words and non-words, and then find the words (easiest way is probably to use any elements that contain at least one word character):



foreach my $word (grep { m/w/ } split /b{wb}/, $line) {
$words{$word}++;
}


The b{wb} regex sequence requires Perl 5.24+.






share|improve this answer























  • That doesn't handle strings that don't start with a word (such as the OP's example). /// Also, this approach doesn't lend itself to improving the definition of a word. All else being equal, a solution that can evolve is better.
    – ikegami
    Nov 12 at 22:53












  • I agree, but it's an option, and it is still extensible in some ways. For instance, if you change the definition of "word character" to include :: split /[^w:]+/, $line
    – Grinnz
    Nov 12 at 23:14












  • There are of course unicode considerations to w and W, but until the code in question is decoding its input from UTF-8, I'm assuming it's not a factor for this input...
    – Grinnz
    Nov 12 at 23:16










  • That's hardly the way word parsers evolve, though. First, there's 's,...
    – ikegami
    Nov 12 at 23:18










  • @ikegami If you really want to do that right, there's the unicode word boundary in recent Perls...
    – Grinnz
    Nov 12 at 23:22











Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53267484%2fsplit-regex-return-only-characters-digits-and-underscores-perl%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























2 Answers
2






active

oldest

votes








2 Answers
2






active

oldest

votes









active

oldest

votes






active

oldest

votes









5














Don't split, extract.



++$words{$_} for $line =~ /w+/g;





share|improve this answer





















  • So much more compact, love it. However, I can't figure a way to only output words greater than 1 character (so we would discard the -w).
    – Conman
    Nov 12 at 17:53










  • ww+ or simply w{2,}
    – ikegami
    Nov 12 at 17:53












  • Perfect, thank you so much! Can you use extraction to convert all of the letters to lowercase as well?
    – Conman
    Nov 12 at 17:59










  • No, but neither could split. Just do it outside (++$words{lc($_)}).
    – ikegami
    Nov 12 at 18:02












  • Ahh okay no worries! I was just seeing to what extent I could work this whole "extraction" method. Sorry about that
    – Conman
    Nov 12 at 18:32
















5














Don't split, extract.



++$words{$_} for $line =~ /w+/g;





share|improve this answer





















  • So much more compact, love it. However, I can't figure a way to only output words greater than 1 character (so we would discard the -w).
    – Conman
    Nov 12 at 17:53










  • ww+ or simply w{2,}
    – ikegami
    Nov 12 at 17:53












  • Perfect, thank you so much! Can you use extraction to convert all of the letters to lowercase as well?
    – Conman
    Nov 12 at 17:59










  • No, but neither could split. Just do it outside (++$words{lc($_)}).
    – ikegami
    Nov 12 at 18:02












  • Ahh okay no worries! I was just seeing to what extent I could work this whole "extraction" method. Sorry about that
    – Conman
    Nov 12 at 18:32














5












5








5






Don't split, extract.



++$words{$_} for $line =~ /w+/g;





share|improve this answer












Don't split, extract.



++$words{$_} for $line =~ /w+/g;






share|improve this answer












share|improve this answer



share|improve this answer










answered Nov 12 at 17:49









ikegami

261k11176396




261k11176396












  • So much more compact, love it. However, I can't figure a way to only output words greater than 1 character (so we would discard the -w).
    – Conman
    Nov 12 at 17:53










  • ww+ or simply w{2,}
    – ikegami
    Nov 12 at 17:53












  • Perfect, thank you so much! Can you use extraction to convert all of the letters to lowercase as well?
    – Conman
    Nov 12 at 17:59










  • No, but neither could split. Just do it outside (++$words{lc($_)}).
    – ikegami
    Nov 12 at 18:02












  • Ahh okay no worries! I was just seeing to what extent I could work this whole "extraction" method. Sorry about that
    – Conman
    Nov 12 at 18:32


















  • So much more compact, love it. However, I can't figure a way to only output words greater than 1 character (so we would discard the -w).
    – Conman
    Nov 12 at 17:53










  • ww+ or simply w{2,}
    – ikegami
    Nov 12 at 17:53












  • Perfect, thank you so much! Can you use extraction to convert all of the letters to lowercase as well?
    – Conman
    Nov 12 at 17:59










  • No, but neither could split. Just do it outside (++$words{lc($_)}).
    – ikegami
    Nov 12 at 18:02












  • Ahh okay no worries! I was just seeing to what extent I could work this whole "extraction" method. Sorry about that
    – Conman
    Nov 12 at 18:32
















So much more compact, love it. However, I can't figure a way to only output words greater than 1 character (so we would discard the -w).
– Conman
Nov 12 at 17:53




So much more compact, love it. However, I can't figure a way to only output words greater than 1 character (so we would discard the -w).
– Conman
Nov 12 at 17:53












ww+ or simply w{2,}
– ikegami
Nov 12 at 17:53






ww+ or simply w{2,}
– ikegami
Nov 12 at 17:53














Perfect, thank you so much! Can you use extraction to convert all of the letters to lowercase as well?
– Conman
Nov 12 at 17:59




Perfect, thank you so much! Can you use extraction to convert all of the letters to lowercase as well?
– Conman
Nov 12 at 17:59












No, but neither could split. Just do it outside (++$words{lc($_)}).
– ikegami
Nov 12 at 18:02






No, but neither could split. Just do it outside (++$words{lc($_)}).
– ikegami
Nov 12 at 18:02














Ahh okay no worries! I was just seeing to what extent I could work this whole "extraction" method. Sorry about that
– Conman
Nov 12 at 18:32




Ahh okay no worries! I was just seeing to what extent I could work this whole "extraction" method. Sorry about that
– Conman
Nov 12 at 18:32













1














You can also do this with split and the negated word character class:



foreach my $word (split /W+/, $line) {
$words{$word}++;
}


But note that since your string starts with non-word characters, the first word it will return is the empty string at the beginning of the string.



Another tool for this task (but more suited to prose than code and filenames) is the unicode word boundary, which uses Unicode rules for where words start and end, and takes into account things like apostrophes being part of words (can't). To utilize this, you'd first need to split your input into a list containing both words and non-words, and then find the words (easiest way is probably to use any elements that contain at least one word character):



foreach my $word (grep { m/w/ } split /b{wb}/, $line) {
$words{$word}++;
}


The b{wb} regex sequence requires Perl 5.24+.






share|improve this answer























  • That doesn't handle strings that don't start with a word (such as the OP's example). /// Also, this approach doesn't lend itself to improving the definition of a word. All else being equal, a solution that can evolve is better.
    – ikegami
    Nov 12 at 22:53












  • I agree, but it's an option, and it is still extensible in some ways. For instance, if you change the definition of "word character" to include :: split /[^w:]+/, $line
    – Grinnz
    Nov 12 at 23:14












  • There are of course unicode considerations to w and W, but until the code in question is decoding its input from UTF-8, I'm assuming it's not a factor for this input...
    – Grinnz
    Nov 12 at 23:16










  • That's hardly the way word parsers evolve, though. First, there's 's,...
    – ikegami
    Nov 12 at 23:18










  • @ikegami If you really want to do that right, there's the unicode word boundary in recent Perls...
    – Grinnz
    Nov 12 at 23:22
















1














You can also do this with split and the negated word character class:



foreach my $word (split /W+/, $line) {
$words{$word}++;
}


But note that since your string starts with non-word characters, the first word it will return is the empty string at the beginning of the string.



Another tool for this task (but more suited to prose than code and filenames) is the unicode word boundary, which uses Unicode rules for where words start and end, and takes into account things like apostrophes being part of words (can't). To utilize this, you'd first need to split your input into a list containing both words and non-words, and then find the words (easiest way is probably to use any elements that contain at least one word character):



foreach my $word (grep { m/w/ } split /b{wb}/, $line) {
$words{$word}++;
}


The b{wb} regex sequence requires Perl 5.24+.






share|improve this answer























  • That doesn't handle strings that don't start with a word (such as the OP's example). /// Also, this approach doesn't lend itself to improving the definition of a word. All else being equal, a solution that can evolve is better.
    – ikegami
    Nov 12 at 22:53












  • I agree, but it's an option, and it is still extensible in some ways. For instance, if you change the definition of "word character" to include :: split /[^w:]+/, $line
    – Grinnz
    Nov 12 at 23:14












  • There are of course unicode considerations to w and W, but until the code in question is decoding its input from UTF-8, I'm assuming it's not a factor for this input...
    – Grinnz
    Nov 12 at 23:16










  • That's hardly the way word parsers evolve, though. First, there's 's,...
    – ikegami
    Nov 12 at 23:18










  • @ikegami If you really want to do that right, there's the unicode word boundary in recent Perls...
    – Grinnz
    Nov 12 at 23:22














1












1








1






You can also do this with split and the negated word character class:



foreach my $word (split /W+/, $line) {
$words{$word}++;
}


But note that since your string starts with non-word characters, the first word it will return is the empty string at the beginning of the string.



Another tool for this task (but more suited to prose than code and filenames) is the unicode word boundary, which uses Unicode rules for where words start and end, and takes into account things like apostrophes being part of words (can't). To utilize this, you'd first need to split your input into a list containing both words and non-words, and then find the words (easiest way is probably to use any elements that contain at least one word character):



foreach my $word (grep { m/w/ } split /b{wb}/, $line) {
$words{$word}++;
}


The b{wb} regex sequence requires Perl 5.24+.






share|improve this answer














You can also do this with split and the negated word character class:



foreach my $word (split /W+/, $line) {
$words{$word}++;
}


But note that since your string starts with non-word characters, the first word it will return is the empty string at the beginning of the string.



Another tool for this task (but more suited to prose than code and filenames) is the unicode word boundary, which uses Unicode rules for where words start and end, and takes into account things like apostrophes being part of words (can't). To utilize this, you'd first need to split your input into a list containing both words and non-words, and then find the words (easiest way is probably to use any elements that contain at least one word character):



foreach my $word (grep { m/w/ } split /b{wb}/, $line) {
$words{$word}++;
}


The b{wb} regex sequence requires Perl 5.24+.







share|improve this answer














share|improve this answer



share|improve this answer








edited Nov 12 at 23:30

























answered Nov 12 at 22:02









Grinnz

1,877311




1,877311












  • That doesn't handle strings that don't start with a word (such as the OP's example). /// Also, this approach doesn't lend itself to improving the definition of a word. All else being equal, a solution that can evolve is better.
    – ikegami
    Nov 12 at 22:53












  • I agree, but it's an option, and it is still extensible in some ways. For instance, if you change the definition of "word character" to include :: split /[^w:]+/, $line
    – Grinnz
    Nov 12 at 23:14












  • There are of course unicode considerations to w and W, but until the code in question is decoding its input from UTF-8, I'm assuming it's not a factor for this input...
    – Grinnz
    Nov 12 at 23:16










  • That's hardly the way word parsers evolve, though. First, there's 's,...
    – ikegami
    Nov 12 at 23:18










  • @ikegami If you really want to do that right, there's the unicode word boundary in recent Perls...
    – Grinnz
    Nov 12 at 23:22


















  • That doesn't handle strings that don't start with a word (such as the OP's example). /// Also, this approach doesn't lend itself to improving the definition of a word. All else being equal, a solution that can evolve is better.
    – ikegami
    Nov 12 at 22:53












  • I agree, but it's an option, and it is still extensible in some ways. For instance, if you change the definition of "word character" to include :: split /[^w:]+/, $line
    – Grinnz
    Nov 12 at 23:14












  • There are of course unicode considerations to w and W, but until the code in question is decoding its input from UTF-8, I'm assuming it's not a factor for this input...
    – Grinnz
    Nov 12 at 23:16










  • That's hardly the way word parsers evolve, though. First, there's 's,...
    – ikegami
    Nov 12 at 23:18










  • @ikegami If you really want to do that right, there's the unicode word boundary in recent Perls...
    – Grinnz
    Nov 12 at 23:22
















That doesn't handle strings that don't start with a word (such as the OP's example). /// Also, this approach doesn't lend itself to improving the definition of a word. All else being equal, a solution that can evolve is better.
– ikegami
Nov 12 at 22:53






That doesn't handle strings that don't start with a word (such as the OP's example). /// Also, this approach doesn't lend itself to improving the definition of a word. All else being equal, a solution that can evolve is better.
– ikegami
Nov 12 at 22:53














I agree, but it's an option, and it is still extensible in some ways. For instance, if you change the definition of "word character" to include :: split /[^w:]+/, $line
– Grinnz
Nov 12 at 23:14






I agree, but it's an option, and it is still extensible in some ways. For instance, if you change the definition of "word character" to include :: split /[^w:]+/, $line
– Grinnz
Nov 12 at 23:14














There are of course unicode considerations to w and W, but until the code in question is decoding its input from UTF-8, I'm assuming it's not a factor for this input...
– Grinnz
Nov 12 at 23:16




There are of course unicode considerations to w and W, but until the code in question is decoding its input from UTF-8, I'm assuming it's not a factor for this input...
– Grinnz
Nov 12 at 23:16












That's hardly the way word parsers evolve, though. First, there's 's,...
– ikegami
Nov 12 at 23:18




That's hardly the way word parsers evolve, though. First, there's 's,...
– ikegami
Nov 12 at 23:18












@ikegami If you really want to do that right, there's the unicode word boundary in recent Perls...
– Grinnz
Nov 12 at 23:22




@ikegami If you really want to do that right, there's the unicode word boundary in recent Perls...
– Grinnz
Nov 12 at 23:22


















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.





Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


Please pay close attention to the following guidance:


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53267484%2fsplit-regex-return-only-characters-digits-and-underscores-perl%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Xamarin.iOS Cant Deploy on Iphone

Glorious Revolution

Dulmage-Mendelsohn matrix decomposition in Python