Python - pyparsing unicode characters
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}
:) I tried using w = Word(printables), but it isn't working. How should I give the spec for this. 'w' is meant to process Hindi characters (UTF-8)
The code specifies the grammar and parses accordingly.
671.assess :: अहसास ::2
x=number + "." + src + "::" + w + "::" + number + "." + number
If there is only english characters it is working so the code is correct for the ascii format but the code is not working for the unicode format.
I mean that the code works when we have something of the form
671.assess :: ahsaas ::2
i.e. it parses words in the english format, but I am not sure how to parse and then print characters in the unicode format. I need this for English Hindi word alignment for purpose.
The python code looks like this:
# -*- coding: utf-8 -*-
from pyparsing import Literal, Word, Optional, nums, alphas, ZeroOrMore, printables , Group , alphas8bit ,
# grammar
src = Word(printables)
trans = Word(printables)
number = Word(nums)
x=number + "." + src + "::" + trans + "::" + number + "." + number
#parsing for eng-dict
efiledata = open('b1aop_or_not_word.txt').read()
eresults = x.parseString(efiledata)
edict1 = {}
edict2 = {}
counter=0
xx=list()
for result in eresults:
trans=""#translation string
ew=""#english word
xx=result[0]
ew=xx[2]
trans=xx[4]
edict1 = { ew:trans }
edict2.update(edict1)
print len(edict2) #no of entries in the english dictionary
print "edict2 has been created"
print "english dictionary" , edict2
#parsing for hin-dict
hfiledata = open('b1aop_or_not_word.txt').read()
hresults = x.scanString(hfiledata)
hdict1 = {}
hdict2 = {}
counter=0
for result in hresults:
trans=""#translation string
hw=""#hin word
xx=result[0]
hw=xx[2]
trans=xx[4]
#print trans
hdict1 = { trans:hw }
hdict2.update(hdict1)
print len(hdict2) #no of entries in the hindi dictionary
print"hdict2 has been created"
print "hindi dictionary" , hdict2
'''
#######################################################################################################################
def translate(d, ow, hinlist):
if ow in d.keys():#ow=old word d=dict
print ow , "exists in the dictionary keys"
transes = d[ow]
transes = transes.split()
print "possible transes for" , ow , " = ", transes
for word in transes:
if word in hinlist:
print "trans for" , ow , " = ", word
return word
return None
else:
print ow , "absent"
return None
f = open('bidir','w')
#lines = ["'
#5# 10 # and better performance in business in turn benefits consumers . # 0 0 0 0 0 0 0 0 0 0
#5# 11 # vHyaapaar mEmn bEhtr kaam upbhOkHtaaomn kE lIe laabhpHrdd hOtaa hAI . # 0 0 0 0 0 0 0 0 0 0 0
#'"]
data=open('bi_full_2','rb').read()
lines = data.split('!@#$%')
loc=0
for line in lines:
eng, hin = [subline.split(' # ')
for subline in line.strip('n').split('n')]
for transdict, source, dest in [(edict2, eng, hin),
(hdict2, hin, eng)]:
sourcethings = source[2].split()
for word in source[1].split():
tl = dest[1].split()
otherword = translate(transdict, word, tl)
loc = source[1].split().index(word)
if otherword is not None:
otherword = otherword.strip()
print word, ' <-> ', otherword, 'meaning=good'
if otherword in dest[1].split():
print word, ' <-> ', otherword, 'trans=good'
sourcethings[loc] = str(
dest[1].split().index(otherword) + 1)
source[2] = ' '.join(sourcethings)
eng = ' # '.join(eng)
hin = ' # '.join(hin)
f.write(eng+'n'+hin+'nnn')
f.close()
'''
if an example input sentence for the source file is:
1# 5 # modern markets : confident consumers # 0 0 0 0 0
1# 6 # AddhUnIk baajaar : AshHvsHt upbhOkHtaa . # 0 0 0 0 0 0
!@#$%
the ouptut would look like this :-
1# 5 # modern markets : confident consumers # 1 2 3 4 5
1# 6 # AddhUnIk baajaar : AshHvsHt upbhOkHtaa . # 1 2 3 4 5 0
!@#$%
Output Explanation:-
This achieves bidirectional alignment.
It means the first word of english 'modern' maps to the first word of hindi 'AddhUnIk' and vice versa. Here even characters are take as words as they also are an integral part of bidirectional mapping. Thus if you observe the hindi WORD '.' has a null alignment and it maps to nothing with respect to the English sentence as it doesn't have a full stop.
The 3rd line int the output basically represents a delimiter when we are working for a number of sentences for which your trying to achieve bidirectional mapping.
What modification should i make for it to work if the I have the hindi sentences in Unicode(UTF-8) format.
python unicode nlp pyparsing
add a comment |
:) I tried using w = Word(printables), but it isn't working. How should I give the spec for this. 'w' is meant to process Hindi characters (UTF-8)
The code specifies the grammar and parses accordingly.
671.assess :: अहसास ::2
x=number + "." + src + "::" + w + "::" + number + "." + number
If there is only english characters it is working so the code is correct for the ascii format but the code is not working for the unicode format.
I mean that the code works when we have something of the form
671.assess :: ahsaas ::2
i.e. it parses words in the english format, but I am not sure how to parse and then print characters in the unicode format. I need this for English Hindi word alignment for purpose.
The python code looks like this:
# -*- coding: utf-8 -*-
from pyparsing import Literal, Word, Optional, nums, alphas, ZeroOrMore, printables , Group , alphas8bit ,
# grammar
src = Word(printables)
trans = Word(printables)
number = Word(nums)
x=number + "." + src + "::" + trans + "::" + number + "." + number
#parsing for eng-dict
efiledata = open('b1aop_or_not_word.txt').read()
eresults = x.parseString(efiledata)
edict1 = {}
edict2 = {}
counter=0
xx=list()
for result in eresults:
trans=""#translation string
ew=""#english word
xx=result[0]
ew=xx[2]
trans=xx[4]
edict1 = { ew:trans }
edict2.update(edict1)
print len(edict2) #no of entries in the english dictionary
print "edict2 has been created"
print "english dictionary" , edict2
#parsing for hin-dict
hfiledata = open('b1aop_or_not_word.txt').read()
hresults = x.scanString(hfiledata)
hdict1 = {}
hdict2 = {}
counter=0
for result in hresults:
trans=""#translation string
hw=""#hin word
xx=result[0]
hw=xx[2]
trans=xx[4]
#print trans
hdict1 = { trans:hw }
hdict2.update(hdict1)
print len(hdict2) #no of entries in the hindi dictionary
print"hdict2 has been created"
print "hindi dictionary" , hdict2
'''
#######################################################################################################################
def translate(d, ow, hinlist):
if ow in d.keys():#ow=old word d=dict
print ow , "exists in the dictionary keys"
transes = d[ow]
transes = transes.split()
print "possible transes for" , ow , " = ", transes
for word in transes:
if word in hinlist:
print "trans for" , ow , " = ", word
return word
return None
else:
print ow , "absent"
return None
f = open('bidir','w')
#lines = ["'
#5# 10 # and better performance in business in turn benefits consumers . # 0 0 0 0 0 0 0 0 0 0
#5# 11 # vHyaapaar mEmn bEhtr kaam upbhOkHtaaomn kE lIe laabhpHrdd hOtaa hAI . # 0 0 0 0 0 0 0 0 0 0 0
#'"]
data=open('bi_full_2','rb').read()
lines = data.split('!@#$%')
loc=0
for line in lines:
eng, hin = [subline.split(' # ')
for subline in line.strip('n').split('n')]
for transdict, source, dest in [(edict2, eng, hin),
(hdict2, hin, eng)]:
sourcethings = source[2].split()
for word in source[1].split():
tl = dest[1].split()
otherword = translate(transdict, word, tl)
loc = source[1].split().index(word)
if otherword is not None:
otherword = otherword.strip()
print word, ' <-> ', otherword, 'meaning=good'
if otherword in dest[1].split():
print word, ' <-> ', otherword, 'trans=good'
sourcethings[loc] = str(
dest[1].split().index(otherword) + 1)
source[2] = ' '.join(sourcethings)
eng = ' # '.join(eng)
hin = ' # '.join(hin)
f.write(eng+'n'+hin+'nnn')
f.close()
'''
if an example input sentence for the source file is:
1# 5 # modern markets : confident consumers # 0 0 0 0 0
1# 6 # AddhUnIk baajaar : AshHvsHt upbhOkHtaa . # 0 0 0 0 0 0
!@#$%
the ouptut would look like this :-
1# 5 # modern markets : confident consumers # 1 2 3 4 5
1# 6 # AddhUnIk baajaar : AshHvsHt upbhOkHtaa . # 1 2 3 4 5 0
!@#$%
Output Explanation:-
This achieves bidirectional alignment.
It means the first word of english 'modern' maps to the first word of hindi 'AddhUnIk' and vice versa. Here even characters are take as words as they also are an integral part of bidirectional mapping. Thus if you observe the hindi WORD '.' has a null alignment and it maps to nothing with respect to the English sentence as it doesn't have a full stop.
The 3rd line int the output basically represents a delimiter when we are working for a number of sentences for which your trying to achieve bidirectional mapping.
What modification should i make for it to work if the I have the hindi sentences in Unicode(UTF-8) format.
python unicode nlp pyparsing
1
Please edit this question and make use of proper formatting so that the question is readable.
– Ignacio Vazquez-Abrams
Feb 26 '10 at 5:12
add a comment |
:) I tried using w = Word(printables), but it isn't working. How should I give the spec for this. 'w' is meant to process Hindi characters (UTF-8)
The code specifies the grammar and parses accordingly.
671.assess :: अहसास ::2
x=number + "." + src + "::" + w + "::" + number + "." + number
If there is only english characters it is working so the code is correct for the ascii format but the code is not working for the unicode format.
I mean that the code works when we have something of the form
671.assess :: ahsaas ::2
i.e. it parses words in the english format, but I am not sure how to parse and then print characters in the unicode format. I need this for English Hindi word alignment for purpose.
The python code looks like this:
# -*- coding: utf-8 -*-
from pyparsing import Literal, Word, Optional, nums, alphas, ZeroOrMore, printables , Group , alphas8bit ,
# grammar
src = Word(printables)
trans = Word(printables)
number = Word(nums)
x=number + "." + src + "::" + trans + "::" + number + "." + number
#parsing for eng-dict
efiledata = open('b1aop_or_not_word.txt').read()
eresults = x.parseString(efiledata)
edict1 = {}
edict2 = {}
counter=0
xx=list()
for result in eresults:
trans=""#translation string
ew=""#english word
xx=result[0]
ew=xx[2]
trans=xx[4]
edict1 = { ew:trans }
edict2.update(edict1)
print len(edict2) #no of entries in the english dictionary
print "edict2 has been created"
print "english dictionary" , edict2
#parsing for hin-dict
hfiledata = open('b1aop_or_not_word.txt').read()
hresults = x.scanString(hfiledata)
hdict1 = {}
hdict2 = {}
counter=0
for result in hresults:
trans=""#translation string
hw=""#hin word
xx=result[0]
hw=xx[2]
trans=xx[4]
#print trans
hdict1 = { trans:hw }
hdict2.update(hdict1)
print len(hdict2) #no of entries in the hindi dictionary
print"hdict2 has been created"
print "hindi dictionary" , hdict2
'''
#######################################################################################################################
def translate(d, ow, hinlist):
if ow in d.keys():#ow=old word d=dict
print ow , "exists in the dictionary keys"
transes = d[ow]
transes = transes.split()
print "possible transes for" , ow , " = ", transes
for word in transes:
if word in hinlist:
print "trans for" , ow , " = ", word
return word
return None
else:
print ow , "absent"
return None
f = open('bidir','w')
#lines = ["'
#5# 10 # and better performance in business in turn benefits consumers . # 0 0 0 0 0 0 0 0 0 0
#5# 11 # vHyaapaar mEmn bEhtr kaam upbhOkHtaaomn kE lIe laabhpHrdd hOtaa hAI . # 0 0 0 0 0 0 0 0 0 0 0
#'"]
data=open('bi_full_2','rb').read()
lines = data.split('!@#$%')
loc=0
for line in lines:
eng, hin = [subline.split(' # ')
for subline in line.strip('n').split('n')]
for transdict, source, dest in [(edict2, eng, hin),
(hdict2, hin, eng)]:
sourcethings = source[2].split()
for word in source[1].split():
tl = dest[1].split()
otherword = translate(transdict, word, tl)
loc = source[1].split().index(word)
if otherword is not None:
otherword = otherword.strip()
print word, ' <-> ', otherword, 'meaning=good'
if otherword in dest[1].split():
print word, ' <-> ', otherword, 'trans=good'
sourcethings[loc] = str(
dest[1].split().index(otherword) + 1)
source[2] = ' '.join(sourcethings)
eng = ' # '.join(eng)
hin = ' # '.join(hin)
f.write(eng+'n'+hin+'nnn')
f.close()
'''
if an example input sentence for the source file is:
1# 5 # modern markets : confident consumers # 0 0 0 0 0
1# 6 # AddhUnIk baajaar : AshHvsHt upbhOkHtaa . # 0 0 0 0 0 0
!@#$%
the ouptut would look like this :-
1# 5 # modern markets : confident consumers # 1 2 3 4 5
1# 6 # AddhUnIk baajaar : AshHvsHt upbhOkHtaa . # 1 2 3 4 5 0
!@#$%
Output Explanation:-
This achieves bidirectional alignment.
It means the first word of english 'modern' maps to the first word of hindi 'AddhUnIk' and vice versa. Here even characters are take as words as they also are an integral part of bidirectional mapping. Thus if you observe the hindi WORD '.' has a null alignment and it maps to nothing with respect to the English sentence as it doesn't have a full stop.
The 3rd line int the output basically represents a delimiter when we are working for a number of sentences for which your trying to achieve bidirectional mapping.
What modification should i make for it to work if the I have the hindi sentences in Unicode(UTF-8) format.
python unicode nlp pyparsing
:) I tried using w = Word(printables), but it isn't working. How should I give the spec for this. 'w' is meant to process Hindi characters (UTF-8)
The code specifies the grammar and parses accordingly.
671.assess :: अहसास ::2
x=number + "." + src + "::" + w + "::" + number + "." + number
If there is only english characters it is working so the code is correct for the ascii format but the code is not working for the unicode format.
I mean that the code works when we have something of the form
671.assess :: ahsaas ::2
i.e. it parses words in the english format, but I am not sure how to parse and then print characters in the unicode format. I need this for English Hindi word alignment for purpose.
The python code looks like this:
# -*- coding: utf-8 -*-
from pyparsing import Literal, Word, Optional, nums, alphas, ZeroOrMore, printables , Group , alphas8bit ,
# grammar
src = Word(printables)
trans = Word(printables)
number = Word(nums)
x=number + "." + src + "::" + trans + "::" + number + "." + number
#parsing for eng-dict
efiledata = open('b1aop_or_not_word.txt').read()
eresults = x.parseString(efiledata)
edict1 = {}
edict2 = {}
counter=0
xx=list()
for result in eresults:
trans=""#translation string
ew=""#english word
xx=result[0]
ew=xx[2]
trans=xx[4]
edict1 = { ew:trans }
edict2.update(edict1)
print len(edict2) #no of entries in the english dictionary
print "edict2 has been created"
print "english dictionary" , edict2
#parsing for hin-dict
hfiledata = open('b1aop_or_not_word.txt').read()
hresults = x.scanString(hfiledata)
hdict1 = {}
hdict2 = {}
counter=0
for result in hresults:
trans=""#translation string
hw=""#hin word
xx=result[0]
hw=xx[2]
trans=xx[4]
#print trans
hdict1 = { trans:hw }
hdict2.update(hdict1)
print len(hdict2) #no of entries in the hindi dictionary
print"hdict2 has been created"
print "hindi dictionary" , hdict2
'''
#######################################################################################################################
def translate(d, ow, hinlist):
if ow in d.keys():#ow=old word d=dict
print ow , "exists in the dictionary keys"
transes = d[ow]
transes = transes.split()
print "possible transes for" , ow , " = ", transes
for word in transes:
if word in hinlist:
print "trans for" , ow , " = ", word
return word
return None
else:
print ow , "absent"
return None
f = open('bidir','w')
#lines = ["'
#5# 10 # and better performance in business in turn benefits consumers . # 0 0 0 0 0 0 0 0 0 0
#5# 11 # vHyaapaar mEmn bEhtr kaam upbhOkHtaaomn kE lIe laabhpHrdd hOtaa hAI . # 0 0 0 0 0 0 0 0 0 0 0
#'"]
data=open('bi_full_2','rb').read()
lines = data.split('!@#$%')
loc=0
for line in lines:
eng, hin = [subline.split(' # ')
for subline in line.strip('n').split('n')]
for transdict, source, dest in [(edict2, eng, hin),
(hdict2, hin, eng)]:
sourcethings = source[2].split()
for word in source[1].split():
tl = dest[1].split()
otherword = translate(transdict, word, tl)
loc = source[1].split().index(word)
if otherword is not None:
otherword = otherword.strip()
print word, ' <-> ', otherword, 'meaning=good'
if otherword in dest[1].split():
print word, ' <-> ', otherword, 'trans=good'
sourcethings[loc] = str(
dest[1].split().index(otherword) + 1)
source[2] = ' '.join(sourcethings)
eng = ' # '.join(eng)
hin = ' # '.join(hin)
f.write(eng+'n'+hin+'nnn')
f.close()
'''
if an example input sentence for the source file is:
1# 5 # modern markets : confident consumers # 0 0 0 0 0
1# 6 # AddhUnIk baajaar : AshHvsHt upbhOkHtaa . # 0 0 0 0 0 0
!@#$%
the ouptut would look like this :-
1# 5 # modern markets : confident consumers # 1 2 3 4 5
1# 6 # AddhUnIk baajaar : AshHvsHt upbhOkHtaa . # 1 2 3 4 5 0
!@#$%
Output Explanation:-
This achieves bidirectional alignment.
It means the first word of english 'modern' maps to the first word of hindi 'AddhUnIk' and vice versa. Here even characters are take as words as they also are an integral part of bidirectional mapping. Thus if you observe the hindi WORD '.' has a null alignment and it maps to nothing with respect to the English sentence as it doesn't have a full stop.
The 3rd line int the output basically represents a delimiter when we are working for a number of sentences for which your trying to achieve bidirectional mapping.
What modification should i make for it to work if the I have the hindi sentences in Unicode(UTF-8) format.
python unicode nlp pyparsing
python unicode nlp pyparsing
edited Apr 27 '10 at 22:17
ThomasH
13.3k85053
13.3k85053
asked Feb 26 '10 at 3:52
boddhisattvaboddhisattva
2,45294268
2,45294268
1
Please edit this question and make use of proper formatting so that the question is readable.
– Ignacio Vazquez-Abrams
Feb 26 '10 at 5:12
add a comment |
1
Please edit this question and make use of proper formatting so that the question is readable.
– Ignacio Vazquez-Abrams
Feb 26 '10 at 5:12
1
1
Please edit this question and make use of proper formatting so that the question is readable.
– Ignacio Vazquez-Abrams
Feb 26 '10 at 5:12
Please edit this question and make use of proper formatting so that the question is readable.
– Ignacio Vazquez-Abrams
Feb 26 '10 at 5:12
add a comment |
2 Answers
2
active
oldest
votes
As a general rule, do not process encoded bytestrings: make them into proper unicode strings (by calling their .decode
method) as soon as possible, do all of your processing always on unicode strings, then, if you have to for I/O purposes, .encode
them back into whatever bytestring encoding you require.
If you're talking about literals, as it seems you are in your code, the "as soon as possible" is at once: use u'...'
to express your literals. In a more general case, where you're forced to do I/O in encoded form, it's immediately after input (just as it's immediately before output if you need to perform output in a specific encoded form).
Hello Sir..:) Thank you for your answer.. whatever you have said in the 2nd para is exactly applicable to my case.. I tried this thing in the following line of the code: trans = u'Word(printables)' and I couldn achieve the desired output. Could you please correct me if I have made the modification in the wrong line, as after making this change the error is coming ' Expecting printables at that position ' with respect to the lines which defines the grammmar.
– boddhisattva
Feb 26 '10 at 9:06
@mgj, don't assign a unicode string literal totrans
, that makes no sense. Just ensureprintables
is a unicode object (not a utf8-encoded byte string! -- nor a byte string with any other encoding!), and usetrans = Word(printables)
. If your file is utf-8 encoded, or encoded with any other encoding, decode it by usingcodecs.open
from thecodecs
module, not the built-inopen
as you're doing, so that eachline
is a unicode object, not a byte string (in whatever encoding).
– Alex Martelli
Feb 26 '10 at 15:11
add a comment |
Pyparsing's printables
only deals with strings in the ASCII range of characters. You want printables in the full Unicode range, like this:
unicodePrintables = u''.join(unichr(c) for c in xrange(sys.maxunicode)
if not unichr(c).isspace())
Now you can define trans
using this more complete set of non-space characters:
trans = Word(unicodePrintables)
I was unable to test against your Hindi test string, but I think this will do the trick.
(If you are using Python 3, then there is no separate unichr function, and no xrange generator, just use:
unicodePrintables = ''.join(chr(c) for c in range(sys.maxunicode)
if not chr(c).isspace())
EDIT:
With the recent release of pyparsing 2.3.0, new namespace classes have been defined to give printables
, alphas
, nums
, and alphanums
for various Unicode language ranges.
import pyparsing as pp
pp.Word(pp.pyparsing_unicode.printables)
pp.Word(pp.pyparsing_unicode.Devanagari.printables)
pp.Word(pp.pyparsing_unicode.देवनागरी.printables)
Thank you for your answer sir..:)
– boddhisattva
Mar 11 '10 at 5:25
this answer is long since obsolete: unicode no longer is 16 bit, and looping everything isn’t performant at all.
– flying sheep
Sep 15 '15 at 12:33
2
@flyingsheep - good tip, updated to usesys.maxunicode
instead of a hard-coded constant, so it will track with Python'ssys
module. As for looping everything, this bit only runs once, when initially defining a parser, and when used to create a pyparsingWord
, gets stored as a set(), so parse-time performance is still quite good.
– PaulMcG
Jul 21 '16 at 15:20
This would include a lot of unprintable characters, such asu''
.
– user2357112
Nov 16 '18 at 19:39
The latest version of pyparsing includes a number of Unicode ranges. All control characters belowu'20'
are not included.
– PaulMcG
Nov 16 '18 at 22:42
|
show 1 more comment
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f2339386%2fpython-pyparsing-unicode-characters%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
As a general rule, do not process encoded bytestrings: make them into proper unicode strings (by calling their .decode
method) as soon as possible, do all of your processing always on unicode strings, then, if you have to for I/O purposes, .encode
them back into whatever bytestring encoding you require.
If you're talking about literals, as it seems you are in your code, the "as soon as possible" is at once: use u'...'
to express your literals. In a more general case, where you're forced to do I/O in encoded form, it's immediately after input (just as it's immediately before output if you need to perform output in a specific encoded form).
Hello Sir..:) Thank you for your answer.. whatever you have said in the 2nd para is exactly applicable to my case.. I tried this thing in the following line of the code: trans = u'Word(printables)' and I couldn achieve the desired output. Could you please correct me if I have made the modification in the wrong line, as after making this change the error is coming ' Expecting printables at that position ' with respect to the lines which defines the grammmar.
– boddhisattva
Feb 26 '10 at 9:06
@mgj, don't assign a unicode string literal totrans
, that makes no sense. Just ensureprintables
is a unicode object (not a utf8-encoded byte string! -- nor a byte string with any other encoding!), and usetrans = Word(printables)
. If your file is utf-8 encoded, or encoded with any other encoding, decode it by usingcodecs.open
from thecodecs
module, not the built-inopen
as you're doing, so that eachline
is a unicode object, not a byte string (in whatever encoding).
– Alex Martelli
Feb 26 '10 at 15:11
add a comment |
As a general rule, do not process encoded bytestrings: make them into proper unicode strings (by calling their .decode
method) as soon as possible, do all of your processing always on unicode strings, then, if you have to for I/O purposes, .encode
them back into whatever bytestring encoding you require.
If you're talking about literals, as it seems you are in your code, the "as soon as possible" is at once: use u'...'
to express your literals. In a more general case, where you're forced to do I/O in encoded form, it's immediately after input (just as it's immediately before output if you need to perform output in a specific encoded form).
Hello Sir..:) Thank you for your answer.. whatever you have said in the 2nd para is exactly applicable to my case.. I tried this thing in the following line of the code: trans = u'Word(printables)' and I couldn achieve the desired output. Could you please correct me if I have made the modification in the wrong line, as after making this change the error is coming ' Expecting printables at that position ' with respect to the lines which defines the grammmar.
– boddhisattva
Feb 26 '10 at 9:06
@mgj, don't assign a unicode string literal totrans
, that makes no sense. Just ensureprintables
is a unicode object (not a utf8-encoded byte string! -- nor a byte string with any other encoding!), and usetrans = Word(printables)
. If your file is utf-8 encoded, or encoded with any other encoding, decode it by usingcodecs.open
from thecodecs
module, not the built-inopen
as you're doing, so that eachline
is a unicode object, not a byte string (in whatever encoding).
– Alex Martelli
Feb 26 '10 at 15:11
add a comment |
As a general rule, do not process encoded bytestrings: make them into proper unicode strings (by calling their .decode
method) as soon as possible, do all of your processing always on unicode strings, then, if you have to for I/O purposes, .encode
them back into whatever bytestring encoding you require.
If you're talking about literals, as it seems you are in your code, the "as soon as possible" is at once: use u'...'
to express your literals. In a more general case, where you're forced to do I/O in encoded form, it's immediately after input (just as it's immediately before output if you need to perform output in a specific encoded form).
As a general rule, do not process encoded bytestrings: make them into proper unicode strings (by calling their .decode
method) as soon as possible, do all of your processing always on unicode strings, then, if you have to for I/O purposes, .encode
them back into whatever bytestring encoding you require.
If you're talking about literals, as it seems you are in your code, the "as soon as possible" is at once: use u'...'
to express your literals. In a more general case, where you're forced to do I/O in encoded form, it's immediately after input (just as it's immediately before output if you need to perform output in a specific encoded form).
answered Feb 26 '10 at 6:08
Alex MartelliAlex Martelli
637k12910461286
637k12910461286
Hello Sir..:) Thank you for your answer.. whatever you have said in the 2nd para is exactly applicable to my case.. I tried this thing in the following line of the code: trans = u'Word(printables)' and I couldn achieve the desired output. Could you please correct me if I have made the modification in the wrong line, as after making this change the error is coming ' Expecting printables at that position ' with respect to the lines which defines the grammmar.
– boddhisattva
Feb 26 '10 at 9:06
@mgj, don't assign a unicode string literal totrans
, that makes no sense. Just ensureprintables
is a unicode object (not a utf8-encoded byte string! -- nor a byte string with any other encoding!), and usetrans = Word(printables)
. If your file is utf-8 encoded, or encoded with any other encoding, decode it by usingcodecs.open
from thecodecs
module, not the built-inopen
as you're doing, so that eachline
is a unicode object, not a byte string (in whatever encoding).
– Alex Martelli
Feb 26 '10 at 15:11
add a comment |
Hello Sir..:) Thank you for your answer.. whatever you have said in the 2nd para is exactly applicable to my case.. I tried this thing in the following line of the code: trans = u'Word(printables)' and I couldn achieve the desired output. Could you please correct me if I have made the modification in the wrong line, as after making this change the error is coming ' Expecting printables at that position ' with respect to the lines which defines the grammmar.
– boddhisattva
Feb 26 '10 at 9:06
@mgj, don't assign a unicode string literal totrans
, that makes no sense. Just ensureprintables
is a unicode object (not a utf8-encoded byte string! -- nor a byte string with any other encoding!), and usetrans = Word(printables)
. If your file is utf-8 encoded, or encoded with any other encoding, decode it by usingcodecs.open
from thecodecs
module, not the built-inopen
as you're doing, so that eachline
is a unicode object, not a byte string (in whatever encoding).
– Alex Martelli
Feb 26 '10 at 15:11
Hello Sir..:) Thank you for your answer.. whatever you have said in the 2nd para is exactly applicable to my case.. I tried this thing in the following line of the code: trans = u'Word(printables)' and I couldn achieve the desired output. Could you please correct me if I have made the modification in the wrong line, as after making this change the error is coming ' Expecting printables at that position ' with respect to the lines which defines the grammmar.
– boddhisattva
Feb 26 '10 at 9:06
Hello Sir..:) Thank you for your answer.. whatever you have said in the 2nd para is exactly applicable to my case.. I tried this thing in the following line of the code: trans = u'Word(printables)' and I couldn achieve the desired output. Could you please correct me if I have made the modification in the wrong line, as after making this change the error is coming ' Expecting printables at that position ' with respect to the lines which defines the grammmar.
– boddhisattva
Feb 26 '10 at 9:06
@mgj, don't assign a unicode string literal to
trans
, that makes no sense. Just ensure printables
is a unicode object (not a utf8-encoded byte string! -- nor a byte string with any other encoding!), and use trans = Word(printables)
. If your file is utf-8 encoded, or encoded with any other encoding, decode it by using codecs.open
from the codecs
module, not the built-in open
as you're doing, so that each line
is a unicode object, not a byte string (in whatever encoding).– Alex Martelli
Feb 26 '10 at 15:11
@mgj, don't assign a unicode string literal to
trans
, that makes no sense. Just ensure printables
is a unicode object (not a utf8-encoded byte string! -- nor a byte string with any other encoding!), and use trans = Word(printables)
. If your file is utf-8 encoded, or encoded with any other encoding, decode it by using codecs.open
from the codecs
module, not the built-in open
as you're doing, so that each line
is a unicode object, not a byte string (in whatever encoding).– Alex Martelli
Feb 26 '10 at 15:11
add a comment |
Pyparsing's printables
only deals with strings in the ASCII range of characters. You want printables in the full Unicode range, like this:
unicodePrintables = u''.join(unichr(c) for c in xrange(sys.maxunicode)
if not unichr(c).isspace())
Now you can define trans
using this more complete set of non-space characters:
trans = Word(unicodePrintables)
I was unable to test against your Hindi test string, but I think this will do the trick.
(If you are using Python 3, then there is no separate unichr function, and no xrange generator, just use:
unicodePrintables = ''.join(chr(c) for c in range(sys.maxunicode)
if not chr(c).isspace())
EDIT:
With the recent release of pyparsing 2.3.0, new namespace classes have been defined to give printables
, alphas
, nums
, and alphanums
for various Unicode language ranges.
import pyparsing as pp
pp.Word(pp.pyparsing_unicode.printables)
pp.Word(pp.pyparsing_unicode.Devanagari.printables)
pp.Word(pp.pyparsing_unicode.देवनागरी.printables)
Thank you for your answer sir..:)
– boddhisattva
Mar 11 '10 at 5:25
this answer is long since obsolete: unicode no longer is 16 bit, and looping everything isn’t performant at all.
– flying sheep
Sep 15 '15 at 12:33
2
@flyingsheep - good tip, updated to usesys.maxunicode
instead of a hard-coded constant, so it will track with Python'ssys
module. As for looping everything, this bit only runs once, when initially defining a parser, and when used to create a pyparsingWord
, gets stored as a set(), so parse-time performance is still quite good.
– PaulMcG
Jul 21 '16 at 15:20
This would include a lot of unprintable characters, such asu''
.
– user2357112
Nov 16 '18 at 19:39
The latest version of pyparsing includes a number of Unicode ranges. All control characters belowu'20'
are not included.
– PaulMcG
Nov 16 '18 at 22:42
|
show 1 more comment
Pyparsing's printables
only deals with strings in the ASCII range of characters. You want printables in the full Unicode range, like this:
unicodePrintables = u''.join(unichr(c) for c in xrange(sys.maxunicode)
if not unichr(c).isspace())
Now you can define trans
using this more complete set of non-space characters:
trans = Word(unicodePrintables)
I was unable to test against your Hindi test string, but I think this will do the trick.
(If you are using Python 3, then there is no separate unichr function, and no xrange generator, just use:
unicodePrintables = ''.join(chr(c) for c in range(sys.maxunicode)
if not chr(c).isspace())
EDIT:
With the recent release of pyparsing 2.3.0, new namespace classes have been defined to give printables
, alphas
, nums
, and alphanums
for various Unicode language ranges.
import pyparsing as pp
pp.Word(pp.pyparsing_unicode.printables)
pp.Word(pp.pyparsing_unicode.Devanagari.printables)
pp.Word(pp.pyparsing_unicode.देवनागरी.printables)
Thank you for your answer sir..:)
– boddhisattva
Mar 11 '10 at 5:25
this answer is long since obsolete: unicode no longer is 16 bit, and looping everything isn’t performant at all.
– flying sheep
Sep 15 '15 at 12:33
2
@flyingsheep - good tip, updated to usesys.maxunicode
instead of a hard-coded constant, so it will track with Python'ssys
module. As for looping everything, this bit only runs once, when initially defining a parser, and when used to create a pyparsingWord
, gets stored as a set(), so parse-time performance is still quite good.
– PaulMcG
Jul 21 '16 at 15:20
This would include a lot of unprintable characters, such asu''
.
– user2357112
Nov 16 '18 at 19:39
The latest version of pyparsing includes a number of Unicode ranges. All control characters belowu'20'
are not included.
– PaulMcG
Nov 16 '18 at 22:42
|
show 1 more comment
Pyparsing's printables
only deals with strings in the ASCII range of characters. You want printables in the full Unicode range, like this:
unicodePrintables = u''.join(unichr(c) for c in xrange(sys.maxunicode)
if not unichr(c).isspace())
Now you can define trans
using this more complete set of non-space characters:
trans = Word(unicodePrintables)
I was unable to test against your Hindi test string, but I think this will do the trick.
(If you are using Python 3, then there is no separate unichr function, and no xrange generator, just use:
unicodePrintables = ''.join(chr(c) for c in range(sys.maxunicode)
if not chr(c).isspace())
EDIT:
With the recent release of pyparsing 2.3.0, new namespace classes have been defined to give printables
, alphas
, nums
, and alphanums
for various Unicode language ranges.
import pyparsing as pp
pp.Word(pp.pyparsing_unicode.printables)
pp.Word(pp.pyparsing_unicode.Devanagari.printables)
pp.Word(pp.pyparsing_unicode.देवनागरी.printables)
Pyparsing's printables
only deals with strings in the ASCII range of characters. You want printables in the full Unicode range, like this:
unicodePrintables = u''.join(unichr(c) for c in xrange(sys.maxunicode)
if not unichr(c).isspace())
Now you can define trans
using this more complete set of non-space characters:
trans = Word(unicodePrintables)
I was unable to test against your Hindi test string, but I think this will do the trick.
(If you are using Python 3, then there is no separate unichr function, and no xrange generator, just use:
unicodePrintables = ''.join(chr(c) for c in range(sys.maxunicode)
if not chr(c).isspace())
EDIT:
With the recent release of pyparsing 2.3.0, new namespace classes have been defined to give printables
, alphas
, nums
, and alphanums
for various Unicode language ranges.
import pyparsing as pp
pp.Word(pp.pyparsing_unicode.printables)
pp.Word(pp.pyparsing_unicode.Devanagari.printables)
pp.Word(pp.pyparsing_unicode.देवनागरी.printables)
edited Nov 17 '18 at 2:27
answered Feb 26 '10 at 9:43
PaulMcGPaulMcG
47k969111
47k969111
Thank you for your answer sir..:)
– boddhisattva
Mar 11 '10 at 5:25
this answer is long since obsolete: unicode no longer is 16 bit, and looping everything isn’t performant at all.
– flying sheep
Sep 15 '15 at 12:33
2
@flyingsheep - good tip, updated to usesys.maxunicode
instead of a hard-coded constant, so it will track with Python'ssys
module. As for looping everything, this bit only runs once, when initially defining a parser, and when used to create a pyparsingWord
, gets stored as a set(), so parse-time performance is still quite good.
– PaulMcG
Jul 21 '16 at 15:20
This would include a lot of unprintable characters, such asu''
.
– user2357112
Nov 16 '18 at 19:39
The latest version of pyparsing includes a number of Unicode ranges. All control characters belowu'20'
are not included.
– PaulMcG
Nov 16 '18 at 22:42
|
show 1 more comment
Thank you for your answer sir..:)
– boddhisattva
Mar 11 '10 at 5:25
this answer is long since obsolete: unicode no longer is 16 bit, and looping everything isn’t performant at all.
– flying sheep
Sep 15 '15 at 12:33
2
@flyingsheep - good tip, updated to usesys.maxunicode
instead of a hard-coded constant, so it will track with Python'ssys
module. As for looping everything, this bit only runs once, when initially defining a parser, and when used to create a pyparsingWord
, gets stored as a set(), so parse-time performance is still quite good.
– PaulMcG
Jul 21 '16 at 15:20
This would include a lot of unprintable characters, such asu''
.
– user2357112
Nov 16 '18 at 19:39
The latest version of pyparsing includes a number of Unicode ranges. All control characters belowu'20'
are not included.
– PaulMcG
Nov 16 '18 at 22:42
Thank you for your answer sir..:)
– boddhisattva
Mar 11 '10 at 5:25
Thank you for your answer sir..:)
– boddhisattva
Mar 11 '10 at 5:25
this answer is long since obsolete: unicode no longer is 16 bit, and looping everything isn’t performant at all.
– flying sheep
Sep 15 '15 at 12:33
this answer is long since obsolete: unicode no longer is 16 bit, and looping everything isn’t performant at all.
– flying sheep
Sep 15 '15 at 12:33
2
2
@flyingsheep - good tip, updated to use
sys.maxunicode
instead of a hard-coded constant, so it will track with Python's sys
module. As for looping everything, this bit only runs once, when initially defining a parser, and when used to create a pyparsing Word
, gets stored as a set(), so parse-time performance is still quite good.– PaulMcG
Jul 21 '16 at 15:20
@flyingsheep - good tip, updated to use
sys.maxunicode
instead of a hard-coded constant, so it will track with Python's sys
module. As for looping everything, this bit only runs once, when initially defining a parser, and when used to create a pyparsing Word
, gets stored as a set(), so parse-time performance is still quite good.– PaulMcG
Jul 21 '16 at 15:20
This would include a lot of unprintable characters, such as
u''
.– user2357112
Nov 16 '18 at 19:39
This would include a lot of unprintable characters, such as
u''
.– user2357112
Nov 16 '18 at 19:39
The latest version of pyparsing includes a number of Unicode ranges. All control characters below
u'20'
are not included.– PaulMcG
Nov 16 '18 at 22:42
The latest version of pyparsing includes a number of Unicode ranges. All control characters below
u'20'
are not included.– PaulMcG
Nov 16 '18 at 22:42
|
show 1 more comment
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f2339386%2fpython-pyparsing-unicode-characters%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
1
Please edit this question and make use of proper formatting so that the question is readable.
– Ignacio Vazquez-Abrams
Feb 26 '10 at 5:12