Python - pyparsing unicode characters





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}







12















:) I tried using w = Word(printables), but it isn't working. How should I give the spec for this. 'w' is meant to process Hindi characters (UTF-8)



The code specifies the grammar and parses accordingly.



671.assess  :: अहसास  ::2
x=number + "." + src + "::" + w + "::" + number + "." + number


If there is only english characters it is working so the code is correct for the ascii format but the code is not working for the unicode format.



I mean that the code works when we have something of the form
671.assess :: ahsaas ::2



i.e. it parses words in the english format, but I am not sure how to parse and then print characters in the unicode format. I need this for English Hindi word alignment for purpose.



The python code looks like this:



# -*- coding: utf-8 -*-
from pyparsing import Literal, Word, Optional, nums, alphas, ZeroOrMore, printables , Group , alphas8bit ,
# grammar
src = Word(printables)
trans = Word(printables)
number = Word(nums)
x=number + "." + src + "::" + trans + "::" + number + "." + number
#parsing for eng-dict
efiledata = open('b1aop_or_not_word.txt').read()
eresults = x.parseString(efiledata)
edict1 = {}
edict2 = {}
counter=0
xx=list()
for result in eresults:
trans=""#translation string
ew=""#english word
xx=result[0]
ew=xx[2]
trans=xx[4]
edict1 = { ew:trans }
edict2.update(edict1)
print len(edict2) #no of entries in the english dictionary
print "edict2 has been created"
print "english dictionary" , edict2

#parsing for hin-dict
hfiledata = open('b1aop_or_not_word.txt').read()
hresults = x.scanString(hfiledata)
hdict1 = {}
hdict2 = {}
counter=0
for result in hresults:
trans=""#translation string
hw=""#hin word
xx=result[0]
hw=xx[2]
trans=xx[4]
#print trans
hdict1 = { trans:hw }
hdict2.update(hdict1)

print len(hdict2) #no of entries in the hindi dictionary
print"hdict2 has been created"
print "hindi dictionary" , hdict2
'''
#######################################################################################################################

def translate(d, ow, hinlist):
if ow in d.keys():#ow=old word d=dict
print ow , "exists in the dictionary keys"
transes = d[ow]
transes = transes.split()
print "possible transes for" , ow , " = ", transes
for word in transes:
if word in hinlist:
print "trans for" , ow , " = ", word
return word
return None
else:
print ow , "absent"
return None

f = open('bidir','w')
#lines = ["'
#5# 10 # and better performance in business in turn benefits consumers . # 0 0 0 0 0 0 0 0 0 0
#5# 11 # vHyaapaar mEmn bEhtr kaam upbhOkHtaaomn kE lIe laabhpHrdd hOtaa hAI . # 0 0 0 0 0 0 0 0 0 0 0
#'"]
data=open('bi_full_2','rb').read()
lines = data.split('!@#$%')
loc=0
for line in lines:
eng, hin = [subline.split(' # ')
for subline in line.strip('n').split('n')]

for transdict, source, dest in [(edict2, eng, hin),
(hdict2, hin, eng)]:
sourcethings = source[2].split()
for word in source[1].split():
tl = dest[1].split()
otherword = translate(transdict, word, tl)
loc = source[1].split().index(word)
if otherword is not None:
otherword = otherword.strip()
print word, ' <-> ', otherword, 'meaning=good'
if otherword in dest[1].split():
print word, ' <-> ', otherword, 'trans=good'
sourcethings[loc] = str(
dest[1].split().index(otherword) + 1)

source[2] = ' '.join(sourcethings)

eng = ' # '.join(eng)
hin = ' # '.join(hin)
f.write(eng+'n'+hin+'nnn')
f.close()
'''


if an example input sentence for the source file is:



1# 5 # modern markets : confident consumers  # 0 0 0 0 0 
1# 6 # AddhUnIk baajaar : AshHvsHt upbhOkHtaa . # 0 0 0 0 0 0
!@#$%


the ouptut would look like this :-



1# 5 # modern markets : confident consumers  # 1 2 3 4 5 
1# 6 # AddhUnIk baajaar : AshHvsHt upbhOkHtaa . # 1 2 3 4 5 0
!@#$%


Output Explanation:-
This achieves bidirectional alignment.
It means the first word of english 'modern' maps to the first word of hindi 'AddhUnIk' and vice versa. Here even characters are take as words as they also are an integral part of bidirectional mapping. Thus if you observe the hindi WORD '.' has a null alignment and it maps to nothing with respect to the English sentence as it doesn't have a full stop.
The 3rd line int the output basically represents a delimiter when we are working for a number of sentences for which your trying to achieve bidirectional mapping.



What modification should i make for it to work if the I have the hindi sentences in Unicode(UTF-8) format.










share|improve this question




















  • 1





    Please edit this question and make use of proper formatting so that the question is readable.

    – Ignacio Vazquez-Abrams
    Feb 26 '10 at 5:12


















12















:) I tried using w = Word(printables), but it isn't working. How should I give the spec for this. 'w' is meant to process Hindi characters (UTF-8)



The code specifies the grammar and parses accordingly.



671.assess  :: अहसास  ::2
x=number + "." + src + "::" + w + "::" + number + "." + number


If there is only english characters it is working so the code is correct for the ascii format but the code is not working for the unicode format.



I mean that the code works when we have something of the form
671.assess :: ahsaas ::2



i.e. it parses words in the english format, but I am not sure how to parse and then print characters in the unicode format. I need this for English Hindi word alignment for purpose.



The python code looks like this:



# -*- coding: utf-8 -*-
from pyparsing import Literal, Word, Optional, nums, alphas, ZeroOrMore, printables , Group , alphas8bit ,
# grammar
src = Word(printables)
trans = Word(printables)
number = Word(nums)
x=number + "." + src + "::" + trans + "::" + number + "." + number
#parsing for eng-dict
efiledata = open('b1aop_or_not_word.txt').read()
eresults = x.parseString(efiledata)
edict1 = {}
edict2 = {}
counter=0
xx=list()
for result in eresults:
trans=""#translation string
ew=""#english word
xx=result[0]
ew=xx[2]
trans=xx[4]
edict1 = { ew:trans }
edict2.update(edict1)
print len(edict2) #no of entries in the english dictionary
print "edict2 has been created"
print "english dictionary" , edict2

#parsing for hin-dict
hfiledata = open('b1aop_or_not_word.txt').read()
hresults = x.scanString(hfiledata)
hdict1 = {}
hdict2 = {}
counter=0
for result in hresults:
trans=""#translation string
hw=""#hin word
xx=result[0]
hw=xx[2]
trans=xx[4]
#print trans
hdict1 = { trans:hw }
hdict2.update(hdict1)

print len(hdict2) #no of entries in the hindi dictionary
print"hdict2 has been created"
print "hindi dictionary" , hdict2
'''
#######################################################################################################################

def translate(d, ow, hinlist):
if ow in d.keys():#ow=old word d=dict
print ow , "exists in the dictionary keys"
transes = d[ow]
transes = transes.split()
print "possible transes for" , ow , " = ", transes
for word in transes:
if word in hinlist:
print "trans for" , ow , " = ", word
return word
return None
else:
print ow , "absent"
return None

f = open('bidir','w')
#lines = ["'
#5# 10 # and better performance in business in turn benefits consumers . # 0 0 0 0 0 0 0 0 0 0
#5# 11 # vHyaapaar mEmn bEhtr kaam upbhOkHtaaomn kE lIe laabhpHrdd hOtaa hAI . # 0 0 0 0 0 0 0 0 0 0 0
#'"]
data=open('bi_full_2','rb').read()
lines = data.split('!@#$%')
loc=0
for line in lines:
eng, hin = [subline.split(' # ')
for subline in line.strip('n').split('n')]

for transdict, source, dest in [(edict2, eng, hin),
(hdict2, hin, eng)]:
sourcethings = source[2].split()
for word in source[1].split():
tl = dest[1].split()
otherword = translate(transdict, word, tl)
loc = source[1].split().index(word)
if otherword is not None:
otherword = otherword.strip()
print word, ' <-> ', otherword, 'meaning=good'
if otherword in dest[1].split():
print word, ' <-> ', otherword, 'trans=good'
sourcethings[loc] = str(
dest[1].split().index(otherword) + 1)

source[2] = ' '.join(sourcethings)

eng = ' # '.join(eng)
hin = ' # '.join(hin)
f.write(eng+'n'+hin+'nnn')
f.close()
'''


if an example input sentence for the source file is:



1# 5 # modern markets : confident consumers  # 0 0 0 0 0 
1# 6 # AddhUnIk baajaar : AshHvsHt upbhOkHtaa . # 0 0 0 0 0 0
!@#$%


the ouptut would look like this :-



1# 5 # modern markets : confident consumers  # 1 2 3 4 5 
1# 6 # AddhUnIk baajaar : AshHvsHt upbhOkHtaa . # 1 2 3 4 5 0
!@#$%


Output Explanation:-
This achieves bidirectional alignment.
It means the first word of english 'modern' maps to the first word of hindi 'AddhUnIk' and vice versa. Here even characters are take as words as they also are an integral part of bidirectional mapping. Thus if you observe the hindi WORD '.' has a null alignment and it maps to nothing with respect to the English sentence as it doesn't have a full stop.
The 3rd line int the output basically represents a delimiter when we are working for a number of sentences for which your trying to achieve bidirectional mapping.



What modification should i make for it to work if the I have the hindi sentences in Unicode(UTF-8) format.










share|improve this question




















  • 1





    Please edit this question and make use of proper formatting so that the question is readable.

    – Ignacio Vazquez-Abrams
    Feb 26 '10 at 5:12














12












12








12


3






:) I tried using w = Word(printables), but it isn't working. How should I give the spec for this. 'w' is meant to process Hindi characters (UTF-8)



The code specifies the grammar and parses accordingly.



671.assess  :: अहसास  ::2
x=number + "." + src + "::" + w + "::" + number + "." + number


If there is only english characters it is working so the code is correct for the ascii format but the code is not working for the unicode format.



I mean that the code works when we have something of the form
671.assess :: ahsaas ::2



i.e. it parses words in the english format, but I am not sure how to parse and then print characters in the unicode format. I need this for English Hindi word alignment for purpose.



The python code looks like this:



# -*- coding: utf-8 -*-
from pyparsing import Literal, Word, Optional, nums, alphas, ZeroOrMore, printables , Group , alphas8bit ,
# grammar
src = Word(printables)
trans = Word(printables)
number = Word(nums)
x=number + "." + src + "::" + trans + "::" + number + "." + number
#parsing for eng-dict
efiledata = open('b1aop_or_not_word.txt').read()
eresults = x.parseString(efiledata)
edict1 = {}
edict2 = {}
counter=0
xx=list()
for result in eresults:
trans=""#translation string
ew=""#english word
xx=result[0]
ew=xx[2]
trans=xx[4]
edict1 = { ew:trans }
edict2.update(edict1)
print len(edict2) #no of entries in the english dictionary
print "edict2 has been created"
print "english dictionary" , edict2

#parsing for hin-dict
hfiledata = open('b1aop_or_not_word.txt').read()
hresults = x.scanString(hfiledata)
hdict1 = {}
hdict2 = {}
counter=0
for result in hresults:
trans=""#translation string
hw=""#hin word
xx=result[0]
hw=xx[2]
trans=xx[4]
#print trans
hdict1 = { trans:hw }
hdict2.update(hdict1)

print len(hdict2) #no of entries in the hindi dictionary
print"hdict2 has been created"
print "hindi dictionary" , hdict2
'''
#######################################################################################################################

def translate(d, ow, hinlist):
if ow in d.keys():#ow=old word d=dict
print ow , "exists in the dictionary keys"
transes = d[ow]
transes = transes.split()
print "possible transes for" , ow , " = ", transes
for word in transes:
if word in hinlist:
print "trans for" , ow , " = ", word
return word
return None
else:
print ow , "absent"
return None

f = open('bidir','w')
#lines = ["'
#5# 10 # and better performance in business in turn benefits consumers . # 0 0 0 0 0 0 0 0 0 0
#5# 11 # vHyaapaar mEmn bEhtr kaam upbhOkHtaaomn kE lIe laabhpHrdd hOtaa hAI . # 0 0 0 0 0 0 0 0 0 0 0
#'"]
data=open('bi_full_2','rb').read()
lines = data.split('!@#$%')
loc=0
for line in lines:
eng, hin = [subline.split(' # ')
for subline in line.strip('n').split('n')]

for transdict, source, dest in [(edict2, eng, hin),
(hdict2, hin, eng)]:
sourcethings = source[2].split()
for word in source[1].split():
tl = dest[1].split()
otherword = translate(transdict, word, tl)
loc = source[1].split().index(word)
if otherword is not None:
otherword = otherword.strip()
print word, ' <-> ', otherword, 'meaning=good'
if otherword in dest[1].split():
print word, ' <-> ', otherword, 'trans=good'
sourcethings[loc] = str(
dest[1].split().index(otherword) + 1)

source[2] = ' '.join(sourcethings)

eng = ' # '.join(eng)
hin = ' # '.join(hin)
f.write(eng+'n'+hin+'nnn')
f.close()
'''


if an example input sentence for the source file is:



1# 5 # modern markets : confident consumers  # 0 0 0 0 0 
1# 6 # AddhUnIk baajaar : AshHvsHt upbhOkHtaa . # 0 0 0 0 0 0
!@#$%


the ouptut would look like this :-



1# 5 # modern markets : confident consumers  # 1 2 3 4 5 
1# 6 # AddhUnIk baajaar : AshHvsHt upbhOkHtaa . # 1 2 3 4 5 0
!@#$%


Output Explanation:-
This achieves bidirectional alignment.
It means the first word of english 'modern' maps to the first word of hindi 'AddhUnIk' and vice versa. Here even characters are take as words as they also are an integral part of bidirectional mapping. Thus if you observe the hindi WORD '.' has a null alignment and it maps to nothing with respect to the English sentence as it doesn't have a full stop.
The 3rd line int the output basically represents a delimiter when we are working for a number of sentences for which your trying to achieve bidirectional mapping.



What modification should i make for it to work if the I have the hindi sentences in Unicode(UTF-8) format.










share|improve this question
















:) I tried using w = Word(printables), but it isn't working. How should I give the spec for this. 'w' is meant to process Hindi characters (UTF-8)



The code specifies the grammar and parses accordingly.



671.assess  :: अहसास  ::2
x=number + "." + src + "::" + w + "::" + number + "." + number


If there is only english characters it is working so the code is correct for the ascii format but the code is not working for the unicode format.



I mean that the code works when we have something of the form
671.assess :: ahsaas ::2



i.e. it parses words in the english format, but I am not sure how to parse and then print characters in the unicode format. I need this for English Hindi word alignment for purpose.



The python code looks like this:



# -*- coding: utf-8 -*-
from pyparsing import Literal, Word, Optional, nums, alphas, ZeroOrMore, printables , Group , alphas8bit ,
# grammar
src = Word(printables)
trans = Word(printables)
number = Word(nums)
x=number + "." + src + "::" + trans + "::" + number + "." + number
#parsing for eng-dict
efiledata = open('b1aop_or_not_word.txt').read()
eresults = x.parseString(efiledata)
edict1 = {}
edict2 = {}
counter=0
xx=list()
for result in eresults:
trans=""#translation string
ew=""#english word
xx=result[0]
ew=xx[2]
trans=xx[4]
edict1 = { ew:trans }
edict2.update(edict1)
print len(edict2) #no of entries in the english dictionary
print "edict2 has been created"
print "english dictionary" , edict2

#parsing for hin-dict
hfiledata = open('b1aop_or_not_word.txt').read()
hresults = x.scanString(hfiledata)
hdict1 = {}
hdict2 = {}
counter=0
for result in hresults:
trans=""#translation string
hw=""#hin word
xx=result[0]
hw=xx[2]
trans=xx[4]
#print trans
hdict1 = { trans:hw }
hdict2.update(hdict1)

print len(hdict2) #no of entries in the hindi dictionary
print"hdict2 has been created"
print "hindi dictionary" , hdict2
'''
#######################################################################################################################

def translate(d, ow, hinlist):
if ow in d.keys():#ow=old word d=dict
print ow , "exists in the dictionary keys"
transes = d[ow]
transes = transes.split()
print "possible transes for" , ow , " = ", transes
for word in transes:
if word in hinlist:
print "trans for" , ow , " = ", word
return word
return None
else:
print ow , "absent"
return None

f = open('bidir','w')
#lines = ["'
#5# 10 # and better performance in business in turn benefits consumers . # 0 0 0 0 0 0 0 0 0 0
#5# 11 # vHyaapaar mEmn bEhtr kaam upbhOkHtaaomn kE lIe laabhpHrdd hOtaa hAI . # 0 0 0 0 0 0 0 0 0 0 0
#'"]
data=open('bi_full_2','rb').read()
lines = data.split('!@#$%')
loc=0
for line in lines:
eng, hin = [subline.split(' # ')
for subline in line.strip('n').split('n')]

for transdict, source, dest in [(edict2, eng, hin),
(hdict2, hin, eng)]:
sourcethings = source[2].split()
for word in source[1].split():
tl = dest[1].split()
otherword = translate(transdict, word, tl)
loc = source[1].split().index(word)
if otherword is not None:
otherword = otherword.strip()
print word, ' <-> ', otherword, 'meaning=good'
if otherword in dest[1].split():
print word, ' <-> ', otherword, 'trans=good'
sourcethings[loc] = str(
dest[1].split().index(otherword) + 1)

source[2] = ' '.join(sourcethings)

eng = ' # '.join(eng)
hin = ' # '.join(hin)
f.write(eng+'n'+hin+'nnn')
f.close()
'''


if an example input sentence for the source file is:



1# 5 # modern markets : confident consumers  # 0 0 0 0 0 
1# 6 # AddhUnIk baajaar : AshHvsHt upbhOkHtaa . # 0 0 0 0 0 0
!@#$%


the ouptut would look like this :-



1# 5 # modern markets : confident consumers  # 1 2 3 4 5 
1# 6 # AddhUnIk baajaar : AshHvsHt upbhOkHtaa . # 1 2 3 4 5 0
!@#$%


Output Explanation:-
This achieves bidirectional alignment.
It means the first word of english 'modern' maps to the first word of hindi 'AddhUnIk' and vice versa. Here even characters are take as words as they also are an integral part of bidirectional mapping. Thus if you observe the hindi WORD '.' has a null alignment and it maps to nothing with respect to the English sentence as it doesn't have a full stop.
The 3rd line int the output basically represents a delimiter when we are working for a number of sentences for which your trying to achieve bidirectional mapping.



What modification should i make for it to work if the I have the hindi sentences in Unicode(UTF-8) format.







python unicode nlp pyparsing






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Apr 27 '10 at 22:17









ThomasH

13.3k85053




13.3k85053










asked Feb 26 '10 at 3:52









boddhisattvaboddhisattva

2,45294268




2,45294268








  • 1





    Please edit this question and make use of proper formatting so that the question is readable.

    – Ignacio Vazquez-Abrams
    Feb 26 '10 at 5:12














  • 1





    Please edit this question and make use of proper formatting so that the question is readable.

    – Ignacio Vazquez-Abrams
    Feb 26 '10 at 5:12








1




1





Please edit this question and make use of proper formatting so that the question is readable.

– Ignacio Vazquez-Abrams
Feb 26 '10 at 5:12





Please edit this question and make use of proper formatting so that the question is readable.

– Ignacio Vazquez-Abrams
Feb 26 '10 at 5:12












2 Answers
2






active

oldest

votes


















7














As a general rule, do not process encoded bytestrings: make them into proper unicode strings (by calling their .decode method) as soon as possible, do all of your processing always on unicode strings, then, if you have to for I/O purposes, .encode them back into whatever bytestring encoding you require.



If you're talking about literals, as it seems you are in your code, the "as soon as possible" is at once: use u'...' to express your literals. In a more general case, where you're forced to do I/O in encoded form, it's immediately after input (just as it's immediately before output if you need to perform output in a specific encoded form).






share|improve this answer
























  • Hello Sir..:) Thank you for your answer.. whatever you have said in the 2nd para is exactly applicable to my case.. I tried this thing in the following line of the code: trans = u'Word(printables)' and I couldn achieve the desired output. Could you please correct me if I have made the modification in the wrong line, as after making this change the error is coming ' Expecting printables at that position ' with respect to the lines which defines the grammmar.

    – boddhisattva
    Feb 26 '10 at 9:06











  • @mgj, don't assign a unicode string literal to trans, that makes no sense. Just ensure printables is a unicode object (not a utf8-encoded byte string! -- nor a byte string with any other encoding!), and use trans = Word(printables). If your file is utf-8 encoded, or encoded with any other encoding, decode it by using codecs.open from the codecs module, not the built-in open as you're doing, so that each line is a unicode object, not a byte string (in whatever encoding).

    – Alex Martelli
    Feb 26 '10 at 15:11



















26














Pyparsing's printables only deals with strings in the ASCII range of characters. You want printables in the full Unicode range, like this:



unicodePrintables = u''.join(unichr(c) for c in xrange(sys.maxunicode) 
if not unichr(c).isspace())


Now you can define trans using this more complete set of non-space characters:



trans = Word(unicodePrintables)


I was unable to test against your Hindi test string, but I think this will do the trick.



(If you are using Python 3, then there is no separate unichr function, and no xrange generator, just use:



unicodePrintables = ''.join(chr(c) for c in range(sys.maxunicode) 
if not chr(c).isspace())


EDIT:



With the recent release of pyparsing 2.3.0, new namespace classes have been defined to give printables, alphas, nums, and alphanums for various Unicode language ranges.



import pyparsing as pp
pp.Word(pp.pyparsing_unicode.printables)
pp.Word(pp.pyparsing_unicode.Devanagari.printables)
pp.Word(pp.pyparsing_unicode.देवनागरी.printables)





share|improve this answer


























  • Thank you for your answer sir..:)

    – boddhisattva
    Mar 11 '10 at 5:25











  • this answer is long since obsolete: unicode no longer is 16 bit, and looping everything isn’t performant at all.

    – flying sheep
    Sep 15 '15 at 12:33






  • 2





    @flyingsheep - good tip, updated to use sys.maxunicode instead of a hard-coded constant, so it will track with Python's sys module. As for looping everything, this bit only runs once, when initially defining a parser, and when used to create a pyparsing Word, gets stored as a set(), so parse-time performance is still quite good.

    – PaulMcG
    Jul 21 '16 at 15:20











  • This would include a lot of unprintable characters, such as u''.

    – user2357112
    Nov 16 '18 at 19:39











  • The latest version of pyparsing includes a number of Unicode ranges. All control characters below u'20' are not included.

    – PaulMcG
    Nov 16 '18 at 22:42












Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f2339386%2fpython-pyparsing-unicode-characters%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























2 Answers
2






active

oldest

votes








2 Answers
2






active

oldest

votes









active

oldest

votes






active

oldest

votes









7














As a general rule, do not process encoded bytestrings: make them into proper unicode strings (by calling their .decode method) as soon as possible, do all of your processing always on unicode strings, then, if you have to for I/O purposes, .encode them back into whatever bytestring encoding you require.



If you're talking about literals, as it seems you are in your code, the "as soon as possible" is at once: use u'...' to express your literals. In a more general case, where you're forced to do I/O in encoded form, it's immediately after input (just as it's immediately before output if you need to perform output in a specific encoded form).






share|improve this answer
























  • Hello Sir..:) Thank you for your answer.. whatever you have said in the 2nd para is exactly applicable to my case.. I tried this thing in the following line of the code: trans = u'Word(printables)' and I couldn achieve the desired output. Could you please correct me if I have made the modification in the wrong line, as after making this change the error is coming ' Expecting printables at that position ' with respect to the lines which defines the grammmar.

    – boddhisattva
    Feb 26 '10 at 9:06











  • @mgj, don't assign a unicode string literal to trans, that makes no sense. Just ensure printables is a unicode object (not a utf8-encoded byte string! -- nor a byte string with any other encoding!), and use trans = Word(printables). If your file is utf-8 encoded, or encoded with any other encoding, decode it by using codecs.open from the codecs module, not the built-in open as you're doing, so that each line is a unicode object, not a byte string (in whatever encoding).

    – Alex Martelli
    Feb 26 '10 at 15:11
















7














As a general rule, do not process encoded bytestrings: make them into proper unicode strings (by calling their .decode method) as soon as possible, do all of your processing always on unicode strings, then, if you have to for I/O purposes, .encode them back into whatever bytestring encoding you require.



If you're talking about literals, as it seems you are in your code, the "as soon as possible" is at once: use u'...' to express your literals. In a more general case, where you're forced to do I/O in encoded form, it's immediately after input (just as it's immediately before output if you need to perform output in a specific encoded form).






share|improve this answer
























  • Hello Sir..:) Thank you for your answer.. whatever you have said in the 2nd para is exactly applicable to my case.. I tried this thing in the following line of the code: trans = u'Word(printables)' and I couldn achieve the desired output. Could you please correct me if I have made the modification in the wrong line, as after making this change the error is coming ' Expecting printables at that position ' with respect to the lines which defines the grammmar.

    – boddhisattva
    Feb 26 '10 at 9:06











  • @mgj, don't assign a unicode string literal to trans, that makes no sense. Just ensure printables is a unicode object (not a utf8-encoded byte string! -- nor a byte string with any other encoding!), and use trans = Word(printables). If your file is utf-8 encoded, or encoded with any other encoding, decode it by using codecs.open from the codecs module, not the built-in open as you're doing, so that each line is a unicode object, not a byte string (in whatever encoding).

    – Alex Martelli
    Feb 26 '10 at 15:11














7












7








7







As a general rule, do not process encoded bytestrings: make them into proper unicode strings (by calling their .decode method) as soon as possible, do all of your processing always on unicode strings, then, if you have to for I/O purposes, .encode them back into whatever bytestring encoding you require.



If you're talking about literals, as it seems you are in your code, the "as soon as possible" is at once: use u'...' to express your literals. In a more general case, where you're forced to do I/O in encoded form, it's immediately after input (just as it's immediately before output if you need to perform output in a specific encoded form).






share|improve this answer













As a general rule, do not process encoded bytestrings: make them into proper unicode strings (by calling their .decode method) as soon as possible, do all of your processing always on unicode strings, then, if you have to for I/O purposes, .encode them back into whatever bytestring encoding you require.



If you're talking about literals, as it seems you are in your code, the "as soon as possible" is at once: use u'...' to express your literals. In a more general case, where you're forced to do I/O in encoded form, it's immediately after input (just as it's immediately before output if you need to perform output in a specific encoded form).







share|improve this answer












share|improve this answer



share|improve this answer










answered Feb 26 '10 at 6:08









Alex MartelliAlex Martelli

637k12910461286




637k12910461286













  • Hello Sir..:) Thank you for your answer.. whatever you have said in the 2nd para is exactly applicable to my case.. I tried this thing in the following line of the code: trans = u'Word(printables)' and I couldn achieve the desired output. Could you please correct me if I have made the modification in the wrong line, as after making this change the error is coming ' Expecting printables at that position ' with respect to the lines which defines the grammmar.

    – boddhisattva
    Feb 26 '10 at 9:06











  • @mgj, don't assign a unicode string literal to trans, that makes no sense. Just ensure printables is a unicode object (not a utf8-encoded byte string! -- nor a byte string with any other encoding!), and use trans = Word(printables). If your file is utf-8 encoded, or encoded with any other encoding, decode it by using codecs.open from the codecs module, not the built-in open as you're doing, so that each line is a unicode object, not a byte string (in whatever encoding).

    – Alex Martelli
    Feb 26 '10 at 15:11



















  • Hello Sir..:) Thank you for your answer.. whatever you have said in the 2nd para is exactly applicable to my case.. I tried this thing in the following line of the code: trans = u'Word(printables)' and I couldn achieve the desired output. Could you please correct me if I have made the modification in the wrong line, as after making this change the error is coming ' Expecting printables at that position ' with respect to the lines which defines the grammmar.

    – boddhisattva
    Feb 26 '10 at 9:06











  • @mgj, don't assign a unicode string literal to trans, that makes no sense. Just ensure printables is a unicode object (not a utf8-encoded byte string! -- nor a byte string with any other encoding!), and use trans = Word(printables). If your file is utf-8 encoded, or encoded with any other encoding, decode it by using codecs.open from the codecs module, not the built-in open as you're doing, so that each line is a unicode object, not a byte string (in whatever encoding).

    – Alex Martelli
    Feb 26 '10 at 15:11

















Hello Sir..:) Thank you for your answer.. whatever you have said in the 2nd para is exactly applicable to my case.. I tried this thing in the following line of the code: trans = u'Word(printables)' and I couldn achieve the desired output. Could you please correct me if I have made the modification in the wrong line, as after making this change the error is coming ' Expecting printables at that position ' with respect to the lines which defines the grammmar.

– boddhisattva
Feb 26 '10 at 9:06





Hello Sir..:) Thank you for your answer.. whatever you have said in the 2nd para is exactly applicable to my case.. I tried this thing in the following line of the code: trans = u'Word(printables)' and I couldn achieve the desired output. Could you please correct me if I have made the modification in the wrong line, as after making this change the error is coming ' Expecting printables at that position ' with respect to the lines which defines the grammmar.

– boddhisattva
Feb 26 '10 at 9:06













@mgj, don't assign a unicode string literal to trans, that makes no sense. Just ensure printables is a unicode object (not a utf8-encoded byte string! -- nor a byte string with any other encoding!), and use trans = Word(printables). If your file is utf-8 encoded, or encoded with any other encoding, decode it by using codecs.open from the codecs module, not the built-in open as you're doing, so that each line is a unicode object, not a byte string (in whatever encoding).

– Alex Martelli
Feb 26 '10 at 15:11





@mgj, don't assign a unicode string literal to trans, that makes no sense. Just ensure printables is a unicode object (not a utf8-encoded byte string! -- nor a byte string with any other encoding!), and use trans = Word(printables). If your file is utf-8 encoded, or encoded with any other encoding, decode it by using codecs.open from the codecs module, not the built-in open as you're doing, so that each line is a unicode object, not a byte string (in whatever encoding).

– Alex Martelli
Feb 26 '10 at 15:11













26














Pyparsing's printables only deals with strings in the ASCII range of characters. You want printables in the full Unicode range, like this:



unicodePrintables = u''.join(unichr(c) for c in xrange(sys.maxunicode) 
if not unichr(c).isspace())


Now you can define trans using this more complete set of non-space characters:



trans = Word(unicodePrintables)


I was unable to test against your Hindi test string, but I think this will do the trick.



(If you are using Python 3, then there is no separate unichr function, and no xrange generator, just use:



unicodePrintables = ''.join(chr(c) for c in range(sys.maxunicode) 
if not chr(c).isspace())


EDIT:



With the recent release of pyparsing 2.3.0, new namespace classes have been defined to give printables, alphas, nums, and alphanums for various Unicode language ranges.



import pyparsing as pp
pp.Word(pp.pyparsing_unicode.printables)
pp.Word(pp.pyparsing_unicode.Devanagari.printables)
pp.Word(pp.pyparsing_unicode.देवनागरी.printables)





share|improve this answer


























  • Thank you for your answer sir..:)

    – boddhisattva
    Mar 11 '10 at 5:25











  • this answer is long since obsolete: unicode no longer is 16 bit, and looping everything isn’t performant at all.

    – flying sheep
    Sep 15 '15 at 12:33






  • 2





    @flyingsheep - good tip, updated to use sys.maxunicode instead of a hard-coded constant, so it will track with Python's sys module. As for looping everything, this bit only runs once, when initially defining a parser, and when used to create a pyparsing Word, gets stored as a set(), so parse-time performance is still quite good.

    – PaulMcG
    Jul 21 '16 at 15:20











  • This would include a lot of unprintable characters, such as u''.

    – user2357112
    Nov 16 '18 at 19:39











  • The latest version of pyparsing includes a number of Unicode ranges. All control characters below u'20' are not included.

    – PaulMcG
    Nov 16 '18 at 22:42
















26














Pyparsing's printables only deals with strings in the ASCII range of characters. You want printables in the full Unicode range, like this:



unicodePrintables = u''.join(unichr(c) for c in xrange(sys.maxunicode) 
if not unichr(c).isspace())


Now you can define trans using this more complete set of non-space characters:



trans = Word(unicodePrintables)


I was unable to test against your Hindi test string, but I think this will do the trick.



(If you are using Python 3, then there is no separate unichr function, and no xrange generator, just use:



unicodePrintables = ''.join(chr(c) for c in range(sys.maxunicode) 
if not chr(c).isspace())


EDIT:



With the recent release of pyparsing 2.3.0, new namespace classes have been defined to give printables, alphas, nums, and alphanums for various Unicode language ranges.



import pyparsing as pp
pp.Word(pp.pyparsing_unicode.printables)
pp.Word(pp.pyparsing_unicode.Devanagari.printables)
pp.Word(pp.pyparsing_unicode.देवनागरी.printables)





share|improve this answer


























  • Thank you for your answer sir..:)

    – boddhisattva
    Mar 11 '10 at 5:25











  • this answer is long since obsolete: unicode no longer is 16 bit, and looping everything isn’t performant at all.

    – flying sheep
    Sep 15 '15 at 12:33






  • 2





    @flyingsheep - good tip, updated to use sys.maxunicode instead of a hard-coded constant, so it will track with Python's sys module. As for looping everything, this bit only runs once, when initially defining a parser, and when used to create a pyparsing Word, gets stored as a set(), so parse-time performance is still quite good.

    – PaulMcG
    Jul 21 '16 at 15:20











  • This would include a lot of unprintable characters, such as u''.

    – user2357112
    Nov 16 '18 at 19:39











  • The latest version of pyparsing includes a number of Unicode ranges. All control characters below u'20' are not included.

    – PaulMcG
    Nov 16 '18 at 22:42














26












26








26







Pyparsing's printables only deals with strings in the ASCII range of characters. You want printables in the full Unicode range, like this:



unicodePrintables = u''.join(unichr(c) for c in xrange(sys.maxunicode) 
if not unichr(c).isspace())


Now you can define trans using this more complete set of non-space characters:



trans = Word(unicodePrintables)


I was unable to test against your Hindi test string, but I think this will do the trick.



(If you are using Python 3, then there is no separate unichr function, and no xrange generator, just use:



unicodePrintables = ''.join(chr(c) for c in range(sys.maxunicode) 
if not chr(c).isspace())


EDIT:



With the recent release of pyparsing 2.3.0, new namespace classes have been defined to give printables, alphas, nums, and alphanums for various Unicode language ranges.



import pyparsing as pp
pp.Word(pp.pyparsing_unicode.printables)
pp.Word(pp.pyparsing_unicode.Devanagari.printables)
pp.Word(pp.pyparsing_unicode.देवनागरी.printables)





share|improve this answer















Pyparsing's printables only deals with strings in the ASCII range of characters. You want printables in the full Unicode range, like this:



unicodePrintables = u''.join(unichr(c) for c in xrange(sys.maxunicode) 
if not unichr(c).isspace())


Now you can define trans using this more complete set of non-space characters:



trans = Word(unicodePrintables)


I was unable to test against your Hindi test string, but I think this will do the trick.



(If you are using Python 3, then there is no separate unichr function, and no xrange generator, just use:



unicodePrintables = ''.join(chr(c) for c in range(sys.maxunicode) 
if not chr(c).isspace())


EDIT:



With the recent release of pyparsing 2.3.0, new namespace classes have been defined to give printables, alphas, nums, and alphanums for various Unicode language ranges.



import pyparsing as pp
pp.Word(pp.pyparsing_unicode.printables)
pp.Word(pp.pyparsing_unicode.Devanagari.printables)
pp.Word(pp.pyparsing_unicode.देवनागरी.printables)






share|improve this answer














share|improve this answer



share|improve this answer








edited Nov 17 '18 at 2:27

























answered Feb 26 '10 at 9:43









PaulMcGPaulMcG

47k969111




47k969111













  • Thank you for your answer sir..:)

    – boddhisattva
    Mar 11 '10 at 5:25











  • this answer is long since obsolete: unicode no longer is 16 bit, and looping everything isn’t performant at all.

    – flying sheep
    Sep 15 '15 at 12:33






  • 2





    @flyingsheep - good tip, updated to use sys.maxunicode instead of a hard-coded constant, so it will track with Python's sys module. As for looping everything, this bit only runs once, when initially defining a parser, and when used to create a pyparsing Word, gets stored as a set(), so parse-time performance is still quite good.

    – PaulMcG
    Jul 21 '16 at 15:20











  • This would include a lot of unprintable characters, such as u''.

    – user2357112
    Nov 16 '18 at 19:39











  • The latest version of pyparsing includes a number of Unicode ranges. All control characters below u'20' are not included.

    – PaulMcG
    Nov 16 '18 at 22:42



















  • Thank you for your answer sir..:)

    – boddhisattva
    Mar 11 '10 at 5:25











  • this answer is long since obsolete: unicode no longer is 16 bit, and looping everything isn’t performant at all.

    – flying sheep
    Sep 15 '15 at 12:33






  • 2





    @flyingsheep - good tip, updated to use sys.maxunicode instead of a hard-coded constant, so it will track with Python's sys module. As for looping everything, this bit only runs once, when initially defining a parser, and when used to create a pyparsing Word, gets stored as a set(), so parse-time performance is still quite good.

    – PaulMcG
    Jul 21 '16 at 15:20











  • This would include a lot of unprintable characters, such as u''.

    – user2357112
    Nov 16 '18 at 19:39











  • The latest version of pyparsing includes a number of Unicode ranges. All control characters below u'20' are not included.

    – PaulMcG
    Nov 16 '18 at 22:42

















Thank you for your answer sir..:)

– boddhisattva
Mar 11 '10 at 5:25





Thank you for your answer sir..:)

– boddhisattva
Mar 11 '10 at 5:25













this answer is long since obsolete: unicode no longer is 16 bit, and looping everything isn’t performant at all.

– flying sheep
Sep 15 '15 at 12:33





this answer is long since obsolete: unicode no longer is 16 bit, and looping everything isn’t performant at all.

– flying sheep
Sep 15 '15 at 12:33




2




2





@flyingsheep - good tip, updated to use sys.maxunicode instead of a hard-coded constant, so it will track with Python's sys module. As for looping everything, this bit only runs once, when initially defining a parser, and when used to create a pyparsing Word, gets stored as a set(), so parse-time performance is still quite good.

– PaulMcG
Jul 21 '16 at 15:20





@flyingsheep - good tip, updated to use sys.maxunicode instead of a hard-coded constant, so it will track with Python's sys module. As for looping everything, this bit only runs once, when initially defining a parser, and when used to create a pyparsing Word, gets stored as a set(), so parse-time performance is still quite good.

– PaulMcG
Jul 21 '16 at 15:20













This would include a lot of unprintable characters, such as u''.

– user2357112
Nov 16 '18 at 19:39





This would include a lot of unprintable characters, such as u''.

– user2357112
Nov 16 '18 at 19:39













The latest version of pyparsing includes a number of Unicode ranges. All control characters below u'20' are not included.

– PaulMcG
Nov 16 '18 at 22:42





The latest version of pyparsing includes a number of Unicode ranges. All control characters below u'20' are not included.

– PaulMcG
Nov 16 '18 at 22:42


















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f2339386%2fpython-pyparsing-unicode-characters%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Xamarin.iOS Cant Deploy on Iphone

Glorious Revolution

Dulmage-Mendelsohn matrix decomposition in Python