Python - pyparsing unicode characters

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}

:) I tried using w = Word(printables), but it isn't working. How should I give the spec for this. 'w' is meant to process Hindi characters (UTF-8)

The code specifies the grammar and parses accordingly.

671.assess  :: अहसास  ::2

x=number + "." + src + "::" + w + "::" + number + "." + number

If there is only english characters it is working so the code is correct for the ascii format but the code is not working for the unicode format.

I mean that the code works when we have something of the form
671.assess :: ahsaas ::2

i.e. it parses words in the english format, but I am not sure how to parse and then print characters in the unicode format. I need this for English Hindi word alignment for purpose.

The python code looks like this:

# -*- coding: utf-8 -*-

from pyparsing import Literal, Word, Optional, nums, alphas, ZeroOrMore, printables , Group , alphas8bit , 

# grammar 

src = Word(printables)

trans =  Word(printables)

number = Word(nums)

x=number + "." + src + "::" + trans + "::" + number + "." + number

#parsing for eng-dict

efiledata = open('b1aop_or_not_word.txt').read()

eresults = x.parseString(efiledata)

edict1 = {}

edict2 = {}

counter=0

xx=list()

for result in eresults:

  trans=""#translation string

  ew=""#english word

  xx=result[0]

  ew=xx[2]

  trans=xx[4]   

  edict1 = { ew:trans }

  edict2.update(edict1)

print len(edict2) #no of entries in the english dictionary

print "edict2 has been created"

print "english dictionary" , edict2 



#parsing for hin-dict

hfiledata = open('b1aop_or_not_word.txt').read()

hresults = x.scanString(hfiledata)

hdict1 = {}

hdict2 = {}

counter=0

for result in hresults:

  trans=""#translation string

  hw=""#hin word

  xx=result[0]  

  hw=xx[2]

  trans=xx[4]

  #print trans

  hdict1 = { trans:hw }

  hdict2.update(hdict1)



print len(hdict2) #no of entries in the hindi dictionary

print"hdict2 has been created"

print "hindi dictionary" , hdict2

'''

#######################################################################################################################



def translate(d, ow, hinlist):

   if ow in d.keys():#ow=old word d=dict

    print ow , "exists in the dictionary keys"

        transes = d[ow]

    transes = transes.split()

        print "possible transes for" , ow , " = ", transes

        for word in transes:

            if word in hinlist:

        print "trans for" , ow , " = ", word

                return word

        return None

   else:

        print ow , "absent"

        return None



f = open('bidir','w')

#lines = ["'

#5# 10 # and better performance in business in turn benefits consumers .  # 0 0 0 0 0 0 0 0 0 0 

#5# 11 # vHyaapaar mEmn bEhtr kaam upbhOkHtaaomn kE lIe laabhpHrdd hOtaa hAI .  # 0 0 0 0 0 0 0 0 0 0 0 

#'"]

data=open('bi_full_2','rb').read()

lines = data.split('!@#$%')

loc=0

for line in lines:

    eng, hin = [subline.split(' # ')

                for subline in line.strip('n').split('n')]



    for transdict, source, dest in [(edict2, eng, hin),

                                    (hdict2, hin, eng)]:

        sourcethings = source[2].split()

        for word in source[1].split():

            tl = dest[1].split()

            otherword = translate(transdict, word, tl)

            loc = source[1].split().index(word)

            if otherword is not None:

                otherword = otherword.strip()

                print word, ' <-> ', otherword, 'meaning=good'

                if otherword in dest[1].split():

                    print word, ' <-> ', otherword, 'trans=good'

                    sourcethings[loc] = str(

                        dest[1].split().index(otherword) + 1)



        source[2] = ' '.join(sourcethings)



    eng = ' # '.join(eng)

    hin = ' # '.join(hin)

    f.write(eng+'n'+hin+'nnn')

f.close()

'''

if an example input sentence for the source file is:

1# 5 # modern markets : confident consumers  # 0 0 0 0 0 

1# 6 # AddhUnIk baajaar : AshHvsHt upbhOkHtaa .  # 0 0 0 0 0 0 

!@#$%

the ouptut would look like this :-

1# 5 # modern markets : confident consumers  # 1 2 3 4 5 

1# 6 # AddhUnIk baajaar : AshHvsHt upbhOkHtaa .  # 1 2 3 4 5 0 

!@#$%

Output Explanation:-
This achieves bidirectional alignment.
It means the first word of english 'modern' maps to the first word of hindi 'AddhUnIk' and vice versa. Here even characters are take as words as they also are an integral part of bidirectional mapping. Thus if you observe the hindi WORD '.' has a null alignment and it maps to nothing with respect to the English sentence as it doesn't have a full stop.
The 3rd line int the output basically represents a delimiter when we are working for a number of sentences for which your trying to achieve bidirectional mapping.

What modification should i make for it to work if the I have the hindi sentences in Unicode(UTF-8) format.

edited Apr 27 '10 at 22:17

ThomasH

13.3k85053

asked Feb 26 '10 at 3:52

boddhisattva

2,45294268

1

Please edit this question and make use of proper formatting so that the question is readable.

– Ignacio Vazquez-Abrams
Feb 26 '10 at 5:12

add a comment |

:) I tried using w = Word(printables), but it isn't working. How should I give the spec for this. 'w' is meant to process Hindi characters (UTF-8)

The code specifies the grammar and parses accordingly.

671.assess  :: अहसास  ::2

x=number + "." + src + "::" + w + "::" + number + "." + number

If there is only english characters it is working so the code is correct for the ascii format but the code is not working for the unicode format.

I mean that the code works when we have something of the form
671.assess :: ahsaas ::2

i.e. it parses words in the english format, but I am not sure how to parse and then print characters in the unicode format. I need this for English Hindi word alignment for purpose.

The python code looks like this:

# -*- coding: utf-8 -*-

from pyparsing import Literal, Word, Optional, nums, alphas, ZeroOrMore, printables , Group , alphas8bit , 

# grammar 

src = Word(printables)

trans =  Word(printables)

number = Word(nums)

x=number + "." + src + "::" + trans + "::" + number + "." + number

#parsing for eng-dict

efiledata = open('b1aop_or_not_word.txt').read()

eresults = x.parseString(efiledata)

edict1 = {}

edict2 = {}

counter=0

xx=list()

for result in eresults:

  trans=""#translation string

  ew=""#english word

  xx=result[0]

  ew=xx[2]

  trans=xx[4]   

  edict1 = { ew:trans }

  edict2.update(edict1)

print len(edict2) #no of entries in the english dictionary

print "edict2 has been created"

print "english dictionary" , edict2 



#parsing for hin-dict

hfiledata = open('b1aop_or_not_word.txt').read()

hresults = x.scanString(hfiledata)

hdict1 = {}

hdict2 = {}

counter=0

for result in hresults:

  trans=""#translation string

  hw=""#hin word

  xx=result[0]  

  hw=xx[2]

  trans=xx[4]

  #print trans

  hdict1 = { trans:hw }

  hdict2.update(hdict1)



print len(hdict2) #no of entries in the hindi dictionary

print"hdict2 has been created"

print "hindi dictionary" , hdict2

'''

#######################################################################################################################



def translate(d, ow, hinlist):

   if ow in d.keys():#ow=old word d=dict

    print ow , "exists in the dictionary keys"

        transes = d[ow]

    transes = transes.split()

        print "possible transes for" , ow , " = ", transes

        for word in transes:

            if word in hinlist:

        print "trans for" , ow , " = ", word

                return word

        return None

   else:

        print ow , "absent"

        return None



f = open('bidir','w')

#lines = ["'

#5# 10 # and better performance in business in turn benefits consumers .  # 0 0 0 0 0 0 0 0 0 0 

#5# 11 # vHyaapaar mEmn bEhtr kaam upbhOkHtaaomn kE lIe laabhpHrdd hOtaa hAI .  # 0 0 0 0 0 0 0 0 0 0 0 

#'"]

data=open('bi_full_2','rb').read()

lines = data.split('!@#$%')

loc=0

for line in lines:

    eng, hin = [subline.split(' # ')

                for subline in line.strip('n').split('n')]



    for transdict, source, dest in [(edict2, eng, hin),

                                    (hdict2, hin, eng)]:

        sourcethings = source[2].split()

        for word in source[1].split():

            tl = dest[1].split()

            otherword = translate(transdict, word, tl)

            loc = source[1].split().index(word)

            if otherword is not None:

                otherword = otherword.strip()

                print word, ' <-> ', otherword, 'meaning=good'

                if otherword in dest[1].split():

                    print word, ' <-> ', otherword, 'trans=good'

                    sourcethings[loc] = str(

                        dest[1].split().index(otherword) + 1)



        source[2] = ' '.join(sourcethings)



    eng = ' # '.join(eng)

    hin = ' # '.join(hin)

    f.write(eng+'n'+hin+'nnn')

f.close()

'''

if an example input sentence for the source file is:

1# 5 # modern markets : confident consumers  # 0 0 0 0 0 

1# 6 # AddhUnIk baajaar : AshHvsHt upbhOkHtaa .  # 0 0 0 0 0 0 

!@#$%

the ouptut would look like this :-

1# 5 # modern markets : confident consumers  # 1 2 3 4 5 

1# 6 # AddhUnIk baajaar : AshHvsHt upbhOkHtaa .  # 1 2 3 4 5 0 

!@#$%

What modification should i make for it to work if the I have the hindi sentences in Unicode(UTF-8) format.

edited Apr 27 '10 at 22:17

ThomasH

13.3k85053

asked Feb 26 '10 at 3:52

boddhisattva

2,45294268

1

Please edit this question and make use of proper formatting so that the question is readable.

– Ignacio Vazquez-Abrams
Feb 26 '10 at 5:12

add a comment |

:) I tried using w = Word(printables), but it isn't working. How should I give the spec for this. 'w' is meant to process Hindi characters (UTF-8)

The code specifies the grammar and parses accordingly.

671.assess  :: अहसास  ::2

x=number + "." + src + "::" + w + "::" + number + "." + number

If there is only english characters it is working so the code is correct for the ascii format but the code is not working for the unicode format.

I mean that the code works when we have something of the form
671.assess :: ahsaas ::2

i.e. it parses words in the english format, but I am not sure how to parse and then print characters in the unicode format. I need this for English Hindi word alignment for purpose.

The python code looks like this:

# -*- coding: utf-8 -*-

from pyparsing import Literal, Word, Optional, nums, alphas, ZeroOrMore, printables , Group , alphas8bit , 

# grammar 

src = Word(printables)

trans =  Word(printables)

number = Word(nums)

x=number + "." + src + "::" + trans + "::" + number + "." + number

#parsing for eng-dict

efiledata = open('b1aop_or_not_word.txt').read()

eresults = x.parseString(efiledata)

edict1 = {}

edict2 = {}

counter=0

xx=list()

for result in eresults:

  trans=""#translation string

  ew=""#english word

  xx=result[0]

  ew=xx[2]

  trans=xx[4]   

  edict1 = { ew:trans }

  edict2.update(edict1)

print len(edict2) #no of entries in the english dictionary

print "edict2 has been created"

print "english dictionary" , edict2 



#parsing for hin-dict

hfiledata = open('b1aop_or_not_word.txt').read()

hresults = x.scanString(hfiledata)

hdict1 = {}

hdict2 = {}

counter=0

for result in hresults:

  trans=""#translation string

  hw=""#hin word

  xx=result[0]  

  hw=xx[2]

  trans=xx[4]

  #print trans

  hdict1 = { trans:hw }

  hdict2.update(hdict1)



print len(hdict2) #no of entries in the hindi dictionary

print"hdict2 has been created"

print "hindi dictionary" , hdict2

'''

#######################################################################################################################



def translate(d, ow, hinlist):

   if ow in d.keys():#ow=old word d=dict

    print ow , "exists in the dictionary keys"

        transes = d[ow]

    transes = transes.split()

        print "possible transes for" , ow , " = ", transes

        for word in transes:

            if word in hinlist:

        print "trans for" , ow , " = ", word

                return word

        return None

   else:

        print ow , "absent"

        return None



f = open('bidir','w')

#lines = ["'

#5# 10 # and better performance in business in turn benefits consumers .  # 0 0 0 0 0 0 0 0 0 0 

#5# 11 # vHyaapaar mEmn bEhtr kaam upbhOkHtaaomn kE lIe laabhpHrdd hOtaa hAI .  # 0 0 0 0 0 0 0 0 0 0 0 

#'"]

data=open('bi_full_2','rb').read()

lines = data.split('!@#$%')

loc=0

for line in lines:

    eng, hin = [subline.split(' # ')

                for subline in line.strip('n').split('n')]



    for transdict, source, dest in [(edict2, eng, hin),

                                    (hdict2, hin, eng)]:

        sourcethings = source[2].split()

        for word in source[1].split():

            tl = dest[1].split()

            otherword = translate(transdict, word, tl)

            loc = source[1].split().index(word)

            if otherword is not None:

                otherword = otherword.strip()

                print word, ' <-> ', otherword, 'meaning=good'

                if otherword in dest[1].split():

                    print word, ' <-> ', otherword, 'trans=good'

                    sourcethings[loc] = str(

                        dest[1].split().index(otherword) + 1)



        source[2] = ' '.join(sourcethings)



    eng = ' # '.join(eng)

    hin = ' # '.join(hin)

    f.write(eng+'n'+hin+'nnn')

f.close()

'''

if an example input sentence for the source file is:

1# 5 # modern markets : confident consumers  # 0 0 0 0 0 

1# 6 # AddhUnIk baajaar : AshHvsHt upbhOkHtaa .  # 0 0 0 0 0 0 

!@#$%

the ouptut would look like this :-

1# 5 # modern markets : confident consumers  # 1 2 3 4 5 

1# 6 # AddhUnIk baajaar : AshHvsHt upbhOkHtaa .  # 1 2 3 4 5 0 

!@#$%

What modification should i make for it to work if the I have the hindi sentences in Unicode(UTF-8) format.

edited Apr 27 '10 at 22:17

ThomasH

13.3k85053

asked Feb 26 '10 at 3:52

boddhisattva

2,45294268

:) I tried using w = Word(printables), but it isn't working. How should I give the spec for this. 'w' is meant to process Hindi characters (UTF-8)

The code specifies the grammar and parses accordingly.

671.assess  :: अहसास  ::2

x=number + "." + src + "::" + w + "::" + number + "." + number

If there is only english characters it is working so the code is correct for the ascii format but the code is not working for the unicode format.

I mean that the code works when we have something of the form
671.assess :: ahsaas ::2

i.e. it parses words in the english format, but I am not sure how to parse and then print characters in the unicode format. I need this for English Hindi word alignment for purpose.

The python code looks like this:

# -*- coding: utf-8 -*-

from pyparsing import Literal, Word, Optional, nums, alphas, ZeroOrMore, printables , Group , alphas8bit , 

# grammar 

src = Word(printables)

trans =  Word(printables)

number = Word(nums)

x=number + "." + src + "::" + trans + "::" + number + "." + number

#parsing for eng-dict

efiledata = open('b1aop_or_not_word.txt').read()

eresults = x.parseString(efiledata)

edict1 = {}

edict2 = {}

counter=0

xx=list()

for result in eresults:

  trans=""#translation string

  ew=""#english word

  xx=result[0]

  ew=xx[2]

  trans=xx[4]   

  edict1 = { ew:trans }

  edict2.update(edict1)

print len(edict2) #no of entries in the english dictionary

print "edict2 has been created"

print "english dictionary" , edict2 



#parsing for hin-dict

hfiledata = open('b1aop_or_not_word.txt').read()

hresults = x.scanString(hfiledata)

hdict1 = {}

hdict2 = {}

counter=0

for result in hresults:

  trans=""#translation string

  hw=""#hin word

  xx=result[0]  

  hw=xx[2]

  trans=xx[4]

  #print trans

  hdict1 = { trans:hw }

  hdict2.update(hdict1)



print len(hdict2) #no of entries in the hindi dictionary

print"hdict2 has been created"

print "hindi dictionary" , hdict2

'''

#######################################################################################################################



def translate(d, ow, hinlist):

   if ow in d.keys():#ow=old word d=dict

    print ow , "exists in the dictionary keys"

        transes = d[ow]

    transes = transes.split()

        print "possible transes for" , ow , " = ", transes

        for word in transes:

            if word in hinlist:

        print "trans for" , ow , " = ", word

                return word

        return None

   else:

        print ow , "absent"

        return None



f = open('bidir','w')

#lines = ["'

#5# 10 # and better performance in business in turn benefits consumers .  # 0 0 0 0 0 0 0 0 0 0 

#5# 11 # vHyaapaar mEmn bEhtr kaam upbhOkHtaaomn kE lIe laabhpHrdd hOtaa hAI .  # 0 0 0 0 0 0 0 0 0 0 0 

#'"]

data=open('bi_full_2','rb').read()

lines = data.split('!@#$%')

loc=0

for line in lines:

    eng, hin = [subline.split(' # ')

                for subline in line.strip('n').split('n')]



    for transdict, source, dest in [(edict2, eng, hin),

                                    (hdict2, hin, eng)]:

        sourcethings = source[2].split()

        for word in source[1].split():

            tl = dest[1].split()

            otherword = translate(transdict, word, tl)

            loc = source[1].split().index(word)

            if otherword is not None:

                otherword = otherword.strip()

                print word, ' <-> ', otherword, 'meaning=good'

                if otherword in dest[1].split():

                    print word, ' <-> ', otherword, 'trans=good'

                    sourcethings[loc] = str(

                        dest[1].split().index(otherword) + 1)



        source[2] = ' '.join(sourcethings)



    eng = ' # '.join(eng)

    hin = ' # '.join(hin)

    f.write(eng+'n'+hin+'nnn')

f.close()

'''

if an example input sentence for the source file is:

1# 5 # modern markets : confident consumers  # 0 0 0 0 0 

1# 6 # AddhUnIk baajaar : AshHvsHt upbhOkHtaa .  # 0 0 0 0 0 0 

!@#$%

the ouptut would look like this :-

1# 5 # modern markets : confident consumers  # 1 2 3 4 5 

1# 6 # AddhUnIk baajaar : AshHvsHt upbhOkHtaa .  # 1 2 3 4 5 0 

!@#$%

What modification should i make for it to work if the I have the hindi sentences in Unicode(UTF-8) format.

python unicode nlp pyparsing

edited Apr 27 '10 at 22:17

ThomasH

13.3k85053

asked Feb 26 '10 at 3:52

boddhisattva

2,45294268

edited Apr 27 '10 at 22:17

ThomasH

13.3k85053

asked Feb 26 '10 at 3:52

boddhisattva

2,45294268

edited Apr 27 '10 at 22:17

ThomasH

13.3k85053

edited Apr 27 '10 at 22:17

ThomasH

13.3k85053

edited Apr 27 '10 at 22:17

ThomasH

13.3k85053

asked Feb 26 '10 at 3:52

boddhisattva

2,45294268

asked Feb 26 '10 at 3:52

boddhisattva

2,45294268

asked Feb 26 '10 at 3:52

boddhisattva

2,45294268

1

Please edit this question and make use of proper formatting so that the question is readable.

– Ignacio Vazquez-Abrams
Feb 26 '10 at 5:12

add a comment |

1

Please edit this question and make use of proper formatting so that the question is readable.

– Ignacio Vazquez-Abrams
Feb 26 '10 at 5:12

Please edit this question and make use of proper formatting so that the question is readable.

– Ignacio Vazquez-Abrams
Feb 26 '10 at 5:12

add a comment |

2 Answers
2

active

oldest

votes

As a general rule, do not process encoded bytestrings: make them into proper unicode strings (by calling their .decode method) as soon as possible, do all of your processing always on unicode strings, then, if you have to for I/O purposes, .encode them back into whatever bytestring encoding you require.

If you're talking about literals, as it seems you are in your code, the "as soon as possible" is at once: use u'...' to express your literals. In a more general case, where you're forced to do I/O in encoded form, it's immediately after input (just as it's immediately before output if you need to perform output in a specific encoded form).

answered Feb 26 '10 at 6:08

Alex Martelli

637k12910461286

Hello Sir..:) Thank you for your answer.. whatever you have said in the 2nd para is exactly applicable to my case.. I tried this thing in the following line of the code: trans = u'Word(printables)' and I couldn achieve the desired output. Could you please correct me if I have made the modification in the wrong line, as after making this change the error is coming ' Expecting printables at that position ' with respect to the lines which defines the grammmar.

– boddhisattva
Feb 26 '10 at 9:06

@mgj, don't assign a unicode string literal to trans, that makes no sense. Just ensure printables is a unicode object (not a utf8-encoded byte string! -- nor a byte string with any other encoding!), and use trans = Word(printables). If your file is utf-8 encoded, or encoded with any other encoding, decode it by using codecs.open from the codecs module, not the built-in open as you're doing, so that each line is a unicode object, not a byte string (in whatever encoding).

– Alex Martelli
Feb 26 '10 at 15:11

add a comment |

Pyparsing's printables only deals with strings in the ASCII range of characters. You want printables in the full Unicode range, like this:

unicodePrintables = u''.join(unichr(c) for c in xrange(sys.maxunicode) 

                                        if not unichr(c).isspace())

Now you can define trans using this more complete set of non-space characters:

trans = Word(unicodePrintables)

I was unable to test against your Hindi test string, but I think this will do the trick.

(If you are using Python 3, then there is no separate unichr function, and no xrange generator, just use:

unicodePrintables = ''.join(chr(c) for c in range(sys.maxunicode) 

                                        if not chr(c).isspace())

EDIT:

With the recent release of pyparsing 2.3.0, new namespace classes have been defined to give printables, alphas, nums, and alphanums for various Unicode language ranges.

import pyparsing as pp

pp.Word(pp.pyparsing_unicode.printables)

pp.Word(pp.pyparsing_unicode.Devanagari.printables)

pp.Word(pp.pyparsing_unicode.देवनागरी.printables)

edited Nov 17 '18 at 2:27

answered Feb 26 '10 at 9:43

PaulMcG

47k969111

Thank you for your answer sir..:)

– boddhisattva
Mar 11 '10 at 5:25

this answer is long since obsolete: unicode no longer is 16 bit, and looping everything isn’t performant at all.

– flying sheep
Sep 15 '15 at 12:33

2

@flyingsheep - good tip, updated to use sys.maxunicode instead of a hard-coded constant, so it will track with Python's sys module. As for looping everything, this bit only runs once, when initially defining a parser, and when used to create a pyparsing Word, gets stored as a set(), so parse-time performance is still quite good.

– PaulMcG
Jul 21 '16 at 15:20

This would include a lot of unprintable characters, such as u''.

– user2357112
Nov 16 '18 at 19:39

The latest version of pyparsing includes a number of Unicode ranges. All control characters below u'20' are not included.

– PaulMcG
Nov 16 '18 at 22:42

|
show 1 more comment

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f2339386%2fpython-pyparsing-unicode-characters%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

answered Feb 26 '10 at 6:08

Alex Martelli

637k12910461286

Hello Sir..:) Thank you for your answer.. whatever you have said in the 2nd para is exactly applicable to my case.. I tried this thing in the following line of the code: trans = u'Word(printables)' and I couldn achieve the desired output. Could you please correct me if I have made the modification in the wrong line, as after making this change the error is coming ' Expecting printables at that position ' with respect to the lines which defines the grammmar.

– boddhisattva
Feb 26 '10 at 9:06

@mgj, don't assign a unicode string literal to trans, that makes no sense. Just ensure printables is a unicode object (not a utf8-encoded byte string! -- nor a byte string with any other encoding!), and use trans = Word(printables). If your file is utf-8 encoded, or encoded with any other encoding, decode it by using codecs.open from the codecs module, not the built-in open as you're doing, so that each line is a unicode object, not a byte string (in whatever encoding).

– Alex Martelli
Feb 26 '10 at 15:11

add a comment |

answered Feb 26 '10 at 6:08

Alex Martelli

637k12910461286

Hello Sir..:) Thank you for your answer.. whatever you have said in the 2nd para is exactly applicable to my case.. I tried this thing in the following line of the code: trans = u'Word(printables)' and I couldn achieve the desired output. Could you please correct me if I have made the modification in the wrong line, as after making this change the error is coming ' Expecting printables at that position ' with respect to the lines which defines the grammmar.

– boddhisattva
Feb 26 '10 at 9:06

@mgj, don't assign a unicode string literal to trans, that makes no sense. Just ensure printables is a unicode object (not a utf8-encoded byte string! -- nor a byte string with any other encoding!), and use trans = Word(printables). If your file is utf-8 encoded, or encoded with any other encoding, decode it by using codecs.open from the codecs module, not the built-in open as you're doing, so that each line is a unicode object, not a byte string (in whatever encoding).

– Alex Martelli
Feb 26 '10 at 15:11

add a comment |

answered Feb 26 '10 at 6:08

Alex Martelli

637k12910461286

answered Feb 26 '10 at 6:08

Alex Martelli

637k12910461286

answered Feb 26 '10 at 6:08

Alex Martelli

637k12910461286

answered Feb 26 '10 at 6:08

Alex Martelli

637k12910461286

answered Feb 26 '10 at 6:08

Alex Martelli

637k12910461286

Hello Sir..:) Thank you for your answer.. whatever you have said in the 2nd para is exactly applicable to my case.. I tried this thing in the following line of the code: trans = u'Word(printables)' and I couldn achieve the desired output. Could you please correct me if I have made the modification in the wrong line, as after making this change the error is coming ' Expecting printables at that position ' with respect to the lines which defines the grammmar.

– boddhisattva
Feb 26 '10 at 9:06

@mgj, don't assign a unicode string literal to trans, that makes no sense. Just ensure printables is a unicode object (not a utf8-encoded byte string! -- nor a byte string with any other encoding!), and use trans = Word(printables). If your file is utf-8 encoded, or encoded with any other encoding, decode it by using codecs.open from the codecs module, not the built-in open as you're doing, so that each line is a unicode object, not a byte string (in whatever encoding).

– Alex Martelli
Feb 26 '10 at 15:11

add a comment |

Hello Sir..:) Thank you for your answer.. whatever you have said in the 2nd para is exactly applicable to my case.. I tried this thing in the following line of the code: trans = u'Word(printables)' and I couldn achieve the desired output. Could you please correct me if I have made the modification in the wrong line, as after making this change the error is coming ' Expecting printables at that position ' with respect to the lines which defines the grammmar.

– boddhisattva
Feb 26 '10 at 9:06

@mgj, don't assign a unicode string literal to trans, that makes no sense. Just ensure printables is a unicode object (not a utf8-encoded byte string! -- nor a byte string with any other encoding!), and use trans = Word(printables). If your file is utf-8 encoded, or encoded with any other encoding, decode it by using codecs.open from the codecs module, not the built-in open as you're doing, so that each line is a unicode object, not a byte string (in whatever encoding).

– Alex Martelli
Feb 26 '10 at 15:11

Hello Sir..:) Thank you for your answer.. whatever you have said in the 2nd para is exactly applicable to my case.. I tried this thing in the following line of the code: trans = u'Word(printables)' and I couldn achieve the desired output. Could you please correct me if I have made the modification in the wrong line, as after making this change the error is coming ' Expecting printables at that position ' with respect to the lines which defines the grammmar.

– boddhisattva
Feb 26 '10 at 9:06

@mgj, don't assign a unicode string literal to trans, that makes no sense. Just ensure printables is a unicode object (not a utf8-encoded byte string! -- nor a byte string with any other encoding!), and use trans = Word(printables). If your file is utf-8 encoded, or encoded with any other encoding, decode it by using codecs.open from the codecs module, not the built-in open as you're doing, so that each line is a unicode object, not a byte string (in whatever encoding).

– Alex Martelli
Feb 26 '10 at 15:11

add a comment |

Pyparsing's printables only deals with strings in the ASCII range of characters. You want printables in the full Unicode range, like this:

unicodePrintables = u''.join(unichr(c) for c in xrange(sys.maxunicode) 

                                        if not unichr(c).isspace())

Now you can define trans using this more complete set of non-space characters:

trans = Word(unicodePrintables)

I was unable to test against your Hindi test string, but I think this will do the trick.

(If you are using Python 3, then there is no separate unichr function, and no xrange generator, just use:

unicodePrintables = ''.join(chr(c) for c in range(sys.maxunicode) 

                                        if not chr(c).isspace())

EDIT:

With the recent release of pyparsing 2.3.0, new namespace classes have been defined to give printables, alphas, nums, and alphanums for various Unicode language ranges.

import pyparsing as pp

pp.Word(pp.pyparsing_unicode.printables)

pp.Word(pp.pyparsing_unicode.Devanagari.printables)

pp.Word(pp.pyparsing_unicode.देवनागरी.printables)

edited Nov 17 '18 at 2:27

answered Feb 26 '10 at 9:43

PaulMcG

47k969111

Thank you for your answer sir..:)

– boddhisattva
Mar 11 '10 at 5:25

this answer is long since obsolete: unicode no longer is 16 bit, and looping everything isn’t performant at all.

– flying sheep
Sep 15 '15 at 12:33

2

@flyingsheep - good tip, updated to use sys.maxunicode instead of a hard-coded constant, so it will track with Python's sys module. As for looping everything, this bit only runs once, when initially defining a parser, and when used to create a pyparsing Word, gets stored as a set(), so parse-time performance is still quite good.

– PaulMcG
Jul 21 '16 at 15:20

This would include a lot of unprintable characters, such as u''.

– user2357112
Nov 16 '18 at 19:39

The latest version of pyparsing includes a number of Unicode ranges. All control characters below u'20' are not included.

– PaulMcG
Nov 16 '18 at 22:42

|
show 1 more comment

Pyparsing's printables only deals with strings in the ASCII range of characters. You want printables in the full Unicode range, like this:

unicodePrintables = u''.join(unichr(c) for c in xrange(sys.maxunicode) 

                                        if not unichr(c).isspace())

Now you can define trans using this more complete set of non-space characters:

trans = Word(unicodePrintables)

I was unable to test against your Hindi test string, but I think this will do the trick.

(If you are using Python 3, then there is no separate unichr function, and no xrange generator, just use:

unicodePrintables = ''.join(chr(c) for c in range(sys.maxunicode) 

                                        if not chr(c).isspace())

EDIT:

With the recent release of pyparsing 2.3.0, new namespace classes have been defined to give printables, alphas, nums, and alphanums for various Unicode language ranges.

import pyparsing as pp

pp.Word(pp.pyparsing_unicode.printables)

pp.Word(pp.pyparsing_unicode.Devanagari.printables)

pp.Word(pp.pyparsing_unicode.देवनागरी.printables)

edited Nov 17 '18 at 2:27

answered Feb 26 '10 at 9:43

PaulMcG

47k969111

Thank you for your answer sir..:)

– boddhisattva
Mar 11 '10 at 5:25

this answer is long since obsolete: unicode no longer is 16 bit, and looping everything isn’t performant at all.

– flying sheep
Sep 15 '15 at 12:33

2

@flyingsheep - good tip, updated to use sys.maxunicode instead of a hard-coded constant, so it will track with Python's sys module. As for looping everything, this bit only runs once, when initially defining a parser, and when used to create a pyparsing Word, gets stored as a set(), so parse-time performance is still quite good.

– PaulMcG
Jul 21 '16 at 15:20

This would include a lot of unprintable characters, such as u''.

– user2357112
Nov 16 '18 at 19:39

The latest version of pyparsing includes a number of Unicode ranges. All control characters below u'20' are not included.

– PaulMcG
Nov 16 '18 at 22:42

|
show 1 more comment

Pyparsing's printables only deals with strings in the ASCII range of characters. You want printables in the full Unicode range, like this:

unicodePrintables = u''.join(unichr(c) for c in xrange(sys.maxunicode) 

                                        if not unichr(c).isspace())

Now you can define trans using this more complete set of non-space characters:

trans = Word(unicodePrintables)

I was unable to test against your Hindi test string, but I think this will do the trick.

(If you are using Python 3, then there is no separate unichr function, and no xrange generator, just use:

unicodePrintables = ''.join(chr(c) for c in range(sys.maxunicode) 

                                        if not chr(c).isspace())

EDIT:

With the recent release of pyparsing 2.3.0, new namespace classes have been defined to give printables, alphas, nums, and alphanums for various Unicode language ranges.

import pyparsing as pp

pp.Word(pp.pyparsing_unicode.printables)

pp.Word(pp.pyparsing_unicode.Devanagari.printables)

pp.Word(pp.pyparsing_unicode.देवनागरी.printables)

edited Nov 17 '18 at 2:27

answered Feb 26 '10 at 9:43

PaulMcG

47k969111

Pyparsing's printables only deals with strings in the ASCII range of characters. You want printables in the full Unicode range, like this:

unicodePrintables = u''.join(unichr(c) for c in xrange(sys.maxunicode) 

                                        if not unichr(c).isspace())

Now you can define trans using this more complete set of non-space characters:

trans = Word(unicodePrintables)

I was unable to test against your Hindi test string, but I think this will do the trick.

(If you are using Python 3, then there is no separate unichr function, and no xrange generator, just use:

unicodePrintables = ''.join(chr(c) for c in range(sys.maxunicode) 

                                        if not chr(c).isspace())

EDIT:

With the recent release of pyparsing 2.3.0, new namespace classes have been defined to give printables, alphas, nums, and alphanums for various Unicode language ranges.

import pyparsing as pp

pp.Word(pp.pyparsing_unicode.printables)

pp.Word(pp.pyparsing_unicode.Devanagari.printables)

pp.Word(pp.pyparsing_unicode.देवनागरी.printables)

edited Nov 17 '18 at 2:27

answered Feb 26 '10 at 9:43

PaulMcG

47k969111

edited Nov 17 '18 at 2:27

answered Feb 26 '10 at 9:43

PaulMcG

47k969111

answered Feb 26 '10 at 9:43

PaulMcG

47k969111

answered Feb 26 '10 at 9:43

PaulMcG

47k969111

Thank you for your answer sir..:)

– boddhisattva
Mar 11 '10 at 5:25

this answer is long since obsolete: unicode no longer is 16 bit, and looping everything isn’t performant at all.

– flying sheep
Sep 15 '15 at 12:33

2

@flyingsheep - good tip, updated to use sys.maxunicode instead of a hard-coded constant, so it will track with Python's sys module. As for looping everything, this bit only runs once, when initially defining a parser, and when used to create a pyparsing Word, gets stored as a set(), so parse-time performance is still quite good.

– PaulMcG
Jul 21 '16 at 15:20

This would include a lot of unprintable characters, such as u''.

– user2357112
Nov 16 '18 at 19:39

The latest version of pyparsing includes a number of Unicode ranges. All control characters below u'20' are not included.

– PaulMcG
Nov 16 '18 at 22:42

|
show 1 more comment

Thank you for your answer sir..:)

– boddhisattva
Mar 11 '10 at 5:25

this answer is long since obsolete: unicode no longer is 16 bit, and looping everything isn’t performant at all.

– flying sheep
Sep 15 '15 at 12:33

2

@flyingsheep - good tip, updated to use sys.maxunicode instead of a hard-coded constant, so it will track with Python's sys module. As for looping everything, this bit only runs once, when initially defining a parser, and when used to create a pyparsing Word, gets stored as a set(), so parse-time performance is still quite good.

– PaulMcG
Jul 21 '16 at 15:20

This would include a lot of unprintable characters, such as u''.

– user2357112
Nov 16 '18 at 19:39

The latest version of pyparsing includes a number of Unicode ranges. All control characters below u'20' are not included.

– PaulMcG
Nov 16 '18 at 22:42

Thank you for your answer sir..:)

– boddhisattva
Mar 11 '10 at 5:25

this answer is long since obsolete: unicode no longer is 16 bit, and looping everything isn’t performant at all.

– flying sheep
Sep 15 '15 at 12:33

@flyingsheep - good tip, updated to use sys.maxunicode instead of a hard-coded constant, so it will track with Python's sys module. As for looping everything, this bit only runs once, when initially defining a parser, and when used to create a pyparsing Word, gets stored as a set(), so parse-time performance is still quite good.

– PaulMcG
Jul 21 '16 at 15:20

This would include a lot of unprintable characters, such as u''.

– user2357112
Nov 16 '18 at 19:39

The latest version of pyparsing includes a number of Unicode ranges. All control characters below u'20' are not included.

– PaulMcG
Nov 16 '18 at 22:42

|
show 1 more comment

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Vfrdtyky