group and classify words as well as characters

up vote
3
down vote

favorite

I need to split on slash and then report tags. This is hunspell dictionary format. I tried to find a class on github that would do this, but could not find one.

# vi test.txt

test/S

boy

girl/SE

home/

house/SE123

man/E

country

wind/ES

The code:

from collections import defaultdict

myl=defaultdict(list)



with open('test.txt') as f :

    for l in f:

        l = l.rstrip()

        try:

            tags = l.split('/')[1]

            myl[tags].append(l.split('/')[0])

            for t in tags:

                myl[t].append( l.split('/')[0])

        except:

            pass

output:

defaultdict(list,

            {'S': ['test', 'test', 'girl', 'house', 'wind'],

             'SE': ['girl'],

             'E': ['girl', 'house', 'man', 'man', 'wind'],

             '': ['home'],

             'SE123': ['house'],

             '1': ['house'],

             '2': ['house'],

             '3': ['house'],

             'ES': ['wind']})

SE group should have 3 words 'girl', 'wind' and 'house'. There should be no ES group because it is included and same as "SE" and SE123 should remain as is. how do I achieve this?

Update:

I have managed to add bigrams, but how do I add 3, 4, 5 grams?

from collections import defaultdict

import nltk

myl=defaultdict(list)



with open('hi_IN.dic') as f :

    for l in f:

        l = l.rstrip()

        try:

            tags = l.split('/')[1]

            ntags=''.join(sorted(tags))

            myl[ntags].append(l.split('/')[0])

            for t in tags:

                myl[t].append( l.split('/')[0])

            bigrm = list(nltk.bigrams([i for i in tags]))

            nlist=[x+y for x, y in bigrm]

            for t1 in nlist:

                t1a=''.join(sorted(t1))

                myl[t1a].append(l.split('/')[0])

        except:

            pass

I guess it would help if I sort the tags at source:

with open('test1.txt', 'w') as nf:

    with open('test.txt') as f :

        for l in f:

            l = l.rstrip()

            try:

                tags = l.split('/')[1]

            except IndexError:

                nline= l 

            else:

                ntags=''.join(sorted(tags))

                nline= l.split('/')[0] + '/' + ntags

            nf.write(nline+'n')

This will create a new file test1.txt with sorted tags. But the problem of trigrams+ still not resolved.

I downloaded a sample file:

!wget https://raw.githubusercontent.com/wooorm/dictionaries/master/dictionaries/en-US/index.dic

The report using "grep" command is correct.

!grep 'P.*U' index1.dic



CPU/M

GPU

aware/PU

cleanly/PRTU

common/PRTUY

conscious/PUY

easy/PRTU

faithful/PUY

friendly/PRTU

godly/PRTU

grateful/PUY

happy/PRTU

healthy/PRTU

holy/PRTU

kind/PRTUY

lawful/PUY

likely/PRTU

lucky/PRTU

natural/PUY

obtrusive/PUY

pleasant/PTUY

prepared/PU

reasonable/PU

responsive/PUY

righteous/PU

scrupulous/PUY

seemly/PRTU

selfish/PUY

timely/PRTU

truthful/PUY

wary/PRTU

wholesome/PU

willing/PUY

worldly/PTU

worthy/PRTU

The python report using bigrams on sorted tags file does not contain all the words mentioned above.

myl['PU']



['aware',

 'aware',

 'conscious',

 'faithful',

 'grateful',

 'lawful',

 'natural',

 'obtrusive',

 'prepared',

 'prepared',

 'reasonable',

 'reasonable',

 'responsive',

 'righteous',

 'righteous',

 'scrupulous',

 'selfish',

 'truthful',

 'wholesome',

 'wholesome',

 'willing']

edited yesterday

asked Nov 1 at 7:57

shantanuo

11.2k55150252

This question has an open bounty worth +50
reputation from shantanuo ending tomorrow.

This question has not received enough attention.

It seems more manageable if you split this into 2 steps in my opinion. 1.Create the set of keys for each tag 2. Loop through your list of value/tag and append the value to each of the keys for that tag
– kerwei
Nov 1 at 8:45

Is 5-gram the maximum that you need or would you need 6-gram if there was 6 tags? Do you need also skipragams, like S1, S2? Do you all combinatinos of the tags?
– Tomáš Přinda
Nov 9 at 3:43

Yes. I need all combinations of tags. It includes skipgrams and 5-grams, 6-grams depending upon the number of tags used. In other words I need a report of how the words are tagged. for e.g. from this file github.com/wooorm/dictionaries/blob/master/dictionaries/en-US/…
– shantanuo
yesterday

add a comment |

up vote
3
down vote

favorite

I need to split on slash and then report tags. This is hunspell dictionary format. I tried to find a class on github that would do this, but could not find one.

# vi test.txt

test/S

boy

girl/SE

home/

house/SE123

man/E

country

wind/ES

The code:

from collections import defaultdict

myl=defaultdict(list)



with open('test.txt') as f :

    for l in f:

        l = l.rstrip()

        try:

            tags = l.split('/')[1]

            myl[tags].append(l.split('/')[0])

            for t in tags:

                myl[t].append( l.split('/')[0])

        except:

            pass

output:

defaultdict(list,

            {'S': ['test', 'test', 'girl', 'house', 'wind'],

             'SE': ['girl'],

             'E': ['girl', 'house', 'man', 'man', 'wind'],

             '': ['home'],

             'SE123': ['house'],

             '1': ['house'],

             '2': ['house'],

             '3': ['house'],

             'ES': ['wind']})

SE group should have 3 words 'girl', 'wind' and 'house'. There should be no ES group because it is included and same as "SE" and SE123 should remain as is. how do I achieve this?

Update:

I have managed to add bigrams, but how do I add 3, 4, 5 grams?

from collections import defaultdict

import nltk

myl=defaultdict(list)



with open('hi_IN.dic') as f :

    for l in f:

        l = l.rstrip()

        try:

            tags = l.split('/')[1]

            ntags=''.join(sorted(tags))

            myl[ntags].append(l.split('/')[0])

            for t in tags:

                myl[t].append( l.split('/')[0])

            bigrm = list(nltk.bigrams([i for i in tags]))

            nlist=[x+y for x, y in bigrm]

            for t1 in nlist:

                t1a=''.join(sorted(t1))

                myl[t1a].append(l.split('/')[0])

        except:

            pass

I guess it would help if I sort the tags at source:

with open('test1.txt', 'w') as nf:

    with open('test.txt') as f :

        for l in f:

            l = l.rstrip()

            try:

                tags = l.split('/')[1]

            except IndexError:

                nline= l 

            else:

                ntags=''.join(sorted(tags))

                nline= l.split('/')[0] + '/' + ntags

            nf.write(nline+'n')

This will create a new file test1.txt with sorted tags. But the problem of trigrams+ still not resolved.

I downloaded a sample file:

!wget https://raw.githubusercontent.com/wooorm/dictionaries/master/dictionaries/en-US/index.dic

The report using "grep" command is correct.

!grep 'P.*U' index1.dic



CPU/M

GPU

aware/PU

cleanly/PRTU

common/PRTUY

conscious/PUY

easy/PRTU

faithful/PUY

friendly/PRTU

godly/PRTU

grateful/PUY

happy/PRTU

healthy/PRTU

holy/PRTU

kind/PRTUY

lawful/PUY

likely/PRTU

lucky/PRTU

natural/PUY

obtrusive/PUY

pleasant/PTUY

prepared/PU

reasonable/PU

responsive/PUY

righteous/PU

scrupulous/PUY

seemly/PRTU

selfish/PUY

timely/PRTU

truthful/PUY

wary/PRTU

wholesome/PU

willing/PUY

worldly/PTU

worthy/PRTU

The python report using bigrams on sorted tags file does not contain all the words mentioned above.

myl['PU']



['aware',

 'aware',

 'conscious',

 'faithful',

 'grateful',

 'lawful',

 'natural',

 'obtrusive',

 'prepared',

 'prepared',

 'reasonable',

 'reasonable',

 'responsive',

 'righteous',

 'righteous',

 'scrupulous',

 'selfish',

 'truthful',

 'wholesome',

 'wholesome',

 'willing']

edited yesterday

asked Nov 1 at 7:57

shantanuo

11.2k55150252

This question has an open bounty worth +50
reputation from shantanuo ending tomorrow.

This question has not received enough attention.

It seems more manageable if you split this into 2 steps in my opinion. 1.Create the set of keys for each tag 2. Loop through your list of value/tag and append the value to each of the keys for that tag
– kerwei
Nov 1 at 8:45

Is 5-gram the maximum that you need or would you need 6-gram if there was 6 tags? Do you need also skipragams, like S1, S2? Do you all combinatinos of the tags?
– Tomáš Přinda
Nov 9 at 3:43

Yes. I need all combinations of tags. It includes skipgrams and 5-grams, 6-grams depending upon the number of tags used. In other words I need a report of how the words are tagged. for e.g. from this file github.com/wooorm/dictionaries/blob/master/dictionaries/en-US/…
– shantanuo
yesterday

add a comment |

up vote
3
down vote

favorite

I need to split on slash and then report tags. This is hunspell dictionary format. I tried to find a class on github that would do this, but could not find one.

# vi test.txt

test/S

boy

girl/SE

home/

house/SE123

man/E

country

wind/ES

The code:

from collections import defaultdict

myl=defaultdict(list)



with open('test.txt') as f :

    for l in f:

        l = l.rstrip()

        try:

            tags = l.split('/')[1]

            myl[tags].append(l.split('/')[0])

            for t in tags:

                myl[t].append( l.split('/')[0])

        except:

            pass

output:

defaultdict(list,

            {'S': ['test', 'test', 'girl', 'house', 'wind'],

             'SE': ['girl'],

             'E': ['girl', 'house', 'man', 'man', 'wind'],

             '': ['home'],

             'SE123': ['house'],

             '1': ['house'],

             '2': ['house'],

             '3': ['house'],

             'ES': ['wind']})

SE group should have 3 words 'girl', 'wind' and 'house'. There should be no ES group because it is included and same as "SE" and SE123 should remain as is. how do I achieve this?

Update:

I have managed to add bigrams, but how do I add 3, 4, 5 grams?

from collections import defaultdict

import nltk

myl=defaultdict(list)



with open('hi_IN.dic') as f :

    for l in f:

        l = l.rstrip()

        try:

            tags = l.split('/')[1]

            ntags=''.join(sorted(tags))

            myl[ntags].append(l.split('/')[0])

            for t in tags:

                myl[t].append( l.split('/')[0])

            bigrm = list(nltk.bigrams([i for i in tags]))

            nlist=[x+y for x, y in bigrm]

            for t1 in nlist:

                t1a=''.join(sorted(t1))

                myl[t1a].append(l.split('/')[0])

        except:

            pass

I guess it would help if I sort the tags at source:

with open('test1.txt', 'w') as nf:

    with open('test.txt') as f :

        for l in f:

            l = l.rstrip()

            try:

                tags = l.split('/')[1]

            except IndexError:

                nline= l 

            else:

                ntags=''.join(sorted(tags))

                nline= l.split('/')[0] + '/' + ntags

            nf.write(nline+'n')

This will create a new file test1.txt with sorted tags. But the problem of trigrams+ still not resolved.

I downloaded a sample file:

!wget https://raw.githubusercontent.com/wooorm/dictionaries/master/dictionaries/en-US/index.dic

The report using "grep" command is correct.

!grep 'P.*U' index1.dic



CPU/M

GPU

aware/PU

cleanly/PRTU

common/PRTUY

conscious/PUY

easy/PRTU

faithful/PUY

friendly/PRTU

godly/PRTU

grateful/PUY

happy/PRTU

healthy/PRTU

holy/PRTU

kind/PRTUY

lawful/PUY

likely/PRTU

lucky/PRTU

natural/PUY

obtrusive/PUY

pleasant/PTUY

prepared/PU

reasonable/PU

responsive/PUY

righteous/PU

scrupulous/PUY

seemly/PRTU

selfish/PUY

timely/PRTU

truthful/PUY

wary/PRTU

wholesome/PU

willing/PUY

worldly/PTU

worthy/PRTU

The python report using bigrams on sorted tags file does not contain all the words mentioned above.

myl['PU']



['aware',

 'aware',

 'conscious',

 'faithful',

 'grateful',

 'lawful',

 'natural',

 'obtrusive',

 'prepared',

 'prepared',

 'reasonable',

 'reasonable',

 'responsive',

 'righteous',

 'righteous',

 'scrupulous',

 'selfish',

 'truthful',

 'wholesome',

 'wholesome',

 'willing']

edited yesterday

asked Nov 1 at 7:57

shantanuo

11.2k55150252

I need to split on slash and then report tags. This is hunspell dictionary format. I tried to find a class on github that would do this, but could not find one.

# vi test.txt

test/S

boy

girl/SE

home/

house/SE123

man/E

country

wind/ES

The code:

from collections import defaultdict

myl=defaultdict(list)



with open('test.txt') as f :

    for l in f:

        l = l.rstrip()

        try:

            tags = l.split('/')[1]

            myl[tags].append(l.split('/')[0])

            for t in tags:

                myl[t].append( l.split('/')[0])

        except:

            pass

output:

defaultdict(list,

            {'S': ['test', 'test', 'girl', 'house', 'wind'],

             'SE': ['girl'],

             'E': ['girl', 'house', 'man', 'man', 'wind'],

             '': ['home'],

             'SE123': ['house'],

             '1': ['house'],

             '2': ['house'],

             '3': ['house'],

             'ES': ['wind']})

SE group should have 3 words 'girl', 'wind' and 'house'. There should be no ES group because it is included and same as "SE" and SE123 should remain as is. how do I achieve this?

Update:

I have managed to add bigrams, but how do I add 3, 4, 5 grams?

from collections import defaultdict

import nltk

myl=defaultdict(list)



with open('hi_IN.dic') as f :

    for l in f:

        l = l.rstrip()

        try:

            tags = l.split('/')[1]

            ntags=''.join(sorted(tags))

            myl[ntags].append(l.split('/')[0])

            for t in tags:

                myl[t].append( l.split('/')[0])

            bigrm = list(nltk.bigrams([i for i in tags]))

            nlist=[x+y for x, y in bigrm]

            for t1 in nlist:

                t1a=''.join(sorted(t1))

                myl[t1a].append(l.split('/')[0])

        except:

            pass

I guess it would help if I sort the tags at source:

with open('test1.txt', 'w') as nf:

    with open('test.txt') as f :

        for l in f:

            l = l.rstrip()

            try:

                tags = l.split('/')[1]

            except IndexError:

                nline= l 

            else:

                ntags=''.join(sorted(tags))

                nline= l.split('/')[0] + '/' + ntags

            nf.write(nline+'n')

This will create a new file test1.txt with sorted tags. But the problem of trigrams+ still not resolved.

I downloaded a sample file:

!wget https://raw.githubusercontent.com/wooorm/dictionaries/master/dictionaries/en-US/index.dic

The report using "grep" command is correct.

!grep 'P.*U' index1.dic



CPU/M

GPU

aware/PU

cleanly/PRTU

common/PRTUY

conscious/PUY

easy/PRTU

faithful/PUY

friendly/PRTU

godly/PRTU

grateful/PUY

happy/PRTU

healthy/PRTU

holy/PRTU

kind/PRTUY

lawful/PUY

likely/PRTU

lucky/PRTU

natural/PUY

obtrusive/PUY

pleasant/PTUY

prepared/PU

reasonable/PU

responsive/PUY

righteous/PU

scrupulous/PUY

seemly/PRTU

selfish/PUY

timely/PRTU

truthful/PUY

wary/PRTU

wholesome/PU

willing/PUY

worldly/PTU

worthy/PRTU

The python report using bigrams on sorted tags file does not contain all the words mentioned above.

myl['PU']



['aware',

 'aware',

 'conscious',

 'faithful',

 'grateful',

 'lawful',

 'natural',

 'obtrusive',

 'prepared',

 'prepared',

 'reasonable',

 'reasonable',

 'responsive',

 'righteous',

 'righteous',

 'scrupulous',

 'selfish',

 'truthful',

 'wholesome',

 'wholesome',

 'willing']

python nltk hunspell

edited yesterday

asked Nov 1 at 7:57

shantanuo

11.2k55150252

edited yesterday

asked Nov 1 at 7:57

shantanuo

11.2k55150252

edited yesterday

asked Nov 1 at 7:57

shantanuo

11.2k55150252

asked Nov 1 at 7:57

shantanuo

11.2k55150252

asked Nov 1 at 7:57

shantanuo

11.2k55150252

This question has an open bounty worth +50
reputation from shantanuo ending tomorrow.

This question has not received enough attention.

This question has an open bounty worth +50
reputation from shantanuo ending tomorrow.

This question has not received enough attention.

It seems more manageable if you split this into 2 steps in my opinion. 1.Create the set of keys for each tag 2. Loop through your list of value/tag and append the value to each of the keys for that tag
– kerwei
Nov 1 at 8:45

Is 5-gram the maximum that you need or would you need 6-gram if there was 6 tags? Do you need also skipragams, like S1, S2? Do you all combinatinos of the tags?
– Tomáš Přinda
Nov 9 at 3:43

Yes. I need all combinations of tags. It includes skipgrams and 5-grams, 6-grams depending upon the number of tags used. In other words I need a report of how the words are tagged. for e.g. from this file github.com/wooorm/dictionaries/blob/master/dictionaries/en-US/…
– shantanuo
yesterday

add a comment |

It seems more manageable if you split this into 2 steps in my opinion. 1.Create the set of keys for each tag 2. Loop through your list of value/tag and append the value to each of the keys for that tag
– kerwei
Nov 1 at 8:45

Is 5-gram the maximum that you need or would you need 6-gram if there was 6 tags? Do you need also skipragams, like S1, S2? Do you all combinatinos of the tags?
– Tomáš Přinda
Nov 9 at 3:43

Yes. I need all combinations of tags. It includes skipgrams and 5-grams, 6-grams depending upon the number of tags used. In other words I need a report of how the words are tagged. for e.g. from this file github.com/wooorm/dictionaries/blob/master/dictionaries/en-US/…
– shantanuo
yesterday

It seems more manageable if you split this into 2 steps in my opinion. 1.Create the set of keys for each tag 2. Loop through your list of value/tag and append the value to each of the keys for that tag
– kerwei
Nov 1 at 8:45

Is 5-gram the maximum that you need or would you need 6-gram if there was 6 tags? Do you need also skipragams, like S1, S2? Do you all combinatinos of the tags?
– Tomáš Přinda
Nov 9 at 3:43

Yes. I need all combinations of tags. It includes skipgrams and 5-grams, 6-grams depending upon the number of tags used. In other words I need a report of how the words are tagged. for e.g. from this file github.com/wooorm/dictionaries/blob/master/dictionaries/en-US/…
– shantanuo
yesterday

add a comment |

1 Answer
1

active

oldest

votes

up vote
1
down vote

Given I understand it correctly, this is more a matter of constructing a datastructure that, for a given tag, constructs the correct list. We can do this by constructing a dictionary that only takes into account singular tags. Later, when a person queries on multiple tags, we calculate the intersection. This thus makes it compact to represent, as well as easy to extract for example all the elements with tag AC and this will list elements with tag ABCD, ACD, ZABC, etc.

We can thus construct a parser:

from collections import defaultdict



class Hunspell(object):



    def __init__(self, data):

        self.data = data



    def __getitem__(self, tags):

        if not tags:

            return self.data.get(None, )



        elements = [self.data.get(tag ,()) for tag in tags]

        data = set.intersection(*map(set, elements))

        return [e for e in self.data.get(tags[0], ()) if e in data]



    @staticmethod

    def load(f):

       data = defaultdict(list)

       for line in f:

           try:

               element, tags = line.rstrip().split('/', 1)

               for tag in tags:

                   data[tag].append(element)

               data[None].append(element)

           except ValueError:

               pass  # element with no tags

       return Hunspell(dict(data))

The list processing at the end of __getitem__ is done to retrieve the elements in the correct order.

We can then load the file into memory with:

>>> with open('test.txt') as f:

...     h = Hunspell.load(f)

and query it for arbitrary keys:

>>> h['SE']

['girl', 'house', 'wind']

>>> h['ES']

['girl', 'house', 'wind']

>>> h['1']

['house']

>>> h['']

['test', 'girl', 'home', 'house', 'man', 'wind']

>>> h['S3']

['house']

>>> h['S2']

['house']

>>> h['SE2']

['house']

>>> h[None]

['test', 'girl', 'home', 'house', 'man', 'wind']

>>> h['4']

querying for non-existing tags will result in an empty list. Here we thus postpone the "intersection" process upon calling. We can in fact already generate all possible intersections, but this will result in a large datastructure, and perhaps a large amount of work

answered yesterday

Willem Van Onsem

138k16129220

1

This is correct. But I will wait for some time before accepting the answer. Because this is a little difficult to understand :)
– shantanuo
yesterday

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53097193%2fgroup-and-classify-words-as-well-as-characters%23new-answer', 'question_page');
}
);

Post as a guest

Name

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

up vote
1
down vote

We can thus construct a parser:

from collections import defaultdict



class Hunspell(object):



    def __init__(self, data):

        self.data = data



    def __getitem__(self, tags):

        if not tags:

            return self.data.get(None, )



        elements = [self.data.get(tag ,()) for tag in tags]

        data = set.intersection(*map(set, elements))

        return [e for e in self.data.get(tags[0], ()) if e in data]



    @staticmethod

    def load(f):

       data = defaultdict(list)

       for line in f:

           try:

               element, tags = line.rstrip().split('/', 1)

               for tag in tags:

                   data[tag].append(element)

               data[None].append(element)

           except ValueError:

               pass  # element with no tags

       return Hunspell(dict(data))

The list processing at the end of __getitem__ is done to retrieve the elements in the correct order.

We can then load the file into memory with:

>>> with open('test.txt') as f:

...     h = Hunspell.load(f)

and query it for arbitrary keys:

>>> h['SE']

['girl', 'house', 'wind']

>>> h['ES']

['girl', 'house', 'wind']

>>> h['1']

['house']

>>> h['']

['test', 'girl', 'home', 'house', 'man', 'wind']

>>> h['S3']

['house']

>>> h['S2']

['house']

>>> h['SE2']

['house']

>>> h[None]

['test', 'girl', 'home', 'house', 'man', 'wind']

>>> h['4']

answered yesterday

Willem Van Onsem

138k16129220

1

This is correct. But I will wait for some time before accepting the answer. Because this is a little difficult to understand :)
– shantanuo
yesterday

add a comment |

up vote
1
down vote

We can thus construct a parser:

from collections import defaultdict



class Hunspell(object):



    def __init__(self, data):

        self.data = data



    def __getitem__(self, tags):

        if not tags:

            return self.data.get(None, )



        elements = [self.data.get(tag ,()) for tag in tags]

        data = set.intersection(*map(set, elements))

        return [e for e in self.data.get(tags[0], ()) if e in data]



    @staticmethod

    def load(f):

       data = defaultdict(list)

       for line in f:

           try:

               element, tags = line.rstrip().split('/', 1)

               for tag in tags:

                   data[tag].append(element)

               data[None].append(element)

           except ValueError:

               pass  # element with no tags

       return Hunspell(dict(data))

The list processing at the end of __getitem__ is done to retrieve the elements in the correct order.

We can then load the file into memory with:

>>> with open('test.txt') as f:

...     h = Hunspell.load(f)

and query it for arbitrary keys:

>>> h['SE']

['girl', 'house', 'wind']

>>> h['ES']

['girl', 'house', 'wind']

>>> h['1']

['house']

>>> h['']

['test', 'girl', 'home', 'house', 'man', 'wind']

>>> h['S3']

['house']

>>> h['S2']

['house']

>>> h['SE2']

['house']

>>> h[None]

['test', 'girl', 'home', 'house', 'man', 'wind']

>>> h['4']

answered yesterday

Willem Van Onsem

138k16129220

1

This is correct. But I will wait for some time before accepting the answer. Because this is a little difficult to understand :)
– shantanuo
yesterday

add a comment |

up vote
1
down vote

We can thus construct a parser:

from collections import defaultdict



class Hunspell(object):



    def __init__(self, data):

        self.data = data



    def __getitem__(self, tags):

        if not tags:

            return self.data.get(None, )



        elements = [self.data.get(tag ,()) for tag in tags]

        data = set.intersection(*map(set, elements))

        return [e for e in self.data.get(tags[0], ()) if e in data]



    @staticmethod

    def load(f):

       data = defaultdict(list)

       for line in f:

           try:

               element, tags = line.rstrip().split('/', 1)

               for tag in tags:

                   data[tag].append(element)

               data[None].append(element)

           except ValueError:

               pass  # element with no tags

       return Hunspell(dict(data))

The list processing at the end of __getitem__ is done to retrieve the elements in the correct order.

We can then load the file into memory with:

>>> with open('test.txt') as f:

...     h = Hunspell.load(f)

and query it for arbitrary keys:

>>> h['SE']

['girl', 'house', 'wind']

>>> h['ES']

['girl', 'house', 'wind']

>>> h['1']

['house']

>>> h['']

['test', 'girl', 'home', 'house', 'man', 'wind']

>>> h['S3']

['house']

>>> h['S2']

['house']

>>> h['SE2']

['house']

>>> h[None]

['test', 'girl', 'home', 'house', 'man', 'wind']

>>> h['4']

answered yesterday

Willem Van Onsem

138k16129220

We can thus construct a parser:

from collections import defaultdict



class Hunspell(object):



    def __init__(self, data):

        self.data = data



    def __getitem__(self, tags):

        if not tags:

            return self.data.get(None, )



        elements = [self.data.get(tag ,()) for tag in tags]

        data = set.intersection(*map(set, elements))

        return [e for e in self.data.get(tags[0], ()) if e in data]



    @staticmethod

    def load(f):

       data = defaultdict(list)

       for line in f:

           try:

               element, tags = line.rstrip().split('/', 1)

               for tag in tags:

                   data[tag].append(element)

               data[None].append(element)

           except ValueError:

               pass  # element with no tags

       return Hunspell(dict(data))

The list processing at the end of __getitem__ is done to retrieve the elements in the correct order.

We can then load the file into memory with:

>>> with open('test.txt') as f:

...     h = Hunspell.load(f)

and query it for arbitrary keys:

>>> h['SE']

['girl', 'house', 'wind']

>>> h['ES']

['girl', 'house', 'wind']

>>> h['1']

['house']

>>> h['']

['test', 'girl', 'home', 'house', 'man', 'wind']

>>> h['S3']

['house']

>>> h['S2']

['house']

>>> h['SE2']

['house']

>>> h[None]

['test', 'girl', 'home', 'house', 'man', 'wind']

>>> h['4']

answered yesterday

Willem Van Onsem

138k16129220

answered yesterday

Willem Van Onsem

138k16129220

answered yesterday

Willem Van Onsem

138k16129220

answered yesterday

Willem Van Onsem

138k16129220

1

This is correct. But I will wait for some time before accepting the answer. Because this is a little difficult to understand :)
– shantanuo
yesterday

add a comment |

1

This is correct. But I will wait for some time before accepting the answer. Because this is a little difficult to understand :)
– shantanuo
yesterday

This is correct. But I will wait for some time before accepting the answer. Because this is a little difficult to understand :)
– shantanuo
yesterday

add a comment |

draft saved

draft discarded

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Name

This page is only for reference, If you need detailed information, please check here

g9vsvTmfwJBkn3qobMJq3OgEuH56ntlpNXqNCf4aTfkKizuAD59,1 ssRAzQsRgg,ab

搜尋此網誌

Vfrdtyky