group and classify words as well as characters











up vote
3
down vote

favorite
1












I need to split on slash and then report tags. This is hunspell dictionary format. I tried to find a class on github that would do this, but could not find one.



# vi test.txt
test/S
boy
girl/SE
home/
house/SE123
man/E
country
wind/ES


The code:



from collections import defaultdict
myl=defaultdict(list)

with open('test.txt') as f :
for l in f:
l = l.rstrip()
try:
tags = l.split('/')[1]
myl[tags].append(l.split('/')[0])
for t in tags:
myl[t].append( l.split('/')[0])
except:
pass


output:



defaultdict(list,
{'S': ['test', 'test', 'girl', 'house', 'wind'],
'SE': ['girl'],
'E': ['girl', 'house', 'man', 'man', 'wind'],
'': ['home'],
'SE123': ['house'],
'1': ['house'],
'2': ['house'],
'3': ['house'],
'ES': ['wind']})


SE group should have 3 words 'girl', 'wind' and 'house'. There should be no ES group because it is included and same as "SE" and SE123 should remain as is. how do I achieve this?





Update:



I have managed to add bigrams, but how do I add 3, 4, 5 grams?



from collections import defaultdict
import nltk
myl=defaultdict(list)

with open('hi_IN.dic') as f :
for l in f:
l = l.rstrip()
try:
tags = l.split('/')[1]
ntags=''.join(sorted(tags))
myl[ntags].append(l.split('/')[0])
for t in tags:
myl[t].append( l.split('/')[0])
bigrm = list(nltk.bigrams([i for i in tags]))
nlist=[x+y for x, y in bigrm]
for t1 in nlist:
t1a=''.join(sorted(t1))
myl[t1a].append(l.split('/')[0])
except:
pass




I guess it would help if I sort the tags at source:



with open('test1.txt', 'w') as nf:
with open('test.txt') as f :
for l in f:
l = l.rstrip()
try:
tags = l.split('/')[1]
except IndexError:
nline= l
else:
ntags=''.join(sorted(tags))
nline= l.split('/')[0] + '/' + ntags
nf.write(nline+'n')


This will create a new file test1.txt with sorted tags. But the problem of trigrams+ still not resolved.





I downloaded a sample file:



!wget https://raw.githubusercontent.com/wooorm/dictionaries/master/dictionaries/en-US/index.dic



The report using "grep" command is correct.



!grep 'P.*U' index1.dic

CPU/M
GPU
aware/PU
cleanly/PRTU
common/PRTUY
conscious/PUY
easy/PRTU
faithful/PUY
friendly/PRTU
godly/PRTU
grateful/PUY
happy/PRTU
healthy/PRTU
holy/PRTU
kind/PRTUY
lawful/PUY
likely/PRTU
lucky/PRTU
natural/PUY
obtrusive/PUY
pleasant/PTUY
prepared/PU
reasonable/PU
responsive/PUY
righteous/PU
scrupulous/PUY
seemly/PRTU
selfish/PUY
timely/PRTU
truthful/PUY
wary/PRTU
wholesome/PU
willing/PUY
worldly/PTU
worthy/PRTU


The python report using bigrams on sorted tags file does not contain all the words mentioned above.



myl['PU']

['aware',
'aware',
'conscious',
'faithful',
'grateful',
'lawful',
'natural',
'obtrusive',
'prepared',
'prepared',
'reasonable',
'reasonable',
'responsive',
'righteous',
'righteous',
'scrupulous',
'selfish',
'truthful',
'wholesome',
'wholesome',
'willing']









share|improve this question

















This question has an open bounty worth +50
reputation from shantanuo ending tomorrow.


This question has not received enough attention.
















  • It seems more manageable if you split this into 2 steps in my opinion. 1.Create the set of keys for each tag 2. Loop through your list of value/tag and append the value to each of the keys for that tag
    – kerwei
    Nov 1 at 8:45










  • Is 5-gram the maximum that you need or would you need 6-gram if there was 6 tags? Do you need also skipragams, like S1, S2? Do you all combinatinos of the tags?
    – Tomáš Přinda
    Nov 9 at 3:43










  • Yes. I need all combinations of tags. It includes skipgrams and 5-grams, 6-grams depending upon the number of tags used. In other words I need a report of how the words are tagged. for e.g. from this file github.com/wooorm/dictionaries/blob/master/dictionaries/en-US/…
    – shantanuo
    yesterday















up vote
3
down vote

favorite
1












I need to split on slash and then report tags. This is hunspell dictionary format. I tried to find a class on github that would do this, but could not find one.



# vi test.txt
test/S
boy
girl/SE
home/
house/SE123
man/E
country
wind/ES


The code:



from collections import defaultdict
myl=defaultdict(list)

with open('test.txt') as f :
for l in f:
l = l.rstrip()
try:
tags = l.split('/')[1]
myl[tags].append(l.split('/')[0])
for t in tags:
myl[t].append( l.split('/')[0])
except:
pass


output:



defaultdict(list,
{'S': ['test', 'test', 'girl', 'house', 'wind'],
'SE': ['girl'],
'E': ['girl', 'house', 'man', 'man', 'wind'],
'': ['home'],
'SE123': ['house'],
'1': ['house'],
'2': ['house'],
'3': ['house'],
'ES': ['wind']})


SE group should have 3 words 'girl', 'wind' and 'house'. There should be no ES group because it is included and same as "SE" and SE123 should remain as is. how do I achieve this?





Update:



I have managed to add bigrams, but how do I add 3, 4, 5 grams?



from collections import defaultdict
import nltk
myl=defaultdict(list)

with open('hi_IN.dic') as f :
for l in f:
l = l.rstrip()
try:
tags = l.split('/')[1]
ntags=''.join(sorted(tags))
myl[ntags].append(l.split('/')[0])
for t in tags:
myl[t].append( l.split('/')[0])
bigrm = list(nltk.bigrams([i for i in tags]))
nlist=[x+y for x, y in bigrm]
for t1 in nlist:
t1a=''.join(sorted(t1))
myl[t1a].append(l.split('/')[0])
except:
pass




I guess it would help if I sort the tags at source:



with open('test1.txt', 'w') as nf:
with open('test.txt') as f :
for l in f:
l = l.rstrip()
try:
tags = l.split('/')[1]
except IndexError:
nline= l
else:
ntags=''.join(sorted(tags))
nline= l.split('/')[0] + '/' + ntags
nf.write(nline+'n')


This will create a new file test1.txt with sorted tags. But the problem of trigrams+ still not resolved.





I downloaded a sample file:



!wget https://raw.githubusercontent.com/wooorm/dictionaries/master/dictionaries/en-US/index.dic



The report using "grep" command is correct.



!grep 'P.*U' index1.dic

CPU/M
GPU
aware/PU
cleanly/PRTU
common/PRTUY
conscious/PUY
easy/PRTU
faithful/PUY
friendly/PRTU
godly/PRTU
grateful/PUY
happy/PRTU
healthy/PRTU
holy/PRTU
kind/PRTUY
lawful/PUY
likely/PRTU
lucky/PRTU
natural/PUY
obtrusive/PUY
pleasant/PTUY
prepared/PU
reasonable/PU
responsive/PUY
righteous/PU
scrupulous/PUY
seemly/PRTU
selfish/PUY
timely/PRTU
truthful/PUY
wary/PRTU
wholesome/PU
willing/PUY
worldly/PTU
worthy/PRTU


The python report using bigrams on sorted tags file does not contain all the words mentioned above.



myl['PU']

['aware',
'aware',
'conscious',
'faithful',
'grateful',
'lawful',
'natural',
'obtrusive',
'prepared',
'prepared',
'reasonable',
'reasonable',
'responsive',
'righteous',
'righteous',
'scrupulous',
'selfish',
'truthful',
'wholesome',
'wholesome',
'willing']









share|improve this question

















This question has an open bounty worth +50
reputation from shantanuo ending tomorrow.


This question has not received enough attention.
















  • It seems more manageable if you split this into 2 steps in my opinion. 1.Create the set of keys for each tag 2. Loop through your list of value/tag and append the value to each of the keys for that tag
    – kerwei
    Nov 1 at 8:45










  • Is 5-gram the maximum that you need or would you need 6-gram if there was 6 tags? Do you need also skipragams, like S1, S2? Do you all combinatinos of the tags?
    – Tomáš Přinda
    Nov 9 at 3:43










  • Yes. I need all combinations of tags. It includes skipgrams and 5-grams, 6-grams depending upon the number of tags used. In other words I need a report of how the words are tagged. for e.g. from this file github.com/wooorm/dictionaries/blob/master/dictionaries/en-US/…
    – shantanuo
    yesterday













up vote
3
down vote

favorite
1









up vote
3
down vote

favorite
1






1





I need to split on slash and then report tags. This is hunspell dictionary format. I tried to find a class on github that would do this, but could not find one.



# vi test.txt
test/S
boy
girl/SE
home/
house/SE123
man/E
country
wind/ES


The code:



from collections import defaultdict
myl=defaultdict(list)

with open('test.txt') as f :
for l in f:
l = l.rstrip()
try:
tags = l.split('/')[1]
myl[tags].append(l.split('/')[0])
for t in tags:
myl[t].append( l.split('/')[0])
except:
pass


output:



defaultdict(list,
{'S': ['test', 'test', 'girl', 'house', 'wind'],
'SE': ['girl'],
'E': ['girl', 'house', 'man', 'man', 'wind'],
'': ['home'],
'SE123': ['house'],
'1': ['house'],
'2': ['house'],
'3': ['house'],
'ES': ['wind']})


SE group should have 3 words 'girl', 'wind' and 'house'. There should be no ES group because it is included and same as "SE" and SE123 should remain as is. how do I achieve this?





Update:



I have managed to add bigrams, but how do I add 3, 4, 5 grams?



from collections import defaultdict
import nltk
myl=defaultdict(list)

with open('hi_IN.dic') as f :
for l in f:
l = l.rstrip()
try:
tags = l.split('/')[1]
ntags=''.join(sorted(tags))
myl[ntags].append(l.split('/')[0])
for t in tags:
myl[t].append( l.split('/')[0])
bigrm = list(nltk.bigrams([i for i in tags]))
nlist=[x+y for x, y in bigrm]
for t1 in nlist:
t1a=''.join(sorted(t1))
myl[t1a].append(l.split('/')[0])
except:
pass




I guess it would help if I sort the tags at source:



with open('test1.txt', 'w') as nf:
with open('test.txt') as f :
for l in f:
l = l.rstrip()
try:
tags = l.split('/')[1]
except IndexError:
nline= l
else:
ntags=''.join(sorted(tags))
nline= l.split('/')[0] + '/' + ntags
nf.write(nline+'n')


This will create a new file test1.txt with sorted tags. But the problem of trigrams+ still not resolved.





I downloaded a sample file:



!wget https://raw.githubusercontent.com/wooorm/dictionaries/master/dictionaries/en-US/index.dic



The report using "grep" command is correct.



!grep 'P.*U' index1.dic

CPU/M
GPU
aware/PU
cleanly/PRTU
common/PRTUY
conscious/PUY
easy/PRTU
faithful/PUY
friendly/PRTU
godly/PRTU
grateful/PUY
happy/PRTU
healthy/PRTU
holy/PRTU
kind/PRTUY
lawful/PUY
likely/PRTU
lucky/PRTU
natural/PUY
obtrusive/PUY
pleasant/PTUY
prepared/PU
reasonable/PU
responsive/PUY
righteous/PU
scrupulous/PUY
seemly/PRTU
selfish/PUY
timely/PRTU
truthful/PUY
wary/PRTU
wholesome/PU
willing/PUY
worldly/PTU
worthy/PRTU


The python report using bigrams on sorted tags file does not contain all the words mentioned above.



myl['PU']

['aware',
'aware',
'conscious',
'faithful',
'grateful',
'lawful',
'natural',
'obtrusive',
'prepared',
'prepared',
'reasonable',
'reasonable',
'responsive',
'righteous',
'righteous',
'scrupulous',
'selfish',
'truthful',
'wholesome',
'wholesome',
'willing']









share|improve this question















I need to split on slash and then report tags. This is hunspell dictionary format. I tried to find a class on github that would do this, but could not find one.



# vi test.txt
test/S
boy
girl/SE
home/
house/SE123
man/E
country
wind/ES


The code:



from collections import defaultdict
myl=defaultdict(list)

with open('test.txt') as f :
for l in f:
l = l.rstrip()
try:
tags = l.split('/')[1]
myl[tags].append(l.split('/')[0])
for t in tags:
myl[t].append( l.split('/')[0])
except:
pass


output:



defaultdict(list,
{'S': ['test', 'test', 'girl', 'house', 'wind'],
'SE': ['girl'],
'E': ['girl', 'house', 'man', 'man', 'wind'],
'': ['home'],
'SE123': ['house'],
'1': ['house'],
'2': ['house'],
'3': ['house'],
'ES': ['wind']})


SE group should have 3 words 'girl', 'wind' and 'house'. There should be no ES group because it is included and same as "SE" and SE123 should remain as is. how do I achieve this?





Update:



I have managed to add bigrams, but how do I add 3, 4, 5 grams?



from collections import defaultdict
import nltk
myl=defaultdict(list)

with open('hi_IN.dic') as f :
for l in f:
l = l.rstrip()
try:
tags = l.split('/')[1]
ntags=''.join(sorted(tags))
myl[ntags].append(l.split('/')[0])
for t in tags:
myl[t].append( l.split('/')[0])
bigrm = list(nltk.bigrams([i for i in tags]))
nlist=[x+y for x, y in bigrm]
for t1 in nlist:
t1a=''.join(sorted(t1))
myl[t1a].append(l.split('/')[0])
except:
pass




I guess it would help if I sort the tags at source:



with open('test1.txt', 'w') as nf:
with open('test.txt') as f :
for l in f:
l = l.rstrip()
try:
tags = l.split('/')[1]
except IndexError:
nline= l
else:
ntags=''.join(sorted(tags))
nline= l.split('/')[0] + '/' + ntags
nf.write(nline+'n')


This will create a new file test1.txt with sorted tags. But the problem of trigrams+ still not resolved.





I downloaded a sample file:



!wget https://raw.githubusercontent.com/wooorm/dictionaries/master/dictionaries/en-US/index.dic



The report using "grep" command is correct.



!grep 'P.*U' index1.dic

CPU/M
GPU
aware/PU
cleanly/PRTU
common/PRTUY
conscious/PUY
easy/PRTU
faithful/PUY
friendly/PRTU
godly/PRTU
grateful/PUY
happy/PRTU
healthy/PRTU
holy/PRTU
kind/PRTUY
lawful/PUY
likely/PRTU
lucky/PRTU
natural/PUY
obtrusive/PUY
pleasant/PTUY
prepared/PU
reasonable/PU
responsive/PUY
righteous/PU
scrupulous/PUY
seemly/PRTU
selfish/PUY
timely/PRTU
truthful/PUY
wary/PRTU
wholesome/PU
willing/PUY
worldly/PTU
worthy/PRTU


The python report using bigrams on sorted tags file does not contain all the words mentioned above.



myl['PU']

['aware',
'aware',
'conscious',
'faithful',
'grateful',
'lawful',
'natural',
'obtrusive',
'prepared',
'prepared',
'reasonable',
'reasonable',
'responsive',
'righteous',
'righteous',
'scrupulous',
'selfish',
'truthful',
'wholesome',
'wholesome',
'willing']






python nltk hunspell






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited yesterday

























asked Nov 1 at 7:57









shantanuo

11.2k55150252




11.2k55150252






This question has an open bounty worth +50
reputation from shantanuo ending tomorrow.


This question has not received enough attention.








This question has an open bounty worth +50
reputation from shantanuo ending tomorrow.


This question has not received enough attention.














  • It seems more manageable if you split this into 2 steps in my opinion. 1.Create the set of keys for each tag 2. Loop through your list of value/tag and append the value to each of the keys for that tag
    – kerwei
    Nov 1 at 8:45










  • Is 5-gram the maximum that you need or would you need 6-gram if there was 6 tags? Do you need also skipragams, like S1, S2? Do you all combinatinos of the tags?
    – Tomáš Přinda
    Nov 9 at 3:43










  • Yes. I need all combinations of tags. It includes skipgrams and 5-grams, 6-grams depending upon the number of tags used. In other words I need a report of how the words are tagged. for e.g. from this file github.com/wooorm/dictionaries/blob/master/dictionaries/en-US/…
    – shantanuo
    yesterday


















  • It seems more manageable if you split this into 2 steps in my opinion. 1.Create the set of keys for each tag 2. Loop through your list of value/tag and append the value to each of the keys for that tag
    – kerwei
    Nov 1 at 8:45










  • Is 5-gram the maximum that you need or would you need 6-gram if there was 6 tags? Do you need also skipragams, like S1, S2? Do you all combinatinos of the tags?
    – Tomáš Přinda
    Nov 9 at 3:43










  • Yes. I need all combinations of tags. It includes skipgrams and 5-grams, 6-grams depending upon the number of tags used. In other words I need a report of how the words are tagged. for e.g. from this file github.com/wooorm/dictionaries/blob/master/dictionaries/en-US/…
    – shantanuo
    yesterday
















It seems more manageable if you split this into 2 steps in my opinion. 1.Create the set of keys for each tag 2. Loop through your list of value/tag and append the value to each of the keys for that tag
– kerwei
Nov 1 at 8:45




It seems more manageable if you split this into 2 steps in my opinion. 1.Create the set of keys for each tag 2. Loop through your list of value/tag and append the value to each of the keys for that tag
– kerwei
Nov 1 at 8:45












Is 5-gram the maximum that you need or would you need 6-gram if there was 6 tags? Do you need also skipragams, like S1, S2? Do you all combinatinos of the tags?
– Tomáš Přinda
Nov 9 at 3:43




Is 5-gram the maximum that you need or would you need 6-gram if there was 6 tags? Do you need also skipragams, like S1, S2? Do you all combinatinos of the tags?
– Tomáš Přinda
Nov 9 at 3:43












Yes. I need all combinations of tags. It includes skipgrams and 5-grams, 6-grams depending upon the number of tags used. In other words I need a report of how the words are tagged. for e.g. from this file github.com/wooorm/dictionaries/blob/master/dictionaries/en-US/…
– shantanuo
yesterday




Yes. I need all combinations of tags. It includes skipgrams and 5-grams, 6-grams depending upon the number of tags used. In other words I need a report of how the words are tagged. for e.g. from this file github.com/wooorm/dictionaries/blob/master/dictionaries/en-US/…
– shantanuo
yesterday












1 Answer
1






active

oldest

votes

















up vote
1
down vote













Given I understand it correctly, this is more a matter of constructing a datastructure that, for a given tag, constructs the correct list. We can do this by constructing a dictionary that only takes into account singular tags. Later, when a person queries on multiple tags, we calculate the intersection. This thus makes it compact to represent, as well as easy to extract for example all the elements with tag AC and this will list elements with tag ABCD, ACD, ZABC, etc.



We can thus construct a parser:



from collections import defaultdict

class Hunspell(object):

def __init__(self, data):
self.data = data

def __getitem__(self, tags):
if not tags:
return self.data.get(None, )

elements = [self.data.get(tag ,()) for tag in tags]
data = set.intersection(*map(set, elements))
return [e for e in self.data.get(tags[0], ()) if e in data]

@staticmethod
def load(f):
data = defaultdict(list)
for line in f:
try:
element, tags = line.rstrip().split('/', 1)
for tag in tags:
data[tag].append(element)
data[None].append(element)
except ValueError:
pass # element with no tags
return Hunspell(dict(data))


The list processing at the end of __getitem__ is done to retrieve the elements in the correct order.



We can then load the file into memory with:



>>> with open('test.txt') as f:
... h = Hunspell.load(f)


and query it for arbitrary keys:



>>> h['SE']
['girl', 'house', 'wind']
>>> h['ES']
['girl', 'house', 'wind']
>>> h['1']
['house']
>>> h['']
['test', 'girl', 'home', 'house', 'man', 'wind']
>>> h['S3']
['house']
>>> h['S2']
['house']
>>> h['SE2']
['house']
>>> h[None]
['test', 'girl', 'home', 'house', 'man', 'wind']
>>> h['4']



querying for non-existing tags will result in an empty list. Here we thus postpone the "intersection" process upon calling. We can in fact already generate all possible intersections, but this will result in a large datastructure, and perhaps a large amount of work






share|improve this answer

















  • 1




    This is correct. But I will wait for some time before accepting the answer. Because this is a little difficult to understand :)
    – shantanuo
    yesterday











Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














 

draft saved


draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53097193%2fgroup-and-classify-words-as-well-as-characters%23new-answer', 'question_page');
}
);

Post as a guest
































1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes








up vote
1
down vote













Given I understand it correctly, this is more a matter of constructing a datastructure that, for a given tag, constructs the correct list. We can do this by constructing a dictionary that only takes into account singular tags. Later, when a person queries on multiple tags, we calculate the intersection. This thus makes it compact to represent, as well as easy to extract for example all the elements with tag AC and this will list elements with tag ABCD, ACD, ZABC, etc.



We can thus construct a parser:



from collections import defaultdict

class Hunspell(object):

def __init__(self, data):
self.data = data

def __getitem__(self, tags):
if not tags:
return self.data.get(None, )

elements = [self.data.get(tag ,()) for tag in tags]
data = set.intersection(*map(set, elements))
return [e for e in self.data.get(tags[0], ()) if e in data]

@staticmethod
def load(f):
data = defaultdict(list)
for line in f:
try:
element, tags = line.rstrip().split('/', 1)
for tag in tags:
data[tag].append(element)
data[None].append(element)
except ValueError:
pass # element with no tags
return Hunspell(dict(data))


The list processing at the end of __getitem__ is done to retrieve the elements in the correct order.



We can then load the file into memory with:



>>> with open('test.txt') as f:
... h = Hunspell.load(f)


and query it for arbitrary keys:



>>> h['SE']
['girl', 'house', 'wind']
>>> h['ES']
['girl', 'house', 'wind']
>>> h['1']
['house']
>>> h['']
['test', 'girl', 'home', 'house', 'man', 'wind']
>>> h['S3']
['house']
>>> h['S2']
['house']
>>> h['SE2']
['house']
>>> h[None]
['test', 'girl', 'home', 'house', 'man', 'wind']
>>> h['4']



querying for non-existing tags will result in an empty list. Here we thus postpone the "intersection" process upon calling. We can in fact already generate all possible intersections, but this will result in a large datastructure, and perhaps a large amount of work






share|improve this answer

















  • 1




    This is correct. But I will wait for some time before accepting the answer. Because this is a little difficult to understand :)
    – shantanuo
    yesterday















up vote
1
down vote













Given I understand it correctly, this is more a matter of constructing a datastructure that, for a given tag, constructs the correct list. We can do this by constructing a dictionary that only takes into account singular tags. Later, when a person queries on multiple tags, we calculate the intersection. This thus makes it compact to represent, as well as easy to extract for example all the elements with tag AC and this will list elements with tag ABCD, ACD, ZABC, etc.



We can thus construct a parser:



from collections import defaultdict

class Hunspell(object):

def __init__(self, data):
self.data = data

def __getitem__(self, tags):
if not tags:
return self.data.get(None, )

elements = [self.data.get(tag ,()) for tag in tags]
data = set.intersection(*map(set, elements))
return [e for e in self.data.get(tags[0], ()) if e in data]

@staticmethod
def load(f):
data = defaultdict(list)
for line in f:
try:
element, tags = line.rstrip().split('/', 1)
for tag in tags:
data[tag].append(element)
data[None].append(element)
except ValueError:
pass # element with no tags
return Hunspell(dict(data))


The list processing at the end of __getitem__ is done to retrieve the elements in the correct order.



We can then load the file into memory with:



>>> with open('test.txt') as f:
... h = Hunspell.load(f)


and query it for arbitrary keys:



>>> h['SE']
['girl', 'house', 'wind']
>>> h['ES']
['girl', 'house', 'wind']
>>> h['1']
['house']
>>> h['']
['test', 'girl', 'home', 'house', 'man', 'wind']
>>> h['S3']
['house']
>>> h['S2']
['house']
>>> h['SE2']
['house']
>>> h[None]
['test', 'girl', 'home', 'house', 'man', 'wind']
>>> h['4']



querying for non-existing tags will result in an empty list. Here we thus postpone the "intersection" process upon calling. We can in fact already generate all possible intersections, but this will result in a large datastructure, and perhaps a large amount of work






share|improve this answer

















  • 1




    This is correct. But I will wait for some time before accepting the answer. Because this is a little difficult to understand :)
    – shantanuo
    yesterday













up vote
1
down vote










up vote
1
down vote









Given I understand it correctly, this is more a matter of constructing a datastructure that, for a given tag, constructs the correct list. We can do this by constructing a dictionary that only takes into account singular tags. Later, when a person queries on multiple tags, we calculate the intersection. This thus makes it compact to represent, as well as easy to extract for example all the elements with tag AC and this will list elements with tag ABCD, ACD, ZABC, etc.



We can thus construct a parser:



from collections import defaultdict

class Hunspell(object):

def __init__(self, data):
self.data = data

def __getitem__(self, tags):
if not tags:
return self.data.get(None, )

elements = [self.data.get(tag ,()) for tag in tags]
data = set.intersection(*map(set, elements))
return [e for e in self.data.get(tags[0], ()) if e in data]

@staticmethod
def load(f):
data = defaultdict(list)
for line in f:
try:
element, tags = line.rstrip().split('/', 1)
for tag in tags:
data[tag].append(element)
data[None].append(element)
except ValueError:
pass # element with no tags
return Hunspell(dict(data))


The list processing at the end of __getitem__ is done to retrieve the elements in the correct order.



We can then load the file into memory with:



>>> with open('test.txt') as f:
... h = Hunspell.load(f)


and query it for arbitrary keys:



>>> h['SE']
['girl', 'house', 'wind']
>>> h['ES']
['girl', 'house', 'wind']
>>> h['1']
['house']
>>> h['']
['test', 'girl', 'home', 'house', 'man', 'wind']
>>> h['S3']
['house']
>>> h['S2']
['house']
>>> h['SE2']
['house']
>>> h[None]
['test', 'girl', 'home', 'house', 'man', 'wind']
>>> h['4']



querying for non-existing tags will result in an empty list. Here we thus postpone the "intersection" process upon calling. We can in fact already generate all possible intersections, but this will result in a large datastructure, and perhaps a large amount of work






share|improve this answer












Given I understand it correctly, this is more a matter of constructing a datastructure that, for a given tag, constructs the correct list. We can do this by constructing a dictionary that only takes into account singular tags. Later, when a person queries on multiple tags, we calculate the intersection. This thus makes it compact to represent, as well as easy to extract for example all the elements with tag AC and this will list elements with tag ABCD, ACD, ZABC, etc.



We can thus construct a parser:



from collections import defaultdict

class Hunspell(object):

def __init__(self, data):
self.data = data

def __getitem__(self, tags):
if not tags:
return self.data.get(None, )

elements = [self.data.get(tag ,()) for tag in tags]
data = set.intersection(*map(set, elements))
return [e for e in self.data.get(tags[0], ()) if e in data]

@staticmethod
def load(f):
data = defaultdict(list)
for line in f:
try:
element, tags = line.rstrip().split('/', 1)
for tag in tags:
data[tag].append(element)
data[None].append(element)
except ValueError:
pass # element with no tags
return Hunspell(dict(data))


The list processing at the end of __getitem__ is done to retrieve the elements in the correct order.



We can then load the file into memory with:



>>> with open('test.txt') as f:
... h = Hunspell.load(f)


and query it for arbitrary keys:



>>> h['SE']
['girl', 'house', 'wind']
>>> h['ES']
['girl', 'house', 'wind']
>>> h['1']
['house']
>>> h['']
['test', 'girl', 'home', 'house', 'man', 'wind']
>>> h['S3']
['house']
>>> h['S2']
['house']
>>> h['SE2']
['house']
>>> h[None]
['test', 'girl', 'home', 'house', 'man', 'wind']
>>> h['4']



querying for non-existing tags will result in an empty list. Here we thus postpone the "intersection" process upon calling. We can in fact already generate all possible intersections, but this will result in a large datastructure, and perhaps a large amount of work







share|improve this answer












share|improve this answer



share|improve this answer










answered yesterday









Willem Van Onsem

138k16129220




138k16129220








  • 1




    This is correct. But I will wait for some time before accepting the answer. Because this is a little difficult to understand :)
    – shantanuo
    yesterday














  • 1




    This is correct. But I will wait for some time before accepting the answer. Because this is a little difficult to understand :)
    – shantanuo
    yesterday








1




1




This is correct. But I will wait for some time before accepting the answer. Because this is a little difficult to understand :)
– shantanuo
yesterday




This is correct. But I will wait for some time before accepting the answer. Because this is a little difficult to understand :)
– shantanuo
yesterday


















 

draft saved


draft discarded



















































 


draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53097193%2fgroup-and-classify-words-as-well-as-characters%23new-answer', 'question_page');
}
);

Post as a guest




















































































Popular posts from this blog

Xamarin.iOS Cant Deploy on Iphone

Glorious Revolution

Dulmage-Mendelsohn matrix decomposition in Python