group and classify words as well as characters
up vote
3
down vote
favorite
I need to split on slash and then report tags. This is hunspell dictionary format. I tried to find a class on github that would do this, but could not find one.
# vi test.txt
test/S
boy
girl/SE
home/
house/SE123
man/E
country
wind/ES
The code:
from collections import defaultdict
myl=defaultdict(list)
with open('test.txt') as f :
for l in f:
l = l.rstrip()
try:
tags = l.split('/')[1]
myl[tags].append(l.split('/')[0])
for t in tags:
myl[t].append( l.split('/')[0])
except:
pass
output:
defaultdict(list,
{'S': ['test', 'test', 'girl', 'house', 'wind'],
'SE': ['girl'],
'E': ['girl', 'house', 'man', 'man', 'wind'],
'': ['home'],
'SE123': ['house'],
'1': ['house'],
'2': ['house'],
'3': ['house'],
'ES': ['wind']})
SE group should have 3 words 'girl', 'wind' and 'house'. There should be no ES group because it is included and same as "SE" and SE123 should remain as is. how do I achieve this?
Update:
I have managed to add bigrams, but how do I add 3, 4, 5 grams?
from collections import defaultdict
import nltk
myl=defaultdict(list)
with open('hi_IN.dic') as f :
for l in f:
l = l.rstrip()
try:
tags = l.split('/')[1]
ntags=''.join(sorted(tags))
myl[ntags].append(l.split('/')[0])
for t in tags:
myl[t].append( l.split('/')[0])
bigrm = list(nltk.bigrams([i for i in tags]))
nlist=[x+y for x, y in bigrm]
for t1 in nlist:
t1a=''.join(sorted(t1))
myl[t1a].append(l.split('/')[0])
except:
pass
I guess it would help if I sort the tags at source:
with open('test1.txt', 'w') as nf:
with open('test.txt') as f :
for l in f:
l = l.rstrip()
try:
tags = l.split('/')[1]
except IndexError:
nline= l
else:
ntags=''.join(sorted(tags))
nline= l.split('/')[0] + '/' + ntags
nf.write(nline+'n')
This will create a new file test1.txt with sorted tags. But the problem of trigrams+ still not resolved.
I downloaded a sample file:
!wget https://raw.githubusercontent.com/wooorm/dictionaries/master/dictionaries/en-US/index.dic
The report using "grep" command is correct.
!grep 'P.*U' index1.dic
CPU/M
GPU
aware/PU
cleanly/PRTU
common/PRTUY
conscious/PUY
easy/PRTU
faithful/PUY
friendly/PRTU
godly/PRTU
grateful/PUY
happy/PRTU
healthy/PRTU
holy/PRTU
kind/PRTUY
lawful/PUY
likely/PRTU
lucky/PRTU
natural/PUY
obtrusive/PUY
pleasant/PTUY
prepared/PU
reasonable/PU
responsive/PUY
righteous/PU
scrupulous/PUY
seemly/PRTU
selfish/PUY
timely/PRTU
truthful/PUY
wary/PRTU
wholesome/PU
willing/PUY
worldly/PTU
worthy/PRTU
The python report using bigrams on sorted tags file does not contain all the words mentioned above.
myl['PU']
['aware',
'aware',
'conscious',
'faithful',
'grateful',
'lawful',
'natural',
'obtrusive',
'prepared',
'prepared',
'reasonable',
'reasonable',
'responsive',
'righteous',
'righteous',
'scrupulous',
'selfish',
'truthful',
'wholesome',
'wholesome',
'willing']
python nltk hunspell
This question has an open bounty worth +50
reputation from shantanuo ending tomorrow.
This question has not received enough attention.
add a comment |
up vote
3
down vote
favorite
I need to split on slash and then report tags. This is hunspell dictionary format. I tried to find a class on github that would do this, but could not find one.
# vi test.txt
test/S
boy
girl/SE
home/
house/SE123
man/E
country
wind/ES
The code:
from collections import defaultdict
myl=defaultdict(list)
with open('test.txt') as f :
for l in f:
l = l.rstrip()
try:
tags = l.split('/')[1]
myl[tags].append(l.split('/')[0])
for t in tags:
myl[t].append( l.split('/')[0])
except:
pass
output:
defaultdict(list,
{'S': ['test', 'test', 'girl', 'house', 'wind'],
'SE': ['girl'],
'E': ['girl', 'house', 'man', 'man', 'wind'],
'': ['home'],
'SE123': ['house'],
'1': ['house'],
'2': ['house'],
'3': ['house'],
'ES': ['wind']})
SE group should have 3 words 'girl', 'wind' and 'house'. There should be no ES group because it is included and same as "SE" and SE123 should remain as is. how do I achieve this?
Update:
I have managed to add bigrams, but how do I add 3, 4, 5 grams?
from collections import defaultdict
import nltk
myl=defaultdict(list)
with open('hi_IN.dic') as f :
for l in f:
l = l.rstrip()
try:
tags = l.split('/')[1]
ntags=''.join(sorted(tags))
myl[ntags].append(l.split('/')[0])
for t in tags:
myl[t].append( l.split('/')[0])
bigrm = list(nltk.bigrams([i for i in tags]))
nlist=[x+y for x, y in bigrm]
for t1 in nlist:
t1a=''.join(sorted(t1))
myl[t1a].append(l.split('/')[0])
except:
pass
I guess it would help if I sort the tags at source:
with open('test1.txt', 'w') as nf:
with open('test.txt') as f :
for l in f:
l = l.rstrip()
try:
tags = l.split('/')[1]
except IndexError:
nline= l
else:
ntags=''.join(sorted(tags))
nline= l.split('/')[0] + '/' + ntags
nf.write(nline+'n')
This will create a new file test1.txt with sorted tags. But the problem of trigrams+ still not resolved.
I downloaded a sample file:
!wget https://raw.githubusercontent.com/wooorm/dictionaries/master/dictionaries/en-US/index.dic
The report using "grep" command is correct.
!grep 'P.*U' index1.dic
CPU/M
GPU
aware/PU
cleanly/PRTU
common/PRTUY
conscious/PUY
easy/PRTU
faithful/PUY
friendly/PRTU
godly/PRTU
grateful/PUY
happy/PRTU
healthy/PRTU
holy/PRTU
kind/PRTUY
lawful/PUY
likely/PRTU
lucky/PRTU
natural/PUY
obtrusive/PUY
pleasant/PTUY
prepared/PU
reasonable/PU
responsive/PUY
righteous/PU
scrupulous/PUY
seemly/PRTU
selfish/PUY
timely/PRTU
truthful/PUY
wary/PRTU
wholesome/PU
willing/PUY
worldly/PTU
worthy/PRTU
The python report using bigrams on sorted tags file does not contain all the words mentioned above.
myl['PU']
['aware',
'aware',
'conscious',
'faithful',
'grateful',
'lawful',
'natural',
'obtrusive',
'prepared',
'prepared',
'reasonable',
'reasonable',
'responsive',
'righteous',
'righteous',
'scrupulous',
'selfish',
'truthful',
'wholesome',
'wholesome',
'willing']
python nltk hunspell
This question has an open bounty worth +50
reputation from shantanuo ending tomorrow.
This question has not received enough attention.
It seems more manageable if you split this into 2 steps in my opinion. 1.Create the set of keys for each tag 2. Loop through your list of value/tag and append the value to each of the keys for that tag
– kerwei
Nov 1 at 8:45
Is 5-gram the maximum that you need or would you need 6-gram if there was 6 tags? Do you need also skipragams, likeS1
,S2
? Do you all combinatinos of the tags?
– Tomáš Přinda
Nov 9 at 3:43
Yes. I need all combinations of tags. It includes skipgrams and 5-grams, 6-grams depending upon the number of tags used. In other words I need a report of how the words are tagged. for e.g. from this file github.com/wooorm/dictionaries/blob/master/dictionaries/en-US/…
– shantanuo
yesterday
add a comment |
up vote
3
down vote
favorite
up vote
3
down vote
favorite
I need to split on slash and then report tags. This is hunspell dictionary format. I tried to find a class on github that would do this, but could not find one.
# vi test.txt
test/S
boy
girl/SE
home/
house/SE123
man/E
country
wind/ES
The code:
from collections import defaultdict
myl=defaultdict(list)
with open('test.txt') as f :
for l in f:
l = l.rstrip()
try:
tags = l.split('/')[1]
myl[tags].append(l.split('/')[0])
for t in tags:
myl[t].append( l.split('/')[0])
except:
pass
output:
defaultdict(list,
{'S': ['test', 'test', 'girl', 'house', 'wind'],
'SE': ['girl'],
'E': ['girl', 'house', 'man', 'man', 'wind'],
'': ['home'],
'SE123': ['house'],
'1': ['house'],
'2': ['house'],
'3': ['house'],
'ES': ['wind']})
SE group should have 3 words 'girl', 'wind' and 'house'. There should be no ES group because it is included and same as "SE" and SE123 should remain as is. how do I achieve this?
Update:
I have managed to add bigrams, but how do I add 3, 4, 5 grams?
from collections import defaultdict
import nltk
myl=defaultdict(list)
with open('hi_IN.dic') as f :
for l in f:
l = l.rstrip()
try:
tags = l.split('/')[1]
ntags=''.join(sorted(tags))
myl[ntags].append(l.split('/')[0])
for t in tags:
myl[t].append( l.split('/')[0])
bigrm = list(nltk.bigrams([i for i in tags]))
nlist=[x+y for x, y in bigrm]
for t1 in nlist:
t1a=''.join(sorted(t1))
myl[t1a].append(l.split('/')[0])
except:
pass
I guess it would help if I sort the tags at source:
with open('test1.txt', 'w') as nf:
with open('test.txt') as f :
for l in f:
l = l.rstrip()
try:
tags = l.split('/')[1]
except IndexError:
nline= l
else:
ntags=''.join(sorted(tags))
nline= l.split('/')[0] + '/' + ntags
nf.write(nline+'n')
This will create a new file test1.txt with sorted tags. But the problem of trigrams+ still not resolved.
I downloaded a sample file:
!wget https://raw.githubusercontent.com/wooorm/dictionaries/master/dictionaries/en-US/index.dic
The report using "grep" command is correct.
!grep 'P.*U' index1.dic
CPU/M
GPU
aware/PU
cleanly/PRTU
common/PRTUY
conscious/PUY
easy/PRTU
faithful/PUY
friendly/PRTU
godly/PRTU
grateful/PUY
happy/PRTU
healthy/PRTU
holy/PRTU
kind/PRTUY
lawful/PUY
likely/PRTU
lucky/PRTU
natural/PUY
obtrusive/PUY
pleasant/PTUY
prepared/PU
reasonable/PU
responsive/PUY
righteous/PU
scrupulous/PUY
seemly/PRTU
selfish/PUY
timely/PRTU
truthful/PUY
wary/PRTU
wholesome/PU
willing/PUY
worldly/PTU
worthy/PRTU
The python report using bigrams on sorted tags file does not contain all the words mentioned above.
myl['PU']
['aware',
'aware',
'conscious',
'faithful',
'grateful',
'lawful',
'natural',
'obtrusive',
'prepared',
'prepared',
'reasonable',
'reasonable',
'responsive',
'righteous',
'righteous',
'scrupulous',
'selfish',
'truthful',
'wholesome',
'wholesome',
'willing']
python nltk hunspell
I need to split on slash and then report tags. This is hunspell dictionary format. I tried to find a class on github that would do this, but could not find one.
# vi test.txt
test/S
boy
girl/SE
home/
house/SE123
man/E
country
wind/ES
The code:
from collections import defaultdict
myl=defaultdict(list)
with open('test.txt') as f :
for l in f:
l = l.rstrip()
try:
tags = l.split('/')[1]
myl[tags].append(l.split('/')[0])
for t in tags:
myl[t].append( l.split('/')[0])
except:
pass
output:
defaultdict(list,
{'S': ['test', 'test', 'girl', 'house', 'wind'],
'SE': ['girl'],
'E': ['girl', 'house', 'man', 'man', 'wind'],
'': ['home'],
'SE123': ['house'],
'1': ['house'],
'2': ['house'],
'3': ['house'],
'ES': ['wind']})
SE group should have 3 words 'girl', 'wind' and 'house'. There should be no ES group because it is included and same as "SE" and SE123 should remain as is. how do I achieve this?
Update:
I have managed to add bigrams, but how do I add 3, 4, 5 grams?
from collections import defaultdict
import nltk
myl=defaultdict(list)
with open('hi_IN.dic') as f :
for l in f:
l = l.rstrip()
try:
tags = l.split('/')[1]
ntags=''.join(sorted(tags))
myl[ntags].append(l.split('/')[0])
for t in tags:
myl[t].append( l.split('/')[0])
bigrm = list(nltk.bigrams([i for i in tags]))
nlist=[x+y for x, y in bigrm]
for t1 in nlist:
t1a=''.join(sorted(t1))
myl[t1a].append(l.split('/')[0])
except:
pass
I guess it would help if I sort the tags at source:
with open('test1.txt', 'w') as nf:
with open('test.txt') as f :
for l in f:
l = l.rstrip()
try:
tags = l.split('/')[1]
except IndexError:
nline= l
else:
ntags=''.join(sorted(tags))
nline= l.split('/')[0] + '/' + ntags
nf.write(nline+'n')
This will create a new file test1.txt with sorted tags. But the problem of trigrams+ still not resolved.
I downloaded a sample file:
!wget https://raw.githubusercontent.com/wooorm/dictionaries/master/dictionaries/en-US/index.dic
The report using "grep" command is correct.
!grep 'P.*U' index1.dic
CPU/M
GPU
aware/PU
cleanly/PRTU
common/PRTUY
conscious/PUY
easy/PRTU
faithful/PUY
friendly/PRTU
godly/PRTU
grateful/PUY
happy/PRTU
healthy/PRTU
holy/PRTU
kind/PRTUY
lawful/PUY
likely/PRTU
lucky/PRTU
natural/PUY
obtrusive/PUY
pleasant/PTUY
prepared/PU
reasonable/PU
responsive/PUY
righteous/PU
scrupulous/PUY
seemly/PRTU
selfish/PUY
timely/PRTU
truthful/PUY
wary/PRTU
wholesome/PU
willing/PUY
worldly/PTU
worthy/PRTU
The python report using bigrams on sorted tags file does not contain all the words mentioned above.
myl['PU']
['aware',
'aware',
'conscious',
'faithful',
'grateful',
'lawful',
'natural',
'obtrusive',
'prepared',
'prepared',
'reasonable',
'reasonable',
'responsive',
'righteous',
'righteous',
'scrupulous',
'selfish',
'truthful',
'wholesome',
'wholesome',
'willing']
python nltk hunspell
python nltk hunspell
edited yesterday
asked Nov 1 at 7:57
shantanuo
11.2k55150252
11.2k55150252
This question has an open bounty worth +50
reputation from shantanuo ending tomorrow.
This question has not received enough attention.
This question has an open bounty worth +50
reputation from shantanuo ending tomorrow.
This question has not received enough attention.
It seems more manageable if you split this into 2 steps in my opinion. 1.Create the set of keys for each tag 2. Loop through your list of value/tag and append the value to each of the keys for that tag
– kerwei
Nov 1 at 8:45
Is 5-gram the maximum that you need or would you need 6-gram if there was 6 tags? Do you need also skipragams, likeS1
,S2
? Do you all combinatinos of the tags?
– Tomáš Přinda
Nov 9 at 3:43
Yes. I need all combinations of tags. It includes skipgrams and 5-grams, 6-grams depending upon the number of tags used. In other words I need a report of how the words are tagged. for e.g. from this file github.com/wooorm/dictionaries/blob/master/dictionaries/en-US/…
– shantanuo
yesterday
add a comment |
It seems more manageable if you split this into 2 steps in my opinion. 1.Create the set of keys for each tag 2. Loop through your list of value/tag and append the value to each of the keys for that tag
– kerwei
Nov 1 at 8:45
Is 5-gram the maximum that you need or would you need 6-gram if there was 6 tags? Do you need also skipragams, likeS1
,S2
? Do you all combinatinos of the tags?
– Tomáš Přinda
Nov 9 at 3:43
Yes. I need all combinations of tags. It includes skipgrams and 5-grams, 6-grams depending upon the number of tags used. In other words I need a report of how the words are tagged. for e.g. from this file github.com/wooorm/dictionaries/blob/master/dictionaries/en-US/…
– shantanuo
yesterday
It seems more manageable if you split this into 2 steps in my opinion. 1.Create the set of keys for each tag 2. Loop through your list of value/tag and append the value to each of the keys for that tag
– kerwei
Nov 1 at 8:45
It seems more manageable if you split this into 2 steps in my opinion. 1.Create the set of keys for each tag 2. Loop through your list of value/tag and append the value to each of the keys for that tag
– kerwei
Nov 1 at 8:45
Is 5-gram the maximum that you need or would you need 6-gram if there was 6 tags? Do you need also skipragams, like
S1
, S2
? Do you all combinatinos of the tags?– Tomáš Přinda
Nov 9 at 3:43
Is 5-gram the maximum that you need or would you need 6-gram if there was 6 tags? Do you need also skipragams, like
S1
, S2
? Do you all combinatinos of the tags?– Tomáš Přinda
Nov 9 at 3:43
Yes. I need all combinations of tags. It includes skipgrams and 5-grams, 6-grams depending upon the number of tags used. In other words I need a report of how the words are tagged. for e.g. from this file github.com/wooorm/dictionaries/blob/master/dictionaries/en-US/…
– shantanuo
yesterday
Yes. I need all combinations of tags. It includes skipgrams and 5-grams, 6-grams depending upon the number of tags used. In other words I need a report of how the words are tagged. for e.g. from this file github.com/wooorm/dictionaries/blob/master/dictionaries/en-US/…
– shantanuo
yesterday
add a comment |
1 Answer
1
active
oldest
votes
up vote
1
down vote
Given I understand it correctly, this is more a matter of constructing a datastructure that, for a given tag, constructs the correct list. We can do this by constructing a dictionary that only takes into account singular tags. Later, when a person queries on multiple tags, we calculate the intersection. This thus makes it compact to represent, as well as easy to extract for example all the elements with tag AC
and this will list elements with tag ABCD
, ACD
, ZABC
, etc.
We can thus construct a parser:
from collections import defaultdict
class Hunspell(object):
def __init__(self, data):
self.data = data
def __getitem__(self, tags):
if not tags:
return self.data.get(None, )
elements = [self.data.get(tag ,()) for tag in tags]
data = set.intersection(*map(set, elements))
return [e for e in self.data.get(tags[0], ()) if e in data]
@staticmethod
def load(f):
data = defaultdict(list)
for line in f:
try:
element, tags = line.rstrip().split('/', 1)
for tag in tags:
data[tag].append(element)
data[None].append(element)
except ValueError:
pass # element with no tags
return Hunspell(dict(data))
The list processing at the end of __getitem__
is done to retrieve the elements in the correct order.
We can then load the file into memory with:
>>> with open('test.txt') as f:
... h = Hunspell.load(f)
and query it for arbitrary keys:
>>> h['SE']
['girl', 'house', 'wind']
>>> h['ES']
['girl', 'house', 'wind']
>>> h['1']
['house']
>>> h['']
['test', 'girl', 'home', 'house', 'man', 'wind']
>>> h['S3']
['house']
>>> h['S2']
['house']
>>> h['SE2']
['house']
>>> h[None]
['test', 'girl', 'home', 'house', 'man', 'wind']
>>> h['4']
querying for non-existing tags will result in an empty list. Here we thus postpone the "intersection" process upon calling. We can in fact already generate all possible intersections, but this will result in a large datastructure, and perhaps a large amount of work
1
This is correct. But I will wait for some time before accepting the answer. Because this is a little difficult to understand :)
– shantanuo
yesterday
add a comment |
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
1
down vote
Given I understand it correctly, this is more a matter of constructing a datastructure that, for a given tag, constructs the correct list. We can do this by constructing a dictionary that only takes into account singular tags. Later, when a person queries on multiple tags, we calculate the intersection. This thus makes it compact to represent, as well as easy to extract for example all the elements with tag AC
and this will list elements with tag ABCD
, ACD
, ZABC
, etc.
We can thus construct a parser:
from collections import defaultdict
class Hunspell(object):
def __init__(self, data):
self.data = data
def __getitem__(self, tags):
if not tags:
return self.data.get(None, )
elements = [self.data.get(tag ,()) for tag in tags]
data = set.intersection(*map(set, elements))
return [e for e in self.data.get(tags[0], ()) if e in data]
@staticmethod
def load(f):
data = defaultdict(list)
for line in f:
try:
element, tags = line.rstrip().split('/', 1)
for tag in tags:
data[tag].append(element)
data[None].append(element)
except ValueError:
pass # element with no tags
return Hunspell(dict(data))
The list processing at the end of __getitem__
is done to retrieve the elements in the correct order.
We can then load the file into memory with:
>>> with open('test.txt') as f:
... h = Hunspell.load(f)
and query it for arbitrary keys:
>>> h['SE']
['girl', 'house', 'wind']
>>> h['ES']
['girl', 'house', 'wind']
>>> h['1']
['house']
>>> h['']
['test', 'girl', 'home', 'house', 'man', 'wind']
>>> h['S3']
['house']
>>> h['S2']
['house']
>>> h['SE2']
['house']
>>> h[None]
['test', 'girl', 'home', 'house', 'man', 'wind']
>>> h['4']
querying for non-existing tags will result in an empty list. Here we thus postpone the "intersection" process upon calling. We can in fact already generate all possible intersections, but this will result in a large datastructure, and perhaps a large amount of work
1
This is correct. But I will wait for some time before accepting the answer. Because this is a little difficult to understand :)
– shantanuo
yesterday
add a comment |
up vote
1
down vote
Given I understand it correctly, this is more a matter of constructing a datastructure that, for a given tag, constructs the correct list. We can do this by constructing a dictionary that only takes into account singular tags. Later, when a person queries on multiple tags, we calculate the intersection. This thus makes it compact to represent, as well as easy to extract for example all the elements with tag AC
and this will list elements with tag ABCD
, ACD
, ZABC
, etc.
We can thus construct a parser:
from collections import defaultdict
class Hunspell(object):
def __init__(self, data):
self.data = data
def __getitem__(self, tags):
if not tags:
return self.data.get(None, )
elements = [self.data.get(tag ,()) for tag in tags]
data = set.intersection(*map(set, elements))
return [e for e in self.data.get(tags[0], ()) if e in data]
@staticmethod
def load(f):
data = defaultdict(list)
for line in f:
try:
element, tags = line.rstrip().split('/', 1)
for tag in tags:
data[tag].append(element)
data[None].append(element)
except ValueError:
pass # element with no tags
return Hunspell(dict(data))
The list processing at the end of __getitem__
is done to retrieve the elements in the correct order.
We can then load the file into memory with:
>>> with open('test.txt') as f:
... h = Hunspell.load(f)
and query it for arbitrary keys:
>>> h['SE']
['girl', 'house', 'wind']
>>> h['ES']
['girl', 'house', 'wind']
>>> h['1']
['house']
>>> h['']
['test', 'girl', 'home', 'house', 'man', 'wind']
>>> h['S3']
['house']
>>> h['S2']
['house']
>>> h['SE2']
['house']
>>> h[None]
['test', 'girl', 'home', 'house', 'man', 'wind']
>>> h['4']
querying for non-existing tags will result in an empty list. Here we thus postpone the "intersection" process upon calling. We can in fact already generate all possible intersections, but this will result in a large datastructure, and perhaps a large amount of work
1
This is correct. But I will wait for some time before accepting the answer. Because this is a little difficult to understand :)
– shantanuo
yesterday
add a comment |
up vote
1
down vote
up vote
1
down vote
Given I understand it correctly, this is more a matter of constructing a datastructure that, for a given tag, constructs the correct list. We can do this by constructing a dictionary that only takes into account singular tags. Later, when a person queries on multiple tags, we calculate the intersection. This thus makes it compact to represent, as well as easy to extract for example all the elements with tag AC
and this will list elements with tag ABCD
, ACD
, ZABC
, etc.
We can thus construct a parser:
from collections import defaultdict
class Hunspell(object):
def __init__(self, data):
self.data = data
def __getitem__(self, tags):
if not tags:
return self.data.get(None, )
elements = [self.data.get(tag ,()) for tag in tags]
data = set.intersection(*map(set, elements))
return [e for e in self.data.get(tags[0], ()) if e in data]
@staticmethod
def load(f):
data = defaultdict(list)
for line in f:
try:
element, tags = line.rstrip().split('/', 1)
for tag in tags:
data[tag].append(element)
data[None].append(element)
except ValueError:
pass # element with no tags
return Hunspell(dict(data))
The list processing at the end of __getitem__
is done to retrieve the elements in the correct order.
We can then load the file into memory with:
>>> with open('test.txt') as f:
... h = Hunspell.load(f)
and query it for arbitrary keys:
>>> h['SE']
['girl', 'house', 'wind']
>>> h['ES']
['girl', 'house', 'wind']
>>> h['1']
['house']
>>> h['']
['test', 'girl', 'home', 'house', 'man', 'wind']
>>> h['S3']
['house']
>>> h['S2']
['house']
>>> h['SE2']
['house']
>>> h[None]
['test', 'girl', 'home', 'house', 'man', 'wind']
>>> h['4']
querying for non-existing tags will result in an empty list. Here we thus postpone the "intersection" process upon calling. We can in fact already generate all possible intersections, but this will result in a large datastructure, and perhaps a large amount of work
Given I understand it correctly, this is more a matter of constructing a datastructure that, for a given tag, constructs the correct list. We can do this by constructing a dictionary that only takes into account singular tags. Later, when a person queries on multiple tags, we calculate the intersection. This thus makes it compact to represent, as well as easy to extract for example all the elements with tag AC
and this will list elements with tag ABCD
, ACD
, ZABC
, etc.
We can thus construct a parser:
from collections import defaultdict
class Hunspell(object):
def __init__(self, data):
self.data = data
def __getitem__(self, tags):
if not tags:
return self.data.get(None, )
elements = [self.data.get(tag ,()) for tag in tags]
data = set.intersection(*map(set, elements))
return [e for e in self.data.get(tags[0], ()) if e in data]
@staticmethod
def load(f):
data = defaultdict(list)
for line in f:
try:
element, tags = line.rstrip().split('/', 1)
for tag in tags:
data[tag].append(element)
data[None].append(element)
except ValueError:
pass # element with no tags
return Hunspell(dict(data))
The list processing at the end of __getitem__
is done to retrieve the elements in the correct order.
We can then load the file into memory with:
>>> with open('test.txt') as f:
... h = Hunspell.load(f)
and query it for arbitrary keys:
>>> h['SE']
['girl', 'house', 'wind']
>>> h['ES']
['girl', 'house', 'wind']
>>> h['1']
['house']
>>> h['']
['test', 'girl', 'home', 'house', 'man', 'wind']
>>> h['S3']
['house']
>>> h['S2']
['house']
>>> h['SE2']
['house']
>>> h[None]
['test', 'girl', 'home', 'house', 'man', 'wind']
>>> h['4']
querying for non-existing tags will result in an empty list. Here we thus postpone the "intersection" process upon calling. We can in fact already generate all possible intersections, but this will result in a large datastructure, and perhaps a large amount of work
answered yesterday
Willem Van Onsem
138k16129220
138k16129220
1
This is correct. But I will wait for some time before accepting the answer. Because this is a little difficult to understand :)
– shantanuo
yesterday
add a comment |
1
This is correct. But I will wait for some time before accepting the answer. Because this is a little difficult to understand :)
– shantanuo
yesterday
1
1
This is correct. But I will wait for some time before accepting the answer. Because this is a little difficult to understand :)
– shantanuo
yesterday
This is correct. But I will wait for some time before accepting the answer. Because this is a little difficult to understand :)
– shantanuo
yesterday
add a comment |
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53097193%2fgroup-and-classify-words-as-well-as-characters%23new-answer', 'question_page');
}
);
Post as a guest
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
It seems more manageable if you split this into 2 steps in my opinion. 1.Create the set of keys for each tag 2. Loop through your list of value/tag and append the value to each of the keys for that tag
– kerwei
Nov 1 at 8:45
Is 5-gram the maximum that you need or would you need 6-gram if there was 6 tags? Do you need also skipragams, like
S1
,S2
? Do you all combinatinos of the tags?– Tomáš Přinda
Nov 9 at 3:43
Yes. I need all combinations of tags. It includes skipgrams and 5-grams, 6-grams depending upon the number of tags used. In other words I need a report of how the words are tagged. for e.g. from this file github.com/wooorm/dictionaries/blob/master/dictionaries/en-US/…
– shantanuo
yesterday