python - How can I make this Python2.6 function work with Unicode? -
i've got function, modified material in chapter 1 of online nltk book. it's been useful me but, despite reading chapter on unicode, feel lost before.
def openbookreturnvocab(book): fileopen = open(book) rawness = fileopen.read() tokens = nltk.wordpunct_tokenize(rawness) nltktext = nltk.text(tokens) nltkwords = [w.lower() w in nltktext] nltkvocab = sorted(set(nltkwords)) return nltkvocab
when tried other day on sprach zarathustra, clobbered words umlat on o's , u's. i'm sure of know why happened. i'm sure it's quite easy fix. know has calling function re-encodes tokens unicode strings. if so, seems me might not happen inside function definition @ all, here, prepare write file:
def jotindex(jotted, filename, readmethod): filemydata = open(filename, readmethod) jottedf = '\n'.join(jotted) filemydata.write(jottedf) filemydata.close() return 0
i heard had encode string unicode after reading file. tried amending function so:
def openbookreturnvocab(book): fileopen = open(book) rawness = fileopen.read() unirawness = rawness.decode('utf-8') tokens = nltk.wordpunct_tokenize(unirawness) nltktext = nltk.text(tokens) nltkwords = [w.lower() w in nltktext] nltkvocab = sorted(set(nltkwords)) return nltkvocab
but brought error, when used on hungarian. when used on german, had no errors.
>>> import bookroutines >>> elles1 = bookroutines.openbookreturnvocab("lk1-les1") traceback (most recent call last): file "<stdin>", line 1, in <module> file "bookroutines.py", line 9, in openbookreturnvocab nltktext = nltk.text(tokens) file "/usr/lib/pymodules/python2.6/nltk/text.py", line 285, in __init__ self.name = " ".join(map(str, tokens[:8])) + "..." unicodeencodeerror: 'ascii' codec can't encode character u'\xe1' in position 4: ordinal not in range(128)
i fixed function files data so:
def jotindex(jotted, filename, readmethod): filemydata = open(filename, readmethod) jottedf = u'\n'.join(jotted) filemydata.write(jottedf) filemydata.close() return 0
however, brought error, when tried file german:
traceback (most recent call last): file "<stdin>", line 1, in <module> file "bookroutines.py", line 23, in jotindex filemydata.write(jottedf) unicodeencodeerror: 'ascii' codec can't encode character u'\xf6' in position 414: ordinal not in range(128) >>>
...which when try write u'\n'.join'ed data.
>>> jottedf = u'/n'.join(elles1) >>> filemydata.write(jottedf) traceback (most recent call last): file "<stdin>", line 1, in <module> unicodeencodeerror: 'ascii' codec can't encode character u'\xf6' in position 504: ordinal not in range(128)
for each string read file, can convert them unicode calling rawness.decode('utf-8')
, if have text in utf-8. end unicode objects. also, don't know "jotted" is, may want make sure it's unicode object , use u'\n'.join(jotted)
instead.
update:
it appears nltk library doesn't unicode objects. fine, have make sure using str instances utf-8 encoded text. try using this:
tokens = nltk.wordpunct_tokenize(unirawness) nltktext = nltk.text([token.encode('utf-8') token in tokens])
and this:
jottedf = u'\n'.join(jotted) filemydata.write(jottedf.encode('utf-8'))
but if jotted list of utf-8-encoded str, don't need , should enough:
jottedf = '\n'.join(jotted) filemydata.write(jottedf)
by way, looks though nltk isn't cautious respect unicode , encoding (at least, demos). better careful , check has processed tokens correctly. also, , may have caused fact errors hungarian text , not german text, check encodings.
Comments
Post a Comment