python - How can I make this Python2.6 function work with Unicode? -


i've got function, modified material in chapter 1 of online nltk book. it's been useful me but, despite reading chapter on unicode, feel lost before.

def openbookreturnvocab(book):     fileopen = open(book)     rawness = fileopen.read()     tokens = nltk.wordpunct_tokenize(rawness)     nltktext = nltk.text(tokens)     nltkwords = [w.lower() w in nltktext]     nltkvocab = sorted(set(nltkwords))     return nltkvocab 

when tried other day on sprach zarathustra, clobbered words umlat on o's , u's. i'm sure of know why happened. i'm sure it's quite easy fix. know has calling function re-encodes tokens unicode strings. if so, seems me might not happen inside function definition @ all, here, prepare write file:

def jotindex(jotted, filename, readmethod):     filemydata = open(filename, readmethod)     jottedf = '\n'.join(jotted)     filemydata.write(jottedf)     filemydata.close()     return 0 

i heard had encode string unicode after reading file. tried amending function so:

def openbookreturnvocab(book):     fileopen = open(book)     rawness = fileopen.read()     unirawness = rawness.decode('utf-8')     tokens = nltk.wordpunct_tokenize(unirawness)     nltktext = nltk.text(tokens)     nltkwords = [w.lower() w in nltktext]     nltkvocab = sorted(set(nltkwords))     return nltkvocab 

but brought error, when used on hungarian. when used on german, had no errors.

>>> import bookroutines >>> elles1 = bookroutines.openbookreturnvocab("lk1-les1") traceback (most recent call last):   file "<stdin>", line 1, in <module>   file "bookroutines.py", line 9, in openbookreturnvocab     nltktext = nltk.text(tokens)   file "/usr/lib/pymodules/python2.6/nltk/text.py", line 285, in __init__     self.name = " ".join(map(str, tokens[:8])) + "..." unicodeencodeerror: 'ascii' codec can't encode character u'\xe1' in position 4: ordinal not in range(128) 

i fixed function files data so:

def jotindex(jotted, filename, readmethod):     filemydata = open(filename, readmethod)     jottedf = u'\n'.join(jotted)     filemydata.write(jottedf)     filemydata.close()     return 0 

however, brought error, when tried file german:

traceback (most recent call last):   file "<stdin>", line 1, in <module>   file "bookroutines.py", line 23, in jotindex     filemydata.write(jottedf) unicodeencodeerror: 'ascii' codec can't encode character u'\xf6' in position 414: ordinal not in range(128) >>>  

...which when try write u'\n'.join'ed data.

>>> jottedf = u'/n'.join(elles1) >>> filemydata.write(jottedf) traceback (most recent call last):   file "<stdin>", line 1, in <module> unicodeencodeerror: 'ascii' codec can't encode character u'\xf6' in position 504: ordinal not in range(128) 

for each string read file, can convert them unicode calling rawness.decode('utf-8'), if have text in utf-8. end unicode objects. also, don't know "jotted" is, may want make sure it's unicode object , use u'\n'.join(jotted) instead.

update:

it appears nltk library doesn't unicode objects. fine, have make sure using str instances utf-8 encoded text. try using this:

tokens = nltk.wordpunct_tokenize(unirawness) nltktext = nltk.text([token.encode('utf-8') token in tokens]) 

and this:

jottedf = u'\n'.join(jotted) filemydata.write(jottedf.encode('utf-8')) 

but if jotted list of utf-8-encoded str, don't need , should enough:

jottedf = '\n'.join(jotted) filemydata.write(jottedf) 

by way, looks though nltk isn't cautious respect unicode , encoding (at least, demos). better careful , check has processed tokens correctly. also, , may have caused fact errors hungarian text , not german text, check encodings.


Comments

Popular posts from this blog

c++ - Convert big endian to little endian when reading from a binary file -

C#: Application without a window or taskbar item (background app) that can still use Console.WriteLine() -

gdi+ - WxWidgets draw a bitmap with opacity -