python - Character encoding is violated -


i trying parse file encoded in utf-8. no operation has problem apart write file (or @ least think so). minimum working example follows:

from lxml import etree parser = etree.htmlparser() tree = etree.parse('example.txt', parser) tree.write('aaaaaaaaaaaaaaaaa.html') 

example.txt:

<html>     <body>         <invalid html here/>         <interesting attrib1="yes">             <group>                 <line>                     δεδομένα1                 </line>             </group>             <group>                 <line>                     δεδομένα2                 </line>             </group>             <group>                 <line>                     δεδομένα3                 </line>             </group>         </interesting>     </body> </html>  

i aware of similar previous question not solve problem either without specifying output encoding, or using utf8 or iso-8859-7.

i have concluded file in utf8 since displays correctly @ chrome when choosing encoding. editor (kate) agrees.

i no runtime error, output not desired. example output tree.write('aaaaaaaaaaaaaaaaa.html', encoding='utf-8'):

<!doctype html public "-//w3c//dtd html 4.0 transitional//en" "http://www.w3.org/tr/rec-html40/loose.dtd"> <html><body>         <invalid html="" here=""/><interesting attrib1="yes"><group><line>                     δεδομένα1                 </line></group><group><line>                     δεδομένα2                 </line></group><group><line>                     δεδομένα3                 </line></group></interesting></body></html> 

the obvious problem htmlparser treats input file ansi default, i.e. utf-8 bytes misinterpreted 8-bit character codes. can pass encoding fix this:

parser = etree.htmlparser(encoding = "utf-8") 

if want check meant misinterpretation, let python print repr(tree.xpath("//line")[0].text) , without htmlparser's encoding parameter.


Comments

Popular posts from this blog

c++ - Convert big endian to little endian when reading from a binary file -

C#: Application without a window or taskbar item (background app) that can still use Console.WriteLine() -

unicode - Are email addresses allowed to contain non-alphanumeric characters? -