python - Character encoding is violated -
i trying parse file encoded in utf-8. no operation has problem apart write file (or @ least think so). minimum working example follows:
from lxml import etree parser = etree.htmlparser() tree = etree.parse('example.txt', parser) tree.write('aaaaaaaaaaaaaaaaa.html')   example.txt:
<html>     <body>         <invalid html here/>         <interesting attrib1="yes">             <group>                 <line>                     δεδομένα1                 </line>             </group>             <group>                 <line>                     δεδομένα2                 </line>             </group>             <group>                 <line>                     δεδομένα3                 </line>             </group>         </interesting>     </body> </html>    i aware of similar previous question not solve problem either without specifying output encoding, or using utf8 or iso-8859-7.
i have concluded file in utf8 since displays correctly @ chrome when choosing encoding. editor (kate) agrees.
i no runtime error, output not desired. example output tree.write('aaaaaaaaaaaaaaaaa.html', encoding='utf-8'):
<!doctype html public "-//w3c//dtd html 4.0 transitional//en" "http://www.w3.org/tr/rec-html40/loose.dtd"> <html><body>         <invalid html="" here=""/><interesting attrib1="yes"><group><line>                     δεδομÎνα1                 </line></group><group><line>                     δεδομÎνα2                 </line></group><group><line>                     δεδομÎνα3                 </line></group></interesting></body></html>      
the obvious problem htmlparser treats input file ansi default, i.e. utf-8 bytes misinterpreted 8-bit character codes. can pass encoding fix this:
parser = etree.htmlparser(encoding = "utf-8")   if want check meant misinterpretation, let python print repr(tree.xpath("//line")[0].text) , without htmlparser's encoding parameter.
Comments
Post a Comment