python - Character encoding is violated -
i trying parse file encoded in utf-8
. no operation has problem apart write file (or @ least think so). minimum working example follows:
from lxml import etree parser = etree.htmlparser() tree = etree.parse('example.txt', parser) tree.write('aaaaaaaaaaaaaaaaa.html')
example.txt:
<html> <body> <invalid html here/> <interesting attrib1="yes"> <group> <line> δεδομένα1 </line> </group> <group> <line> δεδομένα2 </line> </group> <group> <line> δεδομένα3 </line> </group> </interesting> </body> </html>
i aware of similar previous question not solve problem either without specifying output encoding, or using utf8
or iso-8859-7
.
i have concluded file in utf8
since displays correctly @ chrome when choosing encoding. editor (kate) agrees.
i no runtime error, output not desired. example output tree.write('aaaaaaaaaaaaaaaaa.html', encoding='utf-8')
:
<!doctype html public "-//w3c//dtd html 4.0 transitional//en" "http://www.w3.org/tr/rec-html40/loose.dtd"> <html><body> <invalid html="" here=""/><interesting attrib1="yes"><group><line> δεδομÎνα1 </line></group><group><line> δεδομÎνα2 </line></group><group><line> δεδομÎνα3 </line></group></interesting></body></html>
the obvious problem htmlparser treats input file ansi default, i.e. utf-8 bytes misinterpreted 8-bit character codes. can pass encoding fix this:
parser = etree.htmlparser(encoding = "utf-8")
if want check meant misinterpretation, let python print repr(tree.xpath("//line")[0].text)
, without htmlparser's encoding
parameter.
Comments
Post a Comment