python - Find all tags with a specific attribute value -
how can iterate on tags have specific attribute specific value? instance, let's need data1, data2 etc... only.
<html> <body> <invalid html here/> <dont care> ... </dont care> <invalid html here too/> <interesting attrib1="naah, not this"> ... </interesting tag> <interesting attrib1="yes, want"> <group> <line> data </line> </group> <group> <line> data1 <line> </group> <group> <line> data2 <line> </group> </interesting> </body> </html>
i tried beautifulsoup can't parse file. lxml's parser, though, seems work:
broken_html = get_sanitized_data(site) parser = etree.htmlparser() tree = etree.parse(stringio(broken_html), parser) result = etree.tostring(tree.getroot(), pretty_print=true, method="html") print(result)
i not familiar api, , not figure out how use either getiterator or xpath.
here's 1 way, using lxml , xpath 'descendant::*[@attrib1="yes, want"]'
. xpath tells lxml @ descendants of current node , return attrib1
attribute equal "yes, want"
.
import lxml.html lh import cstringio content=''' <html> <body> <invalid html here/> <dont care> ... </dont care> <invalid html here too/> <interesting attrib1="naah, not this"> ... </interesting tag> <interesting attrib1="yes, want"> <group> <line> data </line> </group> <group> <line> data1 <line> </group> <group> <line> data2 <line> </group> </interesting> </body> </html> ''' doc=lh.parse(cstringio.stringio(content)) tags=doc.xpath('descendant::*[@attrib1="yes, want"]') print(tags) # [<element interesting @ b767e14c>] tag in tags: print(lh.tostring(tag)) # <interesting attrib1="yes, want"><group><line> # data # </line></group><group><line> # data1 # <line></line></line></group><group><line> # data2 # <line></line></line></group></interesting>
Comments
Post a Comment