python - Find all tags with a specific attribute value -
how can iterate on tags have specific attribute specific value? instance, let's need data1, data2 etc... only.
<html>     <body>         <invalid html here/>         <dont care> ... </dont care>         <invalid html here too/>         <interesting attrib1="naah, not this"> ... </interesting tag>         <interesting attrib1="yes, want">             <group>                 <line>                     data                 </line>             </group>             <group>                 <line>                     data1                 <line>             </group>             <group>                 <line>                     data2                 <line>             </group>         </interesting>     </body> </html>   i tried beautifulsoup can't parse file. lxml's parser, though, seems work:
broken_html = get_sanitized_data(site)  parser = etree.htmlparser() tree = etree.parse(stringio(broken_html), parser)  result = etree.tostring(tree.getroot(), pretty_print=true, method="html")  print(result)   i not familiar api, , not figure out how use either getiterator or xpath.
here's 1 way, using lxml , xpath 'descendant::*[@attrib1="yes, want"]'. xpath tells lxml @ descendants of current node , return attrib1 attribute equal "yes, want".
import lxml.html lh  import cstringio  content=''' <html>     <body>         <invalid html here/>         <dont care> ... </dont care>         <invalid html here too/>         <interesting attrib1="naah, not this"> ... </interesting tag>         <interesting attrib1="yes, want">             <group>                 <line>                     data                 </line>             </group>             <group>                 <line>                     data1                 <line>             </group>             <group>                 <line>                     data2                 <line>             </group>         </interesting>     </body> </html> ''' doc=lh.parse(cstringio.stringio(content)) tags=doc.xpath('descendant::*[@attrib1="yes, want"]') print(tags) # [<element interesting @ b767e14c>] tag in tags:     print(lh.tostring(tag)) # <interesting attrib1="yes, want"><group><line> #                     data #                 </line></group><group><line> #                     data1 #                 <line></line></line></group><group><line> #                     data2 #                 <line></line></line></group></interesting>      
Comments
Post a Comment