python - Find all tags with a specific attribute value -


how can iterate on tags have specific attribute specific value? instance, let's need data1, data2 etc... only.

<html>     <body>         <invalid html here/>         <dont care> ... </dont care>         <invalid html here too/>         <interesting attrib1="naah, not this"> ... </interesting tag>         <interesting attrib1="yes, want">             <group>                 <line>                     data                 </line>             </group>             <group>                 <line>                     data1                 <line>             </group>             <group>                 <line>                     data2                 <line>             </group>         </interesting>     </body> </html> 

i tried beautifulsoup can't parse file. lxml's parser, though, seems work:

broken_html = get_sanitized_data(site)  parser = etree.htmlparser() tree = etree.parse(stringio(broken_html), parser)  result = etree.tostring(tree.getroot(), pretty_print=true, method="html")  print(result) 

i not familiar api, , not figure out how use either getiterator or xpath.

here's 1 way, using lxml , xpath 'descendant::*[@attrib1="yes, want"]'. xpath tells lxml @ descendants of current node , return attrib1 attribute equal "yes, want".

import lxml.html lh  import cstringio  content=''' <html>     <body>         <invalid html here/>         <dont care> ... </dont care>         <invalid html here too/>         <interesting attrib1="naah, not this"> ... </interesting tag>         <interesting attrib1="yes, want">             <group>                 <line>                     data                 </line>             </group>             <group>                 <line>                     data1                 <line>             </group>             <group>                 <line>                     data2                 <line>             </group>         </interesting>     </body> </html> ''' doc=lh.parse(cstringio.stringio(content)) tags=doc.xpath('descendant::*[@attrib1="yes, want"]') print(tags) # [<element interesting @ b767e14c>] tag in tags:     print(lh.tostring(tag)) # <interesting attrib1="yes, want"><group><line> #                     data #                 </line></group><group><line> #                     data1 #                 <line></line></line></group><group><line> #                     data2 #                 <line></line></line></group></interesting> 

Comments

Popular posts from this blog

c++ - Convert big endian to little endian when reading from a binary file -

C#: Application without a window or taskbar item (background app) that can still use Console.WriteLine() -

unicode - Are email addresses allowed to contain non-alphanumeric characters? -