Foreign Characters in android & Java -
i trying download , parse webpage foreign (chinese) characters. i'm not sure whether should use "utf-8" or else. none of these seems work me.  used sample wikitionary code geturlcontent().
public void oncreate(bundle savedinstancestate) {     super.oncreate(savedinstancestate);     setcontentview(r.layout.main);     mtext = (textview) findviewbyid(r.id.textview1);     huaren.prepareuseragent(this);     string test = new string("fail");      try {         test = geturlcontent("http://huaren.us/");     } catch (apiexception e) {         // todo auto-generated catch block         e.printstacktrace();     }     byte[] b = new byte[100000];      try {           b = test.getbytes("utf-8");     } catch (unsupportedencodingexception e) {         // todo auto-generated catch block         e.printstacktrace();     }      char[] chararr = (new string(b)).tochararray();     charsequence seq = java.nio.charbuffer.wrap(chararr);       mtext.settext(chararr, 0, 1000);//.settext(seq); }  protected static synchronized string geturlcontent(string url) throws apiexception {     if (suseragent == null) {         throw new apiexception("user-agent string must prepared");     }      // create client , set our specific user-agent string     httpclient client = new defaulthttpclient();     httpget request = new httpget(url);     request.setheader("user-agent", suseragent);      try {         httpresponse response = client.execute(request);          // check if server response valid         statusline status = response.getstatusline();         if (status.getstatuscode() != http_status_ok) {             throw new apiexception("invalid response server: " +                     status.tostring());         }          // pull content stream response         httpentity entity = response.getentity();         inputstream inputstream = entity.getcontent();          bytearrayoutputstream content = new bytearrayoutputstream();          // read response buffered stream         int readbytes = 0;         while ((readbytes = inputstream.read(sbuffer)) != -1) {             content.write(sbuffer, 0, readbytes);         }          // return result buffered stream         return new string(content.tobytearray(), "utf-8");     } catch (ioexception e) {         throw new apiexception("problem communicating api", e);     } }      
the charset defined in the page itself:
<meta http-equiv="content-type" content="text/html; charset=gb2312" />    in general, there 3 ways specify encoding of http-server html page:
content-type header of http
content-type: text/html; charset=utf-8   encoding pseudo-attribute in xml declaration
<?xml version="1.0" encoding="utf-8" ?>   meta tag inside head
<meta http-equiv="content-type" content="text/html;charset=utf-8" />   see character encodings details
so should try evaluate each possible declaration in order find appropriate encoding. try parse page utf-8 , restart if encounter content-type declaration meta tag.
Comments
Post a Comment