Foreign Characters in android & Java -


i trying download , parse webpage foreign (chinese) characters. i'm not sure whether should use "utf-8" or else. none of these seems work me. used sample wikitionary code geturlcontent().

public void oncreate(bundle savedinstancestate) {     super.oncreate(savedinstancestate);     setcontentview(r.layout.main);     mtext = (textview) findviewbyid(r.id.textview1);     huaren.prepareuseragent(this);     string test = new string("fail");      try {         test = geturlcontent("http://huaren.us/");     } catch (apiexception e) {         // todo auto-generated catch block         e.printstacktrace();     }     byte[] b = new byte[100000];      try {           b = test.getbytes("utf-8");     } catch (unsupportedencodingexception e) {         // todo auto-generated catch block         e.printstacktrace();     }      char[] chararr = (new string(b)).tochararray();     charsequence seq = java.nio.charbuffer.wrap(chararr);       mtext.settext(chararr, 0, 1000);//.settext(seq); }  protected static synchronized string geturlcontent(string url) throws apiexception {     if (suseragent == null) {         throw new apiexception("user-agent string must prepared");     }      // create client , set our specific user-agent string     httpclient client = new defaulthttpclient();     httpget request = new httpget(url);     request.setheader("user-agent", suseragent);      try {         httpresponse response = client.execute(request);          // check if server response valid         statusline status = response.getstatusline();         if (status.getstatuscode() != http_status_ok) {             throw new apiexception("invalid response server: " +                     status.tostring());         }          // pull content stream response         httpentity entity = response.getentity();         inputstream inputstream = entity.getcontent();          bytearrayoutputstream content = new bytearrayoutputstream();          // read response buffered stream         int readbytes = 0;         while ((readbytes = inputstream.read(sbuffer)) != -1) {             content.write(sbuffer, 0, readbytes);         }          // return result buffered stream         return new string(content.tobytearray(), "utf-8");     } catch (ioexception e) {         throw new apiexception("problem communicating api", e);     } } 

the charset defined in the page itself:

<meta http-equiv="content-type" content="text/html; charset=gb2312" />  

in general, there 3 ways specify encoding of http-server html page:

content-type header of http

content-type: text/html; charset=utf-8 

encoding pseudo-attribute in xml declaration

<?xml version="1.0" encoding="utf-8" ?> 

meta tag inside head

<meta http-equiv="content-type" content="text/html;charset=utf-8" /> 

see character encodings details

so should try evaluate each possible declaration in order find appropriate encoding. try parse page utf-8 , restart if encounter content-type declaration meta tag.


Comments

Popular posts from this blog

c++ - Convert big endian to little endian when reading from a binary file -

C#: Application without a window or taskbar item (background app) that can still use Console.WriteLine() -

unicode - Are email addresses allowed to contain non-alphanumeric characters? -