Foreign Characters in android & Java -
i trying download , parse webpage foreign (chinese) characters. i'm not sure whether should use "utf-8" or else. none of these seems work me. used sample wikitionary code geturlcontent()
.
public void oncreate(bundle savedinstancestate) { super.oncreate(savedinstancestate); setcontentview(r.layout.main); mtext = (textview) findviewbyid(r.id.textview1); huaren.prepareuseragent(this); string test = new string("fail"); try { test = geturlcontent("http://huaren.us/"); } catch (apiexception e) { // todo auto-generated catch block e.printstacktrace(); } byte[] b = new byte[100000]; try { b = test.getbytes("utf-8"); } catch (unsupportedencodingexception e) { // todo auto-generated catch block e.printstacktrace(); } char[] chararr = (new string(b)).tochararray(); charsequence seq = java.nio.charbuffer.wrap(chararr); mtext.settext(chararr, 0, 1000);//.settext(seq); } protected static synchronized string geturlcontent(string url) throws apiexception { if (suseragent == null) { throw new apiexception("user-agent string must prepared"); } // create client , set our specific user-agent string httpclient client = new defaulthttpclient(); httpget request = new httpget(url); request.setheader("user-agent", suseragent); try { httpresponse response = client.execute(request); // check if server response valid statusline status = response.getstatusline(); if (status.getstatuscode() != http_status_ok) { throw new apiexception("invalid response server: " + status.tostring()); } // pull content stream response httpentity entity = response.getentity(); inputstream inputstream = entity.getcontent(); bytearrayoutputstream content = new bytearrayoutputstream(); // read response buffered stream int readbytes = 0; while ((readbytes = inputstream.read(sbuffer)) != -1) { content.write(sbuffer, 0, readbytes); } // return result buffered stream return new string(content.tobytearray(), "utf-8"); } catch (ioexception e) { throw new apiexception("problem communicating api", e); } }
the charset defined in the page itself:
<meta http-equiv="content-type" content="text/html; charset=gb2312" />
in general, there 3 ways specify encoding of http-server html page:
content-type header of http
content-type: text/html; charset=utf-8
encoding pseudo-attribute in xml declaration
<?xml version="1.0" encoding="utf-8" ?>
meta tag inside head
<meta http-equiv="content-type" content="text/html;charset=utf-8" />
see character encodings details
so should try evaluate each possible declaration in order find appropriate encoding. try parse page utf-8 , restart if encounter content-type declaration meta tag.
Comments
Post a Comment