c++ - Character sets - Not clear -
the standard defines
basic source character set
basic execution character set , wide char counterpart
it defines 'execution character set' , wide char counterpart follows
$2.2/3- "the execution character set , execution wide-character set supersets of basic execution character set , basic execution wide-character set, respectively. values of members of execution character sets implementation-defined, , additional members locale-specific."
q1. don't think understand completely, particularly last statement. pointers on aspect?
further,
$3.9.1 - "objects declared characters (char) shall large enough store member of implementation’s basic character set."
q2. in 3.9.1 phrase 'basic character set' means 'basic execution character set'?
you need distinguish between source character set, execution character set, wire execution character set , it's basic versions:
the basic source character set:
§2.1.1: basic source character set consists of 96 characters […]
this character set has 96 characters. fit 7 bit. characters @
not included.
let's example binary representations few basic source characters. can arbitrary , there no need these correspond ascii values.
a -> 0000000 b -> 0100100 c -> 0011101
the basic execution character set …
§2.1.3: basic execution character set , basic execution wide-character set shall each contain members of basic source character set, plus control characters representing alert, backspace, , carriage return, plus null character (respectively, null wide character), representation has 0 bits.
as stated basic execution character set contains members of basic source character set. still doesn't include other character @
. basic execution character set can have different binary representation.
as stated basic execution character set contains representations carriage return, null character , other characters.
a -> 10110101010 b -> 00001000101 <- basic source character set c -> 10101011111 ---------------------------------------------------------- null -> 00000000000 backspace -> 11111100011
if basic execution character set 11 bits long (like in example) char data type shall large enough store 11 bits may longer.
… , basic execution wide character set:
the basic execution wide character used wide characters (wchar_t). basicallly same basic execution wide character set can have different binary representations well.
a -> 1011010101010110101010 b -> 0000100010110101011111 <- basic source character set c -> 1010100101101000011011 --------------------------------------------------------------------- null -> 0000000000000000000000 backspace -> 1111110001100000000001
the fixed member null character needs sequence of 0
bits.
converting between basic character sets:
§2.1.1.5: each source character set member, escape sequence, or universal-character-name in character literals , string literals converted member of execution character set (2.13.2, 2.13.4).
then c++ source file compiled each character of source character set converted basic execution (wide) character set.
example:
const char* string0 = "ba\bc"; const wchar_t string1 = l"ba\bc";
since string0
normal character converted basic execution character set , string1
converted basic execution wide character set.
string0 -> 00001000101 10110101010 11111100011 10101011111 string1 -> 0000100010110101011111 1011010101010110101010 // continued 1111110001100000000001 1010100101101000011011
something file encodings:
there several kind of file encodings. example ascii
7 bit long. windows-1252
8 bit long (known ansi
). ascii
doesn't contain non-english characters. ansi
contains european characters ä Ö ä Õ ø
.
newer file encodings utf-8
or utf-32
can contain characters of language. utf-8
characters variable in length. utf-32
32 bit characters long.
file enconding requirements:
most compilers offer command line switch specify file encoding of source file.
a c++ source file needs encoded in file encoding has representation of basic source character set. example: file encoding of source file needs have representation of ;
character.
if can type character ;
within encoding chosen encoding of source file encoding not suitable c++ source file encoding.
non-basic character sets:
characters not included in basic source character set belong source character set. source character set equivalent file encoding.
for example: @
character not include in basic source character may included in source character set. chosen file encoding of input source file might contain representation of @
. if doesn't contain representation @
can't use character @
within strings.
characters not included in basic (wide) character set belong execution (wide) character set.
remember compiler converts character source character set execution character set , execution wide character set. therefore there needs way how these characters can converted.
for example: if specify windows-1252
encoding of source character set , specify ascii
execution wide character set there no way convert string:
const char* string0 = "string european characters ö, Ä, ô, Ð.";
these characters can not represented in ascii
.
specifying character sets:
here examples how specify character sets using gcc. default values included.
-finput-charset=utf-8 <- source character set -fexec-charset=utf-8 <- execution character set -fwide-exec-charset=utf-32 <- execution wide character set
with utf-8 , utf-32 default encoding c++ source files can contain strings character of language. utf-8 characters can converted both ways without problems.
the extended character set:
§1.1.3: multibyte character, sequence of 1 or more bytes representing member of extended character set of either source or execution environment. extended character set superset of basic character set (2.2).
multibyte character longer entry of normal characters. contain escape sequence marking them multibyte character.
multibyte characters processed according locale set in user's runtime environment. these multibyte characters converted @ runtime encoding set in user's environment.
Comments
Post a Comment