convert raw bytes to Unicode strings

  • Follow


Does standard C++ have any methods to do this? I'd like to convert raw 
bytes to utf-8. Thanks for any tips.
0
Reply byte8bits (134) 7/21/2008 1:24:50 PM

brad wrote:
> Does standard C++ have any methods to do this? I'd like to convert raw 
> bytes to utf-8. Thanks for any tips.

What is the difference between "raw bytes" and "utf-8"?

V
-- 
Please remove capital 'A's when replying by e-mail
I do not respond to top-posted replies, please don't ask
0
Reply v.Abazarov (13255) 7/21/2008 1:47:43 PM


Victor Bazarov wrote:
> What is the difference between "raw bytes" and "utf-8"?
> 
> V

raw bytes are not character streams. They do not conform to the concept 
of a char. grep a binary file for a string, then grep a text file for a 
string to gain a better understanding of this difference.
0
Reply byte8bits (134) 7/21/2008 1:51:26 PM

brad <byte8bits@gmail.com> writes:

> Victor Bazarov wrote:
>> What is the difference between "raw bytes" and "utf-8"?
>> V
>
> raw bytes are not character streams. They do not conform to the
> concept of a char. grep a binary file for a string, then grep a text
> file for a string to gain a better understanding of this difference.

But when you take a string containing characters, and you encode it
into a sequence of UTF-8 bytes, you don't get a string, but a sequence
of bytes. 

What is the difference between these bytes and your "raw" bytes?

Do you know what UTF-8 is? (read at least wikipedia article about it).


Anyways, there's no standard C++ function to do what you want.  You
could use an external library like libiconv, or just write the utf-8
encoding/decoding algorithm in C++ yourself.

-- 
__Pascal Bourguignon__
0
Reply pjb (7645) 7/21/2008 2:11:22 PM

brad wrote:
> Victor Bazarov wrote:
>> What is the difference between "raw bytes" and "utf-8"?
>>
>> V
> 
> raw bytes are not character streams. They do not conform to the concept 
> of a char. grep a binary file for a string, then grep a text file for a 
> string to gain a better understanding of this difference.

In C++ a byte is a char.  The type 'char' is an integral type "large 
enough to store any member of the implementation's basic character set". 
  There is no separate "concept of a char" from that, at least in C++.

C++ has no specific provisions for UTF-8.  There is the class 'codecvt' 
(actually a class template), that the Standard says "is for use when 
converting from one codeset to another".  Perhaps you should look into 
that...

V
-- 
Please remove capital 'A's when replying by e-mail
I do not respond to top-posted replies, please don't ask
0
Reply v.Abazarov (13255) 7/21/2008 2:12:07 PM

brad wrote:
> Does standard C++ have any methods to do this? I'd like to convert raw
> bytes to utf-8. Thanks for any tips.

  I think you are confusing two different (although related) concepts.

  An "unicode string" and an "utf-8 string" are two different things.

  The former is a string where each character represents a unicode
character. Usually that means that every character must be 4 bytes long
in order to be able to store any unicode value. (Although I'm not sure
if there's an existing convention for this. I'm not exactly sure what's
the "standard" width for a unicode wide character.)

  An "utf-8 string" is a string which has been utf-8-encoded. This means
that each "character" in the string is of variable length. Each
character may be between 1 and 4 bytes in size. (This means, among other
things, that random access is not possible. That is, you can't get the
nth character in constant time, but you must traverse the string from
the beginning if you want to do so.)

  You should be clearer about what is it that you want. Let me guess:
You have an utf-8-encoded input, and you want to decode it to a string
containing wide characters.
0
Reply nospam270 (2853) 7/21/2008 2:37:22 PM

On Jul 21, 4:37 pm, Juha Nieminen <nos...@thanks.invalid> wrote:
> brad wrote:
> > Does standard C++ have any methods to do this? I'd like to
> > convert raw bytes to utf-8. Thanks for any tips.

>   I think you are confusing two different (although related)
>   concepts.

I think he's confusing a lot of things.  Raw bytes underly all
data in the computer; UTF-8 strings sit in raw bytes, as do
double, and anything else you can think of.  The idea of
"converting" raw bytes into anything is pattently absurd.

> An "unicode string" and an "utf-8 string" are two different
> things.

The first is less precise than the second.  UTF-8 is only one
possible encoding form of Unicode.

> The former is a string where each character represents a
> unicode character. Usually that means that every character
> must be 4 bytes long in order to be able to store any unicode
> value. (Although I'm not sure if there's an existing
> convention for this. I'm not exactly sure what's the
> "standard" width for a unicode wide character.)

Strictly speaking: "Unicode" is a mapping between "characters"
and integral values.  Unicode also defines severaly encoding
formats, ways of encoding these integral values in machine words
of various lengths: UTF-8, UTF-16 and UTF-32.  In contexts where
byte order matters (e.g. byte oriented transmission mediums),
you can append an LE of BE after UTF-16 or UTF-32, to further
precise.

> An "utf-8 string" is a string which has been utf-8-encoded.

And thus, is a Unicode string.

> This means that each "character" in the string is of variable
> length.

That's more or less true of every encoding format.

> Each character may be between 1 and 4 bytes in size.

Each encoding point may be betwwen 1 and 4 bytes.  A character
may use several encoding points in its representation.

--
James Kanze (GABI Software)             email:james.kanze@gmail.com
Conseils en informatique orient=E9e objet/
                   Beratung in objektorientierter Datenverarbeitung
9 place S=E9mard, 78210 St.-Cyr-l'=C9cole, France, +33 (0)1 30 23 00 34
0
Reply james.kanze (9593) 7/21/2008 6:18:01 PM

6 Replies
46 Views

(page loaded in 0.087 seconds)

Similiar Articles:













7/16/2012 6:45:10 PM


Reply: