Does standard C++ have any methods to do this? I'd like to convert raw
bytes to utf-8. Thanks for any tips.
|
|
0
|
|
|
|
Reply
|
byte8bits (134)
|
7/21/2008 1:24:50 PM |
|
brad wrote:
> Does standard C++ have any methods to do this? I'd like to convert raw
> bytes to utf-8. Thanks for any tips.
What is the difference between "raw bytes" and "utf-8"?
V
--
Please remove capital 'A's when replying by e-mail
I do not respond to top-posted replies, please don't ask
|
|
0
|
|
|
|
Reply
|
v.Abazarov (13255)
|
7/21/2008 1:47:43 PM
|
|
Victor Bazarov wrote:
> What is the difference between "raw bytes" and "utf-8"?
>
> V
raw bytes are not character streams. They do not conform to the concept
of a char. grep a binary file for a string, then grep a text file for a
string to gain a better understanding of this difference.
|
|
0
|
|
|
|
Reply
|
byte8bits (134)
|
7/21/2008 1:51:26 PM
|
|
brad <byte8bits@gmail.com> writes:
> Victor Bazarov wrote:
>> What is the difference between "raw bytes" and "utf-8"?
>> V
>
> raw bytes are not character streams. They do not conform to the
> concept of a char. grep a binary file for a string, then grep a text
> file for a string to gain a better understanding of this difference.
But when you take a string containing characters, and you encode it
into a sequence of UTF-8 bytes, you don't get a string, but a sequence
of bytes.
What is the difference between these bytes and your "raw" bytes?
Do you know what UTF-8 is? (read at least wikipedia article about it).
Anyways, there's no standard C++ function to do what you want. You
could use an external library like libiconv, or just write the utf-8
encoding/decoding algorithm in C++ yourself.
--
__Pascal Bourguignon__
|
|
0
|
|
|
|
Reply
|
pjb (7645)
|
7/21/2008 2:11:22 PM
|
|
brad wrote:
> Victor Bazarov wrote:
>> What is the difference between "raw bytes" and "utf-8"?
>>
>> V
>
> raw bytes are not character streams. They do not conform to the concept
> of a char. grep a binary file for a string, then grep a text file for a
> string to gain a better understanding of this difference.
In C++ a byte is a char. The type 'char' is an integral type "large
enough to store any member of the implementation's basic character set".
There is no separate "concept of a char" from that, at least in C++.
C++ has no specific provisions for UTF-8. There is the class 'codecvt'
(actually a class template), that the Standard says "is for use when
converting from one codeset to another". Perhaps you should look into
that...
V
--
Please remove capital 'A's when replying by e-mail
I do not respond to top-posted replies, please don't ask
|
|
0
|
|
|
|
Reply
|
v.Abazarov (13255)
|
7/21/2008 2:12:07 PM
|
|
brad wrote:
> Does standard C++ have any methods to do this? I'd like to convert raw
> bytes to utf-8. Thanks for any tips.
I think you are confusing two different (although related) concepts.
An "unicode string" and an "utf-8 string" are two different things.
The former is a string where each character represents a unicode
character. Usually that means that every character must be 4 bytes long
in order to be able to store any unicode value. (Although I'm not sure
if there's an existing convention for this. I'm not exactly sure what's
the "standard" width for a unicode wide character.)
An "utf-8 string" is a string which has been utf-8-encoded. This means
that each "character" in the string is of variable length. Each
character may be between 1 and 4 bytes in size. (This means, among other
things, that random access is not possible. That is, you can't get the
nth character in constant time, but you must traverse the string from
the beginning if you want to do so.)
You should be clearer about what is it that you want. Let me guess:
You have an utf-8-encoded input, and you want to decode it to a string
containing wide characters.
|
|
0
|
|
|
|
Reply
|
nospam270 (2853)
|
7/21/2008 2:37:22 PM
|
|
On Jul 21, 4:37 pm, Juha Nieminen <nos...@thanks.invalid> wrote:
> brad wrote:
> > Does standard C++ have any methods to do this? I'd like to
> > convert raw bytes to utf-8. Thanks for any tips.
> I think you are confusing two different (although related)
> concepts.
I think he's confusing a lot of things. Raw bytes underly all
data in the computer; UTF-8 strings sit in raw bytes, as do
double, and anything else you can think of. The idea of
"converting" raw bytes into anything is pattently absurd.
> An "unicode string" and an "utf-8 string" are two different
> things.
The first is less precise than the second. UTF-8 is only one
possible encoding form of Unicode.
> The former is a string where each character represents a
> unicode character. Usually that means that every character
> must be 4 bytes long in order to be able to store any unicode
> value. (Although I'm not sure if there's an existing
> convention for this. I'm not exactly sure what's the
> "standard" width for a unicode wide character.)
Strictly speaking: "Unicode" is a mapping between "characters"
and integral values. Unicode also defines severaly encoding
formats, ways of encoding these integral values in machine words
of various lengths: UTF-8, UTF-16 and UTF-32. In contexts where
byte order matters (e.g. byte oriented transmission mediums),
you can append an LE of BE after UTF-16 or UTF-32, to further
precise.
> An "utf-8 string" is a string which has been utf-8-encoded.
And thus, is a Unicode string.
> This means that each "character" in the string is of variable
> length.
That's more or less true of every encoding format.
> Each character may be between 1 and 4 bytes in size.
Each encoding point may be betwwen 1 and 4 bytes. A character
may use several encoding points in its representation.
--
James Kanze (GABI Software) email:james.kanze@gmail.com
Conseils en informatique orient=E9e objet/
Beratung in objektorientierter Datenverarbeitung
9 place S=E9mard, 78210 St.-Cyr-l'=C9cole, France, +33 (0)1 30 23 00 34
|
|
0
|
|
|
|
Reply
|
james.kanze (9593)
|
7/21/2008 6:18:01 PM
|
|
|
6 Replies
46 Views
(page loaded in 0.087 seconds)
Similiar Articles: newbie: awk not working for multi-byte charsets? - comp.lang.awk ...... awk (I am using gawk) support string operation to multi-byte ... with XML's ability of coping with Unicode. For example, the byte ... Can we convert a char to ascii in awk ... Converting UTF-16 character strings to ISO8859-1 - comp.os.vms ...... string conversion functions that will convert from ... write Unicode - comp.lang.java.programmer On Sat, 07 Jul 2007 11:17:16 -0400 ... bytes, you could> try using String ... XML-Parsing with UTF-8 Byte-Order-Mark (BOM) - comp.lang.java ...... byte[] xml = baos.toByteArray(); String s ... Windows ... you can specify double-byte Unicode (UTF ... ... XML FAQ: Encoding What is the BOM (Byte-Order-Mark)? How do I convert an ... Hardcoded insert string for the event messages - comp.lang.asm.x86 ...I can convert everything into assembly but the "Hardcoded insert string for the event ... word or dword instead of a byte ... Those characters DO all convert to Unicode ... sending hex string to a serial port - comp.lang.labviewi am trying to send a string of hex bytes out a serial port the format being ... ASCII that you want to transmit as hex, then you need to convert the string from ASCII to "raw ... Convert Base64 String to RSA Key - comp.lang.java.programmer ...AES key with raw key bytes - comp.lang.java.security Convert RSA public key, from XML to PEM (PHP ... lang.ruby ... Re: Getting keys out [and into raw string formats ... GE medical systems dicom files: how to read and convert them ...Hello, I want to view and if possible convert some ... hex editor to view the image files, the first bytes show a string ... When i try to read a DICOM image(raw file) using ... unicode display of common characters - comp.emacs... Emacs, I think, but that won't include the change ... variable in my file (I assume it's being byte ... of Unicode code points in characters and string values, a Unicode ... AES key with raw key bytes - comp.lang.java.securityAES key with raw key bytes - comp.lang.java.security Convert RSA public key, from XML to PEM (PHP ... lang.ruby ... Re: Getting keys out [and into raw string formats ... Replace Unicode code value U+0900 - comp.lang.awk... Korean Language When I save into Unicode format. Then using iconv change ... encoding, U+0900 comes as a 3-byte ... code points in characters and string values, a Unicode ... convert raw bytes to Unicode strings - C / C++convert raw bytes to Unicode strings. C / C++ Forums on Bytes. Convert a Byte Array to a String in C# - beecy.net - will work for fooHow to convert a byte array to a string in C#. ... #region ByteArrayToString /// <summary> /// Converts a byte array to a string using Unicode ... 7/16/2012 6:45:10 PM
|