f



How to use String.split to split a mixed encoding string(part encoded in gbk, part encoded in utf-8)

[Note:  parts of this message were removed to make it a legal post.]

Dear Buddies,

Yesterday, I sent a mail of let the split ignore the error utf-8 bytes
sequences. And I checked the string I wanted to parse in Java and found out
that the string is encoded in gbk and part of the string is encoded in
utf-8.

I am wondering if I could find a way to still split the string by split
method, and then I could try to force_encoding part of the string that might
encoded in gbk and resolve the problem.

I am wondering if there is a way I could do so without the "invalid bytes
sequence" error?

Thanks.

Best wishes,
Stanley Xu

0
Stanley
3/23/2011 3:53:38 AM
comp.lang.ruby 48886 articles. 0 followers. Post Follow

2 Replies
1813 Views

Similar Articles

[PageSpeed] 10

On Wed, Mar 23, 2011 at 4:53 AM, Stanley Xu <wenhao.xu@gmail.com> wrote:
> Yesterday, I sent a mail of let the split ignore the error utf-8 bytes
> sequences. And I checked the string I wanted to parse in Java and found out
> that the string is encoded in gbk and part of the string is encoded in
> utf-8.
>
> I am wondering if I could find a way to still split the string by split
> method, and then I could try to force_encoding part of the string that might
> encoded in gbk and resolve the problem.
>
> I am wondering if there is a way I could do so without the "invalid bytes
> sequence" error?

A string with a mixed encoding is difficult to handle.  I think you
have these options

1. Ensure that the string does *not* contain mixed encoding (this
would be the first and best choice IMHO).

2. If you can't because you get the data from somewhere else, use
encoding BINARY as a diversion:

mixed_content.force_encoding Encoding::BINARY
chunks = mixed_content.split /\t/
chunks[0].force_encoding Encoding::UTF_8
chunks[1].force_encoding Encoding::GBK

or

mixed_content.force_encoding Encoding::BINARY
a, b = mixed_content.split /\t/
a.force_encoding Encoding::UTF_8
b.force_encoding Encoding::GBK

Kind regards

robert

-- 
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

0
Robert
3/23/2011 9:32:49 AM
[Note:  parts of this message were removed to make it a legal post.]

Thanks a lot, Robert. Your solution really helps.

Best wishes,
Stanley Xu



On Wed, Mar 23, 2011 at 5:32 PM, Robert Klemme
<shortcutter@googlemail.com>wrote:

> On Wed, Mar 23, 2011 at 4:53 AM, Stanley Xu <wenhao.xu@gmail.com> wrote:
> > Yesterday, I sent a mail of let the split ignore the error utf-8 bytes
> > sequences. And I checked the string I wanted to parse in Java and found
> out
> > that the string is encoded in gbk and part of the string is encoded in
> > utf-8.
> >
> > I am wondering if I could find a way to still split the string by split
> > method, and then I could try to force_encoding part of the string that
> might
> > encoded in gbk and resolve the problem.
> >
> > I am wondering if there is a way I could do so without the "invalid bytes
> > sequence" error?
>
> A string with a mixed encoding is difficult to handle.  I think you
> have these options
>
> 1. Ensure that the string does *not* contain mixed encoding (this
> would be the first and best choice IMHO).
>
> 2. If you can't because you get the data from somewhere else, use
> encoding BINARY as a diversion:
>
> mixed_content.force_encoding Encoding::BINARY
> chunks = mixed_content.split /\t/
> chunks[0].force_encoding Encoding::UTF_8
> chunks[1].force_encoding Encoding::GBK
>
> or
>
> mixed_content.force_encoding Encoding::BINARY
> a, b = mixed_content.split /\t/
> a.force_encoding Encoding::UTF_8
> b.force_encoding Encoding::GBK
>
> Kind regards
>
> robert
>
> --
> remember.guy do |as, often| as.you_can - without end
> http://blog.rubybestpractices.com/
>
>

0
Stanley
3/23/2011 2:06:01 PM
Reply: