Iterating over a String

  • Follow


It has bugged me that the for:each syntax would not let me write code
of the form:

String categories = "amq";
....
for ( char category: categories )


However, you can write this:


String categories = "amq";

....

final char[] cats = categories.toCharArray();
for ( char category : cats )

What is your opinion.  Would you prefer it, or a indexing look with
CharAt?

The indexing loop lets you look back and forward, which the for:each
does not.
-- 
Roedy Green Canadian Mind Products
http://mindprod.com

Without deviation from the norm, progress is not possible. 
~ Frank Zappa (born: 1940-12-21 died: 1993-12-04 at age: 52)
0
Reply see_website (4855) 11/14/2009 5:54:14 PM

Roedy Green wrote:
> It has bugged me that the for:each syntax would not let me write code
> of the form:
> 
> String categories = "amq";
> ...
> for ( char category: categories )

     Bugs me, too.  It seems so obvious to have String (more
generally, CharSequence) implement Iterable, but ...

-- 
Eric Sosman
esosman@ieee-dot-org.invalid
0
Reply Eric 11/14/2009 6:18:16 PM


Roedy Green wrote:
> It has bugged me that the for:each syntax would not let me write code
> of the form:
> 
> String categories = "amq";
> ....
> for ( char category: categories )


I've opined here that I'd like a shorter form for iterating with a 
simple integer.

   for( int i : categories.length() ) {
     char c = categories.charAt(i);
     ....
   }

Not exactly what you are asking for but I thought I'd toss my two bits in.

0
Reply markspace 11/14/2009 6:45:31 PM

Roedy Green wrote:
> It has bugged me that the for:each syntax would not let me write code
> of the form:
> 
> String categories = "amq";
> ...
> for ( char category: categories )
> 
> 
> However, you can write this:
> 
> 
> String categories = "amq";
> 
> ...
> 
> final char[] cats = categories.toCharArray();
> for ( char category : cats )
> 
> What is your opinion.  Would you prefer it, or a indexing look with
> CharAt?
> 
> The indexing loop lets you look back and forward, which the for:each
> does not.

Here's a utility class that makes it easy to apply the for each loop to
a String, if that is what you want to do. I use the for loop whenever it
works without stretching. See the main method at the end for an example
of using the class.

import java.util.Iterator;
import java.util.NoSuchElementException;

public class IterableString implements Iterable<Character> {
   private String data;

   /**
    * Create an Iterable for the specified String
    *
    * @param data
    *          The String to iterate.
    */
   public IterableString(String data) {
     super();
     this.data = data;
   }

   @Override
   public Iterator<Character> iterator() {
     return new StringIterator(data);
   }

   private static class StringIterator implements Iterator<Character> {

     private String data;
     private int index = 0;

     public StringIterator(String data) {
       this.data = data;
     }

     @Override
     public boolean hasNext() {
       return index < data.length();
     }

     @Override
     public Character next() {
       if (index < data.length()) {
         Character result = data.charAt(index);
         index++;
         return result;
       } else {
         throw new NoSuchElementException();
       }
     }

     @Override
     public void remove() {
       throw new UnsupportedOperationException("No remove from String");
     }

   }

   /** Demonstration method */
   public static void main(String[] args) {
     String testData = "xyzzy";
     for(char c : new IterableString(testData)){
       System.out.println(c);
     }
   }

}
0
Reply Patricia 11/14/2009 9:14:02 PM

Roedy Green wrote:
> It has bugged me that the for:each syntax would not let me write code
> of the form:
> 
> String categories = "amq";
> ....
> for ( char category: categories )
> 
> 
> However, you can write this:
> 
> 
> String categories = "amq";
> 
> ....
> 
> final char[] cats = categories.toCharArray();
> for ( char category : cats )
> 
> What is your opinion.  Would you prefer it, or a indexing look with
> CharAt?
> 
> The indexing loop lets you look back and forward, which the for:each
> does not.
What about 64bit codepoints? Wouldn't you rather iterate over codepoints 
than characters?

Patricia gave a good wrapper class for doing what you requested, but I 
suggest adapting it to support Integer codepoints.
0
Reply Daniel 11/16/2009 8:42:43 PM

On Mon, 16 Nov 2009 12:42:43 -0800, Daniel Pitts
<newsgroup.spamfilter@virtualinfinity.net> wrote, quoted or indirectly
quoted someone who said :

>What about 64bit codepoints? Wouldn't you rather iterate over codepoints 
>than characters?

if it were either/or I would say no.  I don't have any application for
32-bit Unicode yet and don't foresee it in my lifetime.
-- 
Roedy Green Canadian Mind Products
http://mindprod.com

Without deviation from the norm, progress is not possible. 
~ Frank Zappa (born: 1940-12-21 died: 1993-12-04 at age: 52)
0
Reply Roedy 11/17/2009 1:29:14 AM

Roedy Green wrote:
> On Mon, 16 Nov 2009 12:42:43 -0800, Daniel Pitts
> <newsgroup.spamfilter@virtualinfinity.net> wrote, quoted or indirectly
> quoted someone who said :
> 
>> What about 64bit codepoints? Wouldn't you rather iterate over codepoints 
>> than characters?
> 
> if it were either/or I would say no.  I don't have any application for
> 32-bit Unicode yet and don't foresee it in my lifetime.

Gadzooks! Do you mean ...

* ASCII is enough.
* ISO 8859-1 is enough.
* Unicode Base Multilingual Plane is enough.
* Something else?

Unicode isn't a 32 bit character set, it's a 21 bit character set[1], 
though one *encoding* of Unicode is 32-bits - UTF-32.


[1] http://unicode.org/faq/utf_bom.html#gen0
-- 
RGB
0
Reply RedGrittyBrick 11/17/2009 9:31:25 AM

RedGrittyBrick wrote:
> 
> Roedy Green wrote:
>> On Mon, 16 Nov 2009 12:42:43 -0800, Daniel Pitts
>> <newsgroup.spamfilter@virtualinfinity.net> wrote, quoted or indirectly
>> quoted someone who said :
>>
>>> What about 64bit codepoints? Wouldn't you rather iterate over 
>>> codepoints than characters?
>>
>> if it were either/or I would say no.  I don't have any application for
>> 32-bit Unicode yet and don't foresee it in my lifetime.
> 
> Gadzooks! Do you mean ...

> * Something else?


This I think.

If you read the quotes above, you'll notice that Daniel wrote "64 bit 
codepoints."  I think that's roughly twice as many bits as even the 
Unicode Consortium has dreamed of using, and more than twice the 
required 21 bits currently required for the whole she-bang, as you point 
out.
0
Reply markspace 11/17/2009 3:05:01 PM

markspace wrote:
> RedGrittyBrick wrote:
>>
>> Roedy Green wrote:
>>> On Mon, 16 Nov 2009 12:42:43 -0800, Daniel Pitts
>>> <newsgroup.spamfilter@virtualinfinity.net> wrote, quoted or indirectly
>>> quoted someone who said :
>>>
>>>> What about 64bit codepoints? Wouldn't you rather iterate over 
>>>> codepoints than characters?
>>>
>>> if it were either/or I would say no.  I don't have any application for
>>> 32-bit Unicode yet and don't foresee it in my lifetime.
>>
>> Gadzooks! Do you mean ...
> 
>> * Something else?
> 
> 
> This I think.
> 
> If you read the quotes above, you'll notice that Daniel wrote "64 bit 
> codepoints."  I think that's roughly twice as many bits as even the 
> Unicode Consortium has dreamed of using, and more than twice the 
> required 21 bits currently required for the whole she-bang, as you point 
> out.

I made a mistake, in a state of cold medicine induced delirium ;-)  I 
meant to say 32bit codepoints, as apposed to 16bit chars.

It doesn't matter if *you* think you need to support it, your clients 
will need you to support it one day, randomly, out of the blue.  When 
your program crashes, or does the wrong thing, it will look bad.  Even 
if you are able to repair it quickly.  It is better to not have to 
repair it at all.
0
Reply Daniel 11/18/2009 12:19:00 AM

On Tue, 17 Nov 2009 09:31:25 +0000, RedGrittyBrick
<RedGrittyBrick@spamweary.invalid> wrote, quoted or indirectly quoted
someone who said :

>Unicode isn't a 32 bit character set, it's a 21 bit character set[1], 

inside java, with codepoints you treat it as 32-bit.

What are the codepoints above the 16 bit point?

Aegean numbers, Mormon Deseret, Cuneiform, Shavian, Osmanya
(Somalian), Byzantine music symbols, extended Chinese.

These are not the sorts of symbols used in business.  I am not likely
to ever use these. These are more for anthropologists.

The only one plausible is the Alphabetic Mathematical, which really
have no business being codepoints. They could just as easily be fonts.
http://www.unicode.org/charts/PDF/U1D400.pdf
-- 
Roedy Green Canadian Mind Products
http://mindprod.com

Without deviation from the norm, progress is not possible. 
~ Frank Zappa (born: 1940-12-21 died: 1993-12-04 at age: 52)
0
Reply Roedy 11/18/2009 12:40:22 AM

Roedy Green wrote:
> On Tue, 17 Nov 2009 09:31:25 +0000, RedGrittyBrick
> <RedGrittyBrick@spamweary.invalid> wrote, quoted or indirectly quoted
> someone who said :
> 
>> Unicode isn't a 32 bit character set, it's a 21 bit character set[1], 
> 
> inside java, with codepoints you treat it as 32-bit.
> 
> What are the codepoints above the 16 bit point?
> 
> Aegean numbers, Mormon Deseret, Cuneiform, Shavian, Osmanya
> (Somalian), Byzantine music symbols, extended Chinese.
> 
> These are not the sorts of symbols used in business. 

These are not *commonly* used in business.

Amazon is a business, it sells books on those subjects. Businesses like 
Amazon sometimes display extracts of the books they sell. The publishers 
of those books are also businesses. Well known businesses sometimes 
index huge numbers of books and make those indexes accessible to the 
public over the web[1].

> I am not likely to ever use these. These are more for anthropologists.

It may be true that you will never do business with anthropologists, 
booksellers or people whose job, hobbies or interests involve the 
writing systems you listed.

I'd prefer my use of Java not to limit my opportunities, no matter how 
unlikely they might seem to me today.

However, like you I think, I'm reluctant to jump through special hoops 
to achieve this.


[1] 
http://www.archive.org/search.php?query=subject%3A%22Cuneiform%20inscriptions%22
-- 
RGB
0
Reply RedGrittyBrick 11/18/2009 10:06:28 AM

RedGrittyBrick wrote:
> 
> Roedy Green wrote:
>> On Tue, 17 Nov 2009 09:31:25 +0000, RedGrittyBrick
>> <RedGrittyBrick@spamweary.invalid> wrote, quoted or indirectly quoted
>> someone who said :
>>
>>> Unicode isn't a 32 bit character set, it's a 21 bit character set[1], 
>>
>> inside java, with codepoints you treat it as 32-bit.
>>
>> What are the codepoints above the 16 bit point?
>>
>> Aegean numbers, Mormon Deseret, Cuneiform, Shavian, Osmanya
>> (Somalian), Byzantine music symbols, extended Chinese.
>>
>> These are not the sorts of symbols used in business. 
> 
> These are not *commonly* used in business.
> 
> Amazon is a business, it sells books on those subjects. Businesses like 
> Amazon sometimes display extracts of the books they sell. The publishers 
> of those books are also businesses. Well known businesses sometimes 
> index huge numbers of books and make those indexes accessible to the 
> public over the web[1].
> 
>> I am not likely to ever use these. These are more for anthropologists.
> 
> It may be true that you will never do business with anthropologists, 
> booksellers or people whose job, hobbies or interests involve the 
> writing systems you listed.
> 
> I'd prefer my use of Java not to limit my opportunities, no matter how 
> unlikely they might seem to me today.
> 
> However, like you I think, I'm reluctant to jump through special hoops 
> to achieve this.

My point would rather be that the moment you expose a text input field 
to end users, is the moment you must support (or at least reject instead 
of misleadingly accept) the entire Unicode range.

Users will install the cool cuneiform font and copy/paste cuneiform 
characters because they'll think it's cool or it helps them organize 
whatever they're inputting. Short version: because they can.

If I had to trust my clients on it, a lot of people have ancient capital 
Greek letters in their last name or phone number shortcode. Especially 
Capital Pi and Sigma.

--
Mayeul
0
Reply Mayeul 11/18/2009 12:43:14 PM

Mayeul wrote:
> My point would rather be that the moment you expose a text input field 
> to end users, 

It's a good point but it doesn't apply to Roedy's original example of
   String categories = "amq";

> is the moment you must support (or at least reject instead 
> of misleadingly accept) the entire Unicode range.

You are right. This is where Java lets us down.

> Users will install the cool cuneiform font and copy/paste cuneiform 
> characters because they'll think it's cool or it helps them organize 
> whatever they're inputting. Short version: because they can.

I'm not familiar with handling surrogate pairs, presumably you write 
code like

  for (int i=0; i<userInput.codePointCount(0,userInput.length()); i++) {
    int codePoint = userInput.codePointAt(i);
    char[] chars = Character.toChars(codePoint);
    if (chars.length == 2) {
      // we have a surrogate pair for a character > \uFFFF
    } else {
      // we have a BMP character
    }
  }

-- 
RGB
0
Reply RedGrittyBrick 11/18/2009 3:03:17 PM

RedGrittyBrick wrote:
> 
> Mayeul wrote:
>> My point would rather be that the moment you expose a text input field 
>> to end users, 
> 
> It's a good point but it doesn't apply to Roedy's original example of
>   String categories = "amq";
> 
>> is the moment you must support (or at least reject instead of 
>> misleadingly accept) the entire Unicode range.
> 
> You are right. This is where Java lets us down.
> 
>> Users will install the cool cuneiform font and copy/paste cuneiform 
>> characters because they'll think it's cool or it helps them organize 
>> whatever they're inputting. Short version: because they can.
> 
> I'm not familiar with handling surrogate pairs, presumably you write 
> code like
> 
>  for (int i=0; i<userInput.codePointCount(0,userInput.length()); i++) {
>    int codePoint = userInput.codePointAt(i);
>    char[] chars = Character.toChars(codePoint);
>    if (chars.length == 2) {
>      // we have a surrogate pair for a character > \uFFFF
>    } else {
>      // we have a BMP character
>    }
>  }
> 

Oops, silly me, the parameter for codePointAt() is in terms of chars 
not, err, characters.

----------------------------------8<------------------------------------ 

         String userInput = "\uD834\uDD1E" /* U+1D11E */ + " G clef";
         System.out.printf("String '%s' has length %d %n",
                 userInput, userInput.length());

         for (int i=0; i<userInput.length(); i++) {
             int codePoint = userInput.codePointAt(i);
             char[] chars = Character.toChars(codePoint);
             if (chars.length == 2) {
                 // its a surrogate pair
                 System.out.printf("%d: pair '%s' code-point %X %n",
                         i, new String(chars), codePoint);
                 i++;
             } else {
                 // its a BMP character
                 System.out.printf("%d: character '%s' code-point %X %n",
                         i, chars[0], codePoint);
             }
         }
----------------------------------8<------------------------------------
String '? G clef' has length 9
0: pair '?' code-point 1D11E
2: character ' ' code-point 20
3: character 'G' code-point 47
4: character ' ' code-point 20
5: character 'c' code-point 63
6: character 'l' code-point 6C
7: character 'e' code-point 65
8: character 'f' code-point 66



-- 
RGB
0
Reply RedGrittyBrick 11/18/2009 3:27:21 PM

RedGrittyBrick wrote:
> I'm not familiar with handling surrogate pairs, presumably you write 
> code like
> 
>  for (int i=0; i<userInput.codePointCount(0,userInput.length()); i++) {
>    int codePoint = userInput.codePointAt(i);
>    char[] chars = Character.toChars(codePoint);
>    if (chars.length == 2) {
>      // we have a surrogate pair for a character > \uFFFF
>    } else {
>      // we have a BMP character
>    }
>  }
> 

I didn't mean to imply I don't know how to handle it.

The problem is more that in some circumstances where you apply 
character-by-character logic, you might need to /know/ you have to 
handle it.

Besides, and taking your following correction into account, you probably 
want to call Character.charCount() instead of Character.toChars()

-----------------------------------------------------
int codePoint;
for (int i=0; i<userInput.length(); i += Character.charCount(codePoint)) {
   codePoint = userInput.codePointAt(i);

   // Really, character-by-character code should be made
   // to work with codePoint and Character methods from here.
}
--------------------------------------------------------

--
Mayeul
0
Reply Mayeul 11/18/2009 4:33:37 PM

On Wed, 18 Nov 2009 15:03:17 +0000, RedGrittyBrick
<RedGrittyBrick@spamweary.invalid> wrote, quoted or indirectly quoted
someone who said :

>> is the moment you must support (or at least reject instead 
>> of misleadingly accept) the entire Unicode range.
>
>You are right. This is where Java lets us down.

An Abundance, my own language, fields come with filters that describe
what chars they accept when keyed.  If you allow lower case letters
only, the upper case get translated to lower case as they are keyed,
and accented letters get their accents stripped. Invalid chars keyed
do nothing other that cause a short distinctive invalid char noise.

The type information is used to automatically generate prompt
information about what is acceptable.

Java is still downright backward when it comes to data entry. Even
KEYPUNCHES were more programmer and user friendly.

Abundance (circa 1980) did data entry, enforcing low/high bounds,
valid phone numbers, postal codes, zips, provinces, countries, states,
currency, dates, optional/mandatory, check digits, credit card
numbers,  fields, ... without any programming other that specifying
the field type and bounds. 

The world seems to have lost interest in high speed data entry.

 
-- 
Roedy Green Canadian Mind Products
http://mindprod.com
Finding a bug is a sign you were asleep a the switch when coding. Stop debugging, and go back over your code line by line.
0
Reply Roedy 11/20/2009 2:31:11 AM

On Wed, 18 Nov 2009 10:06:28 +0000, RedGrittyBrick
<RedGrittyBrick@spamweary.invalid> wrote, quoted or indirectly quoted
someone who said :

>
>I'd prefer my use of Java not to limit my opportunities, no matter how 
>unlikely they might seem to me today.

If you check back to how this strand got going, it was about the need
for 16, 32 or 64 bit String Iterators. I stated, if I could only have
one, it would be 16 bit. I then justified my choice. I am not
preaching deliberate ignorance on codepoints.

I am probably one of the few people have poked around in that part of
the codespace, motivated by nothing more than raw curiosity.

Something I have noticed in my meanderings is that alphabets are NOT
designed to make the letters as visually distinct as possible. They
are primarily designed to visually blend together in an aesthetically
pleasing way.

Many alphabets have letters almost identical. I am curious to learn
how that came about. It seems like an odd thing to do in designing an
alphabet.
-- 
Roedy Green Canadian Mind Products
http://mindprod.com
Finding a bug is a sign you were asleep a the switch when coding. Stop debugging, and go back over your code line by line.
0
Reply Roedy 11/20/2009 2:31:11 AM

Roedy Green wrote:
> On Wed, 18 Nov 2009 10:06:28 +0000, RedGrittyBrick
> <RedGrittyBrick@spamweary.invalid> wrote, quoted or indirectly quoted
> someone who said :
> 
>> I'd prefer my use of Java not to limit my opportunities, no matter how 
>> unlikely they might seem to me today.
> 
> If you check back to how this strand got going, it was about the need
> for 16, 32 or 64 bit String Iterators. I stated, if I could only have
> one, it would be 16 bit. I then justified my choice. I am not
> preaching deliberate ignorance on codepoints.

If I could only have one, I would have one that hides from me the number 
of bits and the encoding used by the underlying representation.

e.g.

   for (UnicodeCharacter char: astring) ...

   UnicodeCharacter char = String.UnicodeCharacterAtOrdinal(7);

Where 7 is the eighth Unicode Character in the string regardless of how 
many 8-bit or 16-bit units precede it in the underlying representation.

-- 
RGB
0
Reply RedGrittyBrick 11/21/2009 2:22:44 PM

Roedy Green wrote:
> 
> Something I have noticed in my meanderings is that alphabets are NOT
> designed to make the letters as visually distinct as possible. They
> are primarily designed to visually blend together in an aesthetically
> pleasing way.
> 
> Many alphabets have letters almost identical. I am curious to learn
> how that came about. It seems like an odd thing to do in designing an
> alphabet.

We are using the word alphabet rather loosely to mean the basic letter 
shapes used in modern alphabetic writing systems used for human 
communications. I imagine most commonly used alphabets don't get 
designed, they evolve, split and coalesce in an unplanned way over 
millennia.

If they were designed, it was for a different purpose (tallying, 
accounting, administration) than that for which we mostly use them.

I like http://www.textism.com/writing/?id=1
but Wikipedia is probably more enlightening:
http://en.wikipedia.org/wiki/History_of_the_alphabet
or
http://whyfiles.org/079writing/index.html

I imagine that similar looking letters can arise in the same way that 
homonyms arise. The shape of letters and the pronunciation and spelling 
of distinct words changes over hundreds of years. Eventually we have 
visual representations that now look the same (or very similar) but have 
differing meanings.

AIUI, we don't read letter-shapes. We mostly read word-shapes.

-- 
RGB
0
Reply RedGrittyBrick 11/21/2009 2:55:17 PM

RedGrittyBrick wrote:
> If I could only have one, I would have one that hides from me the number 
> of bits and the encoding used by the underlying representation.
> 
> e.g.
> 
>   for (UnicodeCharacter char: astring) ...
> 
>   UnicodeCharacter char = String.UnicodeCharacterAtOrdinal(7);
> 
> Where 7 is the eighth Unicode Character in the string regardless of how 
> many 8-bit or 16-bit units precede it in the underlying representation.

String#codePointAt( int index )
String#codePointCount( int beginIndex, int endIndex )

provide crude tools to sort of do that.

-- 
Lew
0
Reply Lew 11/21/2009 5:55:59 PM

RedGrittyBrick wrote:
>> If I could only have one, I would have one that hides from me the 
>> number of bits and the encoding used by the underlying representation.
>>
>> e.g.
>>
>>   for (UnicodeCharacter char: astring) ...
>>
>>   UnicodeCharacter char = String.UnicodeCharacterAtOrdinal(7);
>>
>> Where 7 is the eighth Unicode Character in the string regardless of 
>> how many 8-bit or 16-bit units precede it in the underlying 
>> representation.

Lew wrote:
> String#codePointAt( int index )
> String#codePointCount( int beginIndex, int endIndex )

and String#offsetByCodePoints(int index, int codePointOffset)

> provide crude tools to sort of do that.

-- 
Lew
0
Reply Lew 11/21/2009 5:59:53 PM

20 Replies
280 Views

(page loaded in 0.156 seconds)

Similiar Articles:


















7/26/2012 3:08:17 PM


Reply: