parse two field file

  • Follow


Which way would you guys recommened to best parse a multiline file which contains
two fields seperated by a tab. In this case its the
linux/proc/filesystems file a sample of which I have included below:

nodev	usbfs
	ext3
nodev	fuse
	vfat
	ntfs
nodev	binfmt_misc
	udf
	iso9660

The first field can be "empty" and concist of only a single tab
character. The seperator is a tab.

Is sscanf best suited to this? Or use strtok/strtok_r?

The field I am really interested in is the second one : any hints & tips
appreciated as to do this in the most efficient manor.

-- 
0
Reply rgrdev (1814) 12/17/2006 12:10:16 AM

Richard wrote:
> 
> Which way would you guys recommened to best parse a multiline file
> which contains two fields seperated by a tab. In this case its the
> linux/proc/filesystems file a sample of which I have included below:
> 
> nodev   usbfs
>         ext3
> nodev   fuse
>         vfat
>         ntfs
> nodev   binfmt_misc
>         udf
>         iso9660
> 
> The first field can be "empty" and concist of only a single tab
> character. The seperator is a tab.
> 
> Is sscanf best suited to this? Or use strtok/strtok_r?
> 
> The field I am really interested in is the second one : any hints
> & tips appreciated as to do this in the most efficient manor.

Use toksplit. Call with tokchar set to '\t'. Std C code follows:

/* ------- file toksplit.h ----------*/
#ifndef H_toksplit_h
#  define H_toksplit_h

#  ifdef __cplusplus
      extern "C" {
#  endif

#include <stddef.h>

/* copy over the next token from an input string, after
   skipping leading blanks (or other whitespace?).  The
   token is terminated by the first appearance of tokchar,
   or by the end of the source string.

   The caller must supply sufficient space in token to
   receive any token,  Otherwise tokens will be truncated.

   Returns: a pointer past the terminating tokchar.

   This will happily return an infinity of empty tokens if
   called with src pointing to the end of a string.  Tokens
   will never include a copy of tokchar.

   released to Public Domain, by C.B. Falconer.
   Published 2006-02-20.  Attribution appreciated.
*/

const char *toksplit(const char *src,      /* Source of tokens */
                     char tokchar,    /* token delimiting char */
                     char *token,  /* receiver of parsed token */
                     size_t lgh);  /* length token can receive */
                                   /* not including final '\0' */

#  ifdef __cplusplus
      }
#  endif
#endif
/* ------- end file toksplit.h ----------*/

/* ------- file toksplit.c ----------*/
#include "toksplit.h"

/* copy over the next token from an input string, after
   skipping leading blanks (or other whitespace?).  The
   token is terminated by the first appearance of tokchar,
   or by the end of the source string.

   The caller must supply sufficient space in token to
   receive any token,  Otherwise tokens will be truncated.

   Returns: a pointer past the terminating tokchar.

   This will happily return an infinity of empty tokens if
   called with src pointing to the end of a string.  Tokens
   will never include a copy of tokchar.

   A better name would be "strtkn", except that is reserved
   for the system namespace.  Change to that at your risk.

   released to Public Domain, by C.B. Falconer.
   Published 2006-02-20.  Attribution appreciated.
   Revised   2006-06-13
*/

const char *toksplit(const char *src,      /* Source of tokens */
                     char tokchar,    /* token delimiting char */
                     char *token,  /* receiver of parsed token */
                     size_t lgh)   /* length token can receive */
                                   /* not including final '\0' */
{
   if (src) {
      while (' ' == *src) src++;

      while (*src && (tokchar != *src)) {
         if (lgh) {
            *token++ = *src;
            --lgh;
         }
         src++;
      }
      if (*src && (tokchar == *src)) src++;
   }
   *token = '\0';
   return src;
} /* toksplit */

#ifdef TESTING
#include <stdio.h>

#define ABRsize 6 /* length of acceptable token abbreviations */

/* ---------------- */

static void showtoken(int i, char *tok)
{
   putchar(i + '1'); putchar(':');
   puts(tok);
} /* showtoken */

/* ---------------- */

int main(void)
{
   char teststring[] = "This is a test, ,,   abbrev, more";

   const char *t, *s = teststring;
   int         i;
   char        token[ABRsize + 1];

   puts(teststring);
   t = s;
   for (i = 0; i < 4; i++) {
      t = toksplit(t, ',', token, ABRsize);
      showtoken(i, token);
   }

   puts("\nHow to detect 'no more tokens' while truncating");
   t = s; i = 0;
   while (*t) {
      t = toksplit(t, ',', token, 3);
      showtoken(i, token);
      i++;
   }

   puts("\nUsing blanks as token delimiters");
   t = s; i = 0;
   while (*t) {
      t = toksplit(t, ' ', token, ABRsize);
      showtoken(i, token);
      i++;
   }
   return 0;
} /* main */

#endif
/* ------- end file toksplit.c ----------*/

-- 
Chuck F (cbfalconer at maineline dot net)
   Available for consulting/temporary embedded and systems.
   <http://cbfalconer.home.att.net>

0
Reply cbfalconer (19183) 12/17/2006 3:00:02 AM




"Richard" <rgrdev@gmail.com> wrote in message 
news:jvtzzvbcwn.fsf@gmail.com...
>
> Which way would you guys recommened to best parse a multiline file which 
> contains
> two fields seperated by a tab. In this case its the
> linux/proc/filesystems file a sample of which I have included below:
>
> nodev usbfs
> ext3
> nodev fuse
> vfat
> ntfs
> nodev binfmt_misc
> udf
> iso9660
>
> The first field can be "empty" and concist of only a single tab
> character. The seperator is a tab.
>
> Is sscanf best suited to this? Or use strtok/strtok_r?
>
> The field I am really interested in is the second one : any hints & tips
> appreciated as to do this in the most efficient manor.
>
The input format is slightly quirky, so the best solution is to call fgets() 
to read a line and then parse it yourself.

int checkheader(char *str)

ccan check whether the string is a header or not by looking for the tab or 
counting whitespace.

parseheader(char *str, char *field1, char *field2)

will pull out the fields for you. make sure you reject over-long strings.
Then the data fields only contain one string.

However

void trim(char *str)

which removes leading and trailing whitespace is a good function to have.

so too is
int checkblank(char *str)

which checks for strings which consist entirely of whitespace characters.
-- 
www.personal.leeds.ac.uk/~bgy1mm
freeware games to download.


0
Reply regniztar (3128) 12/17/2006 9:54:08 AM

"Malcolm" <regniztar@btinternet.com> writes:

> "Richard" <rgrdev@gmail.com> wrote in message 
> news:jvtzzvbcwn.fsf@gmail.com...
>>
>> Which way would you guys recommened to best parse a multiline file which 
>> contains
>> two fields seperated by a tab. In this case its the
>> linux/proc/filesystems file a sample of which I have included below:
>>
>> nodev usbfs
>> ext3
>> nodev fuse
>> vfat
>> ntfs
>> nodev binfmt_misc
>> udf
>> iso9660
>>
>> The first field can be "empty" and concist of only a single tab
>> character. The seperator is a tab.
>>
>> Is sscanf best suited to this? Or use strtok/strtok_r?
>>
>> The field I am really interested in is the second one : any hints & tips
>> appreciated as to do this in the most efficient manor.
>>
> The input format is slightly quirky, so the best solution is to call fgets() 
> to read a line and then parse it yourself.
>
> int checkheader(char *str)
>
> ccan check whether the string is a header or not by looking for the tab or 
> counting whitespace.
>
> parseheader(char *str, char *field1, char *field2)
>
> will pull out the fields for you. make sure you reject over-long strings.
> Then the data fields only contain one string.
>
> However
>
> void trim(char *str)
>
> which removes leading and trailing whitespace is a good function to have.
>
> so too is
> int checkblank(char *str)
>
> which checks for strings which consist entirely of whitespace
> characters.

I just did sscanf("%s%s",f1,f2) in the end.

-- 
0
Reply rgrdev (1814) 12/17/2006 3:24:15 PM

Richard wrote:
> Which way would you guys recommened to best parse a multiline file which contains
> two fields seperated by a tab. In this case its the
> linux/proc/filesystems file a sample of which I have included below:
> 
> nodev	usbfs
> 	ext3
> nodev	fuse
> 	vfat
> 	ntfs
> nodev	binfmt_misc
> 	udf
> 	iso9660
> 
> The first field can be "empty" and concist of only a single tab
> character. The seperator is a tab.
> 
> Is sscanf best suited to this? Or use strtok/strtok_r?

     strtok(..., "\t") will give the same result for "\tfoo"
and "\t\tfoo\t" and "foo".  If you *know* that the input has
two tab-separated fields and that only the first (never the
second) can be empty, you can get this to work: If strtok()
finds two fields they are #1 and #2, but if it finds only
one it is #2 with #1 empty.

     However, it makes me queasy to put that much faith in an
input source I don't control programmatically.  Who knows?
Maybe in six months somebody will extend the format, adding
an optional third field.  If that happened, then the field-
counting approach would misinterpret "\tfoo\tbar" as if it
were "foo\tbar".  It would be better to adopt a method that
would complain about "\tfoo\tbar" than to be fooled by it.

     fgets() plus sscanf() is a possibility, but it's a bit
tricky to use: The obvious "%s\t%s" will not do what you
want.  (The first "%s" will skip any leading white space,
leaving you in the same hole as the strtok() approach, and
the "\t" will match any amount of any kind of white space,
tabs or other.)  Something like "%[^\t]%*1[\t]%s" would do
a little better, but still wouldn't be fully satisfactory:
It would match the prefix of "foo\tbar baz goozle frobnitz"
without any warning of the trailing junk.  You could use
"%[^\t]%*1[\t]%s%n" and then check that sscanf() had in fact
consumed the entire string ...

     ... but wouldn't it be simpler just to pick the line
apart for yourself?  Read it in with fgets(), use strchr()
to find the first tab (syntax error if there isn't one), and
the first (possibly empty) field is everything from the start
to just before the tab.  Then start just after the tab and use
strchr() again to find the terminating '\n'; the second field
is everything from just after the tab to just before the '\n'
(syntax error if its length is zero).  You can use strcspn()
to check that the second field contains no white space and
squawk if it does (somebody added a third field you don't
understand).

> The field I am really interested in is the second one : any hints & tips
> appreciated as to do this in the most efficient manor.

     The "most efficient manor" is the house of Usher.  Resist
this unnecessary impulse for efficiency, lest your program meet
the same fate as did that storied manse.

     (In other words: How long is this file, anyhow?  How many
times will you scan its contents?  If you sped up the scanning
by a factor of four hundred twenty gazillion, how much faster
would the program as a whole run?  If you give your SUV a coat
of wax, will you improve its fuel economy by making it slipperier
or harm it by adding weight?)

-- 
Eric Sosman
esosman@acm-dot-org.invalid
0
Reply esosman (1335) 12/17/2006 3:37:28 PM

On Sun, 17 Dec 2006 01:10:16 +0100, Richard <rgrdev@gmail.com> wrote:
> Which way would you guys recommened to best parse a multiline file
> which contains two fields seperated by a tab. In this case its the
> linux/proc/filesystems file a sample of which I have included below:
>
> nodev	usbfs
> 	ext3
> nodev	fuse
> 	vfat
> 	ntfs
> nodev	binfmt_misc
> 	udf
> 	iso9660
>
> The first field can be "empty" and concist of only a single tab
> character. The seperator is a tab.
>
> Is sscanf best suited to this? Or use strtok/strtok_r?

strtok() is not so nice, because it tries to modify the string you pass
to it.  I would probably use strcspn() for this, with something like:

    #include <assert.h>
    #include <stdio.h>
    #include <stdlib.h>
    #include <string.h>

    #define MAXLINE 256

    static void doline(char *buf, size_t bufsize);

    int
    main(void)
    {
        char buf[MAXLINE];
        FILE *fp;

        /*
         * Add code here that opens /proc/filesystems file, instead of using
         * `stdin' as the input file.
         */
        fp = stdin;

        clearerr(fp);
        while (fgets(buf, sizeof buf, fp) != NULL) {
                doline(buf, sizeof buf);
        }
        if (ferror(fp) != 0) {
            perror("fgets");
            exit(EXIT_FAILURE);
        }
        /*
         * Add code here that closes the open file referenced by `fp'.
         */

        return EXIT_SUCCESS;
    }

    static void
    doline(char *buf, size_t bufsize)
    {
        char *field;
        size_t pos, pos2, fieldsize;

        assert(buf != NULL && bufsize > 0);
        (void)bufsize;

        pos = strcspn(buf, "\t");
        if (buf[pos] == '\0') {
            fprintf(stderr,
                "warning: no TAB in `%s', skipping this line\n", buf);
            return;
        }
        pos2 = strcspn(buf + pos + 1, "\t");

        fieldsize = pos2 + 1;
        field = malloc(fieldsize);
        if (field == NULL) {
            perror("malloc");
            return;
        }
        strncpy(field, buf + pos + 1, fieldsize - 1);
        field[fieldsize - 1] = '\0';
        field[strcspn(field, "\n\r")] = '\0';
        printf("%s\n", field);
        free(field);
    }

The trick is to use strcspn() to find out the 'part' of the original
string which you are interested in, and then you can do whatever you
like with this part.  In the particular program, I'm temporarily
allocate a new string buffer, copy the original contents in this new
buffer, print the buffer and release its memory.  Any other way you can
think about to use this substring is fine too :)

0
Reply keramida (459) 12/26/2006 3:14:22 AM

On Sun, 17 Dec 2006 10:37:28 -0500, Eric Sosman
<esosman@acm-dot-org.invalid> wrote:

> Richard wrote:
> > Which way would you guys recommened to best parse a multiline file which contains
> > two fields seperated by a tab. <snip>
>      strtok(..., "\t") will [lose empty fields]

Right.

>      fgets() plus sscanf() is a possibility, but it's a bit
> tricky to use: The obvious "%s\t%s" will not do what you
> want.  (The first "%s" will skip any leading white space,
> leaving you in the same hole as the strtok() approach, and
> the "\t" will match any amount of any kind of white space,
> tabs or other.)  Something like "%[^\t]%*1[\t]%s" would do
> a little better, but still wouldn't be fully satisfactory:

Not enough better. If the first field is empty and thus the first
%[^\t] matches nothing, *scanf stops and doesn't do the %*1[\t]s.

This is effectively the same problem of the people who periodically
try to use {,f}scanf to replace <ILLEGAL> fflush (input) </>.
(Some people, including IIRC Dan Pop, have recommended e.g. 
  if( scanf ("%*[^\n]%*1[\n]") < 2 ) getchar ();
but I consider that too much uglier than the obvious, though slightly
longer and possibly slightly less efficient
  while( (ch = getchar()) != EOF && ch != '\n' ) ;
etc.

Plus unbounded %[...] or %s risks buffer overflow and resulting UB.
You should specify a length at most one less than the buffer size.

> It would match the prefix of "foo\tbar baz goozle frobnitz"
> without any warning of the trailing junk.  You could use
> "%[^\t]%*1[\t]%s%n" and then check that sscanf() had in fact
> consumed the entire string ...
> 
>      ... but wouldn't it be simpler just to pick the line
> apart for yourself?  Read it in with fgets(), use strchr()
> to find the first tab <snip>

Yes.

>      The "most efficient manor" is the house of Usher.  Resist
> this unnecessary impulse for efficiency, lest your program meet
> the same fate as did that storied manse.
> 
Yes. Or even the hundred-year shay, IIRC grade school. <G>

- David.Thompson1 at worldnet.att.net
0
Reply david.thompson1 (1042) 1/3/2007 7:42:00 PM

6 Replies
33 Views

(page loaded in 0.283 seconds)


Reply: