Very fast delimited record parsing with boost

  • Follow


We need to process a very large amount of delimited variable length
ASCII data in files as large as 3-4 gigs. We need a high performance
parser for this and as always, we have no money to buy one. We are ok
with building one as long as that can be done quick enough and I was
wondering if Boost has a panacea for us. Can anyone help with their
ideas / experience.

I am also very open to any suggestions outside Boost. Any outline on
how to build such a parser would be very welcome. If some comparative
performance figures can be mentioned, it would be of tremendous help.
Any fast C++ library would be of help.

We develop a market analytics tool on HP-UX and Linux on 32/64 bits.

Cheers,
Andy


      [ See http://www.gotw.ca/resources/clcm.htm for info about ]
      [ comp.lang.c++.moderated.    First time posters: Do this! ]

0
Reply garth_rockett (50) 8/30/2005 1:22:00 PM

Hi Andy,

Could be a classic case of premature optimisation:  if you're
processing 3-4 gigs per file, you may find that the disk I/O is the
bottleneck, and if so it's likely to be several orders of magnitude
slower than any parser (that wasn't written either as a joke and/or
designed by a large committee of middle and senior managers).  Quite
honestly, parsing delimited records is childs play: do it with
streaming operators or even (s/f)scanf to get you started (depending on
whether you use memory mapping for file I/O or not), and optimise if
needed.  If performance turns out to be an issue, and it's not disk
I/O, then my guess is you'll find your data processing takes more time
than your parsing.

Regards,

Tony


      [ See http://www.gotw.ca/resources/clcm.htm for info about ]
      [ comp.lang.c++.moderated.    First time posters: Do this! ]

0
Reply Tony 8/30/2005 9:59:28 PM


"Andy" <garth_rockett@yahoo.com> writes:

> We need to process a very large amount of delimited variable length
> ASCII data in files as large as 3-4 gigs. We need a high performance
> parser for this and as always, we have no money to buy one. We are ok
> with building one as long as that can be done quick enough and I was
> wondering if Boost has a panacea for us. Can anyone help with their
> ideas / experience.

Did you cross-post your question on one of the boost mailing lists?

http://www.boost.org/more/mailing_lists.htm

-- 
Dave Abrahams
Boost Consulting
www.boost-consulting.com

      [ See http://www.gotw.ca/resources/clcm.htm for info about ]
      [ comp.lang.c++.moderated.    First time posters: Do this! ]

0
Reply David 8/30/2005 10:04:56 PM

Andy wrote:
> We need to process a very large amount of delimited variable length
> ASCII data in files as large as 3-4 gigs. We need a high performance
> parser for this and as always, we have no money to buy one. We are ok
> with building one as long as that can be done quick enough and I was
> wondering if Boost has a panacea for us. Can anyone help with their
> ideas / experience.
> 
> I am also very open to any suggestions outside Boost. Any outline on
> how to build such a parser would be very welcome. If some comparative
> performance figures can be mentioned, it would be of tremendous help.
> Any fast C++ library would be of help.
> 
> We develop a market analytics tool on HP-UX and Linux on 32/64 bits.

I'd use Flex, which you will have *free* with your Linux distro.

	http://www.gnu.org/software/flex/

It is table-driven, and extremely fast.  It also works with Bison - 
which is also table-driven and fast, but by the sound of things you 
probably need Flex without Bison.

You may run into performance issues with Spirit - I mean it's fast 
enough for most things but it uses recursive descent which is slower 
than table-driver parsers.  Your bottleneck may be the module consuming 
the data.

Calum

      [ See http://www.gotw.ca/resources/clcm.htm for info about ]
      [ comp.lang.c++.moderated.    First time posters: Do this! ]

0
Reply Calum 8/30/2005 10:27:13 PM

"Andy" <garth_rockett@yahoo.com> wrote in message 
news:1125400901.786820.191440@g47g2000cwa.googlegroups.com...
> We need to process a very large amount of delimited variable length
> ASCII data in files as large as 3-4 gigs. We need a high performance
> parser for this and as always, we have no money to buy one. We are ok
> with building one as long as that can be done quick enough and I was
> wondering if Boost has a panacea for us. Can anyone help with their
> ideas / experience.

You don't need no parser generator. In fact all you need is to get familiar with 
istream::ignore() function.
We routinely parse multi-gigabyte data files, and the parser (even for complex 
syntax) was never a bottleneck. It may seem counterintuitive, but the bigger the 
data size, the smaller is the contribution of the parser, because usually 
parsing portion has a linear complexity, while most non-trivial data processing 
algorithms tend to have super-linear complexity.

- gene 


      [ See http://www.gotw.ca/resources/clcm.htm for info about ]
      [ comp.lang.c++.moderated.    First time posters: Do this! ]

0
Reply Gene 8/31/2005 8:42:10 AM

Calum Grant wrote:
> Andy wrote:

> I'd use Flex, which you will have *free* with your Linux distro.

>       http://www.gnu.org/software/flex/

> It is table-driven, and extremely fast.

This depends on the tables -- if the regular expressions are
complex, either the tables are very big (causing page faults),
or they are compressed somehow (which slows things down).

> It also works with Bison - which is also table-driven and
> fast, but by the sound of things you probably need Flex
> without Bison.

> You may run into performance issues with Spirit - I mean it's
> fast enough for most things but it uses recursive descent
> which is slower than table-driver parsers.

First, as you say, it sounds like he needs Flex.  Supposing he
needs something that complicated at all -- my impression is that
std::find might even be enough.  I'm not familiar with Spirit,
but if it generates a recursive descent parser, it's closer to
Bison than to Flex (and my experience is that hand written
recursive descent parsers are often faster than what Bison
generates).  If std::find is not enough, he might want to look
into boost::regex, although I suspect that there, he will pay a
significant performance price for generality that he probably
doesn't need.

FWIW: my regular expression class builds a DFA lazily; it also
has a function to build the complete DFA at the start, and
functions to extract the DFA elements, in order to put them into
a separate pre-compiled table, which can then be loaded by
another program.  It pays for this capability by being a lot
less flexible than boost::regex, and I've often found myself
regretting that I cannot (yet) replace it with the Boost version
in my own code.  But it could be that Andy has stumbled on one
of the rare cases where my class would be preferable.

> Your bottleneck may be the module consuming the data.

IO *is* likely to be a bottleneck.  Probably the bottleneck.
For starters, he'd probably gain by doing raw, system level IO;
for sequencial reading, this may even be faster than memory
mapping.  I had a similar problem once in the distant past,
where a program had to read lines of text (processed as lines)
very, very quickly.  The solution I used was to read the file in
large blocks, scanning from the start to find the '\n', and
processing in place.  (The "readLine" function return a pointer
to where the line was located in the buffer.)  If I didn't find
a '\n', I copied what was left of the buffer down to the
beginning, and read exactly behind it.  On the system I was
using at the time, it also turned out that it was faster if 1) I
ensured that the target address for the system reads was
properly aligned, and 2) all system reads had a size which was a
multiple of the segment size on the disk.  (For example: if I
had 13 bytes left without a '\n', I would copy them to the
address buffer + 3, then read buffer size - disk segment size of
data at the address buffer + 16.  Also, I simply punted on lines
which were longer than the buffer size, and refused to process
them.  Given that the actual lines were normally less than 80
characters, and never more than 136, and the buffer size was
something like 16 KBytes -- very big back then, this was a
reasonable restriction.)

--
James Kanze                                           GABI Software
Conseils en informatique orient�e objet/
                    Beratung in objektorientierter Datenverarbeitung
9 place S�mard, 78210 St.-Cyr-l'�cole, France, +33 (0)1 30 23 00 34


      [ See http://www.gotw.ca/resources/clcm.htm for info about ]
      [ comp.lang.c++.moderated.    First time posters: Do this! ]

0
Reply kanze 8/31/2005 10:40:01 AM

Have a look at the split_iterator in the boost string_algo library.

Unfortunatly it only works with forward iterators.


      [ See http://www.gotw.ca/resources/clcm.htm for info about ]
      [ comp.lang.c++.moderated.    First time posters: Do this! ]

0
Reply wpcmame 8/31/2005 10:42:16 AM

Gene, Your idea is excellent. I proposed a similar idea in response to
a different OP who had a similar question on Aug 19, 2005 but the OP
never responded.

"slyi, Actually the std::istringstream class can handle more than
white-space character delimiters, as my earlier example attempted to
show:

istringstream iss(line);
while (iss)
{
   iss >> word;
   iss.ignore(1, ' '); // the delimiter does not have to be white space

}


The ignore member function takes two arguments, the first is the number

of characters to be extracted and ignored and the second is the
delimiter character. In effect, the class istringstream gives you an
elegant tokenizer DFA for free so that you don't have to write your own

tokenizer C++ class."


  Gene, I am wondering about what you mean by super-complexity? Is that
like 0(n log n) or 0(n^3)? Thank you.


      [ See http://www.gotw.ca/resources/clcm.htm for info about ]
      [ comp.lang.c++.moderated.    First time posters: Do this! ]

0
Reply Frank 8/31/2005 2:55:17 PM

"Frank Chang" <FrankChang91@gmail.com> wrote in message 
news:1125492229.689926.263740@g47g2000cwa.googlegroups.com...
....
>  Gene, I am wondering about what you mean by super-complexity? Is that
> like 0(n log n) or 0(n^3)? Thank you.

Superlinear complexity means that it grows faster than linear. When your 
application is processing gigabytes of data, you can't really afford anything 
worse than O(n log n) or you will be out of business in a matter of a few years.

- gene 


      [ See http://www.gotw.ca/resources/clcm.htm for info about ]
      [ comp.lang.c++.moderated.    First time posters: Do this! ]

0
Reply Gene 8/31/2005 9:01:18 PM

kanze@gabi-soft.fr wrote:
> Calum Grant wrote:

>>Your bottleneck may be the module consuming the data.
> 
> 
> IO *is* likely to be a bottleneck.  Probably the bottleneck.

Yes - this is generally the case.  Without knowing more about the 
problem, it's not possible to know whether the bottleneck will be IO in 
or IO out.  If the task is to find the sum of a large table of numbers, 
then that will be input-bound.  If the results are reformatted and 
written to a database, then the bottleneck will be in the output.

I've had very good experience with memory-mapped files (e.g. 
http://lightwave2.com/persist) where the performance is identical to 
normal RAM.  So I would say that this would be the ultimate way to go 
the squeeze that extra little bit of performance out of the parser. 
Instead of parsing a file, point the parser to the memory-mapped data. 
However with 3-4 Gigs it's not going to fit into a 32-bit address space, 
so you'd need to parse the file in chunks.

Cheers, Calum

> For starters, he'd probably gain by doing raw, system level IO;
> for sequencial reading, this may even be faster than memory
> mapping.  I had a similar problem once in the distant past,
> where a program had to read lines of text (processed as lines)
> very, very quickly.  The solution I used was to read the file in
> large blocks, scanning from the start to find the '\n', and
> processing in place.  (The "readLine" function return a pointer
> to where the line was located in the buffer.)  If I didn't find
> a '\n', I copied what was left of the buffer down to the
> beginning, and read exactly behind it.  On the system I was
> using at the time, it also turned out that it was faster if 1) I
> ensured that the target address for the system reads was
> properly aligned, and 2) all system reads had a size which was a
> multiple of the segment size on the disk.  (For example: if I
> had 13 bytes left without a '\n', I would copy them to the
> address buffer + 3, then read buffer size - disk segment size of
> data at the address buffer + 16.  Also, I simply punted on lines
> which were longer than the buffer size, and refused to process
> them.  Given that the actual lines were normally less than 80
> characters, and never more than 136, and the buffer size was
> something like 16 KBytes -- very big back then, this was a
> reasonable restriction.)
> 
> --
> James Kanze                                           GABI Software
> Conseils en informatique orient�e objet/
>                     Beratung in objektorientierter Datenverarbeitung
> 9 place S�mard, 78210 St.-Cyr-l'�cole, France, +33 (0)1 30 23 00 34

      [ See http://www.gotw.ca/resources/clcm.htm for info about ]
      [ comp.lang.c++.moderated.    First time posters: Do this! ]

0
Reply Calum 9/1/2005 9:16:13 AM

9 Replies
777 Views

(page loaded in 0.61 seconds)

Similiar Articles:













7/22/2012 11:15:11 PM


Reply: