I need to parse a file containing about 40,000 records. Each record
would look like
string1|string2|number1|number1|float
I need to read each field in this string and to process some
information based on this data and store all this data in an internal
data structure.
What is the most efficient way to do this? Is it better to use istream
and read each line as std::strings and then parse each string? Or
should I use char arrays and use C-style parsing??
Should I first load the file into a buffer and then parse the large
buffer? Or parse and process wach line by lineof the file.
Help!!
--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
|
|
0
|
|
|
|
Reply
|
coding
|
11/23/2009 11:47:32 PM |
|
coding junkie wrote:
> I need to parse a file containing about 40,000 records. Each record
> would look like
>
> string1|string2|number1|number1|float
>
> I need to read each field in this string and to process some
> information based on this data and store all this data in an internal
> data structure.
>
> What is the most efficient way to do this? Is it better to use istream
> and read each line as std::strings and then parse each string? Or
> should I use char arrays and use C-style parsing??
>
> Should I first load the file into a buffer and then parse the large
> buffer? Or parse and process wach line by lineof the file.
I answered a similar question to this on StackOverflow a while back
(which of course I can't find at the moment) by posting the results of
some tests I did that showed that reading into a large buffer was about
twice as fast as reading line by line into a string. You will have to
decide if a 2x increase in code speed (which isn't all that much) is
worth the effort. Personally, I would get the application working with
the simplest code possible and only look at optimising reads if the
performance is actually unacceptable in practice.
Neil Butterworth
--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
|
|
0
|
|
|
|
Reply
|
Neil
|
11/24/2009 7:37:08 AM
|
|
On Nov 23, 4:47 pm, coding junkie <coding.junkie...@gmail.com> wrote:
> I need to parse a file containing about 40,000 records. Each record
> would look like
>
> string1|string2|number1|number1|float
>
> I need to read each field in this string and to process some
> information based on this data and store all this data in an internal
> data structure.
>
> What is the most efficient way to do this? Is it better to use istream
> and read each line as std::strings and then parse each string? Or
> should I use char arrays and use C-style parsing??
>
> Should I first load the file into a buffer and then parse the large
> buffer? Or parse and process wach line by lineof the file.
>
> Help!!
I suggest using std::string.getline() to read a line into a
std::string from an istream and then parse the string into fields.
Reading the entire file into a buffer uses more space than reading
line by line. Also, you will be able catch lines errors such as wrong
number of fields this way. Reading the entire file at once won't
allow catching these types of errors.
Our group created a delimiter seperated file reader class which does
exactly this. It's constructor opened the file. It contained a
nextLine function which read in the next line. And access functions
that allowed getting each field as an int, double, or string either by
name (the first line in the file contained column headings) or by
position.
HTH
--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
|
|
0
|
|
|
|
Reply
|
AnonMail2005
|
11/24/2009 7:52:04 AM
|
|
On Nov 23, 1:47 pm, coding junkie <coding.junkie...@gmail.com> wrote:
> I need to parse a file containing about 40,000 records. Each record
> would look like
>
> string1|string2|number1|number1|float
>
> I need to read each field in this string and to process some
> information based on this data and store all this data in an internal
> data structure.
>
> What is the most efficient way to do this? Is it better to use istream
> and read each line as std::strings and then parse each string? Or
> should I use char arrays and use C-style parsing??
>
> Should I first load the file into a buffer and then parse the large
> buffer? Or parse and process wach line by lineof the file.
>
Get the code correct first, and then worry about "efficiency". And
BENCHMARK BENCHMARK BENCHMARK before you do "efficiency" tweaks.
I'd use your first method -- read line by line into a std::string.
If benchmarking/profiling then shows that 1) you're too slow, and 2)
the
bottleneck is in the parsing; then -- and ONLY THEN -- should you look
at optimizing the parse.
--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
|
|
0
|
|
|
|
Reply
|
red
|
11/24/2009 7:52:48 AM
|
|
> What is the most efficient way to do this? Is it better to use istream
> and read each line as std::strings and then parse each string? Or
> should I use char arrays and use C-style parsing??
>
> Should I first load the file into a buffer and then parse the large
> buffer? Or parse and process wach line by lineof the file.
>
In your case I'd probably just read the file line by line. I make my
case as follows.
1) You probably wouldn't notice the optimization. 40,000 records
really isn't all that much data and whatever efficiency you might gain
from blobbing it up and trying to parse it all in RAM will be hard to
detect
2) You may not get the optimization you think. The object of blobbing
it up in RAM is to get rid of the 40k function calls to read the line
and allocate the string for it. You may well wind up replacing those
40k function calls with some other function calls to break the memory
image up line by line. In any case, your file request will likely be
read in blocks anyway behind the scenes.. check out what you can do
with changing your buffer sizes.
3) The line by line approach scales better. If per chance your input
file grows to 400 million rows, your utility would work and you would
not have to change a thing if you did it line by line.
I've been playing with Boost Spirit to do file parsing. Do check it
out! The Boost folks have some giant brains, like, they are the like
the super brilliant aliens in the original Star Trek and all you and I
can do is beg for Quatloos and hope to use their stuff. Spirit is
pretty cool because it can handle the case of vertical bars pretty
easily, like you have, but it can also be used to write your own
compiler with it. That's pretty cool, I must say.
--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
|
|
0
|
|
|
|
Reply
|
stork
|
11/24/2009 7:54:32 AM
|
|
coding junkie wrote:
> What is the most efficient way to do this? Is it better to use istream
> and read each line as std::strings and then parse each string? Or
> should I use char arrays and use C-style parsing??
If your target is linux, then mmap() might help. But I agree with others
to profile and see where is the bottleneck.
--
ultrasound www.ezono.com
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
|
|
0
|
|
|
|
Reply
|
Vladimir
|
11/24/2009 6:04:02 PM
|
|
coding junkie wrote:
> I need to parse a file containing about 40,000 records. Each record
> would look like
>
> string1|string2|number1|number1|float
>
> I need to read each field in this string and to process some
> information based on this data and store all this data in an internal
> data structure.
>
> What is the most efficient way to do this? Is it better to use istream
> and read each line as std::strings and then parse each string? Or
> should I use char arrays and use C-style parsing??
>
> Should I first load the file into a buffer and then parse the large
> buffer? Or parse and process wach line by lineof the file.
>
> Help!!
>
It seems like boost::spirit has been popping up in the discussions
fairly frequently here.
I'd suggest using spirit for your needs here, as it seems the perfect
fit for it.
Not only will it make your code cleaner but it also has a chance to be
very efficient.
But please, check out "Using boost::spirit in production code" thread
from Nov 10th before you decide to use it.
HTH,
Andy.
--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
|
|
0
|
|
|
|
Reply
|
Andy
|
11/24/2009 10:37:38 PM
|
|
On Nov 23, 4:47 pm, coding junkie <coding.junkie...@gmail.com> wrote:
> I need to parse a file containing about 40,000 records. Each record
> would look like
>
> string1|string2|number1|number1|float
>
> I need to read each field in this string and to process some
> information based on this data and store all this data in an internal
> data structure.
>
> What is the most efficient way to do this? Is it better to use istream
> and read each line as std::strings and then parse each string? Or
> should I use char arrays and use C-style parsing??
>
> Should I first load the file into a buffer and then parse the large
> buffer? Or parse and process wach line by lineof the file.
>
Lately we have been going over our code base looking to
wring out better performance. One way to do that was to
parallelize some of our algorithms and take advantage of
multiple cores. I know C++ doesn't officially support
multiple cores, but there are some libraries that do.
We are using the Intel Threaded Building Blocks library
and see drastic improvement in some areas, modest in others
and sometimes even a drop in performance.
Perhaps your best performance would be read and pass the
parsing to an idle core.
--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
|
|
0
|
|
|
|
Reply
|
mzdude
|
11/25/2009 7:52:43 PM
|
|
On Nov 24, 3:37 pm, Andy Venikov <swojchelo...@gmail.com> wrote:
> coding junkie wrote:
> > I need to parse a file containing about 40,000 records. Each record
> > would look like
>
> > string1|string2|number1|number1|float
>
> > I need to read each field in this string and to process some
> > information based on this data and store all this data in an internal
> > data structure.
>
> > What is the most efficient way to do this? Is it better to use istream
> > and read each line as std::strings and then parse each string? Or
> > should I use char arrays and use C-style parsing??
>
> > Should I first load the file into a buffer and then parse the large
> > buffer? Or parse and process wach line by lineof the file.
>
> > Help!!
>
> It seems like boost::spirit has been popping up in the discussions
> fairly frequently here.
>
> I'd suggest using spirit for your needs here, as it seems the perfect
> fit for it.
>
> Not only will it make your code cleaner but it also has a chance to be
> very efficient.
>
> But please, check out "Using boost::spirit in production code" thread
> from Nov 10th before you decide to use it.
>
> HTH,
> Andy.
I know you're not the only one who suggested boost spirit but I'm
replying to your post.
Personally, I think it's overkill for this case. The lines are
clearly delimited by pipe characters and the fields are strings, ints
and floats. Tokenizing is easily done with the various find functions
of std::string. And parsing ints and floats is very straight forward
with C++ standard library tools.
HTH
--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
|
|
0
|
|
|
|
Reply
|
AnonMail2005
|
11/25/2009 7:53:48 PM
|
|
|
8 Replies
315 Views
(page loaded in 1.417 seconds)
Similiar Articles: Speed-up the reading of large binary files with complex structures ...If a specific string appears repeatedly in an M-file, it ... is true that is very fast to parse ... up the reading of large binary files with complex structures ... Fastest way ... Very fast delimited record parsing with boost - comp.lang.c++ ...... very large amount of delimited variable length ASCII data in files as ... to parse the file in ... quick way, besides writing a ... Read By String From A Tab Delimited Text File ... XI File loading slow - comp.graphics.apps.paint-shop-pro ...In short you go to View > Customize, click on the File menu to open it (in PSPP XI's ... andrew: 11/14/2007 12:01:44 AM ... performance issue in Solaris 10 ... running very ... Fastest finger first - comp.lang.javascript... scrollto supported in all browsers? Thank you very much. ... You wouldn't even need a particularly large or fast ... Jeopardy Game (Fastest Finger First) Jeopardy Game (Fastest ... Efficient way to delete a large amount of records from a big table ...Fastest way to delete records - comp ... Efficient way to delete a large amount of records from a big table ... What is the most efficient way ... from a very big file ... Fastest way to delete records - comp.lang.clipperOr, if you like to live very dangerously, open the file with two aliases. ... What is the best way to delete a large number of records in t-sql? What is the fastest way to ... asm string parsing - comp.lang.asm.x86i would llike to parse a string, and count all the invalid characters(ie not letters or numbers). the only way i can thing of ... negative number looks like a very large ... Parsing file names with spaces - comp.lang.perl.miscSo normally that is true, but very large files can ... regex! it isn't hard to write one to parse out the file ... batch file for loops ... your input file as a literal string ... Hard Disk Utilities - comp.soft-sys.matlabI will be creating large data files and would like to know in real-time when my hard drive ... E.g for Windows you could you can get the info in a nasty string with: !dir C ... OpenSSH and .rhosts usage - comp.unix.solaris... config: done config len = 280 debug2: parse ... on your own, so that you understand "the big ... Quickest way to bulk copy many files from one disk cluster to ..... at: http ... Fastest way to parse strings from a very big file! C++/VBI need to parse a file containing about 40,000 records. Each record would look like string1|string2|number1|number1|float I need to read each field in this string ... parsing - What is the fastest way to parse text with custom ...What is the fastest way to parse text with custom delimiters and some very, very large ... Very,Very,Large value ... from the file where the 4th field contains the string ... 7/27/2012 4:10:58 PM
|