Reading a Line from a file

  • Follow


I have a huge csv file, that I would just like to be able to read specific lines form the file.

Is this posssible, I have tried fget, fscanf, but  they don't seem to address the problem very well.

for example, I have about 500,000 rows, and irregular noumber of columns. I know which row or rows I need the information from. Is there a way I can just jump to that line using any of the built in commands.

function fgetl, does not work becuase, it start from first line to the next.
0
Reply smith 8/25/2010 1:20:09 PM

"smith Og" <adeog@ymail.com> wrote in message <i535a9$9ji$1@fred.mathworks.com>...
> I have a huge csv file, that I would just like to be able to read specific lines form the file.
> 
> Is this posssible, I have tried fget, fscanf, but  they don't seem to address the problem very well.
> 
> for example, I have about 500,000 rows, and irregular noumber of columns. I know which row or rows I need the information from. Is there a way I can just jump to that line using any of the built in commands.
> 
> function fgetl, does not work becuase, it start from first line to the next.

Hi, have you tried csvread() with the Row and Column options specified?

Wayne
0
Reply Wayne 8/25/2010 1:35:06 PM


"Wayne King" <wmkingty@gmail.com> wrote in message <i5366a$7cc$1@fred.mathworks.com>...
> "smith Og" <adeog@ymail.com> wrote in message <i535a9$9ji$1@fred.mathworks.com>...
> > I have a huge csv file, that I would just like to be able to read specific lines form the file.
> > 
> > Is this posssible, I have tried fget, fscanf, but  they don't seem to address the problem very well.
> > 
> > for example, I have about 500,000 rows, and irregular noumber of columns. I know which row or rows I need the information from. Is there a way I can just jump to that line using any of the built in commands.
> > 
> > function fgetl, does not work becuase, it start from first line to the next.

Hi Wayne,

thanks. unfortunately the file contains both text and numeric, and csvread works only on numeric file. 
> 
> Hi, have you tried csvread() with the Row and Column options specified?
> 
> Wayne
0
Reply smith 8/25/2010 1:45:21 PM

smith Og wrote:
....

> for example, I have about 500,000 rows, and irregular noumber of 
> columns. I know which row or rows I need the information from. Is there 
> a way I can just jump to that line using any of the built in commands.
> 
> function fgetl, does not work becuase, it start from first line to the 
> next.

Well, the thing about sequential files is that they 
are...ummmmmh...._sequential_.

You could play games with assuming an average number of characters/line 
and fseek() based on that then find the next eol and go from there but 
then you're at the mercy of trying to figure out which actual line you 
are on since it is only an average.

If the data file will fit in memory you could suck it all in w/ binary 
read and search for eol internally instead of of reading a line at a 
time, but the file is fundamentally a variable-length record sequential 
file so you're at the mercy of having to deal with it in that form or 
change the file format.

--
0
Reply dpb 8/25/2010 2:22:29 PM

On 25/08/10 8:20 AM, smith Og wrote:
> I have a huge csv file, that I would just like to be able to read
> specific lines form the file.
>
> Is this posssible, I have tried fget, fscanf, but they don't seem to
> address the problem very well.
>
> for example, I have about 500,000 rows, and irregular noumber of
> columns. I know which row or rows I need the information from. Is there
> a way I can just jump to that line using any of the built in commands.
>
> function fgetl, does not work becuase, it start from first line to the
> next.

Are you using a very old version of Matlab on a DEC VAX computer with 
VMS that dates before OpenVMS, and was the csv file created with a text 
editor or a program that was NOT written in C or C++ ? If that 
combination of circumstances does not happen to be the case, then Sorry, 
your operating system itself has no internal markers as how long "lines" 
are and so there is no method other than starting at the beginning and 
reading until you get to the line you want.


[If you do happen to be using an old old Matlab on real VMS not OpenVMS, 
then you might be able to create some MEX that uses VMS's RMS (Record 
Management Services) calls in order to locate a line. There is no 
equivalent service in any modern operating system that I have been 
exposed to.]
0
Reply Walter 8/25/2010 2:30:02 PM

I'm not at MATLAB right now, but does it really take so long to read through 500k lines when you're not doing anything to most of them?

Could you post the result of the following for your data file:

fileName = 'myFile.csv'; % your file name here
fid = fopen(fileName);
currLine = 0;

tic
while fid ~= -1
  tmp = fgetl(fid);
  currLine = currLine + 1;
end
toc

fclose(fid);
0
Reply Andy 8/25/2010 2:36:20 PM

"Andy " <myfakeemailaddress@gmail.com> wrote in message <i539p4$9v6$1@fred.mathworks.com>...
> I'm not at MATLAB right now, but does it really take so long to read through 500k lines when you're not doing anything to most of them?
> 
> Could you post the result of the following for your data file:
> 
> fileName = 'myFile.csv'; % your file name here
> fid = fopen(fileName);
> currLine = 0;
> 
> tic
> while fid ~= -1
>   tmp = fgetl(fid);
>   currLine = currLine + 1;

Andy, 
the solution is subtle, however, fid ~= -1 does not work. 
however if I know the total number of lines, then it works.

fileName = 'myFile.csv'; % your file name here
fid = fopen(fileName);
for i = 1:total number of lines:
tmp = fgetl(fid)
fclose(fid)

then I can extract which lines I want, within the for loop. 

though I still believe MATLAB should have included a more direct approach to read lines of interest.
0
Reply smith 8/25/2010 3:35:08 PM

"smith Og" <adeog@ymail.com> wrote in message <i53d7c$j4$1@fred.mathworks.com>...
> "Andy " <myfakeemailaddress@gmail.com> wrote in message <i539p4$9v6$1@fred.mathworks.com>...
> > I'm not at MATLAB right now, but does it really take so long to read through 500k lines when you're not doing anything to most of them?
> > 
> > Could you post the result of the following for your data file:
> > 
> > fileName = 'myFile.csv'; % your file name here
> > fid = fopen(fileName);
> > currLine = 0;
> > 
> > tic
> > while fid ~= -1
> >   tmp = fgetl(fid);
> >   currLine = currLine + 1;
> 
> Andy, 
> the solution is subtle, however, fid ~= -1 does not work. 
> however if I know the total number of lines, then it works.
> 
> fileName = 'myFile.csv'; % your file name here
> fid = fopen(fileName);
> for i = 1:total number of lines:
> tmp = fgetl(fid)
> fclose(fid)
> 
> then I can extract which lines I want, within the for loop. 
> 
> though I still believe MATLAB should have included a more direct approach to read lines of interest.

Whoops.  That's what I get for posting code that I haven't checked in MATLAB.  It should have read something like:

fileName = 'myFile.csv'; % your file name here
fid = fopen(fileName);
currLine = 1;
tmp = fgetl(fid);

tic
while tmp ~= -1
  tmp = fgetl(fid);
  currLine = currLine + 1;
end
toc

fclose(fid);

As for MATLAB including ways to directly read lines of interest, it seems to me there are as many ways as there are in any other programming language.  You can't get around the fact that there is just no way to identify a particular line of a csv file on disk without reading it in sequentially.  What you could do, assuming you have some other program or function that creates your data file regularly, is automatically run a script that reads through the file line by line (as above), extracts the few lines you care about, and stores them separately in a .mat file for your use later.  Of course this won't save time overall, but it will push this extra file reading to a time when you don't have to wait for it.
0
Reply Andy 8/25/2010 3:57:04 PM


"smith Og" <adeog@ymail.com> wrote in message 
news:i53d7c$j4$1@fred.mathworks.com...
> "Andy " <myfakeemailaddress@gmail.com> wrote in message 
> <i539p4$9v6$1@fred.mathworks.com>...
>> I'm not at MATLAB right now, but does it really take so long to read 
>> through 500k lines when you're not doing anything to most of them?
>>
>> Could you post the result of the following for your data file:
>>
>> fileName = 'myFile.csv'; % your file name here
>> fid = fopen(fileName);
>> currLine = 0;
>>
>> tic
>> while fid ~= -1
>>   tmp = fgetl(fid);
>>   currLine = currLine + 1;
>
> Andy, the solution is subtle, however, fid ~= -1 does not work.

That's correct.  You need that for another reason, though.

> however if I know the total number of lines, then it works.
>
> fileName = 'myFile.csv'; % your file name here
> fid = fopen(fileName);
> for i = 1:total number of lines:
> tmp = fgetl(fid)
> fclose(fid)
>
> then I can extract which lines I want, within the for loop.


filename = 'myFile.csv';
[fid, message] = fopen(filename);
if fid == -1
    error('example:failureOpeningFile', 'Error opening file: %s', message);
end
lineToBeRead = 1;
targetLine = 100000;
while ~feof(fid) && lineToBeRead < targetLine
    fgetl(fid); % can throw this line away
    lineToBeRead = lineToBeRead+1;
end

% We reached this point because either:
% 1) We've reached the end of the file
% 2) We've reached the target line
% Handle case 1 here:
if feof(fid)
    fclose(fid);
    error('example:fileTooShort', 'The file did not contain %d lines', 
targetLine);
end

% If you've reached this point, we're in case 2 and you're now ready to read 
the target line or lines.


-- 
Steve Lord
slord@mathworks.com
comp.soft-sys.matlab (CSSM) FAQ: http://matlabwiki.mathworks.com/MATLAB_FAQ
To contact Technical Support use the Contact Us link on 
http://www.mathworks.com 

0
Reply Steven_Lord 8/25/2010 4:51:05 PM

"Andy " <myfakeemailaddress@gmail.com> wrote in message <i53egg$r33$1@fred.mathworks.com>...
> "smith Og" <adeog@ymail.com> wrote in message <i53d7c$j4$1@fred.mathworks.com>...
> > "Andy " <myfakeemailaddress@gmail.com> wrote in message <i539p4$9v6$1@fred.mathworks.com>...
> > > I'm not at MATLAB right now, but does it really take so long to read through 500k lines when you're not doing anything to most of them?
> > > 
> > > Could you post the result of the following for your data file:
> > > 
> > > fileName = 'myFile.csv'; % your file name here
> > > fid = fopen(fileName);
> > > currLine = 0;
> > > 
> > > tic
> > > while fid ~= -1
> > >   tmp = fgetl(fid);
> > >   currLine = currLine + 1;
> > 
> > Andy, 
> > the solution is subtle, however, fid ~= -1 does not work. 
> > however if I know the total number of lines, then it works.
> > 
> > fileName = 'myFile.csv'; % your file name here
> > fid = fopen(fileName);
> > for i = 1:total number of lines:
> > tmp = fgetl(fid)
> > fclose(fid)
> > 
> > then I can extract which lines I want, within the for loop. 
> > 
> > though I still believe MATLAB should have included a more direct approach to read lines of interest.
> 
> Whoops.  That's what I get for posting code that I haven't checked in MATLAB.  It should have read something like:
> 
> fileName = 'myFile.csv'; % your file name here
> fid = fopen(fileName);
> currLine = 1;
> tmp = fgetl(fid);
> 
> tic
> while tmp ~= -1
>   tmp = fgetl(fid);
>   currLine = currLine + 1;
> end
> toc
> 
> fclose(fid);
> 
> As for MATLAB including ways to directly read lines of interest, it seems to me there are as many ways as there are in any other programming language.  You can't get around the fact that there is just no way to identify a particular line of a csv file on disk without reading it in sequentially.  What you could do, assuming you have some other program or function that creates your data file regularly, is automatically run a script that reads through the file line by line (as above), extracts the few lines you care about, and stores them separately in a .mat file for your use later.  Of course this won't save time overall, but it will push this extra file reading to a time when you don't have to wait for it.

Thanks Andy. Deeply appreciated
0
Reply smith 8/25/2010 4:56:04 PM

smith Og wrote:
....
> though I still believe MATLAB should have included a more direct 
> approach to read lines of interest.

This indicates we have "a failure to communicate" problem....

As noted earlier, a csv file is a variable-length-record _SEQUENTIAL_ 
file.

Precisely describe how would you propose to have a way to determine how 
many random number of bytes into that file would be the beginning of a 
particular line of interest?

Once you have come to a conclusion on that question you should realize 
why the function you request doesn't exist.

--

0
Reply dpb 8/25/2010 5:12:00 PM

dpb <none@non.net> wrote in message <i53j3r$7o5$1@news.eternal-september.org>...
> smith Og wrote:
> ...
> > though I still believe MATLAB should have included a more direct 
> > approach to read lines of interest.
> 
> This indicates we have "a failure to communicate" problem....
> 
> As noted earlier, a csv file is a variable-length-record _SEQUENTIAL_ 
> file.
> 
> Precisely describe how would you propose to have a way to determine how 
> many random number of bytes into that file would be the beginning of a 
> particular line of interest?
> 
> Once you have come to a conclusion on that question you should realize 
> why the function you request doesn't exist.
> 
> --


Thanks dpb, Steven, Walter.

For some reason I did not see you guy's post before my replies. Thanks for the enlightment, and possible solution.
0
Reply smith 8/25/2010 6:47:08 PM

"smith Og" <adeog@ymail.com> wrote in message <i535a9$9ji$1@fred.mathworks.com>...
> I have a huge csv file, that I would just like to be able to read specific lines form the file.
> 
> Is this posssible, I have tried fget, fscanf, but  they don't seem to address the problem very well.
> 
> for example, I have about 500,000 rows, and irregular noumber of columns. I know which row or rows I need the information from. Is there a way I can just jump to that line using any of the built in commands.
> 
> function fgetl, does not work becuase, it start from first line to the next.

I tested this and it is true that doing a loop in Matlab to find the line is quite slow. Generating a million line file a fetching the 350,000th line for example:

---
fileName = 'largefile.txt'; % your file name here
tgt = 350000; %number of the line here

fid = fopen(fileName);
currLine = 1;
tmp = 'a';

tic
while tmp ~= -1
  tmp = fgets(fid);
  
  if currLine == tgt
      break;
  end
  currLine = currLine + 1;
end
toc

fclose(fid);
theline = tmp;
----

This takes about 4 seconds on my computer. If this is the bottleneck in your program then you can always call some external file to do the job. For example this python script reads the nth line in a file:

---
#arg 1 filename arg 2 line num
import sys

try:
    linenum = int(sys.argv[2])
    f = open(sys.argv[1],'rt')
    idx = 0
    theline = 1
    while idx < linenum and theline:
        theline = f.readline()
        idx += 1
    print theline
    sys.exit(0)
except Exception:
    print "Error"
    sys.exit(1)
---

This script, saved as readnthline.py, can be run from matlab like so:
---
[flag,theline] = unix(sprintf('python readnthline.py %s %d',fileName,tgt));
---

This works just like the matlab script but on my computer takes only 0.2 seconds. Note that this will work unmodified on Linux and probably on Mac. On Windows you would have to have python installed and on your path and call dos instead of unix.
0
Reply Patrick 8/25/2010 6:51:04 PM

On 10-08-25 01:51 PM, Patrick Mineault wrote:

> This takes about 4 seconds on my computer. If this is the bottleneck in
> your program then you can always call some external file to do the job.
> For example this python script reads the nth line in a file:

The equivalent perl script would be

$linenum = pop(@ARGV);
while (<>) { $. == $linenum && chomp && print && exit 0 }
exit 1

The script, saved as readnthline.pl, can be run from matlab like so:

fileName = 'largefile.txt'; % your file name here
tgt = 350000; %number of the line here

[theline,flag] = perl(fileName, tgt);
0
Reply Walter 8/25/2010 7:07:43 PM

smith Og wrote:
> dpb <none@non.net> wrote in message 
> <i53j3r$7o5$1@news.eternal-september.org>...
>> smith Og wrote:
>> ...
>> > though I still believe MATLAB should have included a more direct > 
>> approach to read lines of interest.
>>
>> This indicates we have "a failure to communicate" problem....
>>
>> As noted earlier, a csv file is a variable-length-record _SEQUENTIAL_ 
>> file.
>>
>> Precisely describe how would you propose to have a way to determine 
>> how many random number of bytes into that file would be the beginning 
>> of a particular line of interest?
>>
>> Once you have come to a conclusion on that question you should realize 
>> why the function you request doesn't exist.
>>
>> -- 
> 
> 
> Thanks dpb, Steven, Walter.
> 
> For some reason I did not see you guy's post before my replies. Thanks 
> for the enlightment, and possible solution.

And, of course, Steven's solution encapsulated _IS_ that function... :)

It's just that TMW didn't ship it w/ base Matlab leaving it as an 
"exercise for the student".  <vbg>

--
0
Reply dpb 8/25/2010 7:08:32 PM

Walter Roberson <roberson@hushmail.com> wrote in message <i53po0$bd2$1@canopus.cc.umanitoba.ca>...
> On 10-08-25 01:51 PM, Patrick Mineault wrote:
> 
> > This takes about 4 seconds on my computer. If this is the bottleneck in
> > your program then you can always call some external file to do the job.
> > For example this python script reads the nth line in a file:
> 
> The equivalent perl script would be
> 
> $linenum = pop(@ARGV);
> while (<>) { $. == $linenum && chomp && print && exit 0 }
> exit 1
> 
> The script, saved as readnthline.pl, can be run from matlab like so:
> 
> fileName = 'largefile.txt'; % your file name here
> tgt = 350000; %number of the line here
> 
> [theline,flag] = perl(fileName, tgt);

+2 geek points for perl. However the matlab part isn't right, it's actually:

 [theline,flag]=perl('readnthline.pl',fileName,num2str(tgt));

And that takes about .13 s versus .22 s for my Python solution. However the following mex file with some inline assembly is even faster ... (just joking)
0
Reply Patrick 8/25/2010 7:30:11 PM

On 10-08-25 02:30 PM, Patrick Mineault wrote:

> +2 geek points for perl. However the matlab part isn't right, it's
> actually:
>
> [theline,flag]=perl('readnthline.pl',fileName,num2str(tgt));

You are right, and I didn't know before that MATLAB perl() calls only accept 
strings... I guess I just never happened to run into that.

Normally if I have a reason to invoke perl, I do so outside of Matlab, 
probably doing all the pre-processing in one fell swoop and leaving a 
parsing-friendly input file with just the information I need.


Side note: perl is included with the Matlab distribution, which can make it 
handy for odd jobs like this one that aren't worth a serious invocation of python.
0
Reply Walter 8/25/2010 7:45:32 PM

16 Replies
372 Views

(page loaded in 0.094 seconds)

Similiar Articles:


















7/27/2012 12:31:44 AM


Reply: