f



Parsing a potentially corrupted file

I'm looking for a reasonably "clean" way to parse a log file that potential=
ly has incomplete records in it.

The basic structure of the file is a set of multi-line records. Each record=
 starts with a series of fields delimited by [...] (the first of which is a=
lways a date), optionally separated by whitespace. Then there's a trailing =
"free text" field, optionally followed by a multi-line field delimited by [=
[...]]

So, example records might be

[2016-11-30T20:04:08.000+00:00] [Component] [level] [] [] [id] Description =
of the issue goes here

(a record delimited by the end of the line)

or=20

[2016-11-30T20:04:08.000+00:00] [Component] [level] [] [] [id] Description =
of the issue goes here [[Additional
data, potentially multiple lines

including blank lines
goes here
]]

The terminating ]] is on a line of its own.

This is a messy format to parse, but it's manageable. However, there's a ca=
tch. Because the logging software involved is broken, I can occasionally ge=
t a log record prematurely terminated with a new record starting mid-stream=
.. So something like the following:

[2016-11-30T20:04:08.000+00:00] [Component] [le[2016-11-30T20:04:08.000+00:=
00] [Component] [level] [] [] [id] Description of the issue goes here

I'm struggling to find a "clean" way to parse this. I've managed a clumsy a=
pproach, by splitting the file contents on the pattern [ddd-dd-ddTdd:dd:dd.=
ddd+dd:dd] (the timestamp - I've never seen a case where this gets truncate=
d) and then treating each entry as a record and parsing it individually. Bu=
t the resulting code isn't exactly maintainable, and I'm looking for someth=
ing cleaner.

Does anyone have any suggestions for a good way to parse this data?

Thanks,
Paul
0
Paul
12/14/2016 11:43:44 AM
comp.lang.python 77058 articles. 6 followers. Post Follow

5 Replies
746 Views

Similar Articles

[PageSpeed] 11

On Wed, Dec 14, 2016 at 10:43 PM, Paul  Moore <p.f.moore@gmail.com> wrote:
> This is a messy format to parse, but it's manageable. However, there's a =
catch. Because the logging software involved is broken, I can occasionally =
get a log record prematurely terminated with a new record starting mid-stre=
am. So something like the following:
>
> [2016-11-30T20:04:08.000+00:00] [Component] [le[2016-11-30T20:04:08.000+0=
0:00] [Component] [level] [] [] [id] Description of the issue goes here
>
> I'm struggling to find a "clean" way to parse this. I've managed a clumsy=
 approach, by splitting the file contents on the pattern [ddd-dd-ddTdd:dd:d=
d.ddd+dd:dd] (the timestamp - I've never seen a case where this gets trunca=
ted) and then treating each entry as a record and parsing it individually. =
But the resulting code isn't exactly maintainable, and I'm looking for some=
thing cleaner.
>

Is the "[Component]" section something you could verify? (That is - is
there a known list of components?) If so, I would include that as a
secondary check. Ditto anything else you can check (I'm guessing the
[level] is one of a small set of values too.) The logic would be
something like this:

Read line from file.
Verify line as a potential record:
    Assert that line begins with timestamp.
    Verify as many fields as possible (component, level, etc)
    Search line for additional timestamp.
    If additional timestamp found:
        Recurse. If verification fails, assume we didn't really have a
corrupted line.
        (Process partial line? Or discard?)
    If "[[" in line:
        Until line is "]]":
            Read line from file, append to description
            If timestamp found:
                Recurse. If verification succeeds, break out of loop.

 Unfortunately it's still not really clean; but that's the nature of
working with messy data. Coping with ambiguity is *hard*.

ChrisA
0
Chris
12/14/2016 12:57:03 PM
On Wed, 14 Dec 2016 03:43:44 -0800, Paul  Moore wrote:

> I'm looking for a reasonably "clean" way to parse a log file that
> potentially has incomplete records in it.
> 
> The basic structure of the file is a set of multi-line records. Each
> record starts with a series of fields delimited by [...] (the first of
> which is always a date), optionally separated by whitespace. Then
> there's a trailing "free text" field, optionally followed by a
> multi-line field delimited by [[...]]
> 
> So, example records might be
> 
> [2016-11-30T20:04:08.000+00:00] [Component] [level] [] [] [id]
> Description of the issue goes here
> 
> (a record delimited by the end of the line)
> 
> or
> 
> [2016-11-30T20:04:08.000+00:00] [Component] [level] [] [] [id]
> Description of the issue goes here [[Additional data, potentially
> multiple lines
> 
> including blank lines goes here ]]
> 
> The terminating ]] is on a line of its own.
> 
> This is a messy format to parse, but it's manageable. However, there's a
> catch. Because the logging software involved is broken, I can
> occasionally get a log record prematurely terminated with a new record
> starting mid-stream. So something like the following:
> 
> [2016-11-30T20:04:08.000+00:00] [Component]
> [le[2016-11-30T20:04:08.000+00:00] [Component] [level] [] [] [id]
> Description of the issue goes here
> 
> I'm struggling to find a "clean" way to parse this. I've managed a
> clumsy approach, by splitting the file contents on the pattern
> [ddd-dd-ddTdd:dd:dd.ddd+dd:dd] (the timestamp - I've never seen a case
> where this gets truncated) and then treating each entry as a record and
> parsing it individually. But the resulting code isn't exactly
> maintainable, and I'm looking for something cleaner.
> 
> Does anyone have any suggestions for a good way to parse this data?
> 
> Thanks,
> Paul

1st question do you (or anyone you can contact) have any control over the 
logging application?

if so the best approach would be to get the log file output fixed.

if not then you will probably be stuck with a messy solution :-(



-- 
Sin has many tools, but a lie is the handle which fits them all.
0
alister
12/14/2016 1:38:38 PM
On Wednesday, 14 December 2016 12:57:23 UTC, Chris Angelico  wrote:
> Is the "[Component]" section something you could verify? (That is - is
> there a known list of components?) If so, I would include that as a
> secondary check. Ditto anything else you can check (I'm guessing the
> [level] is one of a small set of values too.)

Possibly, although this is to analyze the structure of a basically undocume=
nted log format. So if I validate too tightly, I end up just checking my as=
sumptions rather than checking the data :-(

> The logic would be
> something like this:
>=20
> Read line from file.
> Verify line as a potential record:
>     Assert that line begins with timestamp.
>     Verify as many fields as possible (component, level, etc)
>     Search line for additional timestamp.
>     If additional timestamp found:
>         Recurse. If verification fails, assume we didn't really have a
> corrupted line.
>         (Process partial line? Or discard?)
>     If "[[" in line:
>         Until line is "]]":
>             Read line from file, append to description
>             If timestamp found:
>                 Recurse. If verification succeeds, break out of loop.
>=20
>  Unfortunately it's still not really clean; but that's the nature of
> working with messy data. Coping with ambiguity is *hard*.

Yeah, that's essentially what I have now. As I say, it's working but nobody=
 could really love it. But you're right, it's more the fault of the data th=
an of the code.

One thought I had, which I might try, is to go with the timestamp as the on=
e assumption I make of the data, and read the file in as, in effect, a text=
 stream, spitting out a record every time I see something matching a the [t=
imestamp] pattern. Then parse record by record. Truncated records should ei=
ther be obvious (because the delimited fields have start and end markers, s=
o unmatched markers =3D truncated record) or acceptable (because undelimite=
d fields are free text). I'm OK with ignoring the possibility that the free=
 text contains something that looks like a timestamp.

The only problem with this approach is that I have more data than I'd reall=
y like to read into memory all at once, so I'd need to do some sort of stre=
amed match/split processing. But thinking about it, that sounds like the so=
rt of job a series of chained generators could manage. Maybe I'll look at t=
hat approach...

Paul
0
Paul
12/14/2016 2:07:27 PM
On 2016-12-14 11:43, Paul Moore wrote:
> I'm looking for a reasonably "clean" way to parse a log file that potentially has incomplete records in it.
>
> The basic structure of the file is a set of multi-line records. Each record starts with a series of fields delimited by [...] (the first of which is always a date), optionally separated by whitespace. Then there's a trailing "free text" field, optionally followed by a multi-line field delimited by [[...]]
>
> So, example records might be
>
> [2016-11-30T20:04:08.000+00:00] [Component] [level] [] [] [id] Description of the issue goes here
>
> (a record delimited by the end of the line)
>
> or
>
> [2016-11-30T20:04:08.000+00:00] [Component] [level] [] [] [id] Description of the issue goes here [[Additional
> data, potentially multiple lines
>
> including blank lines
> goes here
> ]]
>
> The terminating ]] is on a line of its own.
>
> This is a messy format to parse, but it's manageable. However, there's a catch. Because the logging software involved is broken, I can occasionally get a log record prematurely terminated with a new record starting mid-stream. So something like the following:
>
> [2016-11-30T20:04:08.000+00:00] [Component] [le[2016-11-30T20:04:08.000+00:00] [Component] [level] [] [] [id] Description of the issue goes here
>
> I'm struggling to find a "clean" way to parse this. I've managed a clumsy approach, by splitting the file contents on the pattern [ddd-dd-ddTdd:dd:dd.ddd+dd:dd] (the timestamp - I've never seen a case where this gets truncated) and then treating each entry as a record and parsing it individually. But the resulting code isn't exactly maintainable, and I'm looking for something cleaner.
>
> Does anyone have any suggestions for a good way to parse this data?
>
I think I'd do something like this:

while have_more(input):
     # At the start of a record.
     timestamp = parse_timestamp(input)

     fields = []
     description = None
     additional = None

     try:
         for i in range(5):
             # A field shouldn't contain a '[', so if it sees one one, it'll
             # push it back and return True for truncated.
             field, truncated = parse_field(input)
             fields.append(fields)

             if truncated:
                 raise TruncatedError()

         # The description shouldn't contain a timestamp, but if it 
does, it'll
         # push it back from that point and return True for truncated.
         description, truncated = parse_description(input)

         if truncated:
             raise TruncatedError()

         # The additional information shouldn't contain a timestamp, but 
if it
         # does, it'll push it back from that point and return True for
         # truncated.
         additional, truncated = parse_additional_information(input)

         if truncated:
             raise TruncatedError()
     except TruncatedError:
         process_record(timestamp, fields, description, additional, 
truncated=True)
     else:
         process_record(timestamp, fields, description, additional)

0
MRAB
12/14/2016 7:39:40 PM
Paul  Moore <p.f.moore@gmail.com> writes:
> I'm looking for a reasonably "clean" way to parse a log file that
> potentially has incomplete records in it.

Basically trial and error.  Code something reasonable, run your program
til it crashes on a record that it doesn't know what to do with, add
code to deal with that, rinse and repeat.  I've done this kind of thing
multiple times.  You tend to get exponentially further along with each
run/crash/fix iteration til you get most of everything, though there
might still be an occasional hopeless record that you have to just log.
0
Paul
12/14/2016 10:43:53 PM
Reply: