nawk: out of space in tostring on ... #2

  • Follow


I have a flat file with over 3 million rows. File is being processed
with nawk program that does some merging on the file and outputs the
result in a different file. After processing around 2.5 million rows,
nawk fails with:

nawk: out of space in tostring on 8273|8273_2|2000747680|35|35-44|
35-44|N|N|N|X|N|N|X|NJ||1630676556|Y|1|7243|||||||||||||||||||||

What can cause this error? Looking at the source code, the error is
caused by failed C malloc function to obtain memory. I am using awk
arrays to do the merging - I am not sure if that has anything to do
with it.

By the way, nawk is run on Sun Solaris 8. Any help is appreciated.

0
Reply mmedic123 (2) 8/2/2007 5:51:57 PM

Mensur wrote:
> I have a flat file with over 3 million rows. File is being processed
> with nawk program that does some merging on the file and outputs the
> result in a different file. After processing around 2.5 million rows,
> nawk fails with:
> 
> nawk: out of space in tostring on 8273|8273_2|2000747680|35|35-44|
> 35-44|N|N|N|X|N|N|X|NJ||1630676556|Y|1|7243|||||||||||||||||||||
> 
> What can cause this error? Looking at the source code, the error is
> caused by failed C malloc function to obtain memory. I am using awk
> arrays to do the merging - I am not sure if that has anything to do
> with it.
> 
> By the way, nawk is run on Sun Solaris 8. Any help is appreciated.
> 

It's hard to guess without seeing your awk script. In the meantime, try 
switching to gawk...

	Ed.
0
Reply Ed 8/2/2007 6:17:13 PM


On Aug 2, 2:17 pm, Ed Morton <mor...@lsupcaemnt.com> wrote:
> Mensur wrote:
> > I have a flat file with over 3 million rows. File is being processed
> > with nawk program that does some merging on the file and outputs the
> > result in a different file. After processing around 2.5 million rows,
> > nawk fails with:
>
> > nawk: out of space in tostring on 8273|8273_2|2000747680|35|35-44|
> > 35-44|N|N|N|X|N|N|X|NJ||1630676556|Y|1|7243|||||||||||||||||||||
>
> > What can cause this error? Looking at the source code, the error is
> > caused by failed C malloc function to obtain memory. I am using awk
> > arrays to do the merging - I am not sure if that has anything to do
> > with it.
>
> > By the way, nawk is run on Sun Solaris 8. Any help is appreciated.
>
> It's hard to guess without seeing your awk script. In the meantime, try
> switching to gawk...
>
>         Ed.- Hide quoted text -
>
> - Show quoted text -

I am afraid switching to gawk is not possible as it is not installed
in our production version of Solaris 8 and I have no influence over
it. I will post the script which is quite complicated but here is the
gist:
There are two files that are read in - one is a list file of clients
who have a unique id but can have multiple rows per client (think
transactional client data like accounts, etc.) and a second file which
is a lookup file. They are both sorted so the IDs are in sync. For
every client in list file there is an entry in the lookup file. I use
entries in lookup file to merge multiple client rows into one row. For
example, list file:

UNIQUE ID | ACCOUNT TYPE #1| Account balance for ACCOUNT TYPE #1 |
ACCOUNT TYPE #1 | Account balance for ACCOUNT TYPE #1
------------------------------------
1|123| 150.01 ||||
1|124| 521.00 ||||
1|125| 587.00 ||||
1|||256| 5.00 ||
1|||257| 5.68 ||

This looks like typical transactional data, where each transactional
record is put on a different line.

Lookup file:
UNIQUE ID | ACCOUNT TYPE #1 | ACCOUNT TYPE #2
-------------------------------
1|125|257

Since one client can have multiple accounts for example, lookup tells
me which transactional record (account) to keep based on some business
rule such as keep account with highest balance. I read in all records
belonging to one client (when UNIQUE ID changes, I know I have all
records belonging to one client), I use information in the lookup file
to merge the records into one record. For example, result of above
files is:

UNIQUE ID | ACCOUNT TYPE #1| Account balance for ACCOUNT TYPE #1 |
ACCOUNT TYPE #1 | Account balance for ACCOUNT TYPE #1
------------------------------------
1|125| 587.00 |257| 5.68


I use nawk for this and read all client records into an array and then
merge them, but then, I do not clear (delete) the array but simply
reuse it for the next record (I keep track at all times how many
records are in the array). So I am wondering if that causes nawk not
to release the memory from the array and have a memory leak eventually
leading to memory exhaustion.

I said that the script is quite complicated and is around 700 lines so
I don't think that anyone want to read through that. Here is the
scaled down version:


BEGIN{
   # Set initial values of variables
   OFS=FS;
   ID_POS=1
   CURR_ID = -1;
   PREV_ID= - 1;
   DATA_SIZE = 0;
   DATA[1] = "";
}
{
   # Get line that was read in
   CURRENT_LINE = $0

   # Get current CINERGY ID
   CURR_ID = $CIN_POS

   # If Cinergy IDs do not match, we have read all records that belong
to a single client - MERGE
   if (PREV_ID != CURR_ID)
   {
      # Merge using lookup file
      merge()

      # Reset ID
      PREV_ID = CURR_ID
   }

    # Add currently read line to list of records that belong to same
client
    ++DATA_SIZE
    DATA[DATA_SIZE] = CURRENT_LINE
}
function merge()
{
   # Merging logic here.....

   # Once merging is done print merged record to file
   print DATA[1]

   # Reset data size to 0 to reuse the array
   DATA_SIZE = 0
}



0
Reply Mensur 8/2/2007 8:10:51 PM

Mensur wrote:
> On Aug 2, 2:17 pm, Ed Morton <mor...@lsupcaemnt.com> wrote:
>> Mensur wrote:
>>> I have a flat file with over 3 million rows. File is being processed
>>> with nawk program that does some merging on the file and outputs the
>>> result in a different file. After processing around 2.5 million rows,
>>> nawk fails with:
>>> nawk: out of space in tostring on 8273|8273_2|2000747680|35|35-44|
>>> 35-44|N|N|N|X|N|N|X|NJ||1630676556|Y|1|7243|||||||||||||||||||||
>>> What can cause this error? Looking at the source code, the error is
>>> caused by failed C malloc function to obtain memory. I am using awk
>>> arrays to do the merging - I am not sure if that has anything to do
>>> with it.

It is not surprising that you script fails when
processing large files. It accumulates all lines
in the memory of the AWK intepreter:

> {
....
> 
>     # Add currently read line to list of records that belong to same
> client
>     ++DATA_SIZE
>     DATA[DATA_SIZE] = CURRENT_LINE
> }

Try to avoid storing all lines.
Ed's recommended using gawk. Using gawk when processing
large files is generally a good idea. gawk avoids memory
limitations where most other implementations _have_ limitations.
0
Reply Juergen 8/3/2007 7:18:25 AM

On Aug 3, 3:18 am, Juergen Kahrs <Juergen.KahrsDELETET...@vr-web.de>
wrote:
> Mensur wrote:
> > On Aug 2, 2:17 pm, Ed Morton <mor...@lsupcaemnt.com> wrote:
> >> Mensur wrote:
> >>> I have a flat file with over 3 million rows. File is being processed
> >>> with nawk program that does some merging on the file and outputs the
> >>> result in a different file. After processing around 2.5 million rows,
> >>> nawk fails with:
> >>> nawk: out of space in tostring on 8273|8273_2|2000747680|35|35-44|
> >>> 35-44|N|N|N|X|N|N|X|NJ||1630676556|Y|1|7243|||||||||||||||||||||
> >>> What can cause this error? Looking at the source code, the error is
> >>> caused by failed C malloc function to obtain memory. I am using awk
> >>> arrays to do the merging - I am not sure if that has anything to do
> >>> with it.
>
> It is not surprising that you script fails when
> processing large files. It accumulates all lines
> in the memory of the AWK intepreter:
>
>
>
> > {
> ...
>
> >     # Add currently read line to list of records that belong to same
> > client
> >     ++DATA_SIZE
> >     DATA[DATA_SIZE] = CURRENT_LINE
> > }
>
> Try to avoid storing all lines.
> Ed's recommended using gawk. Using gawk when processing
> large files is generally a good idea. gawk avoids memory
> limitations where most other implementations _have_ limitations.

It is strange to me that it is happening because I do reset the
DATA_SIZE varibale to 0 after accumulating all records belonging to
one client. So in essence, I allocate maybe at most 50 array elements
(lines in the file) in memory and keep overwriting them. Therefore,
the memory should not grow more than it takes to store those 50 lines.
However, I looked at memory usage while the process is running, and it
keep growing constantly.

I would love to use gawk, but like I mentioned, our compnay does not
have it installed and I have no say so in that.

0
Reply Mensur 8/3/2007 1:03:47 PM

Mensur wrote:
> On Aug 3, 3:18 am, Juergen Kahrs <Juergen.KahrsDELETET...@vr-web.de>
> wrote:
> 
>>Mensur wrote:
>>
>>>On Aug 2, 2:17 pm, Ed Morton <mor...@lsupcaemnt.com> wrote:
>>>
>>>>Mensur wrote:
>>>>
>>>>>I have a flat file with over 3 million rows. File is being processed
>>>>>with nawk program that does some merging on the file and outputs the
>>>>>result in a different file. After processing around 2.5 million rows,
>>>>>nawk fails with:
>>>>>nawk: out of space in tostring on 8273|8273_2|2000747680|35|35-44|
>>>>>35-44|N|N|N|X|N|N|X|NJ||1630676556|Y|1|7243|||||||||||||||||||||
>>>>>What can cause this error? Looking at the source code, the error is
>>>>>caused by failed C malloc function to obtain memory. I am using awk
>>>>>arrays to do the merging - I am not sure if that has anything to do
>>>>>with it.
>>
>>It is not surprising that you script fails when
>>processing large files. It accumulates all lines
>>in the memory of the AWK intepreter:
>>
>>
>>
>>
>>>{
>>
>>...
>>
>>
>>>    # Add currently read line to list of records that belong to same
>>>client
>>>    ++DATA_SIZE
>>>    DATA[DATA_SIZE] = CURRENT_LINE
>>>}
>>
>>Try to avoid storing all lines.
>>Ed's recommended using gawk. Using gawk when processing
>>large files is generally a good idea. gawk avoids memory
>>limitations where most other implementations _have_ limitations.
> 
> 
> It is strange to me that it is happening because I do reset the
> DATA_SIZE varibale to 0 after accumulating all records belonging to
> one client. So in essence, I allocate maybe at most 50 array elements
> (lines in the file) in memory and keep overwriting them. Therefore,
> the memory should not grow more than it takes to store those 50 lines.
> However, I looked at memory usage while the process is running, and it
> keep growing constantly.
> 
> I would love to use gawk, but like I mentioned, our compnay does not
> have it installed and I have no say so in that.
> 

Can you come up with a small, complete script that reproduces the 
problem and post that?

	Ed.
0
Reply Ed 8/3/2007 1:38:54 PM

Mensur wrote:

> It is strange to me that it is happening because I do reset the
> DATA_SIZE varibale to 0 after accumulating all records belonging to
> one client. So in essence, I allocate maybe at most 50 array elements
> (lines in the file) in memory and keep overwriting them. Therefore,
> the memory should not grow more than it takes to store those 50 lines.

Yes, your script really looks like it should not
run into memory problems.

Try this:

function merge()
{
   # Merging logic here.....

   # Once merging is done print merged record to file
   print DATA[1]

   # Reset data size to 0 to reuse the array
   while (DATA_SIZE > 0)
      delete data[DATA_SIZE--]
}

> However, I looked at memory usage while the process is running, and it
> keep growing constantly.

Sounds like a memory leak. Consider using valgrind
to locate it.

> I would love to use gawk, but like I mentioned, our compnay does not
> have it installed and I have no say so in that.

You could build gawk locally in your home directory.
Just for comparison.
0
Reply ISO 8/3/2007 1:47:37 PM

On Aug 3, 9:47 am, J=FCrgen Kahrs <Juergen.KahrsDELETET...@vr-web.de>
wrote:
> Mensur wrote:
> > It is strange to me that it is happening because I do reset the
> > DATA_SIZE varibale to 0 after accumulating all records belonging to
> > one client. So in essence, I allocate maybe at most 50 array elements
> > (lines in the file) in memory and keep overwriting them. Therefore,
> > the memory should not grow more than it takes to store those 50 lines.
>
> Yes, your script really looks like it should not
> run into memory problems.
>
> Try this:
>
> function merge()
> {
>    # Merging logic here.....
>
>    # Once merging is done print merged record to file
>    print DATA[1]
>
>    # Reset data size to 0 to reuse the array
>    while (DATA_SIZE > 0)
>       delete data[DATA_SIZE--]
>
> }
> > However, I looked at memory usage while the process is running, and it
> > keep growing constantly.
>
> Sounds like a memory leak. Consider using valgrind
> to locate it.
>
> > I would love to use gawk, but like I mentioned, our compnay does not
> > have it installed and I have no say so in that.
>
> You could build gawk locally in your home directory.
> Just for comparison.

Thanks for all your help. I tried using this
for (ARRAY_INDEX in DATA)
{
    delete DATA[ARRAY_INDEX]
}

with no success. Memory still keeps growing. I can try using gawk to
see if there is difference but that really doesn't help me. Even if
the problem does not occur with gawk, I still cannot use it.

0
Reply Mensur 8/3/2007 2:59:30 PM

On Aug 3, 9:38 am, Ed Morton <mor...@lsupcaemnt.com> wrote:
> Mensur wrote:
> > On Aug 3, 3:18 am, Juergen Kahrs <Juergen.KahrsDELETET...@vr-web.de>
> > wrote:
>
> >>Mensur wrote:
>
> >>>On Aug 2, 2:17 pm, Ed Morton <mor...@lsupcaemnt.com> wrote:
>
> >>>>Mensur wrote:
>
> >>>>>I have a flat file with over 3 million rows. File is being processed
> >>>>>with nawk program that does some merging on the file and outputs the
> >>>>>result in a different file. After processing around 2.5 million rows,
> >>>>>nawk fails with:
> >>>>>nawk: out of space in tostring on 8273|8273_2|2000747680|35|35-44|
> >>>>>35-44|N|N|N|X|N|N|X|NJ||1630676556|Y|1|7243|||||||||||||||||||||
> >>>>>What can cause this error? Looking at the source code, the error is
> >>>>>caused by failed C malloc function to obtain memory. I am using awk
> >>>>>arrays to do the merging - I am not sure if that has anything to do
> >>>>>with it.
>
> >>It is not surprising that you script fails when
> >>processing large files. It accumulates all lines
> >>in the memory of the AWK intepreter:
>
> >>>{
>
> >>...
>
> >>>    # Add currently read line to list of records that belong to same
> >>>client
> >>>    ++DATA_SIZE
> >>>    DATA[DATA_SIZE] = CURRENT_LINE
> >>>}
>
> >>Try to avoid storing all lines.
> >>Ed's recommended using gawk. Using gawk when processing
> >>large files is generally a good idea. gawk avoids memory
> >>limitations where most other implementations _have_ limitations.
>
> > It is strange to me that it is happening because I do reset the
> > DATA_SIZE varibale to 0 after accumulating all records belonging to
> > one client. So in essence, I allocate maybe at most 50 array elements
> > (lines in the file) in memory and keep overwriting them. Therefore,
> > the memory should not grow more than it takes to store those 50 lines.
> > However, I looked at memory usage while the process is running, and it
> > keep growing constantly.
>
> > I would love to use gawk, but like I mentioned, our compnay does not
> > have it installed and I have no say so in that.
>
> Can you come up with a small, complete script that reproduces the
> problem and post that?
>
>         Ed.- Hide quoted text -
>
> - Show quoted text -

Here is the whole script. Like I said, it is really long. Again, thank
you guys for helping.


BEGIN{
   # Set initial values of variables
   OFS=FS;
   ERROR_CD = 0;
   CIN_POS = -1;
   MG_ID_POS = -1;
   CG_ID_POS = -1;
   AMS_CG_ID_POS = -1;
   AMS_FIRST_NAME_POS = -1;
   CURR_CIN_ID = -1;
   PREV_CIN_ID= - 1;
   DATA_SIZE = 0;
   DATA[1] = "";
   LOOKUP_CIN_ID = -1;
   LOOKUP_MG_ID = -1;
   LOOKUP_CG_ID = -1;

   # Process header record to read in header record, find positions of
variables, output header record and read in next line
   if (process_header_record() != 0)
   {
      log_message("Error inherited from process_header_record
function.")
      ERROR_CD = 1;
      exit 1
   }
}
#######################################################################
# MAIN PROCESSING
#######################################################################
{
   # IF error occured in BEGIN section, detect it here
   if (ERROR_CD != 0)
   {
      log_message("Error occured in BEGIN section. Exiting ....")
      exit 1
   }

   # Skip empty lines
   if (trim($0) == "")
   {
      log_message("Found empty record: " FNR ". Skipping.")
      next
   }

   # Log which record we are processing
   if (FNR % 10000 == 3)
   {
      log_message("Processing record: " FNR)
   }

   # Get line that was read in
   CURRENT_LINE = $0

   # Get current CINERGY ID
   CURR_CIN_ID = $CIN_POS

   # If Cinergy IDs do not match, we have read all records that belong
to a single client - MERGE
   if (PREV_CIN_ID != CURR_CIN_ID)
   {
      # IF either MG ID or CG ID are included on the file, we will use
lookup file to find
      # MG ID with earliest establishment date and CG ID with hjighest
assets
      if (MG_ID_POS != -1 || CG_ID_POS != -1)
      {
         # Merge using lookup file
         if(dedup_merge() != 0)
         {
            log_message("Error inherited from dedup_merge function.")
            ERROR_CD = 1
            exit 1
         }
      }
      else
      {
         # Merge by picking first valid value for each column
         if(merge_first_valid_value() != 0)
         {
            log_message("Error inherited from merge_first_valid_value
function.")
            ERROR_CD = 1
            exit 1
         }
      }

      # Reset Cinergy ID
      PREV_CIN_ID = CURR_CIN_ID
   }

    # Add currently read line to list of records that belong to same
client
    ++DATA_SIZE
    DATA[DATA_SIZE] = CURRENT_LINE
}
# END SECTION
END{
   # Detect error from MAIN section
   if (ERROR_CD != 0)
   {
      log_message("Error occured in MAIN section. Exiting...")
      exit 1
   }

   # Merge last client
   log_message("Merging last client...")

   # IF either MG ID or CG ID are included on the file, we will use
lookup file to find
   # MG ID with earliest establishment date and CG ID with hjighest
assets
   if (MG_ID_POS != -1 || CG_ID_POS != -1)
   {
      if(dedup_merge() != 0)
      {
         log_message("Error inherited from dedup_merge function.")
         exit 1
      }
   }
   else
   {
      # Merge by picking first valid value for each column
      if(merge_first_valid_value() != 0)
      {
         log_message("Error inherited from merge_first_valid_value
function.")
         exit 1
      }
   }
}

###########################################################################################################
#
# Function name:    process_header_record
#
# Purpose:          Function reads in the header record, extracts
field positions from the header, outputs header
#                   record to an output file. Then it reads another
record, initilized DATA array, DATA_SIZE
#                   and CURR_CIN_ID and PREV_CIN_ID variables.
#
# Parameters:       None
#
# Return:           Function returns 0 on success and 1 on failure
#
# Side effects:     $0, NF, FNR are reset
#
###########################################################################################################
function process_header_record()
{
   # Explicitly read in header record from campaign file
   RETURN_CODE = getline
   if (RETURN_CODE <= 0)
   {
     log_message("There are no more campaign records to read or error
reading campaign file!")
     return 1
   }

   # Get position of Cinergy Client Id field
   CIN_POS = get_field_pos("Cinergy Client Id")
   if (CIN_POS == -1)
   {
      log_message("Error trying to get position of Cinergy Client Id")
      return 1
   }

   # Get position of Marketing Group Identifier
   MG_ID_POS = get_field_pos("Marketing Group Identifier")
   if (MG_ID_POS == -1)
   {
      log_message("MG ID not present on the file.")
   }

   # Get position of Client Group ID
   Marketing Group Identifier
   CG_ID_POS = get_field_pos("Client Group ID")
   if (CG_ID_POS == -1)
   {
      log_message("CG ID not present on the file.")
   }

   # Get position of AMS Client Group ID
   AMS_CG_ID_POS = get_field_pos("AMS Client Group ID")
   if (AMS_CG_ID_POS == -1)
   {
      log_message("AMS CG ID not present on the file.")
   }

   # Get position of AMS Crew First Name
   AMS_FIRST_NAME_POS = get_field_pos("AMS Crew First Name")
   if (AMS_FIRST_NAME_POS == -1)
   {
      log_message("AMS Manager First Name not present on the file.")
   }

   # Output header record
   print $0 > OUT_FILE

   # Read in first client record from campaign file
   RETURN_CODE = getline
   if (RETURN_CODE <= 0)
   {
     log_message("There are no more campaign records to read or error
reading campaign file!")
     return 1
   }

   # Set variables
   CURR_CIN_ID = $CIN_POS
   PREV_CIN_ID = $CIN_POS
   DATA_SIZE=1
   DATA[1] = $0
}

###########################################################################################################
#
# Function name:    get_field_pos
#
# Purpose:          Function retruns the position of field FIELD_TEXT
from $0 or -1 if the field doesn't exist.
#
# Parameters:       FIELD_TEXT
#
# Return:           Function returns 0 on success and 1 on failure
#
# Side effects:     $0 must be set
#
###########################################################################################################
function get_field_pos(FIELD_TEXT)
{
   # Loop for each field in $0 and find position of field with text
FIELD_TEXT
   for (i = 1; i <= NF; ++i)
   {
      # If field exists, return position
      if (trim($i) == FIELD_TEXT)
      {
         return i
      }
   }
   return -1
}

###########################################################################################################
#
# Function name:    find_first_non_empty_line
#
# Purpose:          Function returns first non-empty line in DATA
array or empty string if all lines are
#                   empty.
#
# Parameters:       None
#
# Return:           Function returns 0 on success and 1 on failure
#
# Side effects:     DATA and DATA_SIZE must be set
#
###########################################################################################################
function find_first_non_empty_line()
{
   # Find first non-empty line
   for (i = 1; i <= DATA_SIZE; ++i)
   {
      if (DATA[i] != "")
      {
         return DATA[i]
      }
   }
   return ""
}

###########################################################################################################
#
# Function name:    merge_first_valid_value
#
# Purpose:          Merge columns in DATA by taking the first value
that is not empty, UNKNOWN or Default.
#
# Parameters:       None
#
# Return:           Function returns 0 on success and 1 on failure
#
# Side effects:     DATA and DATA_SIZE must be set
#
###########################################################################################################
function merge_first_valid_value()
{
   # If AMS is included, take a first valid AMS client
   if (AMS_FIRST_NAME_POS != -1)
   {
      # Find first valid AMS
      if (dedup_field_by_valid_value(AMS_FIRST_NAME_POS) != 0)
      {
         log_message("Error inherited from dedup_field_by_valid_value
function when deduping on AMS.")
         return 1
      }
   }

   # Read in and set first line
   FIRST_LINE = find_first_non_empty_line();
   if (FIRST_LINE == "")
   {
      log_message("Error reading line. All lines are empty!")
      return 1
   }
   $0 = FIRST_LINE

   # Get number of fields in the line
   FIELD_COUNT = NF

   # Look for each field in the line and replace any fields that have
empty, unknown or default value
   # from remaining records belonging to the same client
   for (i = 1; i <= FIELD_COUNT; ++i)
   {
      # Need to always set the first line as we are changing $0 down
in the code
      $0 = FIRST_LINE

      # If field is empty, UNKNOWN or default, find a replacement
      if (trim($i) == "" || tolower($i) == "unknown" || tolower($i) ==
"default")
      {
         # Set field to empty value in case no replacement is found
         $i = ""
         FIRST_LINE = $0

         # Find a replacement from remaining records
         for (j = 1; j <= DATA_SIZE; ++j)
         {
            # If replacement is empty, keep going
            if (DATA[j] == "")
            {
               continue
            }
            $0 = DATA[j]

            # Find a replacement
            if (trim($i) != "" && tolower($i) != "unknown" &&
tolower($i) != "default")
            {
              # Perfrom replace
              REPLACEMENT_FIELD=$i
              $0 = FIRST_LINE
              $i = REPLACEMENT_FIELD
              FIRST_LINE = $0
              break;
            }
         }
      }
   }

   # Once the replacements are made, output record
   output(FIRST_LINE)

   # Reset data size so it can be used again
   DATA_SIZE=0
}


###########################################################################################################
#
# Function name:    dedup_merge
#
# Purpose:          Merges client records using lookup file
#
# Parameters:       None
#
# Return:           Function returns 0 on success and 1 on failure
#
# Side effects:     DATA and DATA_SIZE must be set
#
###########################################################################################################
function dedup_merge()
{
   # Get next lookup values
   LAST_PROCESSED_LOOKUP_ID = LOOKUP_CIN_ID
   if (get_next_lookup_values() != 0)
   {
      log_message("Error inherited from get_next_lookup_values
function.")
      log_message("Last processed LOOKUP_CINERGY_ID is: "
LAST_PROCESSED_LOOKUP_ID)
      log_data_records()
      return 1
   }

   # If Cinergy IDs do not match, we are out of sync
   if (PREV_CIN_ID != LOOKUP_CIN_ID)
   {
      log_message("Cinergy ID from campaign file: "  PREV_CIN_ID "
does not match lookup file Cinergy ID: " LOOKUP_CIN_ID)
      log_message("Last processed LOOKUP_CINERGY_ID is: "
LAST_PROCESSED_LOOKUP_ID)
      log_message("This ERROR can occur for 4 different reasons:")
      log_message("1. Lookup file returned by the stored procedure
contains duplicate Cinergy Ids which in turn means that campaign
contains recipients with same Cinergy Ids (Campaign may contain Leads,
Purged or Expired clients). THIS IS SCENARIO IS VERY LIKELY. In this
case modify the campaign to exclude those clients.")
      log_message("2. Output format of the file contains a date
(client birth date or other date) that points to the date dimesnion
but the date value is past 12/31/2010. THIS IS SCENARIO IS VERY
LIKELY. ")
      log_message("3. Lookup file returned by the stored procedure
contains Cinergy Id that is not present in the export file.")
      log_message("4. Export file contains Cinergy Id that is not
present in the Lookup file returned by the stored procedure.")


      log_data_records()
      return 1
   }

   # Dedup on MG
   if (LOOKUP_MG_ID != "" && MG_ID_POS != -1)
   {
      if (dedup_field_by_value(MG_ID_POS, LOOKUP_MG_ID, "MG ID") != 0)
      {
         log_message("Error inherited from dedup_field_by_value
function when deduping on MG.")
         log_data_records()
         return 1
      }
   }

   # Dedup on CG
   if (LOOKUP_CG_ID != "" && CG_ID_POS != -1)
   {
      if (dedup_field_by_value(CG_ID_POS, LOOKUP_CG_ID, "CG ID") != 0)
      {
         log_message("Error inherited from dedup_field_by_value
function when deduping on CG.")
         log_data_records()
         return 1
      }
   }

   # Merge other records by picking first valid value
   if (merge_first_valid_value() != 0)
   {
      log_message("Error inherited from merge_first_valid_value.")
      return 1
   }

   return 0
}

###########################################################################################################
#
# Function name:    trim
#
# Purpose:          Trims string
#
# Parameters:       None
#
# Return:           Trimmed string
#
# Side effects:     None
#
###########################################################################################################
function trim(string)
{
    sub(/^[ \t]+/, "",  string); sub(/[ \t]+$/, "", string) #trim left
and right
    return string
}


###########################################################################################################
#
# Function name:    log_message
#
# Purpose:          Logs message to a log file
#
# Parameters:       None
#
# Return:           None
#
# Side effects:     None
#
###########################################################################################################
function log_message(string)
{
   "date +%m/%d/%Y_%H:%M:%S" | getline dt
   close("date +%m/%d/%Y_%H:%M:%S")
   print dt " " string >> LOG_FILE
}


###########################################################################################################
#
# Function name:    output
#
# Purpose:          outputs string to output file
#
# Parameters:       None
#
# Return:           None
#
# Side effects:     None
#
###########################################################################################################
function output(string)
{
    print string >> OUT_FILE
}


###########################################################################################################
#
# Function name:    get_next_lookup_values
#
# Purpose:          retrieved Cinergy ID, MG ID and CG ID from lookup
file
#
# Parameters:       None
#
# Return:           0 if success, 1 if failure
#
# Side effects:     Populates LOOKUP_CIN_ID, LOOKUP_MG_ID,
LOOKUP_CG_ID variables
#
###########################################################################################################
function get_next_lookup_values()
{
   # Read line from lookup file
   RETURN_CODE = getline LINE < LOOKUP_FILE
   if (RETURN_CODE <= 0)
   {
     log_message("There are no more lookup records to read or error
reading lookup file!")
     return 1
   }

   # get number of fields from read line and split it into fileds
   FIELD_COUNT = split(LINE, F, FS)
   if (FIELD_COUNT != 3)
   {
      # Invalid format of lookup file
      log_message("Record on lookup file: " LINE " does not have 3
fields!!!")
      return 1
   }

   # Assign values
   LOOKUP_CIN_ID = F[1]
   LOOKUP_MG_ID = F[2]
   LOOKUP_CG_ID = F[3]

   return 0
}


###########################################################################################################
#
# Function name:    dedup_field_by_value
#
# Purpose:          Removes array elements (DATA) that at field
position FIELD_POS do not match FIELD_VALUE.
#                   After, the removal is done, at least one array
element must remain.
#
# Parameters:       None
#
# Return:           0 if success, 1 if failure
#
# Side effects:     Modifies DATA array, modifies $0
#
###########################################################################################################
function dedup_field_by_value(FIELD_POS, FIELD_VALUE, FIELD_NAME)
{
   # Go through each array element
   VALUE_FOUND = "false"
   COUNTER = 0
   for (i = 1; i <= DATA_SIZE; ++i)
   {
      # Get array element
      LINE = DATA[i]
      $0 = LINE

      # If field is not empty and doesn't match FIELD_VALUE, remove it
(set it to empty string)
      if (trim($FIELD_POS) != "" && $FIELD_POS != FIELD_VALUE)
      {
         DATA[i] = ""
         ++COUNTER
      }
      else if ($FIELD_POS == FIELD_VALUE)
      {
         VALUE_FOUND = "true"
      }
   }

   # if all records were removed - error condition
   if (VALUE_FOUND != "true")
   {
      log_message("All records were removed while searching for field
value: "  FIELD_VALUE " of field: " FIELD_NAME " at position: "
FIELD_POS)
      return 1
   }

   return 0
}


###########################################################################################################
#
# Function name:    log_data_records
#
# Purpose:          outputs records that currently being processed to
log file.
#
# Parameters:       None
#
# Return:           0 if success, 1 if failure
#
# Side effects:     None
#
###########################################################################################################
function log_data_records()
{
   log_message("Currently processing following client records:")
   for (i = 1; i <= DATA_SIZE; ++i)
   {
      log_message(DATA[i])
   }
}


###########################################################################################################
#
# Function name:    dedup_field_by_valid_value
#
# Purpose:          Removes array elements (DATA) that at field
position FIELD_POS are empty, UNKNOWN or
#                   Default. After valid field is found, remove
remaining elements.
#
# Parameters:       None
#
# Return:           0 if success, 1 if failure
#
# Side effects:     Modifies DATA array, modifies $0
#
###########################################################################################################
function dedup_field_by_valid_value(FIELD_POS)
{
   # loop for each array element
   FOUND="false"
   for (i = 1; i <= DATA_SIZE; ++i)
   {
      # Get array element
      LINE = DATA[i]
      $0 = LINE

      # If array element is not empty, no valid field value is found
so far, and this field is valid - mark field as found
      if (DATA[i] != "" && FOUND != "true" && trim($FIELD_POS) != ""
&& tolower($FIELD_POS) != "unknown" && tolower($FIELD_POS) !=
"default")
      {
         FOUND = "true"
         continue
      }

      # If array element is not empty, and we already found a valid
value, remove array element
      if (DATA[i] != "" && FOUND == "true" && trim($FIELD_POS) != "")
      {

         DATA[i] = ""
      }
   }

   return 0
}








0
Reply Mensur 8/3/2007 3:01:30 PM

Mensur wrote:

> Here is the whole script. Like I said, it is really long. Again, thank
> you guys for helping.

There is one thing about your script I dont understand.
Where is OUT_FILE initialized ? I guess it never is.
0
Reply ISO 8/3/2007 5:07:02 PM

Mensur wrote:

>> You could build gawk locally in your home directory.
>> Just for comparison.
> 
> Thanks for all your help. I tried using this
> for (ARRAY_INDEX in DATA)
> {
>     delete DATA[ARRAY_INDEX]
> }

It was a good idea to try this.
Now you know the memory leak is not in data storage.

> with no success. Memory still keeps growing. I can try using gawk to
> see if there is difference but that really doesn't help me. Even if
> the problem does not occur with gawk, I still cannot use it.

If the problem does _not_ occur in gawk, then you
have found the bug: in your AWK implementation.
I think this is worth a try, even if you will never
be allowed to use gawk on Solaris.
0
Reply ISO 8/3/2007 5:10:12 PM

On Aug 3, 1:10 pm, J=FCrgen Kahrs <Juergen.KahrsDELETET...@vr-web.de>
wrote:
> Mensur wrote:
> >> You could build gawk locally in your home directory.
> >> Just for comparison.
>
> > Thanks for all your help. I tried using this
> > for (ARRAY_INDEX in DATA)
> > {
> >     delete DATA[ARRAY_INDEX]
> > }
>
> It was a good idea to try this.
> Now you know the memory leak is not in data storage.
>
> > with no success. Memory still keeps growing. I can try using gawk to
> > see if there is difference but that really doesn't help me. Even if
> > the problem does not occur with gawk, I still cannot use it.
>
> If the problem does _not_ occur in gawk, then you
> have found the bug: in your AWK implementation.
> I think this is worth a try, even if you will never
> be allowed to use gawk on Solaris.

I think I figured it out. It really is a nawk (gawk) memory leak when
using arrays and next, getline etc. This was found and fixed in 2002
but we just haven't gotten the upgrade. Thanks again to everyone for
helping.

0
Reply Mensur 8/3/2007 9:24:23 PM

In article <1186176263.570271.17750@q75g2000hsh.googlegroups.com>,
Mensur  <mmedic123@gmail.com> wrote:
....
>I think I figured it out. It really is a nawk (gawk) memory leak when
>using arrays and next, getline etc.

What do you mean by "nawk (gawk)"?
Is that much like saying "Ford (Chrysler)" ?

Or (to be more closely on-topic): "Solaris (Linux)" ?

>This was found and fixed in 2002
>but we just haven't gotten the upgrade. Thanks again to everyone for
>helping.

0
Reply gazelle 8/3/2007 10:17:18 PM

12 Replies
447 Views

(page loaded in 0.127 seconds)

Similiar Articles:


















7/26/2012 10:10:05 AM


Reply: