I have a flat file with over 3 million rows. File is being processed
with nawk program that does some merging on the file and outputs the
result in a different file. After processing around 2.5 million rows,
nawk fails with:
nawk: out of space in tostring on 8273|8273_2|2000747680|35|35-44|
35-44|N|N|N|X|N|N|X|NJ||1630676556|Y|1|7243|||||||||||||||||||||
What can cause this error? Looking at the source code, the error is
caused by failed C malloc function to obtain memory. I am using awk
arrays to do the merging - I am not sure if that has anything to do
with it.
By the way, nawk is run on Sun Solaris 8. Any help is appreciated.
|
|
0
|
|
|
|
Reply
|
mmedic123 (2)
|
8/2/2007 5:51:57 PM |
|
Mensur wrote:
> I have a flat file with over 3 million rows. File is being processed
> with nawk program that does some merging on the file and outputs the
> result in a different file. After processing around 2.5 million rows,
> nawk fails with:
>
> nawk: out of space in tostring on 8273|8273_2|2000747680|35|35-44|
> 35-44|N|N|N|X|N|N|X|NJ||1630676556|Y|1|7243|||||||||||||||||||||
>
> What can cause this error? Looking at the source code, the error is
> caused by failed C malloc function to obtain memory. I am using awk
> arrays to do the merging - I am not sure if that has anything to do
> with it.
>
> By the way, nawk is run on Sun Solaris 8. Any help is appreciated.
>
It's hard to guess without seeing your awk script. In the meantime, try
switching to gawk...
Ed.
|
|
0
|
|
|
|
Reply
|
Ed
|
8/2/2007 6:17:13 PM
|
|
On Aug 2, 2:17 pm, Ed Morton <mor...@lsupcaemnt.com> wrote:
> Mensur wrote:
> > I have a flat file with over 3 million rows. File is being processed
> > with nawk program that does some merging on the file and outputs the
> > result in a different file. After processing around 2.5 million rows,
> > nawk fails with:
>
> > nawk: out of space in tostring on 8273|8273_2|2000747680|35|35-44|
> > 35-44|N|N|N|X|N|N|X|NJ||1630676556|Y|1|7243|||||||||||||||||||||
>
> > What can cause this error? Looking at the source code, the error is
> > caused by failed C malloc function to obtain memory. I am using awk
> > arrays to do the merging - I am not sure if that has anything to do
> > with it.
>
> > By the way, nawk is run on Sun Solaris 8. Any help is appreciated.
>
> It's hard to guess without seeing your awk script. In the meantime, try
> switching to gawk...
>
> Ed.- Hide quoted text -
>
> - Show quoted text -
I am afraid switching to gawk is not possible as it is not installed
in our production version of Solaris 8 and I have no influence over
it. I will post the script which is quite complicated but here is the
gist:
There are two files that are read in - one is a list file of clients
who have a unique id but can have multiple rows per client (think
transactional client data like accounts, etc.) and a second file which
is a lookup file. They are both sorted so the IDs are in sync. For
every client in list file there is an entry in the lookup file. I use
entries in lookup file to merge multiple client rows into one row. For
example, list file:
UNIQUE ID | ACCOUNT TYPE #1| Account balance for ACCOUNT TYPE #1 |
ACCOUNT TYPE #1 | Account balance for ACCOUNT TYPE #1
------------------------------------
1|123| 150.01 ||||
1|124| 521.00 ||||
1|125| 587.00 ||||
1|||256| 5.00 ||
1|||257| 5.68 ||
This looks like typical transactional data, where each transactional
record is put on a different line.
Lookup file:
UNIQUE ID | ACCOUNT TYPE #1 | ACCOUNT TYPE #2
-------------------------------
1|125|257
Since one client can have multiple accounts for example, lookup tells
me which transactional record (account) to keep based on some business
rule such as keep account with highest balance. I read in all records
belonging to one client (when UNIQUE ID changes, I know I have all
records belonging to one client), I use information in the lookup file
to merge the records into one record. For example, result of above
files is:
UNIQUE ID | ACCOUNT TYPE #1| Account balance for ACCOUNT TYPE #1 |
ACCOUNT TYPE #1 | Account balance for ACCOUNT TYPE #1
------------------------------------
1|125| 587.00 |257| 5.68
I use nawk for this and read all client records into an array and then
merge them, but then, I do not clear (delete) the array but simply
reuse it for the next record (I keep track at all times how many
records are in the array). So I am wondering if that causes nawk not
to release the memory from the array and have a memory leak eventually
leading to memory exhaustion.
I said that the script is quite complicated and is around 700 lines so
I don't think that anyone want to read through that. Here is the
scaled down version:
BEGIN{
# Set initial values of variables
OFS=FS;
ID_POS=1
CURR_ID = -1;
PREV_ID= - 1;
DATA_SIZE = 0;
DATA[1] = "";
}
{
# Get line that was read in
CURRENT_LINE = $0
# Get current CINERGY ID
CURR_ID = $CIN_POS
# If Cinergy IDs do not match, we have read all records that belong
to a single client - MERGE
if (PREV_ID != CURR_ID)
{
# Merge using lookup file
merge()
# Reset ID
PREV_ID = CURR_ID
}
# Add currently read line to list of records that belong to same
client
++DATA_SIZE
DATA[DATA_SIZE] = CURRENT_LINE
}
function merge()
{
# Merging logic here.....
# Once merging is done print merged record to file
print DATA[1]
# Reset data size to 0 to reuse the array
DATA_SIZE = 0
}
|
|
0
|
|
|
|
Reply
|
Mensur
|
8/2/2007 8:10:51 PM
|
|
Mensur wrote:
> On Aug 2, 2:17 pm, Ed Morton <mor...@lsupcaemnt.com> wrote:
>> Mensur wrote:
>>> I have a flat file with over 3 million rows. File is being processed
>>> with nawk program that does some merging on the file and outputs the
>>> result in a different file. After processing around 2.5 million rows,
>>> nawk fails with:
>>> nawk: out of space in tostring on 8273|8273_2|2000747680|35|35-44|
>>> 35-44|N|N|N|X|N|N|X|NJ||1630676556|Y|1|7243|||||||||||||||||||||
>>> What can cause this error? Looking at the source code, the error is
>>> caused by failed C malloc function to obtain memory. I am using awk
>>> arrays to do the merging - I am not sure if that has anything to do
>>> with it.
It is not surprising that you script fails when
processing large files. It accumulates all lines
in the memory of the AWK intepreter:
> {
....
>
> # Add currently read line to list of records that belong to same
> client
> ++DATA_SIZE
> DATA[DATA_SIZE] = CURRENT_LINE
> }
Try to avoid storing all lines.
Ed's recommended using gawk. Using gawk when processing
large files is generally a good idea. gawk avoids memory
limitations where most other implementations _have_ limitations.
|
|
0
|
|
|
|
Reply
|
Juergen
|
8/3/2007 7:18:25 AM
|
|
On Aug 3, 3:18 am, Juergen Kahrs <Juergen.KahrsDELETET...@vr-web.de>
wrote:
> Mensur wrote:
> > On Aug 2, 2:17 pm, Ed Morton <mor...@lsupcaemnt.com> wrote:
> >> Mensur wrote:
> >>> I have a flat file with over 3 million rows. File is being processed
> >>> with nawk program that does some merging on the file and outputs the
> >>> result in a different file. After processing around 2.5 million rows,
> >>> nawk fails with:
> >>> nawk: out of space in tostring on 8273|8273_2|2000747680|35|35-44|
> >>> 35-44|N|N|N|X|N|N|X|NJ||1630676556|Y|1|7243|||||||||||||||||||||
> >>> What can cause this error? Looking at the source code, the error is
> >>> caused by failed C malloc function to obtain memory. I am using awk
> >>> arrays to do the merging - I am not sure if that has anything to do
> >>> with it.
>
> It is not surprising that you script fails when
> processing large files. It accumulates all lines
> in the memory of the AWK intepreter:
>
>
>
> > {
> ...
>
> > # Add currently read line to list of records that belong to same
> > client
> > ++DATA_SIZE
> > DATA[DATA_SIZE] = CURRENT_LINE
> > }
>
> Try to avoid storing all lines.
> Ed's recommended using gawk. Using gawk when processing
> large files is generally a good idea. gawk avoids memory
> limitations where most other implementations _have_ limitations.
It is strange to me that it is happening because I do reset the
DATA_SIZE varibale to 0 after accumulating all records belonging to
one client. So in essence, I allocate maybe at most 50 array elements
(lines in the file) in memory and keep overwriting them. Therefore,
the memory should not grow more than it takes to store those 50 lines.
However, I looked at memory usage while the process is running, and it
keep growing constantly.
I would love to use gawk, but like I mentioned, our compnay does not
have it installed and I have no say so in that.
|
|
0
|
|
|
|
Reply
|
Mensur
|
8/3/2007 1:03:47 PM
|
|
Mensur wrote:
> On Aug 3, 3:18 am, Juergen Kahrs <Juergen.KahrsDELETET...@vr-web.de>
> wrote:
>
>>Mensur wrote:
>>
>>>On Aug 2, 2:17 pm, Ed Morton <mor...@lsupcaemnt.com> wrote:
>>>
>>>>Mensur wrote:
>>>>
>>>>>I have a flat file with over 3 million rows. File is being processed
>>>>>with nawk program that does some merging on the file and outputs the
>>>>>result in a different file. After processing around 2.5 million rows,
>>>>>nawk fails with:
>>>>>nawk: out of space in tostring on 8273|8273_2|2000747680|35|35-44|
>>>>>35-44|N|N|N|X|N|N|X|NJ||1630676556|Y|1|7243|||||||||||||||||||||
>>>>>What can cause this error? Looking at the source code, the error is
>>>>>caused by failed C malloc function to obtain memory. I am using awk
>>>>>arrays to do the merging - I am not sure if that has anything to do
>>>>>with it.
>>
>>It is not surprising that you script fails when
>>processing large files. It accumulates all lines
>>in the memory of the AWK intepreter:
>>
>>
>>
>>
>>>{
>>
>>...
>>
>>
>>> # Add currently read line to list of records that belong to same
>>>client
>>> ++DATA_SIZE
>>> DATA[DATA_SIZE] = CURRENT_LINE
>>>}
>>
>>Try to avoid storing all lines.
>>Ed's recommended using gawk. Using gawk when processing
>>large files is generally a good idea. gawk avoids memory
>>limitations where most other implementations _have_ limitations.
>
>
> It is strange to me that it is happening because I do reset the
> DATA_SIZE varibale to 0 after accumulating all records belonging to
> one client. So in essence, I allocate maybe at most 50 array elements
> (lines in the file) in memory and keep overwriting them. Therefore,
> the memory should not grow more than it takes to store those 50 lines.
> However, I looked at memory usage while the process is running, and it
> keep growing constantly.
>
> I would love to use gawk, but like I mentioned, our compnay does not
> have it installed and I have no say so in that.
>
Can you come up with a small, complete script that reproduces the
problem and post that?
Ed.
|
|
0
|
|
|
|
Reply
|
Ed
|
8/3/2007 1:38:54 PM
|
|
Mensur wrote:
> It is strange to me that it is happening because I do reset the
> DATA_SIZE varibale to 0 after accumulating all records belonging to
> one client. So in essence, I allocate maybe at most 50 array elements
> (lines in the file) in memory and keep overwriting them. Therefore,
> the memory should not grow more than it takes to store those 50 lines.
Yes, your script really looks like it should not
run into memory problems.
Try this:
function merge()
{
# Merging logic here.....
# Once merging is done print merged record to file
print DATA[1]
# Reset data size to 0 to reuse the array
while (DATA_SIZE > 0)
delete data[DATA_SIZE--]
}
> However, I looked at memory usage while the process is running, and it
> keep growing constantly.
Sounds like a memory leak. Consider using valgrind
to locate it.
> I would love to use gawk, but like I mentioned, our compnay does not
> have it installed and I have no say so in that.
You could build gawk locally in your home directory.
Just for comparison.
|
|
0
|
|
|
|
Reply
|
ISO
|
8/3/2007 1:47:37 PM
|
|
On Aug 3, 9:47 am, J=FCrgen Kahrs <Juergen.KahrsDELETET...@vr-web.de>
wrote:
> Mensur wrote:
> > It is strange to me that it is happening because I do reset the
> > DATA_SIZE varibale to 0 after accumulating all records belonging to
> > one client. So in essence, I allocate maybe at most 50 array elements
> > (lines in the file) in memory and keep overwriting them. Therefore,
> > the memory should not grow more than it takes to store those 50 lines.
>
> Yes, your script really looks like it should not
> run into memory problems.
>
> Try this:
>
> function merge()
> {
> # Merging logic here.....
>
> # Once merging is done print merged record to file
> print DATA[1]
>
> # Reset data size to 0 to reuse the array
> while (DATA_SIZE > 0)
> delete data[DATA_SIZE--]
>
> }
> > However, I looked at memory usage while the process is running, and it
> > keep growing constantly.
>
> Sounds like a memory leak. Consider using valgrind
> to locate it.
>
> > I would love to use gawk, but like I mentioned, our compnay does not
> > have it installed and I have no say so in that.
>
> You could build gawk locally in your home directory.
> Just for comparison.
Thanks for all your help. I tried using this
for (ARRAY_INDEX in DATA)
{
delete DATA[ARRAY_INDEX]
}
with no success. Memory still keeps growing. I can try using gawk to
see if there is difference but that really doesn't help me. Even if
the problem does not occur with gawk, I still cannot use it.
|
|
0
|
|
|
|
Reply
|
Mensur
|
8/3/2007 2:59:30 PM
|
|
On Aug 3, 9:38 am, Ed Morton <mor...@lsupcaemnt.com> wrote:
> Mensur wrote:
> > On Aug 3, 3:18 am, Juergen Kahrs <Juergen.KahrsDELETET...@vr-web.de>
> > wrote:
>
> >>Mensur wrote:
>
> >>>On Aug 2, 2:17 pm, Ed Morton <mor...@lsupcaemnt.com> wrote:
>
> >>>>Mensur wrote:
>
> >>>>>I have a flat file with over 3 million rows. File is being processed
> >>>>>with nawk program that does some merging on the file and outputs the
> >>>>>result in a different file. After processing around 2.5 million rows,
> >>>>>nawk fails with:
> >>>>>nawk: out of space in tostring on 8273|8273_2|2000747680|35|35-44|
> >>>>>35-44|N|N|N|X|N|N|X|NJ||1630676556|Y|1|7243|||||||||||||||||||||
> >>>>>What can cause this error? Looking at the source code, the error is
> >>>>>caused by failed C malloc function to obtain memory. I am using awk
> >>>>>arrays to do the merging - I am not sure if that has anything to do
> >>>>>with it.
>
> >>It is not surprising that you script fails when
> >>processing large files. It accumulates all lines
> >>in the memory of the AWK intepreter:
>
> >>>{
>
> >>...
>
> >>> # Add currently read line to list of records that belong to same
> >>>client
> >>> ++DATA_SIZE
> >>> DATA[DATA_SIZE] = CURRENT_LINE
> >>>}
>
> >>Try to avoid storing all lines.
> >>Ed's recommended using gawk. Using gawk when processing
> >>large files is generally a good idea. gawk avoids memory
> >>limitations where most other implementations _have_ limitations.
>
> > It is strange to me that it is happening because I do reset the
> > DATA_SIZE varibale to 0 after accumulating all records belonging to
> > one client. So in essence, I allocate maybe at most 50 array elements
> > (lines in the file) in memory and keep overwriting them. Therefore,
> > the memory should not grow more than it takes to store those 50 lines.
> > However, I looked at memory usage while the process is running, and it
> > keep growing constantly.
>
> > I would love to use gawk, but like I mentioned, our compnay does not
> > have it installed and I have no say so in that.
>
> Can you come up with a small, complete script that reproduces the
> problem and post that?
>
> Ed.- Hide quoted text -
>
> - Show quoted text -
Here is the whole script. Like I said, it is really long. Again, thank
you guys for helping.
BEGIN{
# Set initial values of variables
OFS=FS;
ERROR_CD = 0;
CIN_POS = -1;
MG_ID_POS = -1;
CG_ID_POS = -1;
AMS_CG_ID_POS = -1;
AMS_FIRST_NAME_POS = -1;
CURR_CIN_ID = -1;
PREV_CIN_ID= - 1;
DATA_SIZE = 0;
DATA[1] = "";
LOOKUP_CIN_ID = -1;
LOOKUP_MG_ID = -1;
LOOKUP_CG_ID = -1;
# Process header record to read in header record, find positions of
variables, output header record and read in next line
if (process_header_record() != 0)
{
log_message("Error inherited from process_header_record
function.")
ERROR_CD = 1;
exit 1
}
}
#######################################################################
# MAIN PROCESSING
#######################################################################
{
# IF error occured in BEGIN section, detect it here
if (ERROR_CD != 0)
{
log_message("Error occured in BEGIN section. Exiting ....")
exit 1
}
# Skip empty lines
if (trim($0) == "")
{
log_message("Found empty record: " FNR ". Skipping.")
next
}
# Log which record we are processing
if (FNR % 10000 == 3)
{
log_message("Processing record: " FNR)
}
# Get line that was read in
CURRENT_LINE = $0
# Get current CINERGY ID
CURR_CIN_ID = $CIN_POS
# If Cinergy IDs do not match, we have read all records that belong
to a single client - MERGE
if (PREV_CIN_ID != CURR_CIN_ID)
{
# IF either MG ID or CG ID are included on the file, we will use
lookup file to find
# MG ID with earliest establishment date and CG ID with hjighest
assets
if (MG_ID_POS != -1 || CG_ID_POS != -1)
{
# Merge using lookup file
if(dedup_merge() != 0)
{
log_message("Error inherited from dedup_merge function.")
ERROR_CD = 1
exit 1
}
}
else
{
# Merge by picking first valid value for each column
if(merge_first_valid_value() != 0)
{
log_message("Error inherited from merge_first_valid_value
function.")
ERROR_CD = 1
exit 1
}
}
# Reset Cinergy ID
PREV_CIN_ID = CURR_CIN_ID
}
# Add currently read line to list of records that belong to same
client
++DATA_SIZE
DATA[DATA_SIZE] = CURRENT_LINE
}
# END SECTION
END{
# Detect error from MAIN section
if (ERROR_CD != 0)
{
log_message("Error occured in MAIN section. Exiting...")
exit 1
}
# Merge last client
log_message("Merging last client...")
# IF either MG ID or CG ID are included on the file, we will use
lookup file to find
# MG ID with earliest establishment date and CG ID with hjighest
assets
if (MG_ID_POS != -1 || CG_ID_POS != -1)
{
if(dedup_merge() != 0)
{
log_message("Error inherited from dedup_merge function.")
exit 1
}
}
else
{
# Merge by picking first valid value for each column
if(merge_first_valid_value() != 0)
{
log_message("Error inherited from merge_first_valid_value
function.")
exit 1
}
}
}
###########################################################################################################
#
# Function name: process_header_record
#
# Purpose: Function reads in the header record, extracts
field positions from the header, outputs header
# record to an output file. Then it reads another
record, initilized DATA array, DATA_SIZE
# and CURR_CIN_ID and PREV_CIN_ID variables.
#
# Parameters: None
#
# Return: Function returns 0 on success and 1 on failure
#
# Side effects: $0, NF, FNR are reset
#
###########################################################################################################
function process_header_record()
{
# Explicitly read in header record from campaign file
RETURN_CODE = getline
if (RETURN_CODE <= 0)
{
log_message("There are no more campaign records to read or error
reading campaign file!")
return 1
}
# Get position of Cinergy Client Id field
CIN_POS = get_field_pos("Cinergy Client Id")
if (CIN_POS == -1)
{
log_message("Error trying to get position of Cinergy Client Id")
return 1
}
# Get position of Marketing Group Identifier
MG_ID_POS = get_field_pos("Marketing Group Identifier")
if (MG_ID_POS == -1)
{
log_message("MG ID not present on the file.")
}
# Get position of Client Group ID
Marketing Group Identifier
CG_ID_POS = get_field_pos("Client Group ID")
if (CG_ID_POS == -1)
{
log_message("CG ID not present on the file.")
}
# Get position of AMS Client Group ID
AMS_CG_ID_POS = get_field_pos("AMS Client Group ID")
if (AMS_CG_ID_POS == -1)
{
log_message("AMS CG ID not present on the file.")
}
# Get position of AMS Crew First Name
AMS_FIRST_NAME_POS = get_field_pos("AMS Crew First Name")
if (AMS_FIRST_NAME_POS == -1)
{
log_message("AMS Manager First Name not present on the file.")
}
# Output header record
print $0 > OUT_FILE
# Read in first client record from campaign file
RETURN_CODE = getline
if (RETURN_CODE <= 0)
{
log_message("There are no more campaign records to read or error
reading campaign file!")
return 1
}
# Set variables
CURR_CIN_ID = $CIN_POS
PREV_CIN_ID = $CIN_POS
DATA_SIZE=1
DATA[1] = $0
}
###########################################################################################################
#
# Function name: get_field_pos
#
# Purpose: Function retruns the position of field FIELD_TEXT
from $0 or -1 if the field doesn't exist.
#
# Parameters: FIELD_TEXT
#
# Return: Function returns 0 on success and 1 on failure
#
# Side effects: $0 must be set
#
###########################################################################################################
function get_field_pos(FIELD_TEXT)
{
# Loop for each field in $0 and find position of field with text
FIELD_TEXT
for (i = 1; i <= NF; ++i)
{
# If field exists, return position
if (trim($i) == FIELD_TEXT)
{
return i
}
}
return -1
}
###########################################################################################################
#
# Function name: find_first_non_empty_line
#
# Purpose: Function returns first non-empty line in DATA
array or empty string if all lines are
# empty.
#
# Parameters: None
#
# Return: Function returns 0 on success and 1 on failure
#
# Side effects: DATA and DATA_SIZE must be set
#
###########################################################################################################
function find_first_non_empty_line()
{
# Find first non-empty line
for (i = 1; i <= DATA_SIZE; ++i)
{
if (DATA[i] != "")
{
return DATA[i]
}
}
return ""
}
###########################################################################################################
#
# Function name: merge_first_valid_value
#
# Purpose: Merge columns in DATA by taking the first value
that is not empty, UNKNOWN or Default.
#
# Parameters: None
#
# Return: Function returns 0 on success and 1 on failure
#
# Side effects: DATA and DATA_SIZE must be set
#
###########################################################################################################
function merge_first_valid_value()
{
# If AMS is included, take a first valid AMS client
if (AMS_FIRST_NAME_POS != -1)
{
# Find first valid AMS
if (dedup_field_by_valid_value(AMS_FIRST_NAME_POS) != 0)
{
log_message("Error inherited from dedup_field_by_valid_value
function when deduping on AMS.")
return 1
}
}
# Read in and set first line
FIRST_LINE = find_first_non_empty_line();
if (FIRST_LINE == "")
{
log_message("Error reading line. All lines are empty!")
return 1
}
$0 = FIRST_LINE
# Get number of fields in the line
FIELD_COUNT = NF
# Look for each field in the line and replace any fields that have
empty, unknown or default value
# from remaining records belonging to the same client
for (i = 1; i <= FIELD_COUNT; ++i)
{
# Need to always set the first line as we are changing $0 down
in the code
$0 = FIRST_LINE
# If field is empty, UNKNOWN or default, find a replacement
if (trim($i) == "" || tolower($i) == "unknown" || tolower($i) ==
"default")
{
# Set field to empty value in case no replacement is found
$i = ""
FIRST_LINE = $0
# Find a replacement from remaining records
for (j = 1; j <= DATA_SIZE; ++j)
{
# If replacement is empty, keep going
if (DATA[j] == "")
{
continue
}
$0 = DATA[j]
# Find a replacement
if (trim($i) != "" && tolower($i) != "unknown" &&
tolower($i) != "default")
{
# Perfrom replace
REPLACEMENT_FIELD=$i
$0 = FIRST_LINE
$i = REPLACEMENT_FIELD
FIRST_LINE = $0
break;
}
}
}
}
# Once the replacements are made, output record
output(FIRST_LINE)
# Reset data size so it can be used again
DATA_SIZE=0
}
###########################################################################################################
#
# Function name: dedup_merge
#
# Purpose: Merges client records using lookup file
#
# Parameters: None
#
# Return: Function returns 0 on success and 1 on failure
#
# Side effects: DATA and DATA_SIZE must be set
#
###########################################################################################################
function dedup_merge()
{
# Get next lookup values
LAST_PROCESSED_LOOKUP_ID = LOOKUP_CIN_ID
if (get_next_lookup_values() != 0)
{
log_message("Error inherited from get_next_lookup_values
function.")
log_message("Last processed LOOKUP_CINERGY_ID is: "
LAST_PROCESSED_LOOKUP_ID)
log_data_records()
return 1
}
# If Cinergy IDs do not match, we are out of sync
if (PREV_CIN_ID != LOOKUP_CIN_ID)
{
log_message("Cinergy ID from campaign file: " PREV_CIN_ID "
does not match lookup file Cinergy ID: " LOOKUP_CIN_ID)
log_message("Last processed LOOKUP_CINERGY_ID is: "
LAST_PROCESSED_LOOKUP_ID)
log_message("This ERROR can occur for 4 different reasons:")
log_message("1. Lookup file returned by the stored procedure
contains duplicate Cinergy Ids which in turn means that campaign
contains recipients with same Cinergy Ids (Campaign may contain Leads,
Purged or Expired clients). THIS IS SCENARIO IS VERY LIKELY. In this
case modify the campaign to exclude those clients.")
log_message("2. Output format of the file contains a date
(client birth date or other date) that points to the date dimesnion
but the date value is past 12/31/2010. THIS IS SCENARIO IS VERY
LIKELY. ")
log_message("3. Lookup file returned by the stored procedure
contains Cinergy Id that is not present in the export file.")
log_message("4. Export file contains Cinergy Id that is not
present in the Lookup file returned by the stored procedure.")
log_data_records()
return 1
}
# Dedup on MG
if (LOOKUP_MG_ID != "" && MG_ID_POS != -1)
{
if (dedup_field_by_value(MG_ID_POS, LOOKUP_MG_ID, "MG ID") != 0)
{
log_message("Error inherited from dedup_field_by_value
function when deduping on MG.")
log_data_records()
return 1
}
}
# Dedup on CG
if (LOOKUP_CG_ID != "" && CG_ID_POS != -1)
{
if (dedup_field_by_value(CG_ID_POS, LOOKUP_CG_ID, "CG ID") != 0)
{
log_message("Error inherited from dedup_field_by_value
function when deduping on CG.")
log_data_records()
return 1
}
}
# Merge other records by picking first valid value
if (merge_first_valid_value() != 0)
{
log_message("Error inherited from merge_first_valid_value.")
return 1
}
return 0
}
###########################################################################################################
#
# Function name: trim
#
# Purpose: Trims string
#
# Parameters: None
#
# Return: Trimmed string
#
# Side effects: None
#
###########################################################################################################
function trim(string)
{
sub(/^[ \t]+/, "", string); sub(/[ \t]+$/, "", string) #trim left
and right
return string
}
###########################################################################################################
#
# Function name: log_message
#
# Purpose: Logs message to a log file
#
# Parameters: None
#
# Return: None
#
# Side effects: None
#
###########################################################################################################
function log_message(string)
{
"date +%m/%d/%Y_%H:%M:%S" | getline dt
close("date +%m/%d/%Y_%H:%M:%S")
print dt " " string >> LOG_FILE
}
###########################################################################################################
#
# Function name: output
#
# Purpose: outputs string to output file
#
# Parameters: None
#
# Return: None
#
# Side effects: None
#
###########################################################################################################
function output(string)
{
print string >> OUT_FILE
}
###########################################################################################################
#
# Function name: get_next_lookup_values
#
# Purpose: retrieved Cinergy ID, MG ID and CG ID from lookup
file
#
# Parameters: None
#
# Return: 0 if success, 1 if failure
#
# Side effects: Populates LOOKUP_CIN_ID, LOOKUP_MG_ID,
LOOKUP_CG_ID variables
#
###########################################################################################################
function get_next_lookup_values()
{
# Read line from lookup file
RETURN_CODE = getline LINE < LOOKUP_FILE
if (RETURN_CODE <= 0)
{
log_message("There are no more lookup records to read or error
reading lookup file!")
return 1
}
# get number of fields from read line and split it into fileds
FIELD_COUNT = split(LINE, F, FS)
if (FIELD_COUNT != 3)
{
# Invalid format of lookup file
log_message("Record on lookup file: " LINE " does not have 3
fields!!!")
return 1
}
# Assign values
LOOKUP_CIN_ID = F[1]
LOOKUP_MG_ID = F[2]
LOOKUP_CG_ID = F[3]
return 0
}
###########################################################################################################
#
# Function name: dedup_field_by_value
#
# Purpose: Removes array elements (DATA) that at field
position FIELD_POS do not match FIELD_VALUE.
# After, the removal is done, at least one array
element must remain.
#
# Parameters: None
#
# Return: 0 if success, 1 if failure
#
# Side effects: Modifies DATA array, modifies $0
#
###########################################################################################################
function dedup_field_by_value(FIELD_POS, FIELD_VALUE, FIELD_NAME)
{
# Go through each array element
VALUE_FOUND = "false"
COUNTER = 0
for (i = 1; i <= DATA_SIZE; ++i)
{
# Get array element
LINE = DATA[i]
$0 = LINE
# If field is not empty and doesn't match FIELD_VALUE, remove it
(set it to empty string)
if (trim($FIELD_POS) != "" && $FIELD_POS != FIELD_VALUE)
{
DATA[i] = ""
++COUNTER
}
else if ($FIELD_POS == FIELD_VALUE)
{
VALUE_FOUND = "true"
}
}
# if all records were removed - error condition
if (VALUE_FOUND != "true")
{
log_message("All records were removed while searching for field
value: " FIELD_VALUE " of field: " FIELD_NAME " at position: "
FIELD_POS)
return 1
}
return 0
}
###########################################################################################################
#
# Function name: log_data_records
#
# Purpose: outputs records that currently being processed to
log file.
#
# Parameters: None
#
# Return: 0 if success, 1 if failure
#
# Side effects: None
#
###########################################################################################################
function log_data_records()
{
log_message("Currently processing following client records:")
for (i = 1; i <= DATA_SIZE; ++i)
{
log_message(DATA[i])
}
}
###########################################################################################################
#
# Function name: dedup_field_by_valid_value
#
# Purpose: Removes array elements (DATA) that at field
position FIELD_POS are empty, UNKNOWN or
# Default. After valid field is found, remove
remaining elements.
#
# Parameters: None
#
# Return: 0 if success, 1 if failure
#
# Side effects: Modifies DATA array, modifies $0
#
###########################################################################################################
function dedup_field_by_valid_value(FIELD_POS)
{
# loop for each array element
FOUND="false"
for (i = 1; i <= DATA_SIZE; ++i)
{
# Get array element
LINE = DATA[i]
$0 = LINE
# If array element is not empty, no valid field value is found
so far, and this field is valid - mark field as found
if (DATA[i] != "" && FOUND != "true" && trim($FIELD_POS) != ""
&& tolower($FIELD_POS) != "unknown" && tolower($FIELD_POS) !=
"default")
{
FOUND = "true"
continue
}
# If array element is not empty, and we already found a valid
value, remove array element
if (DATA[i] != "" && FOUND == "true" && trim($FIELD_POS) != "")
{
DATA[i] = ""
}
}
return 0
}
|
|
0
|
|
|
|
Reply
|
Mensur
|
8/3/2007 3:01:30 PM
|
|
Mensur wrote:
> Here is the whole script. Like I said, it is really long. Again, thank
> you guys for helping.
There is one thing about your script I dont understand.
Where is OUT_FILE initialized ? I guess it never is.
|
|
0
|
|
|
|
Reply
|
ISO
|
8/3/2007 5:07:02 PM
|
|
Mensur wrote:
>> You could build gawk locally in your home directory.
>> Just for comparison.
>
> Thanks for all your help. I tried using this
> for (ARRAY_INDEX in DATA)
> {
> delete DATA[ARRAY_INDEX]
> }
It was a good idea to try this.
Now you know the memory leak is not in data storage.
> with no success. Memory still keeps growing. I can try using gawk to
> see if there is difference but that really doesn't help me. Even if
> the problem does not occur with gawk, I still cannot use it.
If the problem does _not_ occur in gawk, then you
have found the bug: in your AWK implementation.
I think this is worth a try, even if you will never
be allowed to use gawk on Solaris.
|
|
0
|
|
|
|
Reply
|
ISO
|
8/3/2007 5:10:12 PM
|
|
On Aug 3, 1:10 pm, J=FCrgen Kahrs <Juergen.KahrsDELETET...@vr-web.de>
wrote:
> Mensur wrote:
> >> You could build gawk locally in your home directory.
> >> Just for comparison.
>
> > Thanks for all your help. I tried using this
> > for (ARRAY_INDEX in DATA)
> > {
> > delete DATA[ARRAY_INDEX]
> > }
>
> It was a good idea to try this.
> Now you know the memory leak is not in data storage.
>
> > with no success. Memory still keeps growing. I can try using gawk to
> > see if there is difference but that really doesn't help me. Even if
> > the problem does not occur with gawk, I still cannot use it.
>
> If the problem does _not_ occur in gawk, then you
> have found the bug: in your AWK implementation.
> I think this is worth a try, even if you will never
> be allowed to use gawk on Solaris.
I think I figured it out. It really is a nawk (gawk) memory leak when
using arrays and next, getline etc. This was found and fixed in 2002
but we just haven't gotten the upgrade. Thanks again to everyone for
helping.
|
|
0
|
|
|
|
Reply
|
Mensur
|
8/3/2007 9:24:23 PM
|
|
In article <1186176263.570271.17750@q75g2000hsh.googlegroups.com>,
Mensur <mmedic123@gmail.com> wrote:
....
>I think I figured it out. It really is a nawk (gawk) memory leak when
>using arrays and next, getline etc.
What do you mean by "nawk (gawk)"?
Is that much like saying "Ford (Chrysler)" ?
Or (to be more closely on-topic): "Solaris (Linux)" ?
>This was found and fixed in 2002
>but we just haven't gotten the upgrade. Thanks again to everyone for
>helping.
|
|
0
|
|
|
|
Reply
|
gazelle
|
8/3/2007 10:17:18 PM
|
|
|
12 Replies
447 Views
(page loaded in 0.127 seconds)
|