|
|
Joining 2 Flat file and creating 3rd file
The scenario is like this:
I have 2 flat files:
File A: Field1 Field2 Field3 Field4
File B: Field1 Field2
I will have to eliminate those records from File A, where File
A.Field2=File B.Field1 and File A.Field3=File B.Field2
and create a new file File 3 with the remaining records.
Please help to do that.
Thanks
|
|
0
|
|
|
|
Reply
|
krishbanerjee
|
4/30/2004 1:55:49 AM |
|
krishbanerjee wrote:
> The scenario is like this:
>
> I have 2 flat files:
>
> File A: Field1 Field2 Field3 Field4
>
> File B: Field1 Field2
>
> I will have to eliminate those records from File A, where File
> A.Field2=File B.Field1 and File A.Field3=File B.Field2
> and create a new file File 3 with the remaining records.
awk 'NR == FNR { fld1[$1]=""; fld2[$2]=""; next }
! ($2 in fld1 && $3 in fld2) { print }' FileB FileA
Regards,
Ed.
|
|
0
|
|
|
|
Reply
|
Ed
|
4/30/2004 3:31:45 AM
|
|
Ed Morton <morton@lsupcaemnt.com> wrote in message >
> awk 'NR == FNR { fld1[$1]=""; fld2[$2]=""; next }
> ! ($2 in fld1 && $3 in fld2) { print }' FileB FileA
>
Unbelievable timing. I was going to post a very similar question.
This gave me enough ideas to get my task done. I had to take two
tab-delimited files, combine them and do some math with certain fields
(which has now led me to a rounding discrepency --but that's a
different post).
File1 has 3 fields, File2 has 34 fields. The first 2 fields in each
file are the same. I needed to insert File1.Field3 and 4 new fields
after File2.Field2 into the output.
Even though this scripts works for me, I'd appreciate some comments on
it, as I'm always trying to learn new things. Incidentally, File1 &
File2 are created from 2 other awk scripts processed against 2 input
files. So a shell script executes the awk scripts in order, creating
the appropriate output files for input into this script, then emails
the final results to me.
- Trevor
gawk script3.awk File1 File2
BEGIN { FS="\t" }
{ Index[NR]=$1 $2; Field3[NR]=$3
if ( NR > FNR && Index[FNR]==$1 $2 )
{ TotMsgCount=Field3[FNR]
NonSpamMsgCount=Field3[FNR]-$11
if ( TotMsgCount == 0 )
{ NonSpamPct = 0; SpamPct = 0 }
else
{ NonSpamPct = int(NonSpamMsgCount/TotMsgCount*1000)/1000;
SpamPct = int($11/TotMsgCount*1000)/1000 }
print $1 "\t" $2 "\t" TotMsgCount "\t" NonSpamMsgCount "\t"
NonSpamPct "\t" SpamPct "\t \t" $3 "\t" $4 "\t" $5 "\t" $6 "\t" $7
"\t" $8 "\t" $9 "\t" $10 "\t" $11 "\t" $12 "\t" $13 "\t" $14 "\t" $15
"\t" $16 "\t" $17 "\t" $18 "\t" $19 "\t" $20 "\t" $21 "\t" $22 "\t"
$23 "\t" $24 "\t" $25 "\t" $26 "\t" $27 "\t" $28 "\t" $29 "\t" $30
"\t" $31 "\t" $32 "\t" $33 "\t" $34
}
}
END { }
|
|
0
|
|
|
|
Reply
|
trevor
|
5/4/2004 4:26:02 PM
|
|
In article <b289840b.0405040826.78acfead@posting.google.com>,
Trevor Vance <trevor.vance@verizon.net> wrote:
% File1 has 3 fields, File2 has 34 fields. The first 2 fields in each
% file are the same. I needed to insert File1.Field3 and 4 new fields
% after File2.Field2 into the output.
[...]
% BEGIN { FS="\t" }
% { Index[NR]=$1 $2; Field3[NR]=$3
It's better to have something like this:
FNR == NR { Field3[$1,$2] = $3; next }
after that, ever record will be from file 2. You can do something
likethis
($1 FIELDSEP $2) in Field3 {
TotMsgCount = Field3[$1, $2]
NonSpamMsgCount = TotMsgCount - $11
if ( TotMsgCount == 0 )
{ NonSpamPct = 0; SpamPct = 0 }
else
{ NonSpamPct = int(NonSpamMsgCount/TotMsgCount*1000)/1000;
SpamPct = int($11/TotMsgCount*1000)/1000 }
printf $1, $2, TotMsgCount, NonSpamMsgCount, NonSpamPct, SpamPct
for (i = 3; i <= NF; i++)
printf "\t%s", $i
printf "\n"
}
BEGIN { OFS = "\t" } # makes print x,y,z work
--
Patrick TJ McPhee
East York Canada
ptjm@interlog.com
|
|
0
|
|
|
|
Reply
|
ptjm
|
5/5/2004 4:15:59 AM
|
|
Trevor Vance wrote:
> Ed Morton <morton@lsupcaemnt.com> wrote in message >
>
>>awk 'NR == FNR { fld1[$1]=""; fld2[$2]=""; next }
>> ! ($2 in fld1 && $3 in fld2) { print }' FileB FileA
>>
>
> Unbelievable timing. I was going to post a very similar question.
> This gave me enough ideas to get my task done. I had to take two
> tab-delimited files, combine them and do some math with certain fields
> (which has now led me to a rounding discrepency --but that's a
> different post).
>
> File1 has 3 fields, File2 has 34 fields. The first 2 fields in each
> file are the same.
Exaclty the same? i.e. does each file have the same number of lines in
the same field1/field2 order? If that is the case then you can simplify
your script as below.
I needed to insert File1.Field3 and 4 new fields
> after File2.Field2 into the output.
>
> Even though this scripts works for me, I'd appreciate some comments on
> it, as I'm always trying to learn new things. Incidentally, File1 &
> File2 are created from 2 other awk scripts processed against 2 input
> files. So a shell script executes the awk scripts in order, creating
> the appropriate output files for input into this script, then emails
> the final results to me.
>
> - Trevor
>
> gawk script3.awk File1 File2
>
> BEGIN { FS="\t" }
> { Index[NR]=$1 $2; Field3[NR]=$3
> if ( NR > FNR && Index[FNR]==$1 $2 )
> { TotMsgCount=Field3[FNR]
> NonSpamMsgCount=Field3[FNR]-$11
> if ( TotMsgCount == 0 )
> { NonSpamPct = 0; SpamPct = 0 }
> else
> { NonSpamPct = int(NonSpamMsgCount/TotMsgCount*1000)/1000;
> SpamPct = int($11/TotMsgCount*1000)/1000 }
> print $1 "\t" $2 "\t" TotMsgCount "\t" NonSpamMsgCount "\t"
> NonSpamPct "\t" SpamPct "\t \t" $3 "\t" $4 "\t" $5 "\t" $6 "\t" $7
> "\t" $8 "\t" $9 "\t" $10 "\t" $11 "\t" $12 "\t" $13 "\t" $14 "\t" $15
> "\t" $16 "\t" $17 "\t" $18 "\t" $19 "\t" $20 "\t" $21 "\t" $22 "\t"
> $23 "\t" $24 "\t" $25 "\t" $26 "\t" $27 "\t" $28 "\t" $29 "\t" $30
> "\t" $31 "\t" $32 "\t" $33 "\t" $34
> }
> }
> END { }
Assuming exactly the sanme field1/field2 line ordering, this should do
the same thing (untested):
gawk 'BEGIN{FS="\t";OFS=FS}
NR != FNR { Field3[FNR]=$3; next }
{ TotMsgCount=Field3[FNR]
NonSpamMsgCount=Field3[FNR]-$11
if ( TotMsgCount == 0 )
{ NonSpamPct = 0; SpamPct = 0 }
else
{ NonSpamPct = int(NonSpamMsgCount/TotMsgCount*1000)/1000
SpamPct = int($11/TotMsgCount*1000)/1000
# you can do the printing this way
# regardless of input line ordering:
$2 = $2 OFS TotMsgCount OFS NonSpamMsgCount OFS NonSpamPct OFS
SpamPct OFS " " OFS
print
}
}'
Regards,
Ed.
|
|
0
|
|
|
|
Reply
|
Ed
|
5/6/2004 8:39:59 PM
|
|
|
4 Replies
252 Views
(page loaded in 0.127 seconds)
|
|
|
|
|
|
|
|
|