Joining 2 Flat file and creating 3rd file

  • Follow


The scenario is like this:

I have 2 flat files:

File A:  Field1  Field2 Field3 Field4

File B:  Field1  Field2

I will have to eliminate those records from File A, where File
A.Field2=File B.Field1 and File A.Field3=File B.Field2
and create a new file File 3 with the remaining records.

Please help to do that.

Thanks

0
Reply krishbanerjee 4/30/2004 1:55:49 AM


krishbanerjee wrote:
> The scenario is like this:
> 
> I have 2 flat files:
> 
> File A:  Field1  Field2 Field3 Field4
> 
> File B:  Field1  Field2
> 
> I will have to eliminate those records from File A, where File
> A.Field2=File B.Field1 and File A.Field3=File B.Field2
> and create a new file File 3 with the remaining records.

awk 'NR == FNR { fld1[$1]=""; fld2[$2]=""; next }
	! ($2 in fld1 && $3 in fld2) { print }' FileB FileA

Regards,

	Ed.

0
Reply Ed 4/30/2004 3:31:45 AM


Ed Morton <morton@lsupcaemnt.com> wrote in message > 
> awk 'NR == FNR { fld1[$1]=""; fld2[$2]=""; next }
> 	! ($2 in fld1 && $3 in fld2) { print }' FileB FileA
> 
Unbelievable timing.  I was going to post a very similar question. 
This gave me enough ideas to get my task done.  I had to take two
tab-delimited files, combine them and do some math with certain fields
(which has now led me to a rounding discrepency --but that's a
different post).

File1 has 3 fields, File2 has 34 fields.  The first 2 fields in each
file are the same.  I needed to insert File1.Field3 and 4 new fields
after File2.Field2 into the output.

Even though this scripts works for me, I'd appreciate some comments on
it, as I'm always trying to learn new things.  Incidentally, File1 &
File2 are created from 2 other awk scripts processed against 2 input
files.  So a shell script executes the awk scripts in order, creating
the appropriate output files for input into this script, then emails
the final results to me.

  -  Trevor

gawk script3.awk File1 File2

BEGIN { FS="\t" }
 { Index[NR]=$1 $2; Field3[NR]=$3
   if ( NR > FNR && Index[FNR]==$1 $2 )
     { TotMsgCount=Field3[FNR]
       NonSpamMsgCount=Field3[FNR]-$11
       if ( TotMsgCount == 0 )
         { NonSpamPct = 0; SpamPct = 0 }
       else
         { NonSpamPct = int(NonSpamMsgCount/TotMsgCount*1000)/1000;
SpamPct = int($11/TotMsgCount*1000)/1000 }
       print $1 "\t" $2 "\t" TotMsgCount "\t" NonSpamMsgCount "\t"
NonSpamPct "\t" SpamPct "\t \t" $3 "\t" $4 "\t" $5 "\t" $6 "\t" $7
"\t" $8 "\t" $9 "\t" $10 "\t" $11 "\t" $12 "\t" $13 "\t" $14 "\t" $15
"\t" $16 "\t" $17 "\t" $18 "\t" $19 "\t" $20 "\t" $21 "\t" $22 "\t"
$23 "\t" $24 "\t" $25 "\t" $26 "\t" $27 "\t" $28 "\t" $29 "\t" $30
"\t" $31 "\t" $32 "\t" $33 "\t" $34
     }
 }
END { }
0
Reply trevor 5/4/2004 4:26:02 PM

In article <b289840b.0405040826.78acfead@posting.google.com>,
Trevor Vance <trevor.vance@verizon.net> wrote:

% File1 has 3 fields, File2 has 34 fields.  The first 2 fields in each
% file are the same.  I needed to insert File1.Field3 and 4 new fields
% after File2.Field2 into the output.

[...]

% BEGIN { FS="\t" }
%  { Index[NR]=$1 $2; Field3[NR]=$3

It's better to have something like this:

 FNR == NR { Field3[$1,$2] = $3; next }

after that, ever record will be from file 2. You can do something
likethis

  ($1 FIELDSEP $2) in Field3 {
    TotMsgCount = Field3[$1, $2]
    NonSpamMsgCount = TotMsgCount - $11

    if ( TotMsgCount == 0 )
      { NonSpamPct = 0; SpamPct = 0 }
    else
      { NonSpamPct = int(NonSpamMsgCount/TotMsgCount*1000)/1000;
        SpamPct = int($11/TotMsgCount*1000)/1000 }


    printf $1, $2, TotMsgCount, NonSpamMsgCount, NonSpamPct, SpamPct
    for (i = 3; i <= NF; i++)
       printf "\t%s", $i
    printf "\n"
  }

  BEGIN { OFS = "\t" } # makes print x,y,z work

-- 

Patrick TJ McPhee
East York  Canada
ptjm@interlog.com
0
Reply ptjm 5/5/2004 4:15:59 AM


Trevor Vance wrote:
> Ed Morton <morton@lsupcaemnt.com> wrote in message > 
> 
>>awk 'NR == FNR { fld1[$1]=""; fld2[$2]=""; next }
>>	! ($2 in fld1 && $3 in fld2) { print }' FileB FileA
>>
> 
> Unbelievable timing.  I was going to post a very similar question. 
> This gave me enough ideas to get my task done.  I had to take two
> tab-delimited files, combine them and do some math with certain fields
> (which has now led me to a rounding discrepency --but that's a
> different post).
> 
> File1 has 3 fields, File2 has 34 fields.  The first 2 fields in each
> file are the same.

Exaclty the same? i.e. does each file have the same number of lines in 
the same field1/field2 order? If that is the case then you can simplify 
your script as below.

   I needed to insert File1.Field3 and 4 new fields
> after File2.Field2 into the output.
> 
> Even though this scripts works for me, I'd appreciate some comments on
> it, as I'm always trying to learn new things.  Incidentally, File1 &
> File2 are created from 2 other awk scripts processed against 2 input
> files.  So a shell script executes the awk scripts in order, creating
> the appropriate output files for input into this script, then emails
> the final results to me.
> 
>   -  Trevor
> 
> gawk script3.awk File1 File2
> 
> BEGIN { FS="\t" }
>  { Index[NR]=$1 $2; Field3[NR]=$3
>    if ( NR > FNR && Index[FNR]==$1 $2 )
>      { TotMsgCount=Field3[FNR]
>        NonSpamMsgCount=Field3[FNR]-$11
>        if ( TotMsgCount == 0 )
>          { NonSpamPct = 0; SpamPct = 0 }
>        else
>          { NonSpamPct = int(NonSpamMsgCount/TotMsgCount*1000)/1000;
> SpamPct = int($11/TotMsgCount*1000)/1000 }
>        print $1 "\t" $2 "\t" TotMsgCount "\t" NonSpamMsgCount "\t"
> NonSpamPct "\t" SpamPct "\t \t" $3 "\t" $4 "\t" $5 "\t" $6 "\t" $7
> "\t" $8 "\t" $9 "\t" $10 "\t" $11 "\t" $12 "\t" $13 "\t" $14 "\t" $15
> "\t" $16 "\t" $17 "\t" $18 "\t" $19 "\t" $20 "\t" $21 "\t" $22 "\t"
> $23 "\t" $24 "\t" $25 "\t" $26 "\t" $27 "\t" $28 "\t" $29 "\t" $30
> "\t" $31 "\t" $32 "\t" $33 "\t" $34
>      }
>  }
> END { }

Assuming exactly the sanme field1/field2 line ordering, this should do 
the same thing (untested):

gawk 'BEGIN{FS="\t";OFS=FS}
  NR != FNR { Field3[FNR]=$3; next }
      { TotMsgCount=Field3[FNR]
        NonSpamMsgCount=Field3[FNR]-$11
        if ( TotMsgCount == 0 )
          { NonSpamPct = 0; SpamPct = 0 }
        else
          { NonSpamPct = int(NonSpamMsgCount/TotMsgCount*1000)/1000
            SpamPct = int($11/TotMsgCount*1000)/1000
            # you can do the printing this way
            # regardless of input line ordering:
	   $2 = $2 OFS TotMsgCount OFS NonSpamMsgCount OFS NonSpamPct OFS 
SpamPct OFS " " OFS
            print
          }
     }'

Regards,

	Ed.


0
Reply Ed 5/6/2004 8:39:59 PM

4 Replies
252 Views

(page loaded in 0.127 seconds)

Similiar Articles:













7/23/2012 7:31:52 PM


Reply: