preprocess a csv file to remove commas



A recent forum question I answered prompted me to post this note - the requirement was that the external table needed to have a number type column but the underlying data had commas in which meant it would not work. I initially thought some kind of number format could be applied in the external table definition but that did not work and the answer was to use a preprocessor - so here are the steps:

The csv file contains this data:

fielda|fieldb
somedata|1,000
somedata|1,000


We need to create a simple script that will strip out commas - for this we can use sed

I created a script (test.sh) with the following content

/bin/sed 's/,//g' $1

Note full path to sed is important as is the $1 and the file must have execute permissions set

So to now use this shell script we refer to it in the external table defintion as a preprocessor - so before the csv file is read the shell script runs against it - this does not alter the original file but the result of the shell script on the file is what is fed into oracle.

So to create the external table we now run:

DROP TABLE XT_TEST;

CREATE TABLE XT_test
                    ( fielda   varchar2(10),
                      fieldb   number
                    )
ORGANIZATION EXTERNAL
  (  TYPE ORACLE_LOADER
     DEFAULT DIRECTORY DATA_PUMP_DIR
     ACCESS PARAMETERS
       ( records delimited by newline

PREPROCESSOR DATA_PUMP_DIR:'test.sh'           badfile      DATA_PUMP_DIR:'testcsv.bad'
           logfile     DATA_PUMP_DIR:'testcsv.log'
           discardfile DATA_PUMP_DIR:'testcsv.dsc'
           skip 1
           fields  terminated by '|'
           lrtrim
         missing field values are null
                   ( FIELDA                                 CHAR(20) ,
                     FIELDB   
                    )
          )
     LOCATION (DATA_PUMP_DIR:'test.csv')
  )
REJECT LIMIT UNLIMITED NOMONITORING
/


Now when we fetch from the table the data is there (with no commas) and the table has a column of type number

SYS@EMZA3>select * from xt_test;

FIELDA         FIELDB
---------- ----------
somedata         1000
somedata         1000

SYS@EMZA3>desc xt_test
 Name                                      Null?    Type
 ----------------------------------------- -------- ----------------------------
 FIELDA                                             VARCHAR2(10)
 FIELDB                                             NUMBER


Comments