Re: Bulkloading using COPY - ignore duplicates? - Mailing list pgsql-hackers

From Lee Kindness
Subject Re: Bulkloading using COPY - ignore duplicates?
Date
Msg-id 15290.57850.68004.391312@elsick.csl.co.uk
Whole thread Raw
In response to Re: Bulkloading using COPY - ignore duplicates?  (Peter Eisentraut <peter_e@gmx.net>)
Responses Bulkloading using COPY - ignore duplicates?
Re: Bulkloading using COPY - ignore duplicates?
List pgsql-hackers
Peter Eisentraut writes:> However, it seems to me that COPY ignoring duplicates can easily be> done by preprocessing
theinput file.
 

Or by post-processing, like (error checking cut):
void import_shots(char *impfile, int lineshoot_id){ char tab_name[128]; char tab_temp[128];
 frig_file(impfile); /* add the postgres header */ sprintf(tab_name, "shot_%d", lineshoot_id); sprintf(tab_temp,
"shot_%d_tmp",lineshoot_id);
 
 sprintf(cmd, "CREATE TEMPORARY TABLE %s AS SELECT * FROM shot",  tab_temp); EXEC SQL EXECUTE IMMEDIATE :cmd; EXEC SQL
COMMITWORK; /* will not work without comit here! */
 
 sprintf(cmd, "COPY BINARY %s FROM '%s'", tab_temp, impfile); append_page_alloc(cmd, tab_name, impfile, 1); EXEC SQL
EXECUTEIMMEDIATE :cmd; sprintf(cmd, "INSERT INTO %s SELECT DISTINCT ON(shot_time) * FROM %s",  tab_name, tab_temp);
EXECSQL EXECUTE IMMEDIATE :cmd;
 
 sprintf(cmd, "DROP TABLE %s", tab_temp); EXEC SQL EXECUTE IMMEDIATE :cmd;
 EXEC SQL COMMIT WORK ; remove(impfile);}

However this is adding significant time to the import
operation. Likewise I could loop round the input file first and hunt
for duplicates, again with a performance hit.

My main point is that Postgres can easily and quickly check for
duplicates during the COPY (as it does currently) and it adds zero
execution time to simply ignore these duplicate rows. Obviously this
is a useful feature otherwise Oracle, Ingres and other commercial
relational databases wouldn't feature similiar functionality.

Yes, in an ideal world the input to COPY should be clean and
consistent with defined indexes. However this is only really the case
when COPY is used for database/table backup and restore. It misses the
point that a major use of COPY is in speed optimisation on bulk
inserts...

Lee.


pgsql-hackers by date:

Previous
From: Tatsuo Ishii
Date:
Subject: Re: Unicode combining characters
Next
From: Dave Harkness
Date:
Subject: PROBLEM SOLVED: LOCK TABLE oddness in PLpgSQL function called via JDBC