Re: Bulkloading using COPY - ignore duplicates? - Mailing list pgsql-hackers

From Lee Kindness
Subject Re: Bulkloading using COPY - ignore duplicates?
Date
Msg-id 15391.26134.348061.302157@elsick.csl.co.uk
Whole thread Raw
In response to Re: Bulkloading using COPY - ignore duplicates?  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Bulkloading using COPY - ignore duplicates?
List pgsql-hackers
Tom Lane writes:> Lee Kindness <lkindness@csl.co.uk> writes:> > You're right - I was meaning 'SELECT DISTINCT ON ()'.
HoweverI'm only> > using it as an example of where the database is choosing (be it> > randomly) the data to discarded.>
Nota good example to support your argument.  The entire point of> DISTINCT ON (imho) is that the rows that are kept or
discardedare> *not* random, but can be selected by the user by specifying additional> sort columns.  DISTINCT ON would
bepretty useless if it weren't for> that flexibility.  The corresponding concept in COPY will need to> provide flexible
meansfor deciding which row to keep and which to> drop, else it'll be pretty useless.
 

At which point it becomes quicker to resort to INSERT...

Here's the crux question - how can I get management to go with
PostgreSQL when a core operation (import of data into a transient
database) is at least 6 times slower than the current version?

With a lot of work investigating the incoming data, the number of
incoming duplicates has been massively reduced by fixing/tackling at
source. However rouge values do still crop up (the data originates
from a real-time system with multiple hardware inputs from multiple
hardware vendors) and when they do (even just 1) the performance dies.
Add to this terrabytes of legacy data...

While you may see the option of ignoring duplicates in COPY as 'pretty
useless', it obviously has its place/use otherwise every other
database system wouldn't have support for it! (not that following the
pack is always a good idea)

In an ideal world 'COPY FROM' would only be used with data output by
'COPY TO' and it would be nice and sanitised. However in some fields
this often is not a possibility due to performance constraints!

Best regards,

-- Lee Kindness, Senior Software Engineer, Concept Systems Limited.http://services.csl.co.uk/
http://www.csl.co.uk/


pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: FreeBSD/alpha
Next
From: Tom Lane
Date:
Subject: Re: Bulkloading using COPY - ignore duplicates?