Re: Performance issue with cross table updates - Mailing list pgsql-general

From David Johnston
Subject Re: Performance issue with cross table updates
Date
Msg-id 00f801cd8f7b$5e5e2530$1b1a6f90$@yahoo.com
Whole thread Raw
In response to Performance issue with cross table updates  (Craig Gibson <craiggib@gmail.com>)
List pgsql-general
> -----Original Message-----
> From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-
> owner@postgresql.org] On Behalf Of Craig Gibson
> Sent: Monday, September 10, 2012 12:34 PM
> To: pgsql-general@postgresql.org
> Subject: [GENERAL] Performance issue with cross table updates
>
> Hi all
>
> I am no database wizard so I am hoping someone may be able to assist me :)
>
> I get a daily CSV file of 6.5 million records. I create a temporary table
and
> COPY them in. On completion I create an index on the mdnid column. This
> column is also indexed in table 2. This part is very fast. I had some
'checkpoint
> too often' issues, but that I have resolved.
>
> I then use the following procedure to update all the records, and if a
record
> does not exist, insert it instead:
>
> ................................
>
> From my understanding, a for loop is encapsulated in a cursor anyway so no
> need to do that. Am I fundamentally doing something wrong as the
> operation is slow as molasses? Maybe there is a better way altogether that
I
> have not thought of? The bottom line is that at no point can the e_lookup
> table be unavailable to clients, else I would have just done a simple drop
and
> rename post the COPY.
>

What version of PostgreSQL are you using.  Your ability to solve this in
more efficient ways is greatly enhanced the newer the version of PostgreSQL
you are using (writable CTE mostly).

That said, you really need to use "set logic" and not procedural logic.  In
this situation I would simply add another column to the import staging table
called "exists_on_master" or something similar.

I would then:

UPDATE staging SET exists_on_master = TRUE FROM live WHERE staging.id =
live.id;

I would then issue a:

UPDATE live SET  .... FROM staging WHERE staging.id = live.id;

Then I would execute:

INSERT INTO live (...) SELECT ... FROM staging WHERE NOT exists_on_master;

You don't really need the extra boolean on the staging table but it makes
the subsequent queries more explicit.  Though for 6.5 million records you
may want to try a couple of variations of this to see which performs best.

I am presuming that the live table does not change other than by this
synchronization query so you really shouldn't have any concurrency issues.

You should not be attempting to UPDATE a record that does not exist.
Identify all the ones you know will fail in advance and just perform the
insert.

David J.






pgsql-general by date:

Previous
From: Craig Gibson
Date:
Subject: Performance issue with cross table updates
Next
From: Tom Lane
Date:
Subject: Re: Performance issue with cross table updates