Thread: Performance issue with cross table updates

Performance issue with cross table updates

From
Craig Gibson
Date:
Hi all

I am no database wizard so I am hoping someone may be able to assist me :)

I get a daily CSV file of 6.5 million records. I create a temporary
table and COPY them in. On completion I create an index on the mdnid
column. This column is also indexed in table 2. This part is very
fast. I had some 'checkpoint too often' issues, but that I have
resolved.

I then use the following procedure to update all the records, and if a
record does not exist, insert it instead:

declare
    v_update e_lookup_load%ROWTYPE;

begin

for v_update in select * from e_lookup_load loop

update e_lookup set
       cur_balance = v_update.cur_balance,
       last_usage = v_update.last_usage,
       b_code = v_update.b_code,
       u_date = v_update.u_date,
       date_of_insert = v_update.date_of_insert
       where mdnid = v_update.mdnid;

if NOT FOUND then
    begin
    insert into e_lookup(mdnid,cur_balance,last_usage,b_code,u_date,date_of_insert)
values
    (v_update.mdnid,
    v_update.cur_balance,
    v_update.last_usage,
    v_update.b_code,
    v_update.u_date,
    v_date_of_insert);
    end;
end if;
end loop;

end;

From my understanding, a for loop is encapsulated in a cursor anyway
so no need to do that. Am I fundamentally doing something wrong as the
operation is slow as molasses? Maybe there is a better way altogether
that I have not thought of? The bottom line is that at no point can
the e_lookup table be unavailable to clients, else I would have just
done a simple drop and rename post the COPY.

Kind Regards
Craig


Re: Performance issue with cross table updates

From
"David Johnston"
Date:
> -----Original Message-----
> From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-
> owner@postgresql.org] On Behalf Of Craig Gibson
> Sent: Monday, September 10, 2012 12:34 PM
> To: pgsql-general@postgresql.org
> Subject: [GENERAL] Performance issue with cross table updates
>
> Hi all
>
> I am no database wizard so I am hoping someone may be able to assist me :)
>
> I get a daily CSV file of 6.5 million records. I create a temporary table
and
> COPY them in. On completion I create an index on the mdnid column. This
> column is also indexed in table 2. This part is very fast. I had some
'checkpoint
> too often' issues, but that I have resolved.
>
> I then use the following procedure to update all the records, and if a
record
> does not exist, insert it instead:
>
> ................................
>
> From my understanding, a for loop is encapsulated in a cursor anyway so no
> need to do that. Am I fundamentally doing something wrong as the
> operation is slow as molasses? Maybe there is a better way altogether that
I
> have not thought of? The bottom line is that at no point can the e_lookup
> table be unavailable to clients, else I would have just done a simple drop
and
> rename post the COPY.
>

What version of PostgreSQL are you using.  Your ability to solve this in
more efficient ways is greatly enhanced the newer the version of PostgreSQL
you are using (writable CTE mostly).

That said, you really need to use "set logic" and not procedural logic.  In
this situation I would simply add another column to the import staging table
called "exists_on_master" or something similar.

I would then:

UPDATE staging SET exists_on_master = TRUE FROM live WHERE staging.id =
live.id;

I would then issue a:

UPDATE live SET  .... FROM staging WHERE staging.id = live.id;

Then I would execute:

INSERT INTO live (...) SELECT ... FROM staging WHERE NOT exists_on_master;

You don't really need the extra boolean on the staging table but it makes
the subsequent queries more explicit.  Though for 6.5 million records you
may want to try a couple of variations of this to see which performs best.

I am presuming that the live table does not change other than by this
synchronization query so you really shouldn't have any concurrency issues.

You should not be attempting to UPDATE a record that does not exist.
Identify all the ones you know will fail in advance and just perform the
insert.

David J.






Re: Performance issue with cross table updates

From
Tom Lane
Date:
Craig Gibson <craiggib@gmail.com> writes:
> I get a daily CSV file of 6.5 million records. I create a temporary
> table and COPY them in. On completion I create an index on the mdnid
> column. This column is also indexed in table 2. This part is very
> fast. I had some 'checkpoint too often' issues, but that I have
> resolved.

FWIW, you probably should also issue an ANALYZE on the temp table after
you've loaded it.  If you skip that, the planner is working blind as to
column statistics and is quite likely to choose inappropriate plans.

I concur with the other comment that you might be able to make this a
lot faster if you could convert it into a couple of multiple-row
commands instead of doing a manual loop.  But you'd need up-to-date
statistics to get a decent plan for that, too ...

            regards, tom lane


Re: Performance issue with cross table updates

From
Jeff Janes
Date:
On Mon, Sep 10, 2012 at 9:34 AM, Craig Gibson <craiggib@gmail.com> wrote:
> Hi all
>
> I am no database wizard so I am hoping someone may be able to assist me :)
>
> I get a daily CSV file of 6.5 million records. I create a temporary
> table and COPY them in. On completion I create an index on the mdnid
> column. This column is also indexed in table 2. This part is very
> fast. I had some 'checkpoint too often' issues, but that I have
> resolved.
>
> I then use the following procedure to update all the records, and if a
> record does not exist, insert it instead:
>
...
>
> From my understanding, a for loop is encapsulated in a cursor anyway
> so no need to do that. Am I fundamentally doing something wrong as the
> operation is slow as molasses?

How big is the parent table?  Are you CPU limited or IO limited?

If you are not CPU limited, then I would guess that the indexes on the
parent table do not fit in RAM or shared_buffers and that maintaining
the indexes on the parent table during the updates/inserts is the
bottleneck.

> Maybe there is a better way altogether
> that I have not thought of? The bottom line is that at no point can
> the e_lookup table be unavailable to clients,

It always has to be available for updates, or just for selects?

> else I would have just
> done a simple drop and rename post the COPY.

Maybe you can just do an atomic rename.

Cheers,

Jeff