Thread: What's the fastest way to do this?

What's the fastest way to do this?

From

Orion

Date:

08 November 2001, 15:28:08

I have several really big tables that have columns uniquely identified by
single or multiple rows.  [ I have about 25 tables, 10k to 500k rows
per table ]

Each day I get a flat file of updates.  I have no way of knowing which
lines in the file are new records and which are updates for existing
records.

I need a way to insert the new ones and update the old ones.   I have
a couple of ideas but none of them seem fast enough ( I will soon
be getting updates faster than I can feed them into the database ).

I am running postgres 7.1.3-1PDGD.

Here are my ideas...

---------------------------------------------------------

1)

select count(*) from table where id = X;
(if the count == 1)
update table set a = 1, b = 2, c = 3 where id = X;

2)

insert into table (id, a, b, c) values (X, 1, 2, 3);
(if there is an exception b/c of duplicate keys)
update table set a = 1, b = 2, c = 3 where id = X;

3)

delete from table where id = X;
(if the record is new nothing happens)
insert into table (id, a, b, c) values (X, 1, 2, 3);

----------------------------------------------------------

right now I am doing (3) but it is very slow - I
am getting about 8 records/second through and
yes my id column is indexed.  The other problem
with (3) is it requires more vacuuming.

I cant imagine that (1) would be any faster.

It would seem that (2) would be the fastest but some
of my tables key on multiple columns
(where id1 = X and id2 = Y) and thus cant use a primary
key to raise an exception.

I cant imagine I am the first person to try this...
What's the standard/best/fastest way to do this.

Thanks in advance for any advice!

        Orion

Re: What's the fastest way to do this?

From

Andrew Gould

Date:

08 November 2001, 17:18:20

Are you updating directly from the flat file?  Or are
you copying the new data into an indexed table that is
emptied and vacuumed after daily updates, and
updating/inserting from there?

Andrew Gould

--- Orion <o2@trustcommerce.com> wrote:
>
> I have several really big tables that have columns
> uniquely identified by
> single or multiple rows.  [ I have about 25 tables,
> 10k to 500k rows
> per table ]
>
> Each day I get a flat file of updates.  I have no
> way of knowing which
> lines in the file are new records and which are
> updates for existing
> records.
>
> I need a way to insert the new ones and update the
> old ones.   I have
> a couple of ideas but none of them seem fast enough
> ( I will soon
> be getting updates faster than I can feed them into
> the database ).
>
> I am running postgres 7.1.3-1PDGD.
>
> Here are my ideas...
>
>
---------------------------------------------------------
>
> 1)
>
> select count(*) from table where id = X;
> (if the count == 1)
> update table set a = 1, b = 2, c = 3 where id = X;
>
> 2)
>
> insert into table (id, a, b, c) values (X, 1, 2, 3);
> (if there is an exception b/c of duplicate keys)
> update table set a = 1, b = 2, c = 3 where id = X;
>
> 3)
>
> delete from table where id = X;
> (if the record is new nothing happens)
> insert into table (id, a, b, c) values (X, 1, 2, 3);
>
>
----------------------------------------------------------
>
> right now I am doing (3) but it is very slow - I
> am getting about 8 records/second through and
> yes my id column is indexed.  The other problem
> with (3) is it requires more vacuuming.
>
> I cant imagine that (1) would be any faster.
>
> It would seem that (2) would be the fastest but some
> of my tables key on multiple columns
> (where id1 = X and id2 = Y) and thus cant use a
> primary
> key to raise an exception.
>
> I cant imagine I am the first person to try this...
> What's the standard/best/fastest way to do this.
>
> Thanks in advance for any advice!
>
>         Orion
>
>
> ---------------------------(end of
> broadcast)---------------------------
> TIP 4: Don't 'kill -9' the postmaster


__________________________________________________
Do You Yahoo!?
Find a job, post your resume.
http://careers.yahoo.com

Re: What's the fastest way to do this?

From

Orion

Date:

08 November 2001, 18:18:19

Andrew Gould wrote:

> Are you updating directly from the flat file?  Or are
> you copying the new data into an indexed table that is
> emptied and vacuumed after daily updates, and
> updating/inserting from there?
>
> Andrew Gould

Well to be honest I'll do it any way I can to make this go faster.

Right now I am updating directly from said file. For each line I read from
the file I need to insert or update the exsisting table depending on
weather or not the record is already there.

        Orion

Re: What's the fastest way to do this?

From

Einar Karttunen

Date:

09 November 2001, 01:58:23

On Thu, Nov 08, 2001 at 11:58:49AM -0800, Orion wrote:
>
> I have several really big tables that have columns uniquely identified by
> single or multiple rows.  [ I have about 25 tables, 10k to 500k rows
> per table ]
>
> Each day I get a flat file of updates.  I have no way of knowing which
> lines in the file are new records and which are updates for existing
> records.
>
> I need a way to insert the new ones and update the old ones.   I have
> a couple of ideas but none of them seem fast enough ( I will soon
> be getting updates faster than I can feed them into the database ).
>
Hello

I was facing a similar problem some time ago. My solution was to create
a temp table and COPY the new data to it. After that I deleted all records
in the original table which existed in the temporary table. Then I just
did a insert from a select * from the temp table. Of course with this
approach you have to lock the tables.

- Einar Karttunen

Re: What's the fastest way to do this?

From

"Colin 't Hart"

Date:

09 November 2001, 04:18:28

Einar Karttunen writes:

> > I have several really big tables that have columns uniquely identified
by
> > single or multiple rows.  [ I have about 25 tables, 10k to 500k rows
> > per table ]
> >
> > Each day I get a flat file of updates.  I have no way of knowing which
> > lines in the file are new records and which are updates for existing
> > records.
> >
> > I need a way to insert the new ones and update the old ones.   I have
> > a couple of ideas but none of them seem fast enough ( I will soon
> > be getting updates faster than I can feed them into the database ).
>
> I was facing a similar problem some time ago. My solution was to create
> a temp table and COPY the new data to it. After that I deleted all records
> in the original table which existed in the temporary table. Then I just
> did a insert from a select * from the temp table. Of course with this
> approach you have to lock the tables.
>
> - Einar Karttunen

Even faster is to:

1. load the data into a temporary table (even faster is to load into a
permanent
table -- just truncate it first in each run). Let's call this table
ImportTable.

2. update the existing records
   UPDATE rt
     SET a = t.a, b = x.b, c = x.c
     FROM RealTable AS rt, ImportTable AS x
     WHERE rt.pk = x.pk

3. insert the new records
   INSERT INTO RealTable(pk, a, b, c)
     SELECT pk, a, b, c
     FROM ImportTable AS x
     WHERE NOT EXISTS (SELECT 1 FROM RealTable WHERE pk = x.pk)

'pk' is the primary key of the table, or, at worst, a UNIQUEly INDEXed
column
combination.

This avoids the expensive DELETE operation (DBMSs are generally better at
INSERTing and UPDATEing than DELETEing, and remember that the DELETEs have
to
go through the transaction log).

Don't use cursors if you can help it -- cursors can be up to several orders
of
magnitude slower, and usually at least 4 times slower.

Using an import table allows you to sanitize the data by insert a step to do
this between steps 1 and 2.

Cheers,

Colin

Re: What's the fastest way to do this?

From

Orion

Date:

12 November 2001, 14:59:23

I tried your idea but I cant get it to work.

Here's the SQL I used:


CREATE TEMP table mfps_action_codes_394_tmp (
        code    text,
        description     text);
INSERT into mfps_action_codes_394_tmp (code,description)
        VALUES ('$','Dun Notice Printed');
INSERT into mfps_action_codes_394_tmp (code,description)
        VALUES ('&','Continuity Speedup');
INSERT into mfps_action_codes_394_tmp (code,description)
        VALUES ('*','Expiring CC Notification-Conty');
INSERT into mfps_action_codes_394_tmp (code,description)
        VALUES ('0','Return Authorization');
UPDATE mfps_action_codes_394
        SET description = x.description
        FROM mfps_action_codes_394 AS rt, mfps_action_codes_394_tmp
        AS x WHERE rt.code = x.code;
INSERT INTO mfps_action_codes_394 (code,description)
        SELECT code,description FROM mfps_action_codes_394_tmp
        AS x
        WHERE NOT EXISTS (SELECT 1 FROM mfps_action_codes_394
                WHERE code = x.code);

What ends up happening is that the UPDATE will change EVERY exsisting
description to 'Dun Notice Printed'

I can't find any documentation as to how to use the FROM keyword on the
UPDATE command lest I'm sure I'd be able to figure this out myself.

> Even faster is to:
>
> 1. load the data into a temporary table (even faster is to load into a
> permanent
> table -- just truncate it first in each run). Let's call this table
> ImportTable.
>
> 2. update the existing records
>    UPDATE rt
>      SET a = t.a, b = x.b, c = x.c
>      FROM RealTable AS rt, ImportTable AS x
>      WHERE rt.pk = x.pk
>
> 3. insert the new records
>    INSERT INTO RealTable(pk, a, b, c)
>      SELECT pk, a, b, c
>      FROM ImportTable AS x
>      WHERE NOT EXISTS (SELECT 1 FROM RealTable WHERE pk = x.pk)
>
>
> 'pk' is the primary key of the table, or, at worst, a UNIQUEly INDEXed
> column
> combination.
>
> This avoids the expensive DELETE operation (DBMSs are generally better at
> INSERTing and UPDATEing than DELETEing, and remember that the DELETEs have
> to
> go through the transaction log).
>
> Don't use cursors if you can help it -- cursors can be up to several
> orders of
> magnitude slower, and usually at least 4 times slower.
>
> Using an import table allows you to sanitize the data by insert a step to
> do this between steps 1 and 2.
>
>
> Cheers,
>
> Colin

Re: What's the fastest way to do this?

From

Orion

Date:

12 November 2001, 15:09:43

Ok, I've boiled this down to an easy to run test...

DROP TABLE test;

CREATE TABLE test (
        code    int4,
        description     text);

INSERT INTO test (code,description) VALUES (1,'OLD VALUE 1');
INSERT INTO test (code,description) VALUES (2,'OLD VALUE 2');

CREATE TEMP TABLE test_tmp (
        code    int4,
        description     text);

INSERT INTO test_tmp (code,description) VALUES (1,'NEW VALUE 1');
INSERT INTO test_tmp (code,description) VALUES (2,'NEW VALUE 2');
INSERT INTO test_tmp (code,description) VALUES (3,'NEW ENTRY 1');
INSERT INTO test_tmp (code,description) VALUES (4,'NEW ENTRY 2');

UPDATE test
        SET description = x.description
        FROM test AS rt, test_tmp AS x
        WHERE rt.code = x.code;

INSERT INTO test (code,description)
        SELECT code,description
        FROM test_tmp AS x
        WHERE NOT EXISTS
        (SELECT 1 FROM test WHERE code = x.code);

SELECT * FROM test;

And here's the results I get from that test

DROP
CREATE
INSERT 1894322 1
INSERT 1894323 1
CREATE
INSERT 1894350 1
INSERT 1894351 1
INSERT 1894352 1
INSERT 1894353 1
UPDATE 2
INSERT 0 2
 code | description
------+-------------
    1 | NEW VALUE 1
    2 | NEW VALUE 1
    3 | NEW ENTRY 1
    4 | NEW ENTRY 2
(4 rows)

And as far as I understand it I should be getting the following

 code | description
------+-------------
    1 | NEW VALUE 1
    2 | NEW VALUE 2
    3 | NEW ENTRY 1
    4 | NEW ENTRY 2

For some reason the update is updating ALL the old records to be
'NEW VALUE 1'.

So (A) is this a bug or (B) do I totally misunderstand that UPDATE statement
above.  I appears that it should update each row in table 'test' with the
description corresponding to the code in test_tmp.  If this is not the
proper way to do this, what is?

Re: What's the fastest way to do this?

From

Stephan Szabo

Date:

12 November 2001, 15:59:30

On Mon, 12 Nov 2001, Orion wrote:

> UPDATE test
>         SET description = x.description
>         FROM test AS rt, test_tmp AS x
>         WHERE rt.code = x.code;

> So (A) is this a bug or (B) do I totally misunderstand that UPDATE statement
> above.  I appears that it should update each row in table 'test' with the
> description corresponding to the code in test_tmp.  If this is not the
> proper way to do this, what is?

The above update isn't doing what you're expecting.  You're updating
test, you don't need the test as rt in the from unless you are attempting
a self-join.

Try:
update test
 set description = x.description
 from test_tmp as x
 where test.code=x.code;

Re: What's the fastest way to do this?

From

Tom Lane

Date:

12 November 2001, 16:29:04

Orion <o2@trustcommerce.com> writes:
> UPDATE test
>         SET description = x.description
>         FROM test AS rt, test_tmp AS x
>         WHERE rt.code = x.code;

This command is a three-way join between test AS test (the target),
test AS rt, and test_tmp AS x.  Almost certainly not what you want,
especially considering that the target table is not constrained at
all by the WHERE clause.

            regards, tom lane

Re: What's the fastest way to do this?

From

Carl van Tast

Date:

13 November 2001, 05:59:33

On Mon, 12 Nov 2001 11:58:27 -0800, Orion <o2@trustcommerce.com>
wrote:
>[...]
>UPDATE test
>        SET description = x.description
>        FROM test AS rt, test_tmp AS x
>        WHERE rt.code = x.code;
This seems to work:
UPDATE test SET description = x.description
        FROM test_tmp x
        WHERE test.code = x.code;

Kind regards
 Carl van Tast