Thread: COPY equivalent for updates

COPY equivalent for updates

From
Ivan Sergio Borgonovo
Date:
Is there a COPY equivalent for updates?

eg I've

create table t1 ( id int primary key, col1 int, col2 int, col3 varchar(32)
);

and a CSV file
10,4,5,"abc"
13,7,3,"def"
18,12,77,"ghi"

I'd like to

UPDATE t1 (col1, col2, col3) from file with @1 as primary key;
or
UPDATE t1 (col1, col2, col3) from file where @1=id;

sort of...

Otherwise what is the fastest approach?


I can think of 2 approaches:
1)load a temp table with COPY

update t1 set col1=temp_t1.col1, col2=temp_t1.col2 where t1.id=temp_t1.id;

2) use awk to generate update statements.

Supposing I could neglect the awk execution time, will COPY + UPDATE
be faster than executing a list of UPDATE?

Considering I've to deal with a where clauses anyway... when (and
if) should I create an index on the id of temp_t1?
t1 will contain 700-1M records while I may update a maximum of 20K a
time.

thanks

-- 
Ivan Sergio Borgonovo
http://www.webthatworks.it



Re: COPY equivalent for updates

From
Decibel!
Date:
On Jul 15, 2008, at 1:10 AM, Ivan Sergio Borgonovo wrote:
> I'd like to
>
> UPDATE t1 (col1, col2, col3) from file with @1 as primary key;
> or
> UPDATE t1 (col1, col2, col3) from file where @1=id;
>
> sort of...

Sorry, there's nothing like COPY for UPDATE.

> Otherwise what is the fastest approach?
>
>
> I can think of 2 approaches:
> 1)load a temp table with COPY
>
> update t1 set col1=temp_t1.col1, col2=temp_t1.col2
>   where t1.id=temp_t1.id;
>
> 2) use awk to generate update statements.
>
> Supposing I could neglect the awk execution time, will COPY + UPDATE
> be faster than executing a list of UPDATE?

Almost certainly... databases like dealing with sets of data; what  
your proposing with AWK turns it into a ton of single-row statements.

Depending on what you're doing, it might well be fastest to...

BEGIN;
COPY temp_table FROM 'file';
DELETE FROM real_table WHERE real_table_id IN (SELECT real_table_id  
FROM temp_table);
INSERT INTO real_table SELECT * FROM temp_table;
COMMIT;

> Considering I've to deal with a where clauses anyway... when (and
> if) should I create an index on the id of temp_t1?
> t1 will contain 700-1M records while I may update a maximum of 20K a
> time.

-- 
Decibel!, aka Jim C. Nasby, Database Architect  decibel@decibel.org
Give your computer some brain candy! www.distributed.net Team #1828



integrity check and visibility was: COPY equivalent for updates

From
Ivan Sergio Borgonovo
Date:
On Tue, 15 Jul 2008 22:13:42 -0500
Decibel! <decibel@decibel.org> wrote:

> Depending on what you're doing, it might well be fastest to...

> BEGIN;
> COPY temp_table FROM 'file';
> DELETE FROM real_table WHERE real_table_id IN (SELECT
> real_table_id FROM temp_table);
> INSERT INTO real_table SELECT * FROM temp_table;
> COMMIT;

Thank you for making me reconsider this approach.
I gave up since I have a lot of ON DELETE CASCADE... but you made me
remind that there should be a way to delay the checks at the end of
the transaction (I still couldn't find the syntax on the manual).

Actually from my understanding an UPDATE in PostgreSQL is a DELETE +
INSERT. But wouldn't the subquery
DELETE FROM real_table WHERE real_table_id IN (SELECT real_table_id
FROM temp_table);

take longer to execute than the UPDATE version:
update t1 set col1=temp_t1.col1, col2=temp_t1.col2 where t1.id=temp_t1.id;
mutatis mutandis?

I'd say that pg should be enough smart to take advantage of a unique
key on real_table.real_table_id and stop searching as soon as it
deleted the first occurrence of an ID found in real_table from
temp_table. Is it?
Would adding an unique index on temp_table make the process faster
or slower? I know that temp_table.real_table_id will be unique so
I'm not interested in the integrity check (that anyway should take
place later). At first glance the unique index on
real_table.real_table_id may be useful, the one on temp_table should
not and will actually make the process slower adding index creation
time.

So the "ON DELETE CASCADE/SET NULL" problem should be solved once I
find the syntax to delay checks
Can it be solved this way? I'm going to check as soon as I find the
syntax on the manual.

What about visibility of the "going to be deleted rows"?

If another transaction start between the DELETE and the INSERT
statement are executed... what is it going to see?
What if I don't have just a single INSERT statement since some
columns will be moved to another table?

Does this make necessary the UPDATE approach in spite of the
DELETE/INSERT approach?
If I had temp_table split into 2 tables eg.

create temp_table( id int, col11 varchar(32), col21 varchar(32)
);
create table table1 ( id int primary key, col11 varchar(32)
);
create table table2 ( id int references table1(id) on delete cascade, col21 varchar(32)
);

I wouldn't be concerned that data in table1 and table2 contains data
updated in different times and "not coherent" rather I'd be
concerned there is a row in table1 but there isn't in table2.
Furthermore... if I delay checks and ON DELETE triggers rows in
table2 won't be deleted when I delete rows in table1 unless I do it
"manually".

At this point I'm more concerned of maintainability over speed.
Actually if I follow the UPDATE approach and temp_table is still
split across several tables I'll have to write an update for each
tableN anyway... so there shouldn't be too much difference between
the UPDATE and the DELETE/INSERT approach.

What about the effect of a longer transaction compared to a list of
updates on speed?

thanks

-- 
Ivan Sergio Borgonovo
http://www.webthatworks.it