Re: delete vs insert vs update due to primary key dups -> which is better - Mailing list pgsql-general

From Richard Huxton
Subject Re: delete vs insert vs update due to primary key dups -> which is better
Date
Msg-id 46D3CC75.5060502@archonet.com
Whole thread Raw
In response to delete vs insert vs update due to primary key dups -> which is better  (Ow Mun Heng <Ow.Mun.Heng@wdc.com>)
Responses Re: delete vs insert vs update due to primary key dups -> which is better  (Ow Mun Heng <Ow.Mun.Heng@wdc.com>)
List pgsql-general
Ow Mun Heng wrote:
> Continuining with my efforts to get similar functionality as mysql's
> mysqlimport --replace I want to ask for the list's opinion on which is
> better

I would suggest #3

> 3. \copy to temp_table, do
> --> update main_table set value=t2.value from temp_table t2 where
> main_table.pkey = t2.pkey
> (alternative yet better way per depezs is append "and (main_table.value
> is distinct from t2.value)" which looks only for distinct values
> --> insert into main_table select * from temp_table t2 where not exists
> (select * from main_table where main_table.pkey = t2.pkey)

Don't forget to run a DELETE phase too, for rows that are no longer in
the import but are in your database.

> btw, pkey was actually a multiple column composite key which I've
> concatenated together as varchar :
>
> eg: select a+cast(b as varchar)+cast(c as
> varchar)+cast(datediff(s,'19700101',timestamp) as pkey from
> mssql_server_table
>
> to make it simpler for the insertion/update/deletion

Not sure I understand this bit. What's gets complicated with a
multi-column key?

> Option 3 seems to be better because it will leave less dead tuples due
> to MVCC

It's also better because it's what you actually want to do (in a logical
sense, you want to update existing rows). That means if you ever attach
triggers to the table they will behave as you expect.

 > but the update procedure will end up to be a bit of a
> maintenance nightmare each time a new column is added. (and it is added
> quite frequently!! alas!)

If you're trying to keep two databases in sync then you need a process
to manage that. Don't try and make your import handle it - deal with it
properly. Scan the table definitions from both servers and compare them
column-by-column, adding and removing as required. THEN copy the data over.

> Option 4 is simpler, delete all the duplicate p_keys in the main_table
> before inserting it with the newest data (temp_table will _always_ have
> the newest unique data for that table)
>
> Only thing I'm worried about option 4 is, MVCC and MVCC (i guess) and
> the dead-tuples and the additional space.
>
> Vacuum often???
>
> Please, I would like your comments on this. Thanks in Advance.
>
> On average, daily will have between 5K to 250K of unique rows

Yes, you should make sure you are vacuuming at least once per import
cycle. However, as long as you have enough space in your free-space-map,
any dead-tuple-space will be re-used.


The other option would be #5:
1. Copy whole table from MSSQL to PG, but under new name (t1_tmp)
2. Rename t1 => t1_old, t1_tmp => t1
3. Drop t1_old
If you don't have any foreign keys referring to the table, or views
depending on it then that is probably simplest, although least efficient.

--
   Richard Huxton
   Archonet Ltd

pgsql-general by date:

Previous
From: Kamil Srot
Date:
Subject: Re: Tables dissapearing
Next
From: Richard Huxton
Date:
Subject: Re: Crashed DB and "base" folder files