Re: Natural key woe - Mailing list pgsql-general

From Robin
Subject Re: Natural key woe
Date
Msg-id BLU436-SMTP170054D0C7C1930D3611C0FE2340@phx.gbl
Whole thread Raw
In response to Natural key woe  (Oliver Kohll - Mailing Lists <oliver.lists@agilebase.co.uk>)
Responses Re: Natural key woe  (Eelke Klein <eelke@bolt.nl>)
List pgsql-general
Oliver

I've read your email, with interest. I haven't had to deal with this
sort of problem in PostgreSQL, but I have frequently dealt with it in a
Sybase environment, first encountered about 25 years ago.

I am most curious to know why you didn't use the same sequence for both
tables, I must be missing something. If there is a gotcha, I'd like to
know about it as I can see this being an issue in a load sharing
environment.
Many years ago, before auto sequencing was common, we set up explicit
functions to generate sequwnce numbers. Whilst this had some perormance
costs in multi-user systems, it did have benefits in terms of making it
easier to restrict the need for row locking to the underlying data table.

Robin St.Clair
On 13/05/2014 10:44, Oliver Kohll - Mailing Lists wrote:
> I'm sure no one else on this list has done anything like this, but here's a cautionary tale.
>
> I wanted to synchronise data in two tables (issue lists) - i.e. whenever a record is added into one, add a similar
recordinto the other. The two tables are similar in format but not exactly the same so only a subset of fields are
copied.Both tables have synthetic primary keys, these can't be used to match data as they are auto-incrementing
sequencesthat might interfere. What I could have done perhaps is get both tables to use the same sequence, but what I
actuallydid is: 
>
> * join both tables based on a natural key
> * use that to copy any missing items from table1 to table2
> * truncate table1 and copy all of table2's rows to table1
> * run this routine once an hour
>
> The natural key was based on the creation timestamp (stored on insert) and the one of the text fields, called
'subject'.
>
> The problem came when someone entered a record with no subject, but left it null. When this was copied over and
presentin both tables, the *next* time the join was done, a duplicate was created because the join didn't see them as
matching(null != null). 
>
> So after 1 hour there were two records. After two there were four, after 3, 8 etc.
>
> When I logged in after 25 hrs and noticed table access was a little slow, there were 2^25 = 33 million records.
>
> That's a learning experience for me at least. It's lucky I did check it at the end of that day rather than leaving it
overnight,otherwise I think our server would have ground to a halt. 
>
> One other wrinkle to note. After clearing out these rows, running 'VACUUM table2', 'ANALYZE table2' and 'REINDEX
tabletable2', some queries with simple sequence scans were taking a few seconds to run even though there are only a
thousandrows in the table. I finally found that running CLUSTER on the table sorted that out, even though we're on an
SSDso I would have thought seeking all over the place for a seq. scan wouldn't have made that much difference. It
obviouslydoes still make some. 
>
> Oliver Kohll
> www.agilebase.co.uk
>
>
>
>
>
>



pgsql-general by date:

Previous
From: Oliver Kohll - Mailing Lists
Date:
Subject: Natural key woe
Next
From: Eelke Klein
Date:
Subject: Re: Natural key woe