Natural key woe - Mailing list pgsql-general

From Oliver Kohll - Mailing Lists
Subject Natural key woe
Date
Msg-id 713571A8-7DDF-4A06-B3EE-22D5B45D2A8F@agilebase.co.uk
Whole thread Raw
Responses Re: Natural key woe  (Robin <robinstc@live.co.uk>)
Re: Natural key woe  (Yeb Havinga <yebhavinga@gmail.com>)
List pgsql-general
I'm sure no one else on this list has done anything like this, but here's a cautionary tale.

I wanted to synchronise data in two tables (issue lists) - i.e. whenever a record is added into one, add a similar
recordinto the other. The two tables are similar in format but not exactly the same so only a subset of fields are
copied.Both tables have synthetic primary keys, these can't be used to match data as they are auto-incrementing
sequencesthat might interfere. What I could have done perhaps is get both tables to use the same sequence, but what I
actuallydid is: 

* join both tables based on a natural key
* use that to copy any missing items from table1 to table2
* truncate table1 and copy all of table2's rows to table1
* run this routine once an hour

The natural key was based on the creation timestamp (stored on insert) and the one of the text fields, called
'subject'.

The problem came when someone entered a record with no subject, but left it null. When this was copied over and present
inboth tables, the *next* time the join was done, a duplicate was created because the join didn't see them as matching
(null!= null). 

So after 1 hour there were two records. After two there were four, after 3, 8 etc.

When I logged in after 25 hrs and noticed table access was a little slow, there were 2^25 = 33 million records.

That's a learning experience for me at least. It's lucky I did check it at the end of that day rather than leaving it
overnight,otherwise I think our server would have ground to a halt. 

One other wrinkle to note. After clearing out these rows, running 'VACUUM table2', 'ANALYZE table2' and 'REINDEX table
table2',some queries with simple sequence scans were taking a few seconds to run even though there are only a thousand
rowsin the table. I finally found that running CLUSTER on the table sorted that out, even though we're on an SSD so I
wouldhave thought seeking all over the place for a seq. scan wouldn't have made that much difference. It obviously does
stillmake some. 

Oliver Kohll
www.agilebase.co.uk






pgsql-general by date:

Previous
From: Dorian Hoxha
Date:
Subject: Re: Log Data Analytics : Confused about the choice of Database
Next
From: Robin
Date:
Subject: Re: Natural key woe