Thread: Natural key woe

Natural key woe

From
Oliver Kohll - Mailing Lists
Date:
I'm sure no one else on this list has done anything like this, but here's a cautionary tale.

I wanted to synchronise data in two tables (issue lists) - i.e. whenever a record is added into one, add a similar
recordinto the other. The two tables are similar in format but not exactly the same so only a subset of fields are
copied.Both tables have synthetic primary keys, these can't be used to match data as they are auto-incrementing
sequencesthat might interfere. What I could have done perhaps is get both tables to use the same sequence, but what I
actuallydid is: 

* join both tables based on a natural key
* use that to copy any missing items from table1 to table2
* truncate table1 and copy all of table2's rows to table1
* run this routine once an hour

The natural key was based on the creation timestamp (stored on insert) and the one of the text fields, called
'subject'.

The problem came when someone entered a record with no subject, but left it null. When this was copied over and present
inboth tables, the *next* time the join was done, a duplicate was created because the join didn't see them as matching
(null!= null). 

So after 1 hour there were two records. After two there were four, after 3, 8 etc.

When I logged in after 25 hrs and noticed table access was a little slow, there were 2^25 = 33 million records.

That's a learning experience for me at least. It's lucky I did check it at the end of that day rather than leaving it
overnight,otherwise I think our server would have ground to a halt. 

One other wrinkle to note. After clearing out these rows, running 'VACUUM table2', 'ANALYZE table2' and 'REINDEX table
table2',some queries with simple sequence scans were taking a few seconds to run even though there are only a thousand
rowsin the table. I finally found that running CLUSTER on the table sorted that out, even though we're on an SSD so I
wouldhave thought seeking all over the place for a seq. scan wouldn't have made that much difference. It obviously does
stillmake some. 

Oliver Kohll
www.agilebase.co.uk






Re: Natural key woe

From
Robin
Date:
Oliver

I've read your email, with interest. I haven't had to deal with this
sort of problem in PostgreSQL, but I have frequently dealt with it in a
Sybase environment, first encountered about 25 years ago.

I am most curious to know why you didn't use the same sequence for both
tables, I must be missing something. If there is a gotcha, I'd like to
know about it as I can see this being an issue in a load sharing
environment.
Many years ago, before auto sequencing was common, we set up explicit
functions to generate sequwnce numbers. Whilst this had some perormance
costs in multi-user systems, it did have benefits in terms of making it
easier to restrict the need for row locking to the underlying data table.

Robin St.Clair
On 13/05/2014 10:44, Oliver Kohll - Mailing Lists wrote:
> I'm sure no one else on this list has done anything like this, but here's a cautionary tale.
>
> I wanted to synchronise data in two tables (issue lists) - i.e. whenever a record is added into one, add a similar
recordinto the other. The two tables are similar in format but not exactly the same so only a subset of fields are
copied.Both tables have synthetic primary keys, these can't be used to match data as they are auto-incrementing
sequencesthat might interfere. What I could have done perhaps is get both tables to use the same sequence, but what I
actuallydid is: 
>
> * join both tables based on a natural key
> * use that to copy any missing items from table1 to table2
> * truncate table1 and copy all of table2's rows to table1
> * run this routine once an hour
>
> The natural key was based on the creation timestamp (stored on insert) and the one of the text fields, called
'subject'.
>
> The problem came when someone entered a record with no subject, but left it null. When this was copied over and
presentin both tables, the *next* time the join was done, a duplicate was created because the join didn't see them as
matching(null != null). 
>
> So after 1 hour there were two records. After two there were four, after 3, 8 etc.
>
> When I logged in after 25 hrs and noticed table access was a little slow, there were 2^25 = 33 million records.
>
> That's a learning experience for me at least. It's lucky I did check it at the end of that day rather than leaving it
overnight,otherwise I think our server would have ground to a halt. 
>
> One other wrinkle to note. After clearing out these rows, running 'VACUUM table2', 'ANALYZE table2' and 'REINDEX
tabletable2', some queries with simple sequence scans were taking a few seconds to run even though there are only a
thousandrows in the table. I finally found that running CLUSTER on the table sorted that out, even though we're on an
SSDso I would have thought seeking all over the place for a seq. scan wouldn't have made that much difference. It
obviouslydoes still make some. 
>
> Oliver Kohll
> www.agilebase.co.uk
>
>
>
>
>
>



Re: Natural key woe

From
Eelke Klein
Date:


One other wrinkle to note. After clearing out these rows, running 'VACUUM table2', 'ANALYZE table2' and 'REINDEX table table2', some queries with simple sequence scans were taking a few seconds to run even though there are only a thousand rows in the table. I finally found that running CLUSTER on the table sorted that out, even though we're on an SSD so I would have thought seeking all over the place for a seq. scan wouldn't have made that much difference. It obviously does still make some.

A VACUUM FULL table2 would have made the CLUSTER unnecesary. A normal VACUUM only marks dead rows as free but does not shrink the table. A VACUUM FULL removes all the free space from the table and returns it to the OS.

Eelke Klein

 

Re: Natural key woe

From
Yeb Havinga
Date:
On 13/05/14 11:44, Oliver Kohll - Mailing Lists wrote:
> The problem came when someone entered a record with no subject, but left it null. When this was copied over and
presentin both tables, the *next* time the join was done, a duplicate was created because the join didn't see them as
matching(null != null). 
Maybe you can use x IS NOT DISTINCT FROM y ?

regards,
Yeb