Thread: Adding foreign key constraint holds exclusive lock for too long (on production database)

Hi folks,

We're adding a foreign key constraint to a 20-million row table on our
production database, and it's taking about 7 minutes. Because it's an
ALTER TABLE, Postgres acquires an ACCESS EXCLUSIVE lock that prevents
any reads/writes (though this particular table is very write-heavy, so
even a read lock wouldn't help here).

For context: we do this whenever we deploy our site, because our
database is split across two schemas ("live" and "content"), and the
"content" schema we dump from our office database and restore into our
production database. To achieve this we restore it as "contentnew"
into the production db, then rename the "content" schema to
"contentold" and the "contentnew" schema to "content".

This completes the actual deployment, however, now our live-to-content
foreign keys are pointing to "contentold", so the final step is to go
through and drop all the live-to-content foreign keys and recreate
them (against the new content schema). Most of the tables are small
and re-adding the constraint is quick, except for this one table,
which is 20M rows and basically pauses our live website for 7 minutes.

A couple of questions about the ADD CONSTRAINT. The foreign key column
on the local table is indexed, and there are only ~50 unique values,
so the db *could* come up with the unique values pretty quickly and
then check them. Or, even if it needs to do a full scan of the 20M-big
table ("ratesrequests") and join with the referenced table
("provider") on the foreign key, which is I think the most it should
have to do to check the foreign key, the following query only takes
~20s, not 7 minutes:

select p.name
from ratesrequests r
join provider p on r.providerid = p.providerid

I'm guessing the ADD CONSTRAINT logic bypasses some of the query
optimization used for SELECT queries. So I suppose my questions are:

1) Are there ways to speed up adding the constraint? Just speeding it
up a little bit won't really help -- for this purpose it'll need to be
an order of magnitude or so. I'm aware of a couple of possibilities:

a) Upgrade to Postgres 9.1 and use ADD CONSTRAINT NOT VALID. However,
this doesn't really help, as you need to run VALIDATE CONSTRAINT at
some later stage, which still grabs the exclusive lock.

b) Delete old rows from the table so it's not so big. Feels a bit
hacky just to fix this issue.

c) Get rid of this foreign key constraint entirely and just check it
in code when we insert. Pragmatic solution, but not ideal.

2) Is there a better way to do the "content" schema dump/restore that
avoids dropping and recreating the inter-schema foreign keys?

Other notes and research:

* We're running "PostgreSQL 9.0.2, compiled by Visual C++ build 1500,
64-bit" on 64-bit Windows Server 2008 SP1 (6.0.6001)
* The "ratesrequests" table has two text columns, one of which often
contains a few hundred to a couple of KB of data in the field. It is
added to rapidly. We regularly VACCUM ANALYZE it.
* As expected, the ADD CONSTRAINT has gotten slower over time as this
table grew. However -- I'm not 100% sure of this, but it seems to have
jumped recently (from 3-4 minutes to 7 minutes).
* http://www.postgresql.org/message-id/20030323112241.W14634-100000@megazone23.bigpanda.com
-- indicates that ADD CONSTRAINT isn't optimized as well as it could
be
* http://www.postgresql.org/message-id/51A11C97.90209@iol.ie --
indicates that the db ignores the index when add constraints

Thanks,
Ben.


Ben Hoyt wrote
> * http://www.postgresql.org/message-id/

> 51A11C97.90209@

>  --
> indicates that the db ignores the index when add constraints

As noted in the referenced thread (and never contradicted) the current
algorithm is "for each record does the value in the FK column exist in the
PK table?" not "do all of the values currently found on the FK table exist
in the PK table?".  The later question being seemingly much faster (if table
statistics imply a small-ish number of bins and the presence of an index on
the column) to answer during a bulk ALTER TABLE but the former being the
more common question - when simply adding a single row.

You need to figure out some way to avoid continually evaluating the FK
constraint on all 20M row - of which most of them already were previously
confirmed.  Most commonly people simply perform an incremental update of a
live table and insert/update/delete only the records that are changing
instead of replacing an entire table with a new one.  If you are generally
happy with your current procedure I would probably continue on with your
"live" and "content" schemas but move this table into a "bulk_content"
schema and within that have a "live" table and a "staging" table.  You can
drop/replace the staging table from your office database and then write a
routine to incrementally update the live table.  The FK references in live
and content would then persistently reference the "live" table and only the
subset of changes introduced would need to be checked.

David J.




--
View this message in context:
http://postgresql.1045698.n5.nabble.com/Adding-foreign-key-constraint-holds-exclusive-lock-for-too-long-on-production-database-tp5776313p5776315.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.


David Johnston <polobo@yahoo.com> writes:
> As noted in the referenced thread (and never contradicted) the current
> algorithm is "for each record does the value in the FK column exist in the
> PK table?" not "do all of the values currently found on the FK table exist
> in the PK table?".

Well, apparently nobody who knows the code was paying attention, because
that hasn't been true for some time.  ALTER TABLE ADD FOREIGN KEY will
actually validate the constraint using a query constructed like this
(cf RI_Initial_Check() in ri_triggers.c):

     *    SELECT fk.keycols FROM ONLY relname fk
     *     LEFT OUTER JOIN ONLY pkrelname pk
     *     ON (pk.pkkeycol1=fk.keycol1 [AND ...])
     *     WHERE pk.pkkeycol1 IS NULL AND
     * For MATCH SIMPLE:
     *     (fk.keycol1 IS NOT NULL [AND ...])
     * For MATCH FULL:
     *     (fk.keycol1 IS NOT NULL [OR ...])

It appears the possible explanations for Ben's problem are:

1. For some reason this query is a lot slower than the one he came up
with;

2. The code isn't using this query but is falling back to a row-at-a-time
check.

Case 2 would apply if the user attempting to do the ALTER TABLE doesn't
have read permission on both tables ... though that seems rather unlikely.

            regards, tom lane


Ben,

> A couple of questions about the ADD CONSTRAINT. The foreign key column
> on the local table is indexed, and there are only ~50 unique values,
> so the db *could* come up with the unique values pretty quickly and
> then check them.

This would indeed be a nice optimization, especially now that we have
index-only scans; you could do a VACUUM FREEZE on the tables and then
add the constraint.

> b) Delete old rows from the table so it's not so big. Feels a bit
> hacky just to fix this issue.
>
> c) Get rid of this foreign key constraint entirely and just check it
> in code when we insert. Pragmatic solution, but not ideal.

d) add a trigger instead of an actual FK.  Slower to execute on
subsequent updates/inserts, but doesn't need to be checked.

e) do something (slony, scripts, whatever) so that you're incrementally
updating this table instead of recreating it from scratch each time.

> * We're running "PostgreSQL 9.0.2, compiled by Visual C++ build 1500,
> 64-bit" on 64-bit Windows Server 2008 SP1 (6.0.6001)

I will point out that you are missing a whole ton of bug fixes,
including two critical security patches.

> * The "ratesrequests" table has two text columns, one of which often
> contains a few hundred to a couple of KB of data in the field. It is
> added to rapidly. We regularly VACCUM ANALYZE it.
> * As expected, the ADD CONSTRAINT has gotten slower over time as this
> table grew. However -- I'm not 100% sure of this, but it seems to have
> jumped recently (from 3-4 minutes to 7 minutes).
> * http://www.postgresql.org/message-id/20030323112241.W14634-100000@megazone23.bigpanda.com

Probably the table just got larger than RAM.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


Thanks, Tom (and David and Josh).

> Well, apparently nobody who knows the code was paying attention, because
> that hasn't been true for some time.  ALTER TABLE ADD FOREIGN KEY will
> actually validate the constraint using a query constructed like this
> (cf RI_Initial_Check() in ri_triggers.c):

This was a very helpful pointer, and interesting to me, because I did
a quick look for the source that handled that but didn't find it (not
knowing the Postgres codebase at all). It was kinda weird to me at
first that the way it implements this is by building an SQL string and
then executing that -- at first I would have thought it'd call the
internal functions to do the job. But on second thoughts, this makes
total sense, as that way it gets all the advantages of the query
planner/optimizer for this too.

> It appears the possible explanations for Ben's problem are:
>
> 1. For some reason this query is a lot slower than the one he came up
> with;
>
> 2. The code isn't using this query but is falling back to a row-at-a-time
> check.

Anyway, it's definitely #1 that's happening, as I build the
RI_Initial_Check() query by hand, and it takes just as long as the ADD
CONSTRAINT.

I'll probably hack around it -- in fact, for now I've just dropped the
contraint entirely, as it's not really necessary on this table.

So I guess this is really a side effect of the quirky way we're
dumping and restoring only one schema, and dropping/re-adding
constraints on deployment because of this. Is this a really strange
thing to do -- deploying only one schema (the "static" data) and
dropping/re-adding constraints -- or are there better practices here?

Relatedly, what about best practices regarding inter-schema foreign keys?

-Ben


Ben Hoyt <benhoyt@gmail.com> writes:
>> It appears the possible explanations for Ben's problem are:
>> 1. For some reason this query is a lot slower than the one he came up
>> with;

> Anyway, it's definitely #1 that's happening, as I build the
> RI_Initial_Check() query by hand, and it takes just as long as the ADD
> CONSTRAINT.

Huh.  Maybe an optimizer failing?  Could we see the full text of both
queries and EXPLAIN ANALYZE results for them?

> So I guess this is really a side effect of the quirky way we're
> dumping and restoring only one schema, and dropping/re-adding
> constraints on deployment because of this. Is this a really strange
> thing to do -- deploying only one schema (the "static" data) and
> dropping/re-adding constraints -- or are there better practices here?

Doesn't seem unreasonable.  One thought is that maybe you need to insert a
manual ANALYZE after reloading the data?

            regards, tom lane


Hmm, weird -- now the RI_Initial_Check() query is much quicker (20s). We do ANALYZE the data every few nights, so maybe that's what changed it. I'll keep that in mind. -Ben


On Fri, Nov 1, 2013 at 3:19 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Ben Hoyt <benhoyt@gmail.com> writes:
>> It appears the possible explanations for Ben's problem are:
>> 1. For some reason this query is a lot slower than the one he came up
>> with;

> Anyway, it's definitely #1 that's happening, as I build the
> RI_Initial_Check() query by hand, and it takes just as long as the ADD
> CONSTRAINT.

Huh.  Maybe an optimizer failing?  Could we see the full text of both
queries and EXPLAIN ANALYZE results for them?

> So I guess this is really a side effect of the quirky way we're
> dumping and restoring only one schema, and dropping/re-adding
> constraints on deployment because of this. Is this a really strange
> thing to do -- deploying only one schema (the "static" data) and
> dropping/re-adding constraints -- or are there better practices here?

Doesn't seem unreasonable.  One thought is that maybe you need to insert a
manual ANALYZE after reloading the data?

                        regards, tom lane