Thread: Deadlock detected after pg_repack receives SIGINT
I'm running a pg_repack from a bash script with timeout of 10 minutes like so (simplified version): timeout -s SIGINT 10m pg_repack --table=frequently_updated_table After the 10 min timeout, the OS sends SIGINT to pg_repack process so the pg_repack calls: SELECT repack.repack_drop($1, $2) and it causes a deadlock with other process which is INSERTing into frequently_updated_table that has a pg_repack's trigger to log every change to a repack.log_[OID of frequently_updated_table]. Is it a bug of pg_repack or am I doing something wrong? Should I add some detailed info? Thank You for any help, Jiri
PS: I was trying a mailing list of pg_repack (http://lists.pgfoundry.org/pipermail/reorg-genera) and github without any notice. If there is better place to ask, tell me, please.
On Tue, Nov 3, 2015 at 9:51 PM, Jiří Hlinka <jiri.hlinka@gmail.com> wrote: > After the 10 min timeout, the OS sends SIGINT to pg_repack process so the > pg_repack calls: > SELECT repack.repack_drop($1, $2) > and it causes a deadlock with other process which is INSERTing into > frequently_updated_table that has a pg_repack's trigger to log every change > to a repack.log_[OID of frequently_updated_table]. Recalling memories from this stuff, that's part of a cleanup rollback, right? A repack can be a lengthy operation on relations that are heavily updated as pg_repack keeps a trace of each DML event that happens to the source table and then replays them one by one. > Is it a bug of pg_repack or am I doing something wrong? Should I add some > detailed info? I doubt there is anything involving Postgres here. It seems that some process is still holding a lock on a relation that is being dropped, caused by a race condition in pg_repack code. > PS: I was trying a mailing list of pg_repack > (http://lists.pgfoundry.org/pipermail/reorg-genera) and github without any > notice. If there is better place to ask, tell me, please. I guess you should let those folks time to answer. If I were you I'd begin first by letting more time for the repack operation to complete. Regards, -- Michael
On 11/3/15 7:44 AM, Michael Paquier wrote: > I doubt there is anything involving Postgres here. It seems that some > process is still holding a lock on a relation that is being dropped, > caused by a race condition in pg_repack code. > >> >PS: I was trying a mailing list of pg_repack >> >(http://lists.pgfoundry.org/pipermail/reorg-genera) and github without any >> >notice. If there is better place to ask, tell me, please. > I guess you should let those folks time to answer. If I were you I'd > begin first by letting more time for the repack operation to complete. Make sure you're on the latest version too. There were some recent commits aimed at fixing some race conditions. -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com
Thanks Jim and Michael for comments.
Michael: your memories are fresh and clear :-), yes, it is part of a cleanup rollback. The problem is, that the pgrepack_drop call this statement:
DROP TABLE IF EXISTS repack.log_%u CASCADE
...and it collides with this trigger inserting new touples inside the [frequently_updated_table] in the repack.log_[OID of frequently_updated_table] routine:
SELECT 'CREATE TRIGGER z_repack_trigger' | |
' BEFORE INSERT OR DELETE OR UPDATE ON ' || repack.oid2text($1) || | |
' FOR EACH ROW EXECUTE PROCEDURE repack.repack_trigger(' || | |
'''INSERT INTO repack.log_' || $1 || '(pk, row) VALUES(' || | |
' CASE WHEN $1 IS NULL THEN NULL ELSE (ROW($1.' || | |
repack.get_index_columns($2, ', $1.') || ')::repack.pk_' || | |
$1 || ') END, $2)'')'; |
As these two actions are both run by pg_repack, it seems like a bug to me as pg_repack should not be able to deadlock itself, but not 100% sure about this part...
On Wed, Nov 4, 2015 at 3:48 AM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
On 11/3/15 7:44 AM, Michael Paquier wrote:I doubt there is anything involving Postgres here. It seems that some
process is still holding a lock on a relation that is being dropped,
caused by a race condition in pg_repack code.>PS: I was trying a mailing list of pg_repackI guess you should let those folks time to answer. If I were you I'd
>(http://lists.pgfoundry.org/pipermail/reorg-genera) and github without any
>notice. If there is better place to ask, tell me, please.
begin first by letting more time for the repack operation to complete.
Make sure you're on the latest version too. There were some recent commits aimed at fixing some race conditions.
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
Bc. Jiří Hlinka
Tel.: 725 315 263
On Wed, Nov 4, 2015 at 10:16 PM, Jiří Hlinka <jiri.hlinka@gmail.com> wrote:
I'm on pg_repack 1.3.2 (latest sable, no devel version available to check if it is already fixed).Michael: your memories are fresh and clear :-), yes, it is part of a cleanup rollback. The problem is, that the pgrepack_drop call this statement:DROP TABLE IF EXISTS repack.log_%u CASCADE...and it collides with this trigger inserting new touples inside the [frequently_updated_table] in the repack.log_[OID of frequently_updated_table] routine:
SELECT 'CREATE TRIGGER z_repack_trigger' ' BEFORE INSERT OR DELETE OR UPDATE ON ' || repack.oid2text($1) || ' FOR EACH ROW EXECUTE PROCEDURE repack.repack_trigger(' || '''INSERT INTO repack.log_' || $1 || '(pk, row) VALUES(' || ' CASE WHEN $1 IS NULL THEN NULL ELSE (ROW($1.' || repack.get_index_columns($2, ', $1.') || ')::repack.pk_' || $1 || ') END, $2)'')'; As these two actions are both run by pg_repack, it seems like a bug to me as pg_repack should not be able to deadlock itself, but not 100% sure about this part...
Is the trigger activity moving on or is this one waiting as well for a lock? It sounds like pg_repack is waiting for the end of the transaction running this trigger to finish before being able to drop the trigger and this relation safely. I guess that you are running large UPDATE queries, right? It really looks like you should let more room to pg_repack to do its stuff.
--
--
Michael
I'll check if the trigger activity was moving on or waiting for a lock from logs.
I'm doing simple UPDATES, INSERTs and DELETEs on this table, but frequency of these DMLs is _very_ high (it is a queue table used for a one-way selective [just part of data are replicated] replication of queries between two instances of the database, lets say from the main DB to data warehouse DB, therefore part of DML queries on tables of the main table is "copied" by a trigger to this queue table which is than used as a source for replicating changes into data warehouse DB) - this is the reason why the table needs VACUUM FULL at least twice a day, or better - running pg_repack on it at least twice a day.
My opinion is, that pg_repack should guarantee a consistent, deadlock-free behaviour via proper locking policy - if it is frequently updated table or not and the "worst" case that could appear is, that the pg_repack will wait for very long time (and this is the reason for the timeout usage in the script - to stop pg_repack if it is unable to receive the lock it is waiting for). I know it is hard to achieve this and not sure if it is possible to achieve this "clean" behaviour, but if for example deadlock happend, it is possible to fix it as a bug to make it even better than it is now.
Maybe I could lower the impact of the problem by checking inside the script if, after timeout expires, the pg_repack is still waiting in "Waiting for X transactions to finished" (= safe to SIGINT pg_repack) or if it is already in the process of repacking the table - do not stop it here. But there still remain the fact, that there may be a bug which I could and sholud report to the developers.
Thanks Michael,
Jiri
On Thu, Nov 5, 2015 at 2:57 AM, Michael Paquier <michael.paquier@gmail.com> wrote:
On Wed, Nov 4, 2015 at 10:16 PM, Jiří Hlinka <jiri.hlinka@gmail.com> wrote:I'm on pg_repack 1.3.2 (latest sable, no devel version available to check if it is already fixed).Michael: your memories are fresh and clear :-), yes, it is part of a cleanup rollback. The problem is, that the pgrepack_drop call this statement:DROP TABLE IF EXISTS repack.log_%u CASCADE...and it collides with this trigger inserting new touples inside the [frequently_updated_table] in the repack.log_[OID of frequently_updated_table] routine:
SELECT 'CREATE TRIGGER z_repack_trigger' ' BEFORE INSERT OR DELETE OR UPDATE ON ' || repack.oid2text($1) || ' FOR EACH ROW EXECUTE PROCEDURE repack.repack_trigger(' || '''INSERT INTO repack.log_' || $1 || '(pk, row) VALUES(' || ' CASE WHEN $1 IS NULL THEN NULL ELSE (ROW($1.' || repack.get_index_columns($2, ', $1.') || ')::repack.pk_' || $1 || ') END, $2)'')'; As these two actions are both run by pg_repack, it seems like a bug to me as pg_repack should not be able to deadlock itself, but not 100% sure about this part...Is the trigger activity moving on or is this one waiting as well for a lock? It sounds like pg_repack is waiting for the end of the transaction running this trigger to finish before being able to drop the trigger and this relation safely. I guess that you are running large UPDATE queries, right? It really looks like you should let more room to pg_repack to do its stuff.
--Michael
Bc. Jiří Hlinka
Tel.: 725 315 263
On Thursday, November 5, 2015 12:16 AM, Jiří Hlinka <jiri.hlinka@gmail.com> wrote: > My opinion is, that pg_repack should guarantee a consistent, > deadlock-free behaviour via proper locking policy I would be very interesting in seeing a description of what locking policy would guarantee deadlock-free behavior when run concurrently with unknown software. If you have a link to a paper on the topic, that would serve as well as a description here. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi Kevin,
my point was, that pg_repack deadlocked itself - I think it should be possible to guarantee deadlock-free behavior at least via advisory lock for operations of pg_repack itself (I understand it is not possible to guarantee this across more apps). If it is not true, I'd be glad to hear I'm wrong (really!).
Thanks,
Jiri
On Thu, Nov 5, 2015 at 5:43 PM, Kevin Grittner <kgrittn@ymail.com> wrote:
On Thursday, November 5, 2015 12:16 AM, Jiří Hlinka <jiri.hlinka@gmail.com> wrote:
> My opinion is, that pg_repack should guarantee a consistent,
> deadlock-free behaviour via proper locking policy
I would be very interesting in seeing a description of what locking
policy would guarantee deadlock-free behavior when run concurrently
with unknown software. If you have a link to a paper on the topic,
that would serve as well as a description here.
--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Bc. Jiří Hlinka
Tel.: 725 315 263
On Fri, Nov 6, 2015 at 4:08 PM, Jiří Hlinka <jiri.hlinka@gmail.com> wrote: > my point was, that pg_repack deadlocked itself - I think it should be > possible to guarantee deadlock-free behavior at least via advisory lock for > operations of pg_repack itself (I understand it is not possible to guarantee > this across more apps). If it is not true, I'd be glad to hear I'm wrong > (really!). Er, well. Based on the information given there is actually no evidence that pg_repack is actually deadlocked. The code path calling the cleanup callback after receiving SIGINT waiting for the transaction working on the trigger being dropped to finish. We could say that there is a deadlock if the transaction inserting data to repack.log% is actually holding a lock that conflicts with the trigger being dropped. -- Michael
On 11/5/15 12:14 AM, Jiří Hlinka wrote: > I'm doing simple UPDATES, INSERTs and DELETEs on this table, but > frequency of these DMLs is _very_ high (it is a queue table used for a > one-way selective [just part of data are replicated] replication of > queries between two instances of the database, lets say from the main DB > to data warehouse DB, therefore part of DML queries on tables of the > main table is "copied" by a trigger to this queue table which is than > used as a source for replicating changes into data warehouse DB) - this > is the reason why the table needs VACUUM FULL at least twice a day, or > better - running pg_repack on it at least twice a day. That's a workload that is very problematic. It's why PgQ (used by londiste) swaps queue tables around and does truncates. Slony now does the same. You'll probably be much happier either using PgQ, using some kind of partitioning on the queue so that you're just truncating, or switching to a different queuing solution altogether. -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com