Funny hang on PostgreSQL 10 during parallel index scan on slave - Mailing list pgsql-hackers

From Chris Travers
Subject Funny hang on PostgreSQL 10 during parallel index scan on slave
Date
Msg-id CAN-RpxBV0-EZhHSEMrZ3eTZGWH-tK40ZFEm4f3oiGavCEoX3nw@mail.gmail.com
Whole thread Raw
Responses Re: Funny hang on PostgreSQL 10 during parallel index scan on slave  (Thomas Munro <thomas.munro@enterprisedb.com>)
List pgsql-hackers
Hi all;

For the last few months we have been facing a funny problem on a slave where queries go to 100% cpu usage and never finish, causing the recovery process to hang and the replica to fall behind,  Over time we ruled out a lot of causes and were banging our heads against this one.  Today we got a break in it when we attached a debugger to various processes even without debugging symbols.  Not only did we get useful stack traces from the hung query but attaching a debugger to the startup process caused the query to finish.  This has shown up in 10.2 and 10.5.

Based on the stack traces we have concluded the following seems to happen:

1.  The query is in a parallel index scan or similar
2.  A process is executing a parallel plan and allocating a significant chunk of memory (2MB for example) in dynamic shared memory.
3.  The startup process goes into a loop where it sends a sigusr1, sleeps 5m, and sends another sigusr1 etc.
4.  The sigusr1 aborts the system call, which is then retried.
5.  Because the system call takes more than 5ms, we end up in an endless loop

I see one of two possible solutions here.
1.  Exponential backoff in sending signals to maybe 1s max.
2.  If there is any way to check for signals before retrying the system call (which I am not 100% sure where it is yet but on my way).

Any feedback or thoughts before we look at implementing a patch?
--
Best Regards,
Chris Travers
Head of Database

Tel: +49 162 9037 210 | Skype: einhverfr | www.adjust.com 
Saarbrücker Straße 37a, 10405 Berlin

pgsql-hackers by date:

Previous
From: Thomas Munro
Date:
Subject: Re: Collation versioning
Next
From: "Bossart, Nathan"
Date:
Subject: Re: Add SKIP LOCKED to VACUUM and ANALYZE