Thread: [GENERAL] startup process stuck in recovery

[GENERAL] startup process stuck in recovery

From

Christophe Pettus

Date:

09 October 2017, 22:08:13

Hi,

We're dealing with a 9.5.5 database with the symptom that, after a certain amount of time after restart, the startup
processreaches a certain WAL segment, and stops.  The startup process runs at 100% CPU, with no output from strace.
Thereare no queries running on the secondary, so it's not a delay for queries.  The WAL receiver continues to operate
normally,and we can see sent_location / write_location / flush_location continue to move ahead in parallel, with
replay_locationstuck in that WAL segment. 

Suggestions on further diagnosis?

--
-- Christophe Pettus  xof@thebuild.com



--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] startup process stuck in recovery

From

Tom Lane

Date:

09 October 2017, 22:10:43

Christophe Pettus <xof@thebuild.com> writes:
> We're dealing with a 9.5.5 database with the symptom that, after a certain amount of time after restart, the startup
processreaches a certain WAL segment, and stops.  The startup process runs at 100% CPU, with no output from strace.
Thereare no queries running on the secondary, so it's not a delay for queries.  The WAL receiver continues to operate
normally,and we can see sent_location / write_location / flush_location continue to move ahead in parallel, with
replay_locationstuck in that WAL segment. 

> Suggestions on further diagnosis?

Attach to startup process with gdb, and get a stack trace?
        regards, tom lane


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] startup process stuck in recovery

From

Christophe Pettus

Date:

09 October 2017, 22:18:21

> On Oct 9, 2017, at 12:10, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> 
> Attach to startup process with gdb, and get a stack trace?

#0  0x0000558812f4f1da in ?? ()
#1  0x0000558812f4f8cb in StandbyReleaseLockTree ()
#2  0x0000558812d718ee in ?? ()
#3  0x0000558812d75520 in xact_redo ()
#4  0x0000558812d7f713 in StartupXLOG ()
#5  0x0000558812f0e262 in StartupProcessMain ()
#6  0x0000558812d8d4ea in AuxiliaryProcessMain ()
#7  0x0000558812f0b2e9 in ?? ()
#8  0x0000558812f0dae7 in PostmasterMain ()
#9  0x0000558812d0c402 in main ()


--
-- Christophe Pettus  xof@thebuild.com



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] startup process stuck in recovery

From

Christophe Pettus

Date:

09 October 2017, 22:21:34

On Oct 9, 2017, at 12:18, Christophe Pettus <xof@thebuild.com> wrote:
>
> #0  0x0000558812f4f1da in ?? ()
> #1  0x0000558812f4f8cb in StandbyReleaseLockTree ()
> #2  0x0000558812d718ee in ?? ()
> #3  0x0000558812d75520 in xact_redo ()
> #4  0x0000558812d7f713 in StartupXLOG ()
> #5  0x0000558812f0e262 in StartupProcessMain ()
> #6  0x0000558812d8d4ea in AuxiliaryProcessMain ()
> #7  0x0000558812f0b2e9 in ?? ()
> #8  0x0000558812f0dae7 in PostmasterMain ()
> #9  0x0000558812d0c402 in main ()

The other observation is that the startup process is holding a *lot* of locks:

(pid 5882 is the startup process)

# select mode, count(*) from pg_locks where pid=5882 group by mode;       mode         | count
---------------------+-------ExclusiveLock       |     1AccessExclusiveLock |  7133
(2 rows)

--
-- Christophe Pettus  xof@thebuild.com



--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] startup process stuck in recovery

From

Tom Lane

Date:

09 October 2017, 23:01:14

Christophe Pettus <xof@thebuild.com> writes:
> The other observation is that the startup process is holding a *lot* of locks:

Hmm.  Is it possible that the process is replaying the abort of a
transaction with a lot of subtransactions?  It seems like maybe
you could be getting into an O(N^2) situation, though for N in
the low thousands it's hard to see that loop taking so long that
you'd think things were stuck.

> # select mode, count(*) from pg_locks where pid=5882 group by mode;
>  AccessExclusiveLock |  7133

Is that number changing at all?
        regards, tom lane

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] startup process stuck in recovery

From

Christophe Pettus

Date:

09 October 2017, 23:10:11

> On Oct 9, 2017, at 13:01, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Hmm.  Is it possible that the process is replaying the abort of a
> transaction with a lot of subtransactions?

That's possible, although we're now talking about an hours-long delay at this point.

> Is that number changing at all?

Increasing:

# select mode, count(*) from pg_locks where pid=5882 group by mode;       mode         | count
---------------------+-------ExclusiveLock       |     1AccessExclusiveLock |  8810
(2 rows)

--
-- Christophe Pettus  xof@thebuild.com



--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] startup process stuck in recovery

From

Tom Lane

Date:

09 October 2017, 23:26:46

Christophe Pettus <xof@thebuild.com> writes:
>> On Oct 9, 2017, at 13:01, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Is that number changing at all?

> Increasing:
>  AccessExclusiveLock |  8810

Oh, that's really interesting.  So it's not *just* releasing locks but
also acquiring them, which says that it is making progress of some sort.

My bet is that the source server did something that's provoking O(N^2)
behavior in the standby server's lock management.  It's hard to say
exactly what, but I'm wondering about something like a plpgsql function
taking an AccessExclusiveLock inside a loop that repeatedly traps an
exception.  Can you correlate where the standby is stuck with what
was happening on the source?
        regards, tom lane


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] startup process stuck in recovery

From

Christophe Pettus

Date:

10 October 2017, 00:01:06

> On Oct 9, 2017, at 13:26, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Oh, that's really interesting.  So it's not *just* releasing locks but
> also acquiring them, which says that it is making progress of some sort.

It seems to have leveled out now, and is still grinding away.

> Can you correlate where the standby is stuck with what
> was happening on the source?

There are definitely some mysterious functions being called, and I'll dig into those.

--
-- Christophe Pettus  xof@thebuild.com



--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] startup process stuck in recovery

From

Christophe Pettus

Date:

10 October 2017, 00:24:00

> On Oct 9, 2017, at 13:26, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> My bet is that the source server did something that's provoking O(N^2)
> behavior in the standby server's lock management.  It's hard to say
> exactly what, but I'm wondering about something like a plpgsql function
> taking an AccessExclusiveLock inside a loop that repeatedly traps an
> exception.  Can you correlate where the standby is stuck with what
> was happening on the source?

Interestingly, the OIDs for the relations on which the locks on the secondary are held aren't present in pg_class, and
they'reclustered together.  Could a large number of temporary table creations that are being undone by an abort cause
this?

--
-- Christophe Pettus  xof@thebuild.com



--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] startup process stuck in recovery

From

Tom Lane

Date:

10 October 2017, 00:29:38

Christophe Pettus <xof@thebuild.com> writes:
>> On Oct 9, 2017, at 13:26, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> My bet is that the source server did something that's provoking O(N^2)
>> behavior in the standby server's lock management.  It's hard to say
>> exactly what, but I'm wondering about something like a plpgsql function
>> taking an AccessExclusiveLock inside a loop that repeatedly traps an
>> exception.  Can you correlate where the standby is stuck with what
>> was happening on the source?

> Interestingly, the OIDs for the relations on which the locks on the secondary are held aren't present in pg_class,
andthey're clustered together.  Could a large number of temporary table creations that are being undone by an abort
causethis? 

Hmm.  Creating or dropping a temp table does take AccessExclusiveLock,
just as it does for a non-temp table.  In principle we'd not have to
transmit those locks to standbys, but I doubt that the WAL code has
enough knowledge to filter them out.  So a lot of temp tables and
a lot of separate subtransactions could be a nasty combination.
        regards, tom lane


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] startup process stuck in recovery

From

Christophe Pettus

Date:

10 October 2017, 02:12:45

> On Oct 9, 2017, at 14:29, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Hmm.  Creating or dropping a temp table does take AccessExclusiveLock,
> just as it does for a non-temp table.  In principle we'd not have to
> transmit those locks to standbys, but I doubt that the WAL code has
> enough knowledge to filter them out.  So a lot of temp tables and
> a lot of separate subtransactions could be a nasty combination.

The problem indeed appear to be a very large number of subtransactions, each one creating a temp table, inside a single
transaction. It's made worse by one of those transactions finally getting replayed on the secondary, only to have
anotherone come in right behind it... 
--
-- Christophe Pettus  xof@thebuild.com



--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] startup process stuck in recovery

From

Merlin Moncure

Date:

10 October 2017, 02:25:38

On Mon, Oct 9, 2017 at 6:12 PM, Christophe Pettus <xof@thebuild.com> wrote:
>
>> On Oct 9, 2017, at 14:29, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Hmm.  Creating or dropping a temp table does take AccessExclusiveLock,
>> just as it does for a non-temp table.  In principle we'd not have to
>> transmit those locks to standbys, but I doubt that the WAL code has
>> enough knowledge to filter them out.  So a lot of temp tables and
>> a lot of separate subtransactions could be a nasty combination.
>
> The problem indeed appear to be a very large number of subtransactions, each one creating a temp table, inside a
singletransaction.  It's made worse by one of those transactions finally getting replayed on the secondary, only to
haveanother one come in right behind it... 

FYI, this is pathological coding; it can cause other nastiness such as
exhausting shared memory.  Consider refactoring the code to using a
permanent table (if access is concurrent, you can do tricks like using
'pid' in the key to subdivide the data).

merlin


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] startup process stuck in recovery

From

Tom Lane

Date:

10 October 2017, 03:30:32

Christophe Pettus <xof@thebuild.com> writes:
> The problem indeed appear to be a very large number of subtransactions, each one creating a temp table, inside a
singletransaction.  It's made worse by one of those transactions finally getting replayed on the secondary, only to
haveanother one come in right behind it... 

Hmm, I tried to reproduce this and could not.  I experimented with
various permutations of this:

create or replace function goslow(n int) returns void language plpgsql as
$$
begin
for i in 1..n loop begin   execute 'create temp table myfoo' || i || '(f1 int)';   perform i/0; exception when
division_by_zerothen null; end; 
end loop;
perform n/0;
end;
$$;

and did not see any untoward behavior, at least not till I got to enough
temp tables to overrun the master's shared lock table, and even then it
cleaned up fine.  At no point was the standby process consuming anywhere
near as much CPU as the master's backend.

What am I missing to reproduce the problem?
        regards, tom lane


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] startup process stuck in recovery

From

Peter Geoghegan

Date:

10 October 2017, 04:02:20

On Mon, Oct 9, 2017 at 5:30 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> and did not see any untoward behavior, at least not till I got to enough
> temp tables to overrun the master's shared lock table, and even then it
> cleaned up fine.  At no point was the standby process consuming anywhere
> near as much CPU as the master's backend.
>
> What am I missing to reproduce the problem?

Just a guess, but do you disable autovacuum on your dev machine? (I know I do.)

It's possible that this is relevant:
https://postgr.es/m/CAB-EU3RawZx8-OzMfvswFf6z+Y7GOZf03TZ=bez+PbQX+A4M7Q@mail.gmail.com

-- 
Peter Geoghegan


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] startup process stuck in recovery

From

Tom Lane

Date:

10 October 2017, 04:05:18

Peter Geoghegan <pg@bowt.ie> writes:
> Just a guess, but do you disable autovacuum on your dev machine? (I know I do.)

Nope.
        regards, tom lane


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] startup process stuck in recovery

From

Peter Geoghegan

Date:

10 October 2017, 04:21:09

On Mon, Oct 9, 2017 at 12:08 PM, Christophe Pettus <xof@thebuild.com> wrote:
> Suggestions on further diagnosis?

What's the hot_standy_feedback setting? How about
max_standby_archive_delay/max_standby_streaming_delay?


-- 
Peter Geoghegan


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] startup process stuck in recovery

From

Christophe Pettus

Date:

10 October 2017, 05:25:03

> On Oct 9, 2017, at 18:21, Peter Geoghegan <pg@bowt.ie> wrote:
> What's the hot_standy_feedback setting? How about
> max_standby_archive_delay/max_standby_streaming_delay?

On, 5m, 5m.

--
-- Christophe Pettus  xof@thebuild.com



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] startup process stuck in recovery

From

Christophe Pettus

Date:

10 October 2017, 05:27:35

> On Oct 9, 2017, at 17:30, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> What am I missing to reproduce the problem?

Not sure.  The actual client behavior here is a bit cryptic (not our code, incompletely logs).  They might be creating
asavepoint before each temp table creation, without a release, but I would expect that to exhaust something else first. 

--
-- Christophe Pettus  xof@thebuild.com

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] startup process stuck in recovery

From

Alvaro Herrera

Date:

10 October 2017, 17:32:41

Tom Lane wrote:
> Christophe Pettus <xof@thebuild.com> writes:
> > The problem indeed appear to be a very large number of subtransactions, each one creating a temp table, inside a
singletransaction.  It's made worse by one of those transactions finally getting replayed on the secondary, only to
haveanother one come in right behind it...
 
> 
> Hmm, I tried to reproduce this and could not.  I experimented with
> various permutations of this:

This problem is probably related to commit 9b013dc238c, which AFAICS is
only in pg10, not 9.5.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] startup process stuck in recovery

From

Tom Lane

Date:

10 October 2017, 18:05:32

Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> Tom Lane wrote:
>> Hmm, I tried to reproduce this and could not.  I experimented with
>> various permutations of this:

> This problem is probably related to commit 9b013dc238c, which AFAICS is
> only in pg10, not 9.5.

You're right, I was testing on HEAD, so that patch might've obscured
the problem.  But the code looks like it could still be O(N^2) in
some cases.  Will look again later.
        regards, tom lane


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] startup process stuck in recovery

From

Christophe Pettus

Date:

10 October 2017, 21:37:15

> On Oct 10, 2017, at 08:05, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> You're right, I was testing on HEAD, so that patch might've obscured
> the problem.  But the code looks like it could still be O(N^2) in
> some cases.  Will look again later.

I was able to reproduce this on 9.5.9 with the following:

DO $$
DECLARE  i int := 1;
BEGIN  FOR i IN 1..12000 LOOP     BEGIN         PERFORM f();         i := i / 0;     EXCEPTION        WHEN
division_by_zeroTHEN     END;  END LOOP; 
END;
$$ language plpgsql;

where f() is:

CREATE OR REPLACE FUNCTION f() RETURNS VOID AS $$
BEGIN CREATE TEMPORARY TABLE test_table ON COMMIT DROP AS SELECT i FROM generate_series(1, 100) i;
END:
$$ language plpgsql;

A couple of observations:

-- In this version, I couldn't do a select * from pg_locks() on the secondary without getting an out-of-shared-memory
error.
-- If I increased max_locks_per_transaction to 15000, the problem didn't occur, even if I bumped up the number of
iterationsin the first to 20000. 

--
-- Christophe Pettus  xof@thebuild.com



--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] startup process stuck in recovery

From

Tom Lane

Date:

10 October 2017, 23:23:52

Christophe Pettus <xof@thebuild.com> writes:
> I was able to reproduce this on 9.5.9 with the following:

Hmm ... so I still can't reproduce the specific symptoms Christophe
reports.

What I see is that, given this particular test case, the backend
process on the master never holds more than a few locks at a time.
Each time we abort a subtransaction, the AE lock it was holding
on the temp table it created gets dropped. However ... on the
standby server, pre v10, the replay process attempts to take all
12000 of those AE locks at once. This is not a great plan.

On 9.5, for me, as soon as we're out of shared memory
ResolveRecoveryConflictWithLock will go into an infinite loop.
And AFAICS it *is* infinite; it doesn't look to me like it's
making any progress. This is pretty easy to diagnose though
because it spews "out of shared memory" WARNING messages to the
postmaster log at an astonishing rate.

9.6 hits the OOM condition as well, but manages to get out of it
somehow. I'm not very clear how, and the log trace doesn't look
like it's real clean: after a bunch of these

WARNING: out of shared memory
CONTEXT: xlog redo at 0/C1098AC0 for Standby/LOCK: xid 134024 db 423347 rel 524106
WARNING: out of shared memory
CONTEXT: xlog redo at 0/C10A97E0 for Standby/LOCK: xid 134024 db 423347 rel 524151
WARNING: out of shared memory
CONTEXT: xlog redo at 0/C10B36B0 for Standby/LOCK: xid 134024 db 423347 rel 524181
WARNING: out of shared memory
CONTEXT: xlog redo at 0/C10BD780 for Standby/LOCK: xid 134024 db 423347 rel 524211

you get a bunch of these

WARNING: you don't own a lock of type AccessExclusiveLock
CONTEXT: xlog redo at 0/C13A79B0 for Transaction/COMMIT: 2017-10-10 15:05:56.615721-04
LOG: RecoveryLockList contains entry for lock no longer recorded by lock manager: xid 134024 database 423347 relation
526185
CONTEXT: xlog redo at 0/C13A79B0 for Transaction/COMMIT: 2017-10-10 15:05:56.615721-04
WARNING: you don't own a lock of type AccessExclusiveLock
CONTEXT: xlog redo at 0/C13A79B0 for Transaction/COMMIT: 2017-10-10 15:05:56.615721-04
LOG: RecoveryLockList contains entry for lock no longer recorded by lock manager: xid 134024 database 423347 relation
526188
CONTEXT: xlog redo at 0/C13A79B0 for Transaction/COMMIT: 2017-10-10 15:05:56.615721-04
WARNING: you don't own a lock of type AccessExclusiveLock
CONTEXT: xlog redo at 0/C13A79B0 for Transaction/COMMIT: 2017-10-10 15:05:56.615721-04
LOG: RecoveryLockList contains entry for lock no longer recorded by lock manager: xid 134024 database 423347 relation
526191
CONTEXT: xlog redo at 0/C13A79B0 for Transaction/COMMIT: 2017-10-10 15:05:56.615721-04

The important point though is that "a bunch" is a finite number,
whereas 9.5 seems to be just stuck. I'm not sure how Christophe's
server managed to continue to make progress.

It looks like the 9.6-era patch 37c54863c must have been responsible
for that behavioral change. There's no indication in the commit message
or the comments that anyone had specifically considered the OOM
scenario, so I think it's just accidental that it's better.

v10 and HEAD avoid the problem because the standby server doesn't
take locks (any at all, AFAICS). I suppose this must be a
consequence of commit 9b013dc238c, though I'm not sure exactly how.

Anyway, it's pretty scary that it's so easy to run the replay process
out of shared memory pre-v10. I wonder if we should consider
backpatching that fix. Any situation where the replay process takes
more locks concurrently than were ever held on the master is surely
very bad news.

A marginally lesser concern is that the replay process does need to have
robust behavior in the face of locktable OOM. AFAICS whatever it is doing
now is just accidental, and I'm not sure it's correct. "Doesn't get into
an infinite loop" is not a sufficiently high bar.

And I'm still wondering exactly what Christophe actually saw ...
regards, tom lane

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] startup process stuck in recovery

From

Simon Riggs

Date:

11 October 2017, 09:54:32

On 10 October 2017 at 21:23, Tom Lane <tgl@sss.pgh.pa.us> wrote:

> What I see is that, given this particular test case, the backend
> process on the master never holds more than a few locks at a time.
> Each time we abort a subtransaction, the AE lock it was holding
> on the temp table it created gets dropped.  However ... on the
> standby server, pre v10, the replay process attempts to take all
> 12000 of those AE locks at once.  This is not a great plan.

Standby doesn't take locks "at once", they are added just as they
arrive. The locks are held by topxid, so not released at subxid abort,
by design, so they are held concurrently.

> v10 and HEAD avoid the problem because the standby server doesn't
> take locks (any at all, AFAICS).  I suppose this must be a
> consequence of commit 9b013dc238c, though I'm not sure exactly how.

Locks are still taken, but in 9b013dc238c we just avoid trying to
release locks when transactions don't have any.

> Anyway, it's pretty scary that it's so easy to run the replay process
> out of shared memory pre-v10.  I wonder if we should consider
> backpatching that fix.  Any situation where the replay process takes
> more locks concurrently than were ever held on the master is surely
> very bad news.

v10 improves on this specific point because we perform lock release at
subxid abort.

Various cases have been reported over time and this has been improving
steadily in each release.

It isn't "easy" to run the replay process out of memory because
clearly that doesn't happen much, but yes there are some pessimal use
cases that don't work well. The use case described seems incredibly
unreal and certainly amenable to being rewritten.

Backpatching some of those fixes is quite risky, IMHO.

-- 
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] startup process stuck in recovery

From

Christophe Pettus

Date:

11 October 2017, 10:09:23

> On Oct 10, 2017, at 23:54, Simon Riggs <simon@2ndQuadrant.com> wrote:
>
> The use case described seems incredibly
> unreal and certainly amenable to being rewritten.

While it's certainly true that this was an extreme case, it was a real-life production situation.  The concern here is
thatin the actual production situation, the only symptom was that the startup process just stopped.  There were no log
messagesor any other indication of what was going wrong. 
--
-- Christophe Pettus  xof@thebuild.com



--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] startup process stuck in recovery

From

Simon Riggs

Date:

11 October 2017, 11:27:17

On 11 October 2017 at 08:09, Christophe Pettus <xof@thebuild.com> wrote:
>
>> On Oct 10, 2017, at 23:54, Simon Riggs <simon@2ndQuadrant.com> wrote:
>>
>> The use case described seems incredibly
>> unreal and certainly amenable to being rewritten.
>
> While it's certainly true that this was an extreme case, it was a real-life production situation.  The concern here
isthat in the actual production situation, the only symptom was that the startup process just stopped.  There were no
logmessages or any other indication of what was going wrong. 

Which indicates it was making progress, just slowly.

Tom says "This is pretty easy to diagnose though
because it spews "out of shared memory" WARNING messages to the
postmaster log at an astonishing rate"

These don't seem to match.

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] startup process stuck in recovery

From

Tom Lane

Date:

11 October 2017, 17:00:12

Simon Riggs <simon@2ndquadrant.com> writes:
> On 11 October 2017 at 08:09, Christophe Pettus <xof@thebuild.com> wrote:
>> While it's certainly true that this was an extreme case, it was a real-life production situation.  The concern here
isthat in the actual production situation, the only symptom was that the startup process just stopped.  There were no
logmessages or any other indication of what was going wrong. 

> Which indicates it was making progress, just slowly.

> Tom says "This is pretty easy to diagnose though
> because it spews "out of shared memory" WARNING messages to the
> postmaster log at an astonishing rate"

> These don't seem to match.

Yeah.  I'm still suspicious that Christophe saw some other misbehavior
than the one I found.  We know his server was dealing with < 10K locks,
which doesn't seem like enough to cause any obvious problem from a mere
O(N^2) behavior.
        regards, tom lane


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general