Re: dsa_allocate() faliure - Mailing list pgsql-hackers

From Justin Pryzby
Subject Re: dsa_allocate() faliure
Date
Msg-id 20190211000215.GU31721@telsasoft.com
Whole thread Raw
In response to Re: dsa_allocate() faliure  (Thomas Munro <thomas.munro@enterprisedb.com>)
Responses Re: dsa_allocate() faliure  (Thomas Munro <thomas.munro@enterprisedb.com>)
List pgsql-hackers
On Mon, Feb 11, 2019 at 09:45:07AM +1100, Thomas Munro wrote:
> Ouch.  Yeah, that'd do it and matches the evidence.  With this change,
> I couldn't reproduce the problem after 90 minutes with a test case
> that otherwise hits it within a couple of minutes.
...
> Note that this patch addresses the error "dsa_allocate could not find
> %zu free pages".  (The error "dsa_area could not attach to segment" is
> something else and apparently rarer.)

"could not attach" is the error reported early this morning while
stress-testing this patch with queued_alters queries in loops, so that's
consistent with your understanding.  And I guess it preceded getting stuck on
lock; although I don't how long between the first happened and the second, I'm
guess not long and perhaps immedidately; since the rest of the processes were
all stuck as in bug#15585 rather than ERRORing once every few minutes.

I mentioned that "could not attach to segment" occurs in leader either/or
parallel worker.  And most of the time causes an ERROR only, and doesn't wedge
all future parallel workers.  Maybe bug#15585 "wedged" state maybe only occurs
after some pattern of leader+worker failures (?)  I've just triggered bug#15585
again, but if there's a pattern, I don't see it.

Please let me know whether you're able to reproduce the "not attach" bug using
simultaneous loops around the queued_alters query; it's easy here.

Justin


pgsql-hackers by date:

Previous
From: Andreas Karlsson
Date:
Subject: Re: libpq compression
Next
From: Thomas Munro
Date:
Subject: Re: dsa_allocate() faliure