Re: BUG #17791: Assert on procarray.c - Mailing list pgsql-bugs

From Andres Freund
Subject Re: BUG #17791: Assert on procarray.c
Date
Msg-id 20230215050612.po5rjq6zd7oq7cu6@awork3.anarazel.de
Whole thread Raw
In response to Re: BUG #17791: Assert on procarray.c  (Robins Tharakan <tharakan@gmail.com>)
Responses Re: BUG #17791: Assert on procarray.c  (Robins Tharakan <tharakan@gmail.com>)
List pgsql-bugs
Hi,

On 2023-02-15 14:46:13 +1030, Robins Tharakan wrote:
> Thanks for taking a look and possibly you're correct with your
> assumption. I mean I see a ton of FATALs but let me know if I am
> mistaken in assuming them to be harmless (since they just convey that
> the client's gone away)?

Those are indeed not very interesting - although it'd be interesting to know
what caused the clients to go away.


> Nonetheless, I have provided error logs going back till Oct 22 just in
> case the engine can recover from some of those scenarios. Two things
> about the test scenario that may be relevant:
> 
> 1. Since performance was the least of my worries, the postgres server
> and the client workload are on the same box. Add dblink / FDW to this
> mix, and it is easy to end up with a ton of loopback connections
> (think SELECT dblink_conect() FROM pg_catalog.pg_class) - IMO
> noteworthy, since there are a ton of "Broken pipe"s and one instance
> of 'too many file descriptors'.

I think the "too many file descriptors" bit might be the interesting part.

I suspect the reason you're not seeing this on newer versions is that 13+ has

commit 3d475515a15f70a4a3f36fbbba93db6877ff8346
Author: Tom Lane <tgl@sss.pgh.pa.us>
Date:   2020-02-24 17:28:33 -0500

    Account explicitly for long-lived FDs that are allocated outside fd.c.


But I can't yet explain precisely why that causes the assertion failures. A
vague guess is that we fail to write 2PC state files due to the lack of FD
accounting, throw an error due to that, and then fail with that assert during
handling the error.


It might be worth trying to reproduce the issue with a much lower ulimit -S
-n, to reach the problematic state more quickly. A reproducer would be very
useufl.


> 2. All versions are subjected to similar workload and it is possible
> that v13+ has generally improved in this area, and thus this possibly
> crashes less? Unsure.

What range of versions / commits are you testing this workload on?

Are you testing 11 as well? Because I don't see why we'd have the issue on 12,
but not 11.

Greetings,

Andres Freund



pgsql-bugs by date:

Previous
From: PG Bug reporting form
Date:
Subject: BUG #17794: dates with zero or negative years are not accepted
Next
From: Alexander Bluce
Date:
Subject: Re: BUG #17782: ERROR: variable not found in subplan target lists