Thread: sidewinder has one failure

sidewinder has one failure

From

Amit Kapila

Date:

03 January 2020, 12:01:03

After my recent commit d207038053837ae9365df2776371632387f6f655,
sidewinder is failing with error "insufficient file descriptors .." in
test 006_logical_decoding.pl [1].  The detailed failure displays
messages as below:

006_logical_decoding_master.log
2020-01-02 19:51:05.567 CET [26174:3] 006_logical_decoding.pl LOG:
statement: ALTER SYSTEM SET max_files_per_process = 26;
2020-01-02 19:51:05.570 CET [2777:4] LOG:  received fast shutdown request
2020-01-02 19:51:05.570 CET [26174:4] 006_logical_decoding.pl LOG:
disconnection: session time: 0:00:00.005 user=pgbf database=postgres
host=[local]
2020-01-02 19:51:05.571 CET [2777:5] LOG:  aborting any active transactions
2020-01-02 19:51:05.572 CET [2777:6] LOG:  background worker "logical
replication launcher" (PID 23736) exited with exit code 1
2020-01-02 19:51:05.572 CET [15764:1] LOG:  shutting down
2020-01-02 19:51:05.575 CET [2777:7] LOG:  database system is shut down
2020-01-02 19:51:05.685 CET [24138:1] LOG:  starting PostgreSQL 12.1
on x86_64-unknown-netbsd7.0, compiled by gcc (nb2 20150115) 4.8.4,
64-bit
2020-01-02 19:51:05.686 CET [24138:2] LOG:  listening on Unix socket
"/tmp/sxAcn7SAzt/.s.PGSQL.56110"
2020-01-02 19:51:05.687 CET [24138:3] FATAL:  insufficient file
descriptors available to start server process
2020-01-02 19:51:05.687 CET [24138:4] DETAIL:  System allows 19, we
need at least 20.
2020-01-02 19:51:05.687 CET [24138:5] LOG:  database system is shut down

Here, I think it is clear that the failure happens because we are
setting the value of max_files_per_process as 26 which is low for this
machine.  It seems to me that the reason it is failing is that before
reaching set_max_safe_fds, it has already seven open files.  Now, I
see on my CentOS system, the value of already_open files is 3, 6 and 6
respectively for versions HEAD, 12 and 10.  We can easily see the
number of already opened files by changing the error level from DEBUG2
to LOG for elog message in set_max_safe_fds.  It is not very clear to
me how many files we can expect to be kept open during startup?  Can
the number vary on different setups?

One possible way to fix is that we change the test to set
max_files_per_process to a slightly higher number say 35, but I am not
sure what will be the safe value for the same.  Alternatively, we can
think of removing the test entirely, but it seems like a useful case
to test corner cases, so we have added it in the first place.

I am planning to investigate this further by seeing which all files
are kept open and why.  I will share my findings on this further, but
in the meantime, if anyone has any thoughts on this matter, please
feel free to share the same.


[1] - https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sidewinder&dt=2020-01-02%2018%3A45%3A25

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: sidewinder has one failure

From

Mikael Kjellström

Date:

03 January 2020, 13:04:10

On 2020-01-03 13:01, Amit Kapila wrote:

> 2020-01-02 19:51:05.687 CET [24138:3] FATAL:  insufficient file
> descriptors available to start server process
> 2020-01-02 19:51:05.687 CET [24138:4] DETAIL:  System allows 19, we
> need at least 20.
> 2020-01-02 19:51:05.687 CET [24138:5] LOG:  database system is shut down
> 
> Here, I think it is clear that the failure happens because we are
> setting the value of max_files_per_process as 26 which is low for this
> machine.  It seems to me that the reason it is failing is that before
> reaching set_max_safe_fds, it has already seven open files.  Now, I
> see on my CentOS system, the value of already_open files is 3, 6 and 6
> respectively for versions HEAD, 12 and 10.  We can easily see the
> number of already opened files by changing the error level from DEBUG2
> to LOG for elog message in set_max_safe_fds.  It is not very clear to
> me how many files we can expect to be kept open during startup?  Can
> the number vary on different setups?

Hm, where does it get the limit from?  Is it something we set?

Why is this machine different from everybody else when it comes to this 
limit?

ulimit -a says:

$ ulimit -a
time(cpu-seconds)    unlimited
file(blocks)         unlimited
coredump(blocks)     unlimited
data(kbytes)         262144
stack(kbytes)        4096
lockedmem(kbytes)    672036
memory(kbytes)       2016108
nofiles(descriptors) 1024
processes            1024
threads              1024
vmemory(kbytes)      unlimited
sbsize(bytes)        unlimited

Is there any configuration setting I could do on the machine to increase 
this limit?

/Mikael

Re: sidewinder has one failure

From

Amit Kapila

Date:

03 January 2020, 13:33:27

On Fri, Jan 3, 2020 at 6:34 PM Mikael Kjellström
<mikael.kjellstrom@mksoft.nu> wrote:
>
>
> On 2020-01-03 13:01, Amit Kapila wrote:
>
> > 2020-01-02 19:51:05.687 CET [24138:3] FATAL:  insufficient file
> > descriptors available to start server process
> > 2020-01-02 19:51:05.687 CET [24138:4] DETAIL:  System allows 19, we
> > need at least 20.
> > 2020-01-02 19:51:05.687 CET [24138:5] LOG:  database system is shut down
> >
> > Here, I think it is clear that the failure happens because we are
> > setting the value of max_files_per_process as 26 which is low for this
> > machine.  It seems to me that the reason it is failing is that before
> > reaching set_max_safe_fds, it has already seven open files.  Now, I
> > see on my CentOS system, the value of already_open files is 3, 6 and 6
> > respectively for versions HEAD, 12 and 10.  We can easily see the
> > number of already opened files by changing the error level from DEBUG2
> > to LOG for elog message in set_max_safe_fds.  It is not very clear to
> > me how many files we can expect to be kept open during startup?  Can
> > the number vary on different setups?
>
> Hm, where does it get the limit from?  Is it something we set?
>
> Why is this machine different from everybody else when it comes to this
> limit?
>

The problem we are seeing on this machine is that I think we have
seven files opened before we reach function set_max_safe_fds during
startup.  Now, it is not clear to me why it is opening extra file(s)
during start-up as compare to other machines.  I think this kind of
problem could occur if one has set shared_preload_libraries and via
that, some file is getting opened which is not closed or there is some
other configuration due to which this extra file is getting opened.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: sidewinder has one failure

From

Amit Kapila

Date:

03 January 2020, 14:48:52

On Fri, Jan 3, 2020 at 7:03 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, Jan 3, 2020 at 6:34 PM Mikael Kjellström
> <mikael.kjellstrom@mksoft.nu> wrote:
> >
> >
> > On 2020-01-03 13:01, Amit Kapila wrote:
> >
> > > 2020-01-02 19:51:05.687 CET [24138:3] FATAL:  insufficient file
> > > descriptors available to start server process
> > > 2020-01-02 19:51:05.687 CET [24138:4] DETAIL:  System allows 19, we
> > > need at least 20.
> > > 2020-01-02 19:51:05.687 CET [24138:5] LOG:  database system is shut down
> > >
> > > Here, I think it is clear that the failure happens because we are
> > > setting the value of max_files_per_process as 26 which is low for this
> > > machine.  It seems to me that the reason it is failing is that before
> > > reaching set_max_safe_fds, it has already seven open files.  Now, I
> > > see on my CentOS system, the value of already_open files is 3, 6 and 6
> > > respectively for versions HEAD, 12 and 10.

I debugged on HEAD and found that we are closing all the files (like
postgresql.conf, postgresql.auto.conf, etc.) that got opened before
set_max_safe_fds.  I think on HEAD the 3 already opened files are
basically stdin, stdout, stderr.   It is still not clear why on some
other versions it shows different number of already opened files.

> > >  We can easily see the
> > > number of already opened files by changing the error level from DEBUG2
> > > to LOG for elog message in set_max_safe_fds.  It is not very clear to
> > > me how many files we can expect to be kept open during startup?  Can
> > > the number vary on different setups?
> >
> > Hm, where does it get the limit from?  Is it something we set?
> >
> > Why is this machine different from everybody else when it comes to this
> > limit?
> >

Mikael, is it possible for you to set log_min_messages to DEBUG2 on
your machine and start the server.  You must see a line like:
"max_safe_fds = 984, usable_fds = 1000, already_open = 6".  Is it
possible to share that information?  This is just to confirm if the
already_open number is 7 on your machine.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: sidewinder has one failure

From

Tom Lane

Date:

03 January 2020, 15:05:52

Amit Kapila <amit.kapila16@gmail.com> writes:
> On Fri, Jan 3, 2020 at 6:34 PM Mikael Kjellström
> <mikael.kjellstrom@mksoft.nu> wrote:
>> Why is this machine different from everybody else when it comes to this
>> limit?

> The problem we are seeing on this machine is that I think we have
> seven files opened before we reach function set_max_safe_fds during
> startup.  Now, it is not clear to me why it is opening extra file(s)
> during start-up as compare to other machines.

Maybe it uses one of the semaphore implementations that consume a
file descriptor per semaphore?

I think that d20703805 was insanely optimistic to think that a
tiny value of max_files_per_process would work the same everywhere.
I'd actually recommend just dropping that test, as I do not think
it's possible to make it portable and reliable.  Even if it could
be fixed, I doubt it would ever find any actual bug to justify
the sweat it would take to maintain it.

            regards, tom lane

Re: sidewinder has one failure

From

Tom Lane

Date:

03 January 2020, 16:56:37

I wrote:
> Amit Kapila <amit.kapila16@gmail.com> writes:
>> The problem we are seeing on this machine is that I think we have
>> seven files opened before we reach function set_max_safe_fds during
>> startup.  Now, it is not clear to me why it is opening extra file(s)
>> during start-up as compare to other machines.

> Maybe it uses one of the semaphore implementations that consume a
> file descriptor per semaphore?

Hm, no, sidewinder reports that it's using SysV semaphores:

checking which semaphore API to use... System V

However, I tried building an installation that uses named POSIX
semaphores, by applying the attached hack on a macOS system.
And sure enough, this test crashes and burns:

2020-01-03 11:36:21.571 EST [91597] FATAL:  insufficient file descriptors available to start server process
2020-01-03 11:36:21.571 EST [91597] DETAIL:  System allows -8, we need at least 20.
2020-01-03 11:36:21.571 EST [91597] LOG:  database system is shut down

Looking at "lsof" output for a postmaster with max_connections=10,
max_wal_senders=5 (the parameters set up by PostgresNode.pm), I see
that it's got 31 "PSXSEM" file descriptors, so the number shown here
is about what you'd expect.  We might be able to constrain that down
a little further, but still, this test has no chance of working in
anything like its present form on a machine that needs file
descriptors for semaphores.  That's a supported configuration, even
if not a recommended one, so I don't think it's okay for the test
to fall over.

(Hmm ... apparently, we have no buildfarm members that use such
semaphores and are running the TAP tests, else we'd have additional
complaints.  Perhaps that's a bad omission.)

Anyway, it remains unclear exactly why sidewinder is failing, but
I'm guessing it has a few more open files than you expected.  My
macOS build has a few more than I can account for in my caffeine-
deprived state, too.  One of them might be for bonjour ... not sure
about some of the rest.  Bottom line here is that it's hard to
predict with any accuracy how many pre-opened files there will be.

            regards, tom lane

diff --git a/src/template/darwin b/src/template/darwin
index f4d4e9d7cf..98331be22d 100644
--- a/src/template/darwin
+++ b/src/template/darwin
@@ -23,11 +23,4 @@ CFLAGS_SL=""
 # support System V semaphores; before that we have to use named POSIX
 # semaphores, which are less good for our purposes because they eat a
 # file descriptor per backend per max_connection slot.
-case $host_os in
-  darwin[015].*)
     USE_NAMED_POSIX_SEMAPHORES=1
-    ;;
-  *)
-    USE_SYSV_SEMAPHORES=1
-    ;;
-esac

Re: sidewinder has one failure

From

Mikael Kjellström

Date:

04 January 2020, 00:05:56

On 2020-01-03 15:48, Amit Kapila wrote:
> On Fri, Jan 3, 2020 at 7:03 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

> I debugged on HEAD and found that we are closing all the files (like
> postgresql.conf, postgresql.auto.conf, etc.) that got opened before
> set_max_safe_fds.  I think on HEAD the 3 already opened files are
> basically stdin, stdout, stderr.   It is still not clear why on some
> other versions it shows different number of already opened files.

I think Tom Lane found the "problem".  It has to do with the semaphores 
taking up FD's.


>>>>   We can easily see the
>>>> number of already opened files by changing the error level from DEBUG2
>>>> to LOG for elog message in set_max_safe_fds.  It is not very clear to
>>>> me how many files we can expect to be kept open during startup?  Can
>>>> the number vary on different setups?
>>>
>>> Hm, where does it get the limit from?  Is it something we set?
>>>
>>> Why is this machine different from everybody else when it comes to this
>>> limit?
>>>
> 
> Mikael, is it possible for you to set log_min_messages to DEBUG2 on
> your machine and start the server.  You must see a line like:
> "max_safe_fds = 984, usable_fds = 1000, already_open = 6".  Is it
> possible to share that information?  This is just to confirm if the
> already_open number is 7 on your machine.

Sure.  I compiled pgsql 12 and this is the complete logfile after 
starting up the server the first time with log_min_messages=debug2:


2020-01-04 01:03:14.484 CET [14906] DEBUG:  registering background 
worker "logical replication launcher"
2020-01-04 01:03:14.484 CET [14906] LOG:  starting PostgreSQL 12.1 on 
x86_64-unknown-netbsd7.0, compiled by gcc (nb2 20150115) 4.8.4, 64-bit
2020-01-04 01:03:14.484 CET [14906] LOG:  listening on IPv6 address 
"::1", port 5432
2020-01-04 01:03:14.484 CET [14906] LOG:  listening on IPv4 address 
"127.0.0.1", port 5432
2020-01-04 01:03:14.485 CET [14906] LOG:  listening on Unix socket 
"/tmp/.s.PGSQL.5432"
2020-01-04 01:03:14.491 CET [14906] DEBUG:  SlruScanDirectory invoking 
callback on pg_notify/0000
2020-01-04 01:03:14.491 CET [14906] DEBUG:  removing file "pg_notify/0000"
2020-01-04 01:03:14.491 CET [14906] DEBUG:  dynamic shared memory system 
will support 308 segments
2020-01-04 01:03:14.491 CET [14906] DEBUG:  created dynamic shared 
memory control segment 2134641633 (7408 bytes)
2020-01-04 01:03:14.492 CET [14906] DEBUG:  max_safe_fds = 984, 
usable_fds = 1000, already_open = 6
2020-01-04 01:03:14.493 CET [426] LOG:  database system was shut down at 
2020-01-04 01:00:15 CET
2020-01-04 01:03:14.493 CET [426] DEBUG:  checkpoint record is at 0/15F15B8
2020-01-04 01:03:14.493 CET [426] DEBUG:  redo record is at 0/15F15B8; 
shutdown true
2020-01-04 01:03:14.493 CET [426] DEBUG:  next transaction ID: 486; next 
OID: 12974
2020-01-04 01:03:14.493 CET [426] DEBUG:  next MultiXactId: 1; next 
MultiXactOffset: 0
2020-01-04 01:03:14.493 CET [426] DEBUG:  oldest unfrozen transaction 
ID: 479, in database 1
2020-01-04 01:03:14.493 CET [426] DEBUG:  oldest MultiXactId: 1, in 
database 1
2020-01-04 01:03:14.493 CET [426] DEBUG:  commit timestamp Xid 
oldest/newest: 0/0
2020-01-04 01:03:14.493 CET [426] DEBUG:  transaction ID wrap limit is 
2147484126, limited by database with OID 1
2020-01-04 01:03:14.493 CET [426] DEBUG:  MultiXactId wrap limit is 
2147483648, limited by database with OID 1
2020-01-04 01:03:14.493 CET [426] DEBUG:  starting up replication slots
2020-01-04 01:03:14.493 CET [426] DEBUG:  starting up replication origin 
progress state
2020-01-04 01:03:14.493 CET [426] DEBUG:  MultiXactId wrap limit is 
2147483648, limited by database with OID 1
2020-01-04 01:03:14.493 CET [426] DEBUG:  MultiXact member stop limit is 
now 4294914944 based on MultiXact 1
2020-01-04 01:03:14.494 CET [14906] DEBUG:  starting background worker 
process "logical replication launcher"
2020-01-04 01:03:14.494 CET [14906] LOG:  database system is ready to 
accept connections
2020-01-04 01:03:14.495 CET [9809] DEBUG:  autovacuum launcher started
2020-01-04 01:03:14.496 CET [11463] DEBUG:  received inquiry for database 0
2020-01-04 01:03:14.496 CET [11463] DEBUG:  writing stats file 
"pg_stat_tmp/global.stat"
2020-01-04 01:03:14.497 CET [7890] DEBUG:  logical replication launcher 
started
2020-01-04 01:03:14.498 CET [28096] DEBUG:  checkpointer updated shared 
memory configuration values

/Mikael

Re: sidewinder has one failure

From

Tom Lane

Date:

04 January 2020, 00:15:28

=?UTF-8?Q?Mikael_Kjellstr=c3=b6m?= <mikael.kjellstrom@mksoft.nu> writes:
> I think Tom Lane found the "problem".  It has to do with the semaphores 
> taking up FD's.

Hm, no, because:

> Sure.  I compiled pgsql 12 and this is the complete logfile after 
> starting up the server the first time with log_min_messages=debug2:
> 2020-01-04 01:03:14.492 CET [14906] DEBUG:  max_safe_fds = 984, 
> usable_fds = 1000, already_open = 6

That's pretty much the same thing we see on most other platforms.
Plus your configure log shows that SysV semaphores were selected,
and those don't eat FDs.

Apparently, in the environment of that TAP test, the server has more
open FDs at this point than it does when running "normally".  I have
no idea what the additional FDs might be.

            regards, tom lane

Re: sidewinder has one failure

From

Mikael Kjellström

Date:

04 January 2020, 00:21:13

On 2020-01-04 01:15, Tom Lane wrote:
> =?UTF-8?Q?Mikael_Kjellstr=c3=b6m?= <mikael.kjellstrom@mksoft.nu> writes:
>> I think Tom Lane found the "problem".  It has to do with the semaphores
>> taking up FD's.
> 
> Hm, no, because:

Yes, saw that after I posted my answer.


>> Sure.  I compiled pgsql 12 and this is the complete logfile after
>> starting up the server the first time with log_min_messages=debug2:
>> 2020-01-04 01:03:14.492 CET [14906] DEBUG:  max_safe_fds = 984,
>> usable_fds = 1000, already_open = 6
> 
> That's pretty much the same thing we see on most other platforms.
> Plus your configure log shows that SysV semaphores were selected,
> and those don't eat FDs.

Yes, it looks "normal".


> Apparently, in the environment of that TAP test, the server has more
> open FDs at this point than it does when running "normally".  I have
> no idea what the additional FDs might be.

Well it's running under cron if that makes a difference and what is the 
TAP-test using?  perl?

/Mikael

Re: sidewinder has one failure

From

Mikael Kjellström

Date:

04 January 2020, 00:41:20

On 2020-01-04 01:21, Mikael Kjellström wrote:

>> Apparently, in the environment of that TAP test, the server has more
>> open FDs at this point than it does when running "normally".  I have
>> no idea what the additional FDs might be.
> 
> Well it's running under cron if that makes a difference and what is the 
> TAP-test using?  perl?

I tried starting it from cron and then I got:

  max_safe_fds = 981, usable_fds = 1000, already_open = 9

/Mikael

Re: sidewinder has one failure

From

Tom Lane

Date:

04 January 2020, 00:47:24

=?UTF-8?Q?Mikael_Kjellstr=c3=b6m?= <mikael.kjellstrom@mksoft.nu> writes:
> On 2020-01-04 01:15, Tom Lane wrote:
>> Apparently, in the environment of that TAP test, the server has more
>> open FDs at this point than it does when running "normally".  I have
>> no idea what the additional FDs might be.

> Well it's running under cron if that makes a difference and what is the 
> TAP-test using?  perl?

Not sure.  There's a few things you could do to investigate:

* Run the recovery TAP tests.  Do you reproduce the buildfarm failure
in your hand build?  If not, we need to ask what's different.

* If you do reproduce it, run those tests at debug2, just to confirm
the theory that already_open is higher than normal.  (The easy way
to make that happen is to add another line to what PostgresNode.pm's
init function is adding to postgresql.conf.)

* Also, try putting a pg_usleep call just before the error in fd.c,
to give yourself enough time to manually point "lsof" at the
postmaster and see what all its FDs are.

            regards, tom lane

Re: sidewinder has one failure

From

Tom Lane

Date:

04 January 2020, 00:49:57

=?UTF-8?Q?Mikael_Kjellstr=c3=b6m?= <mikael.kjellstrom@mksoft.nu> writes:
> I tried starting it from cron and then I got:
>   max_safe_fds = 981, usable_fds = 1000, already_open = 9

Oh!  There we have it then.  I wonder if that's a cron bug (neglecting
to close its own FDs before forking children) or intentional (maybe
it uses those FDs to keep tabs on the children?).

            regards, tom lane

Re: sidewinder has one failure

From

Amit Kapila

Date:

04 January 2020, 01:26:48

On Sat, Jan 4, 2020 at 6:19 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> =?UTF-8?Q?Mikael_Kjellstr=c3=b6m?= <mikael.kjellstrom@mksoft.nu> writes:
> > I tried starting it from cron and then I got:
> >   max_safe_fds = 981, usable_fds = 1000, already_open = 9
>
> Oh!  There we have it then.
>

Right.

>  I wonder if that's a cron bug (neglecting
> to close its own FDs before forking children) or intentional (maybe
> it uses those FDs to keep tabs on the children?).
>

So, where do we go from here?  Shall we try to identify why cron is
keeping extra FDs or we assume that we can't predict how many
pre-opened files there will be?  In the latter case, we either want to
(a) tweak the test to raise the value of max_files_per_process, (b)
remove the test entirely.  You seem to incline towards (b), but I have
a few things to say about that.  We have another strange failure due
to this test on one of Noah's machine, see my email [1].  I have
requested Noah for the stack trace [2].  It is not clear to me whether
the committed code has any problem or the test has discovered a
different problem in v10 specific to that platform.  The same test has
passed for v11, v12, and HEAD on the same platform.

[1] - https://www.postgresql.org/message-id/CAA4eK1LMDx6vK8Kdw8WUeW1MjToN2xVffL2kvtHvZg17%3DY6QQg%40mail.gmail.com
[2] - https://www.postgresql.org/message-id/CAA4eK1LJqMuXoCLuxkTr1HidbR8DkgRrVC7jHWDyXT%3DFD2gt6Q%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: sidewinder has one failure

From

Noah Misch

Date:

05 January 2020, 02:30:05

On Sat, Jan 04, 2020 at 06:56:48AM +0530, Amit Kapila wrote:
> On Sat, Jan 4, 2020 at 6:19 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > =?UTF-8?Q?Mikael_Kjellstr=c3=b6m?= <mikael.kjellstrom@mksoft.nu> writes:
> > > I tried starting it from cron and then I got:
> > >   max_safe_fds = 981, usable_fds = 1000, already_open = 9
> >
> > Oh!  There we have it then.
> >
> 
> Right.
> 
> >  I wonder if that's a cron bug (neglecting
> > to close its own FDs before forking children) or intentional (maybe
> > it uses those FDs to keep tabs on the children?).
> >
> 
> So, where do we go from here?  Shall we try to identify why cron is
> keeping extra FDs or we assume that we can't predict how many
> pre-opened files there will be?

The latter.  If it helps, you could add a regress.c function
leak_fd_until_max_fd_is(integer) so the main part of the test starts from a
known FD consumption state.

> In the latter case, we either want to
> (a) tweak the test to raise the value of max_files_per_process, (b)
> remove the test entirely.

I generally favor keeping the test, but feel free to decide it's too hard.

Re: sidewinder has one failure

From

Tom Lane

Date:

05 January 2020, 03:00:52

Noah Misch <noah@leadboat.com> writes:
> On Sat, Jan 04, 2020 at 06:56:48AM +0530, Amit Kapila wrote:
>> So, where do we go from here?  Shall we try to identify why cron is
>> keeping extra FDs or we assume that we can't predict how many
>> pre-opened files there will be?

> The latter.  If it helps, you could add a regress.c function
> leak_fd_until_max_fd_is(integer) so the main part of the test starts from a
> known FD consumption state.

Hmm ... that's an idea, but I'm not sure that even that would get the
job done.  By the time we reach any code in regress.c, there would
have been a bunch of catalog accesses, and so a bunch of the open FDs
would be from VFDs that fd.c could close on demand.  So you still
wouldn't have a clear idea of how much stress would be needed to get
to an out-of-FDs situation.

Perhaps, on top of this hypothetical regress.c function, you could
add some function in fd.c to force all VFDs closed, and then have
regress.c call that before it leaks a pile of FDs.  But now we're
getting mighty far into the weeds, and away from testing anything
that remotely resembles actual production behavior.

> I generally favor keeping the test, but feel free to decide it's too hard.

I remain dubious that it's worth the trouble, or indeed that the test
would prove anything of interest.

            regards, tom lane

Re: sidewinder has one failure

From

Amit Kapila

Date:

05 January 2020, 03:58:34

On Sun, Jan 5, 2020 at 8:30 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Noah Misch <noah@leadboat.com> writes:
>
> > I generally favor keeping the test, but feel free to decide it's too hard.
>
> I remain dubious that it's worth the trouble, or indeed that the test
> would prove anything of interest.
>

I think we don't have any tests which test operating on many spill
files which this test does leaving aside the part of the test which
tests max open descriptors.  We do have some tests related to spill
files in contrib/spill/test_decoding/spill.sql, but I don't see any
which tests with this many open spill files.  Now, maybe it is not
important to test that, but I think we should wait till we find out
why this test failed on 'tern' and that too only in v10.  It might
turn out that it has revealed some actual code issue(either in what
got committed or some base code).  In either case, it might turn out
to be useful.  So, we might decide to remove setting
max_files_per_process, but leave the test as it is.  I am also not
sure what is the right thing to do here, but it is clear that if we
remove this test we won't be able to figure what went wrong on 'tern'.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: sidewinder has one failure

From

Amit Kapila

Date:

08 January 2020, 03:29:43

On Sun, Jan 5, 2020 at 8:00 AM Noah Misch <noah@leadboat.com> wrote:
>
> On Sat, Jan 04, 2020 at 06:56:48AM +0530, Amit Kapila wrote:
> > In the latter case, we either want to
> > (a) tweak the test to raise the value of max_files_per_process, (b)
> > remove the test entirely.
>
> I generally favor keeping the test, but feel free to decide it's too hard.
>

I am thinking that for now, we should raise the limit of
max_files_per_process in the test to something like 35 or 40, so that
sidewinder passes and unblocks other people who might get blocked due
to this, for example, I think one case is reported here
(https://www.postgresql.org/message-id/20200106105608.GB18560%40msg.df7cb.de,
see Ubuntu bionic ..).  I feel with this still we shall be able to
catch the problem we are facing on 'tern' and 'mandrill'.

Do you have any opinion on this?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: sidewinder has one failure

From

Tom Lane

Date:

08 January 2020, 13:17:04

Amit Kapila <amit.kapila16@gmail.com> writes:
> I am thinking that for now, we should raise the limit of
> max_files_per_process in the test to something like 35 or 40, so that
> sidewinder passes and unblocks other people who might get blocked due
> to this

That will not fix the problem for FD-per-semaphore platforms.

            regards, tom lane