Home > mailing lists

pgstat.c has brittle response to transient problems - Mailing list pgsql-hackers

From	Tom Lane
Subject	pgstat.c has brittle response to transient problems
Date	October 30, 2019 19:58:34
Msg-id	20721.1572465514@sss.pgh.pa.us Whole thread Raw
List	pgsql-hackers

Tree view

While fooling with the NetBSD-vs-libpython issue noted in a nearby
thread, I observed that the core regression tests sometimes hang up
in the "stats" test on this platform (NetBSD 8.1/amd64).  Investigation
found that the stats collector process was sometimes exiting like
this:

2019-10-29 19:38:14.563 EDT [7018] FATAL:  could not read statistics message: No buffer space available
2019-10-29 19:38:14.563 EDT [7932] LOG:  statistics collector process (PID 7018) exited with exit code 1

The postmaster then restarts the collector, but possibly with a time
delay (see the PGSTAT_RESTART_INTERVAL respawn throttling logic).
This seems to interact badly with the wait-for-stats-collector logic
in backend_read_statsfile, so that each cycle of the wait_for_stats()
test function takes a long time ... and we will do 300 of those
unconditionally.  (Possibly wait_for_stats ought to be modified so
that it pays attention to elapsed wall-clock time rather than
iterating for a fixed number of times?)

NetBSD's recv() man page glosses ENOBUFS as "A message was not
delivered because it would have overflowed the buffer", but I don't
believe that's actually what's happening.  (Just to be sure,
I added an Assert on the sending side that no message exceeds
sizeof(PgStat_Msg).  I wonder why we didn't have one already.)
Trawling the NetBSD kernel code, it seems like ENOBUFS could get
returned as a result of transient shortages of kernel working
memory --- most of the uses of that error code seem to be on the
sending side, but I found some that seem to be in receiving code.

In short: it's evidently possible to get ENOBUFS as a transient
failure condition on this platform, and having the stats collector
die seems like an overreaction.  I'm inclined to have it log the
error and press on, instead.  Looking at the POSIX spec for
recv() suggests that ENOMEM is another plausible transient failure,
so maybe we should do the same with that.  Roughly:

            if (len < 0)
            {
+               /* silently ignore these cases (no data available) */
                if (errno == EAGAIN || errno == EWOULDBLOCK || errno == EINTR)
                    break;        /* out of inner loop */
+               /* noisily ignore these cases (soft errors) */
+               if (errno == ENOBUFS || errno == ENOMEM)
+               {
+                   ereport(LOG,
+                           (errcode_for_socket_access(),
+                            errmsg("could not read statistics message: %m")));
+                   break;        /* out of inner loop */
+               }
+               /* hard failure, but maybe hara-kiri will fix it */
                ereport(ERROR,
                        (errcode_for_socket_access(),
                         errmsg("could not read statistics message: %m")));
            }

A variant idea is to treat all unexpected errnos as LOG-and-push-on,
but maybe we ought to have a limit on how many times we'll do that
before erroring out.

Thoughts?

            regards, tom lane

pgsql-hackers by date:

From: Tomas Vondra
Date: 30 October 2019, 19:42:37
Subject: Re: MarkBufferDirtyHint() and LSN update

From: Andres Freund
Date: 30 October 2019, 20:25:14
Subject: Re: Thoughts on nbtree with logical/varwidth table identifiers, v12on-disk representation

pgstat.c has brittle response to transient problems - Mailing list pgsql-hackers

Previous

Next