So I put in the patch I'd proposed to reduce sleep delays in the stats
regression test, and I see that frogmouth has now failed that test twice,
with symptoms suggesting that it's dropping the last stats report ---
but not all of the stats reports --- from the test's first session.
I considered reverting the test patch, but on closer inspection that seems
like it would be shooting the messenger, because this is indicating a real
and reproducible loss of stats data.
I put in some debug elog's to see how much data is getting shoved at the
stats collector during this test, and it seems that it can be as much as
about 12K, between what the first session can send at exit and what the
second session will send immediately after startup. (I think this value
should be pretty platform-independent, but be it noted that I'm measuring
on a 64-bit Linux system while frogmouth is 32-bit Windows.)
Now, what's significant about that is that frogmouth is a pretty old
Windows version, and what I read on the net is that Windows versions
before 2012 have only 8KB socket receive buffer size by default.
So this behavior is plausibly explained by the theory that the stats
collector's receive buffer is overflowing, causing loss of stats messages.
This could well explain the stats-test failures we've seen in the past
too, which if memory serves were mostly on Windows.
Also, it's clear that a session could easily shove much more than 8KB at
a time out to the stats collector, because what we're doing in the stats
test does not involve touching any very large number of tables. So I
think this is not just a test failure but is telling us about a plausible
mechanism for real-world statistics drops.
I observe a default receive buffer size around 124K on my Linux box,
which seems like it'd be far less prone to overflow than 8K.
I propose that it'd be a good idea to try to set the stats socket's
receive buffer size to be a minimum of, say, 100K on all platforms.
Code for this would be analogous to what we already have in pqcomm.c
(circa line 760) for forcing up the send buffer size, but SO_RCVBUF
not SO_SNDBUF.
A further idea is that maybe backends should be tweaked to avoid
blasting large amounts of data at the stats collector in one go.
That would require more thought to design, though.
Thoughts?
regards, tom lane