Thread: [HACKERS] Reducing runtime of stats regression test

[HACKERS] Reducing runtime of stats regression test

From
Tom Lane
Date:
On a reasonably fast development machine, one of the biggest time sinks
while running the core regression tests is the long "sleep" calls in the
stats.sql regression test.  I took a closer look at these, and I think
we could basically get rid of them.

First up is this bit at the beginning of that test script:

-- wait to let any prior tests finish dumping out stats;
-- else our messages might get lost due to contention
SELECT pg_sleep_for('2 seconds');

The stated concern isn't really all that plausible, since even if we
launch a batch of test scripts at once, they don't all finish at once,
so there's unlikely to be a big pileup of traffic to the stats collector.
But we don't have to take that on faith: in assert-enabled builds,
pgstat_send() logs any failure to send a stats message.

I have grepped the buildfarm logs for "could not send to statistics
collector" log messages during the "make check" stage (a total of 754313
runs dating back ten years).  What I find is that members mereswine and
gull occasionally report "Network is down", and a few times currawong and
thrips have complained "Invalid argument", and there are exactly no other
such messages.  In particular there are no EAGAIN or EWOULDBLOCK failures
that would suggest congestion on the stats collector's input.  So this
is basically not something that happens at all in the regression tests,
let alone during startup of the stats test in particular.

Now, another failure mechanism that could conceivably be ameliorated
by this initial wait is if one of the immediately preceding runs has
performed a scan on tenk2 but doesn't manage to transmit that info before
stats.sql creates its initial copy of the stats counts.  Then when that
count does get sent, it could look like one triggered by stats.sql itself,
fooling the test.  But that seems rather unlikely, because tenk2 isn't
touched by very many test scripts.  And even if it did happen that would
not cause an observed test failure; at worst it would obscure a failure
that we should have detected.  I'm doubtful that this is worth losing any
sleep over.

In short, I think we could just drop the above wait altogether,
and be no worse off.  The only useful thing it's doing for us is
exercising pg_sleep_for(), which is otherwise untested ... but we
can transfer that responsibility into wait_for_stats().

The other significant delay in stats.sql is

-- force the rate-limiting logic in pgstat_report_stat() to time out
-- and send a message
SELECT pg_sleep(1.0);

Now, we do seem to need a delay there, because the rate-limiting logic
is unlikely to have permitted the count from the immediately preceding
statement to have gotten sent right then, and the count won't get
sent at all while we're inside wait_for_stats (since backends only
send stats just before going idle).  But there's more than one way
to skin this cat.  We can just start a new connection with \c, and
let wait_for_stats wait for the old one to send its stats before quitting.
Even on my oldest and slowest buildfarm machines, starting a new session
takes well under one second.

In short then, I propose the attached patch, which reduces the
runtime of stats.sql by a shade under 3 seconds.  Given that the
runtime of "make installcheck-parallel" is circa 15-17 seconds on
typical current hardware, that's a nice gain.

I'm not sure about backpatching.  I think developers mostly only
care about regression tests on HEAD, and the savings is relatively
less exciting on the buildfarm, since it doesn't get any bigger on
slower machines.  Thoughts?

            regards, tom lane

diff --git a/src/test/regress/expected/stats.out b/src/test/regress/expected/stats.out
index a811265..91c061b 100644
*** a/src/test/regress/expected/stats.out
--- b/src/test/regress/expected/stats.out
*************** SET enable_seqscan TO on;
*** 16,31 ****
  SET enable_indexscan TO on;
  -- for the moment, we don't want index-only scans here
  SET enable_indexonlyscan TO off;
- -- wait to let any prior tests finish dumping out stats;
- -- else our messages might get lost due to contention
- SELECT pg_sleep_for('2 seconds');
-  pg_sleep_for
- --------------
-
- (1 row)
-
  -- save counters
! CREATE TEMP TABLE prevstats AS
  SELECT t.seq_scan, t.seq_tup_read, t.idx_scan, t.idx_tup_fetch,
         (b.heap_blks_read + b.heap_blks_hit) AS heap_blks,
         (b.idx_blks_read + b.idx_blks_hit) AS idx_blks,
--- 16,23 ----
  SET enable_indexscan TO on;
  -- for the moment, we don't want index-only scans here
  SET enable_indexonlyscan TO off;
  -- save counters
! CREATE TABLE prevstats AS
  SELECT t.seq_scan, t.seq_tup_read, t.idx_scan, t.idx_tup_fetch,
         (b.heap_blks_read + b.heap_blks_hit) AS heap_blks,
         (b.idx_blks_read + b.idx_blks_hit) AS idx_blks,
*************** begin
*** 66,72 ****
      exit when updated1 and updated2 and updated3;

      -- wait a little
!     perform pg_sleep(0.1);

      -- reset stats snapshot so we can test again
      perform pg_stat_clear_snapshot();
--- 58,64 ----
      exit when updated1 and updated2 and updated3;

      -- wait a little
!     perform pg_sleep_for('100 milliseconds');

      -- reset stats snapshot so we can test again
      perform pg_stat_clear_snapshot();
*************** SELECT count(*) FROM tenk2 WHERE unique1
*** 143,156 ****
       1
  (1 row)

! -- force the rate-limiting logic in pgstat_report_stat() to time out
! -- and send a message
! SELECT pg_sleep(1.0);
!  pg_sleep
! ----------
!
! (1 row)
!
  -- wait for stats collector to update
  SELECT wait_for_stats();
   wait_for_stats
--- 135,147 ----
       1
  (1 row)

! -- We can't just call wait_for_stats() at this point, because we only
! -- transmit stats when the session goes idle, and we probably didn't
! -- transmit the last couple of counts yet thanks to the rate-limiting logic
! -- in pgstat_report_stat().  But instead of waiting for the rate limiter's
! -- timeout to elapse, let's just start a new session.  The old one will
! -- then send its stats before dying.
! \c -
  -- wait for stats collector to update
  SELECT wait_for_stats();
   wait_for_stats
*************** FROM prevstats AS pr;
*** 199,202 ****
--- 190,194 ----
  (1 row)

  DROP TABLE trunc_stats_test, trunc_stats_test1, trunc_stats_test2, trunc_stats_test3, trunc_stats_test4;
+ DROP TABLE prevstats;
  -- End of Stats Test
diff --git a/src/test/regress/sql/stats.sql b/src/test/regress/sql/stats.sql
index b3e2efa..1c21f38 100644
*** a/src/test/regress/sql/stats.sql
--- b/src/test/regress/sql/stats.sql
*************** SET enable_indexscan TO on;
*** 14,25 ****
  -- for the moment, we don't want index-only scans here
  SET enable_indexonlyscan TO off;

- -- wait to let any prior tests finish dumping out stats;
- -- else our messages might get lost due to contention
- SELECT pg_sleep_for('2 seconds');
-
  -- save counters
! CREATE TEMP TABLE prevstats AS
  SELECT t.seq_scan, t.seq_tup_read, t.idx_scan, t.idx_tup_fetch,
         (b.heap_blks_read + b.heap_blks_hit) AS heap_blks,
         (b.idx_blks_read + b.idx_blks_hit) AS idx_blks,
--- 14,21 ----
  -- for the moment, we don't want index-only scans here
  SET enable_indexonlyscan TO off;

  -- save counters
! CREATE TABLE prevstats AS
  SELECT t.seq_scan, t.seq_tup_read, t.idx_scan, t.idx_tup_fetch,
         (b.heap_blks_read + b.heap_blks_hit) AS heap_blks,
         (b.idx_blks_read + b.idx_blks_hit) AS idx_blks,
*************** begin
*** 61,67 ****
      exit when updated1 and updated2 and updated3;

      -- wait a little
!     perform pg_sleep(0.1);

      -- reset stats snapshot so we can test again
      perform pg_stat_clear_snapshot();
--- 57,63 ----
      exit when updated1 and updated2 and updated3;

      -- wait a little
!     perform pg_sleep_for('100 milliseconds');

      -- reset stats snapshot so we can test again
      perform pg_stat_clear_snapshot();
*************** SELECT count(*) FROM tenk2;
*** 137,145 ****
  -- do an indexscan
  SELECT count(*) FROM tenk2 WHERE unique1 = 1;

! -- force the rate-limiting logic in pgstat_report_stat() to time out
! -- and send a message
! SELECT pg_sleep(1.0);

  -- wait for stats collector to update
  SELECT wait_for_stats();
--- 133,145 ----
  -- do an indexscan
  SELECT count(*) FROM tenk2 WHERE unique1 = 1;

! -- We can't just call wait_for_stats() at this point, because we only
! -- transmit stats when the session goes idle, and we probably didn't
! -- transmit the last couple of counts yet thanks to the rate-limiting logic
! -- in pgstat_report_stat().  But instead of waiting for the rate limiter's
! -- timeout to elapse, let's just start a new session.  The old one will
! -- then send its stats before dying.
! \c -

  -- wait for stats collector to update
  SELECT wait_for_stats();
*************** SELECT pr.snap_ts < pg_stat_get_snapshot
*** 165,168 ****
--- 165,169 ----
  FROM prevstats AS pr;

  DROP TABLE trunc_stats_test, trunc_stats_test1, trunc_stats_test2, trunc_stats_test3, trunc_stats_test4;
+ DROP TABLE prevstats;
  -- End of Stats Test

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reducing runtime of stats regression test

From
Alvaro Herrera
Date:
Tom Lane wrote:

> The other significant delay in stats.sql is
> 
> -- force the rate-limiting logic in pgstat_report_stat() to time out
> -- and send a message
> SELECT pg_sleep(1.0);
> 
> Now, we do seem to need a delay there, because the rate-limiting logic
> is unlikely to have permitted the count from the immediately preceding
> statement to have gotten sent right then, and the count won't get
> sent at all while we're inside wait_for_stats (since backends only
> send stats just before going idle).  But there's more than one way
> to skin this cat.  We can just start a new connection with \c, and
> let wait_for_stats wait for the old one to send its stats before quitting.
> Even on my oldest and slowest buildfarm machines, starting a new session
> takes well under one second.

So you changed table prevstats from temp to permanent; perhaps make it
unlogged?

I wonder if it'd be useful to have a "pg_stat_flush" or something, which
sends out whatever is queued in this session.  Then you wouldn't need
the reconnection.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] Reducing runtime of stats regression test

From
Tom Lane
Date:
Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> Tom Lane wrote:
>> We can just start a new connection with \c, and
>> let wait_for_stats wait for the old one to send its stats before quitting.
>> Even on my oldest and slowest buildfarm machines, starting a new session
>> takes well under one second.

> So you changed table prevstats from temp to permanent;

Right, so it would survive into the new session.

> perhaps make it unlogged?

Hardly seems worth it; there's not much data in it.  And we're not
generally in the habit of marking short-lived tables as unlogged
elsewhere in the regression tests.  (Maybe we should make some
effort to have some of them be so marked, but that seems like
material for its own patch.)

> I wonder if it'd be useful to have a "pg_stat_flush" or something, which
> sends out whatever is queued in this session.  Then you wouldn't need
> the reconnection.

Yes, but that would be getting into the realm of new features, not
post-feature-freeze test adjustments.  It certainly couldn't be
a candidate for back-patching.
        regards, tom lane



Re: [HACKERS] Reducing runtime of stats regression test

From
Robert Haas
Date:
On Thu, May 4, 2017 at 10:22 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Yes, but that would be getting into the realm of new features, not
> post-feature-freeze test adjustments.  It certainly couldn't be
> a candidate for back-patching.

I'm not sure there's some bright line between adding a new
SQL-callable function to cut down the test time and any other
tinkering we might do to reduce the regression test time.  I think
there's a pretty good argument that all of the recent changes you made
in this area constitute strictly optional tinkering.  I'm haven't been
objecting because they don't seem likely to destabilize anything, but
I don't see that they're really helping us get ready for beta either,
which is presumably what we ought to be focusing on at this point.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Reducing runtime of stats regression test

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Thu, May 4, 2017 at 10:22 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Yes, but that would be getting into the realm of new features, not
>> post-feature-freeze test adjustments.  It certainly couldn't be
>> a candidate for back-patching.

> I'm not sure there's some bright line between adding a new
> SQL-callable function to cut down the test time and any other
> tinkering we might do to reduce the regression test time.  I think
> there's a pretty good argument that all of the recent changes you made
> in this area constitute strictly optional tinkering.  I'm haven't been
> objecting because they don't seem likely to destabilize anything, but
> I don't see that they're really helping us get ready for beta either,
> which is presumably what we ought to be focusing on at this point.

Well, to my mind, making the regression tests faster is something that
could be quite helpful during beta, because a lot of people will be
running them.  (Or so we hope, at least.)  If the fact that e.g.
the recovery tests take a lot of time discourages people from running
them, that can't be a good thing for beta purposes.  So I respectfully
reject your opinion about what I should be working on.
        regards, tom lane



Re: [HACKERS] Reducing runtime of stats regression test

From
Noah Misch
Date:
On Wed, May 03, 2017 at 11:43:43PM -0400, Tom Lane wrote:
> On a reasonably fast development machine, one of the biggest time sinks
> while running the core regression tests is the long "sleep" calls in the
> stats.sql regression test.  I took a closer look at these, and I think
> we could basically get rid of them.
> 
> First up is this bit at the beginning of that test script:
> 
> -- wait to let any prior tests finish dumping out stats;
> -- else our messages might get lost due to contention
> SELECT pg_sleep_for('2 seconds');
> 
> The stated concern isn't really all that plausible, since even if we
> launch a batch of test scripts at once, they don't all finish at once,
> so there's unlikely to be a big pileup of traffic to the stats collector.
> But we don't have to take that on faith: in assert-enabled builds,
> pgstat_send() logs any failure to send a stats message.
> 
> I have grepped the buildfarm logs for "could not send to statistics
> collector" log messages during the "make check" stage (a total of 754313
> runs dating back ten years).  What I find is that members mereswine and
> gull occasionally report "Network is down", and a few times currawong and
> thrips have complained "Invalid argument", and there are exactly no other
> such messages.  In particular there are no EAGAIN or EWOULDBLOCK failures
> that would suggest congestion on the stats collector's input.  So this
> is basically not something that happens at all in the regression tests,
> let alone during startup of the stats test in particular.

Linux and AIX, at least, do not make send() return an error when dropping a
UDP message for lack of queue space.  Thus, the lack of buildfarm logs
reporting send() failure is inconclusive.  Nonetheless, the lack of stats.sql
failures since you finished modifying this test in mid-May suggests this was a
good change.  Besides, if a new member were slow enough to experience a stats
pileup, I wouldn't have firm hope that 2s would suffice.