[HACKERS] Reducing runtime of stats regression test - Mailing list pgsql-hackers

From Tom Lane
Subject [HACKERS] Reducing runtime of stats regression test
Date
Msg-id 17795.1493869423@sss.pgh.pa.us
Whole thread Raw
Responses Re: [HACKERS] Reducing runtime of stats regression test  (Alvaro Herrera <alvherre@2ndquadrant.com>)
Re: [HACKERS] Reducing runtime of stats regression test  (Noah Misch <noah@leadboat.com>)
List pgsql-hackers
On a reasonably fast development machine, one of the biggest time sinks
while running the core regression tests is the long "sleep" calls in the
stats.sql regression test.  I took a closer look at these, and I think
we could basically get rid of them.

First up is this bit at the beginning of that test script:

-- wait to let any prior tests finish dumping out stats;
-- else our messages might get lost due to contention
SELECT pg_sleep_for('2 seconds');

The stated concern isn't really all that plausible, since even if we
launch a batch of test scripts at once, they don't all finish at once,
so there's unlikely to be a big pileup of traffic to the stats collector.
But we don't have to take that on faith: in assert-enabled builds,
pgstat_send() logs any failure to send a stats message.

I have grepped the buildfarm logs for "could not send to statistics
collector" log messages during the "make check" stage (a total of 754313
runs dating back ten years).  What I find is that members mereswine and
gull occasionally report "Network is down", and a few times currawong and
thrips have complained "Invalid argument", and there are exactly no other
such messages.  In particular there are no EAGAIN or EWOULDBLOCK failures
that would suggest congestion on the stats collector's input.  So this
is basically not something that happens at all in the regression tests,
let alone during startup of the stats test in particular.

Now, another failure mechanism that could conceivably be ameliorated
by this initial wait is if one of the immediately preceding runs has
performed a scan on tenk2 but doesn't manage to transmit that info before
stats.sql creates its initial copy of the stats counts.  Then when that
count does get sent, it could look like one triggered by stats.sql itself,
fooling the test.  But that seems rather unlikely, because tenk2 isn't
touched by very many test scripts.  And even if it did happen that would
not cause an observed test failure; at worst it would obscure a failure
that we should have detected.  I'm doubtful that this is worth losing any
sleep over.

In short, I think we could just drop the above wait altogether,
and be no worse off.  The only useful thing it's doing for us is
exercising pg_sleep_for(), which is otherwise untested ... but we
can transfer that responsibility into wait_for_stats().

The other significant delay in stats.sql is

-- force the rate-limiting logic in pgstat_report_stat() to time out
-- and send a message
SELECT pg_sleep(1.0);

Now, we do seem to need a delay there, because the rate-limiting logic
is unlikely to have permitted the count from the immediately preceding
statement to have gotten sent right then, and the count won't get
sent at all while we're inside wait_for_stats (since backends only
send stats just before going idle).  But there's more than one way
to skin this cat.  We can just start a new connection with \c, and
let wait_for_stats wait for the old one to send its stats before quitting.
Even on my oldest and slowest buildfarm machines, starting a new session
takes well under one second.

In short then, I propose the attached patch, which reduces the
runtime of stats.sql by a shade under 3 seconds.  Given that the
runtime of "make installcheck-parallel" is circa 15-17 seconds on
typical current hardware, that's a nice gain.

I'm not sure about backpatching.  I think developers mostly only
care about regression tests on HEAD, and the savings is relatively
less exciting on the buildfarm, since it doesn't get any bigger on
slower machines.  Thoughts?

            regards, tom lane

diff --git a/src/test/regress/expected/stats.out b/src/test/regress/expected/stats.out
index a811265..91c061b 100644
*** a/src/test/regress/expected/stats.out
--- b/src/test/regress/expected/stats.out
*************** SET enable_seqscan TO on;
*** 16,31 ****
  SET enable_indexscan TO on;
  -- for the moment, we don't want index-only scans here
  SET enable_indexonlyscan TO off;
- -- wait to let any prior tests finish dumping out stats;
- -- else our messages might get lost due to contention
- SELECT pg_sleep_for('2 seconds');
-  pg_sleep_for
- --------------
-
- (1 row)
-
  -- save counters
! CREATE TEMP TABLE prevstats AS
  SELECT t.seq_scan, t.seq_tup_read, t.idx_scan, t.idx_tup_fetch,
         (b.heap_blks_read + b.heap_blks_hit) AS heap_blks,
         (b.idx_blks_read + b.idx_blks_hit) AS idx_blks,
--- 16,23 ----
  SET enable_indexscan TO on;
  -- for the moment, we don't want index-only scans here
  SET enable_indexonlyscan TO off;
  -- save counters
! CREATE TABLE prevstats AS
  SELECT t.seq_scan, t.seq_tup_read, t.idx_scan, t.idx_tup_fetch,
         (b.heap_blks_read + b.heap_blks_hit) AS heap_blks,
         (b.idx_blks_read + b.idx_blks_hit) AS idx_blks,
*************** begin
*** 66,72 ****
      exit when updated1 and updated2 and updated3;

      -- wait a little
!     perform pg_sleep(0.1);

      -- reset stats snapshot so we can test again
      perform pg_stat_clear_snapshot();
--- 58,64 ----
      exit when updated1 and updated2 and updated3;

      -- wait a little
!     perform pg_sleep_for('100 milliseconds');

      -- reset stats snapshot so we can test again
      perform pg_stat_clear_snapshot();
*************** SELECT count(*) FROM tenk2 WHERE unique1
*** 143,156 ****
       1
  (1 row)

! -- force the rate-limiting logic in pgstat_report_stat() to time out
! -- and send a message
! SELECT pg_sleep(1.0);
!  pg_sleep
! ----------
!
! (1 row)
!
  -- wait for stats collector to update
  SELECT wait_for_stats();
   wait_for_stats
--- 135,147 ----
       1
  (1 row)

! -- We can't just call wait_for_stats() at this point, because we only
! -- transmit stats when the session goes idle, and we probably didn't
! -- transmit the last couple of counts yet thanks to the rate-limiting logic
! -- in pgstat_report_stat().  But instead of waiting for the rate limiter's
! -- timeout to elapse, let's just start a new session.  The old one will
! -- then send its stats before dying.
! \c -
  -- wait for stats collector to update
  SELECT wait_for_stats();
   wait_for_stats
*************** FROM prevstats AS pr;
*** 199,202 ****
--- 190,194 ----
  (1 row)

  DROP TABLE trunc_stats_test, trunc_stats_test1, trunc_stats_test2, trunc_stats_test3, trunc_stats_test4;
+ DROP TABLE prevstats;
  -- End of Stats Test
diff --git a/src/test/regress/sql/stats.sql b/src/test/regress/sql/stats.sql
index b3e2efa..1c21f38 100644
*** a/src/test/regress/sql/stats.sql
--- b/src/test/regress/sql/stats.sql
*************** SET enable_indexscan TO on;
*** 14,25 ****
  -- for the moment, we don't want index-only scans here
  SET enable_indexonlyscan TO off;

- -- wait to let any prior tests finish dumping out stats;
- -- else our messages might get lost due to contention
- SELECT pg_sleep_for('2 seconds');
-
  -- save counters
! CREATE TEMP TABLE prevstats AS
  SELECT t.seq_scan, t.seq_tup_read, t.idx_scan, t.idx_tup_fetch,
         (b.heap_blks_read + b.heap_blks_hit) AS heap_blks,
         (b.idx_blks_read + b.idx_blks_hit) AS idx_blks,
--- 14,21 ----
  -- for the moment, we don't want index-only scans here
  SET enable_indexonlyscan TO off;

  -- save counters
! CREATE TABLE prevstats AS
  SELECT t.seq_scan, t.seq_tup_read, t.idx_scan, t.idx_tup_fetch,
         (b.heap_blks_read + b.heap_blks_hit) AS heap_blks,
         (b.idx_blks_read + b.idx_blks_hit) AS idx_blks,
*************** begin
*** 61,67 ****
      exit when updated1 and updated2 and updated3;

      -- wait a little
!     perform pg_sleep(0.1);

      -- reset stats snapshot so we can test again
      perform pg_stat_clear_snapshot();
--- 57,63 ----
      exit when updated1 and updated2 and updated3;

      -- wait a little
!     perform pg_sleep_for('100 milliseconds');

      -- reset stats snapshot so we can test again
      perform pg_stat_clear_snapshot();
*************** SELECT count(*) FROM tenk2;
*** 137,145 ****
  -- do an indexscan
  SELECT count(*) FROM tenk2 WHERE unique1 = 1;

! -- force the rate-limiting logic in pgstat_report_stat() to time out
! -- and send a message
! SELECT pg_sleep(1.0);

  -- wait for stats collector to update
  SELECT wait_for_stats();
--- 133,145 ----
  -- do an indexscan
  SELECT count(*) FROM tenk2 WHERE unique1 = 1;

! -- We can't just call wait_for_stats() at this point, because we only
! -- transmit stats when the session goes idle, and we probably didn't
! -- transmit the last couple of counts yet thanks to the rate-limiting logic
! -- in pgstat_report_stat().  But instead of waiting for the rate limiter's
! -- timeout to elapse, let's just start a new session.  The old one will
! -- then send its stats before dying.
! \c -

  -- wait for stats collector to update
  SELECT wait_for_stats();
*************** SELECT pr.snap_ts < pg_stat_get_snapshot
*** 165,168 ****
--- 165,169 ----
  FROM prevstats AS pr;

  DROP TABLE trunc_stats_test, trunc_stats_test1, trunc_stats_test2, trunc_stats_test3, trunc_stats_test4;
+ DROP TABLE prevstats;
  -- End of Stats Test

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

pgsql-hackers by date:

Previous
From: Peter Eisentraut
Date:
Subject: Re: [HACKERS] [BUGS] Concurrent ALTER SEQUENCE RESTART Regression
Next
From: Amit Langote
Date:
Subject: Re: [HACKERS] pg_dump emits ALTER TABLE ONLY partitioned_table