Re: [HACKERS] Increase Vacuum ring buffer. - Mailing list pgsql-hackers

From Sokolov Yura
Subject Re: [HACKERS] Increase Vacuum ring buffer.
Date
Msg-id 20170815180038.1f1953b2@falcon-work
Whole thread Raw
In response to Re: [HACKERS] Increase Vacuum ring buffer.  (Sokolov Yura <funny.falcon@postgrespro.ru>)
List pgsql-hackers
В Mon, 31 Jul 2017 20:11:25 +0300
Sokolov Yura <funny.falcon@postgrespro.ru> пишет:

> On 2017-07-27 11:53, Sokolov Yura wrote:
> > On 2017-07-26 20:28, Sokolov Yura wrote:
> >> On 2017-07-26 19:46, Claudio Freire wrote:
> >>> On Wed, Jul 26, 2017 at 1:39 PM, Sokolov Yura
> >>> <funny.falcon@postgrespro.ru> wrote:
> >>>> On 2017-07-24 12:41, Sokolov Yura wrote:
> >>>> test_master_1/pretty.log
> >>> ...
> >>>> time   activity      tps  latency   stddev      min      max
> >>>> 11130     av+ch      198    198ms    374ms      7ms   1956ms
> >>>> 11160     av+ch      248    163ms    401ms      7ms   2601ms
> >>>> 11190     av+ch      321    125ms    363ms      7ms   2722ms
> >>>> 11220     av+ch     1155     35ms    123ms      7ms   2668ms
> >>>> 11250     av+ch     1390     29ms     79ms      7ms   1422ms
> >>>
> >>> vs
> >>>
> >>>> test_master_ring16_1/pretty.log
> >>>> time   activity      tps  latency   stddev      min      max
> >>>> 11130     av+ch       26   1575ms    635ms    101ms   2536ms
> >>>> 11160     av+ch       25   1552ms    648ms     58ms   2376ms
> >>>> 11190     av+ch       32   1275ms    726ms     16ms   2493ms
> >>>> 11220     av+ch       23   1584ms    674ms     48ms   2454ms
> >>>> 11250     av+ch       35   1235ms    777ms     22ms   3627ms
> >>>
> >>> That's a very huge change in latency for the worse
> >>>
> >>> Are you sure that's the ring buffer's doing and not some
> >>> methodology snafu?
> >>
> >> Well, I tuned postgresql.conf so that there is no such
> >> catastrophic slows down on master branch. (with default
> >> settings such slowdown happens quite frequently).
> >> bgwriter_lru_maxpages = 10 (instead of default 200) were one
> >> of such tuning.
> >>
> >> Probably there were some magic "border" that triggers this
> >> behavior. Tuning postgresql.conf shifted master branch on
> >> "good side" of this border, and faster autovacuum crossed it
> >> to "bad side" again.
> >>
> >> Probably, backend_flush_after = 2MB (instead of default 0) is
> >> also part of this border. I didn't try to bench without this
> >> option yet.
> >>
> >> Any way, given checkpoint and autovacuum interference could be
> >> such noticeable, checkpoint clearly should affect autovacuum
> >> cost mechanism, imho.
> >>
> >> With regards,
> >
> > I'll run two times with default postgresql.conf (except
> > shared_buffers and maintence_work_mem) to find out behavior on
> > default setting.
> >
> I've accidentally lost results of this run, so I will rerun it.
>
> This I remembered:
> - even with default settings, autovacuum runs 3 times faster:
> 9000s on master, 3000s with increased ring buffer.
> So xlog-fsync really slows down autovacuum.
> - but concurrent transactions slows down (not so extremely as in
> previous test, but still significantly).
> I could not draw pretty table now, cause I lost results. I'll do
> it after re-run completes.
>
> With regards,

Excuse me for long delay.

I did run with default postgresql.conf .

First: query was a bit different - instead of updating 5 close but
random points using `aid in (:aid1, :aid2, :aid3, :aid4, :aid5)`,
condition was `aid between (:aid1 and :aid1+9)`. TPS is much slower
(on master 330tps vs 540tps for previous version), but it is hard to
tell, is it due query difference, or is it due config change.
I'm sorry for this inconvenience :-( I will never repeat this mistake
in a future.

Overview:
master : 339 tps, average autovacuum 6000sec.
ring16 : 275 tps, average autovacuum 1100sec, first 2500sec.
ring16 + `vacuum_cost_page_dirty = 40` :
         293 tps, average autovacuum 2100sec, first 4226sec.

Running with default postgresql.conf doesn't show catastrophic tps
decline when checkpoint starts during autovacuum (seen in previous test
runs with `autovacuum_cost_delay = 2ms` ). Overall average tps through
8 hours test is also much closer to "master". Still, increased ring
buffer significantly improves autovacuum performance, that means, fsync
consumes a lot of time, comparable with autovacuum_cost_delay.

Runs with ring16 has occasional jumps in minimum and average
response latency:
test_master_ring16_2/pretty.log
8475      av+ch       15   2689ms    659ms   1867ms   3575ms
27420        av       15   2674ms    170ms   2393ms   2926ms
Usually it happens close to end of autovacuum.
What could it be? It is clearly bad behavior hidden by current small
ring buffer.

Runs with ring16+`cost_page_dirty = 40` are much more stable in term
of performance of concurrent transactions. Only first autovacuum has
such "latency jump", latter runs smoothly.

So, increasing ring buffer certainly improves autovacuum performance.
Its negative effects could be compensated with configuration. It
exposes some bad behavior in current implementation, that should be
investigated closer.

--
With regards,
Sokolov Yura aka funny_falcon
Postgres Professional: https://postgrespro.ru
The Russian Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

pgsql-hackers by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: [HACKERS] shared memory based stat collector (was: Sharingrecord typmods between backends)
Next
From: Andres Freund
Date:
Subject: Re: [HACKERS] [BUGS] Replication to Postgres 10 on Windows is broken