Re: what to revert - Mailing list pgsql-hackers

From Kevin Grittner
Subject Re: what to revert
Date
Msg-id CACjxUsPgmm+LLG1+3d56EhCD8yEKP_b14zHGFOUpJp0Qx-J2pw@mail.gmail.com
Whole thread Raw
In response to Re: what to revert  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Responses Re: what to revert
Re: what to revert
List pgsql-hackers
<div dir="ltr">On Mon, May 9, 2016 at 9:01 PM, Tomas Vondra <<a
href="mailto:tomas.vondra@2ndquadrant.com">tomas.vondra@2ndquadrant.com</a>>wrote:<br /><br />> Over the past few
daysI've been running benchmarks on a fairly<br />> large NUMA box (4 sockets, 32 cores / 64 with HR, 256GB of
RAM)<br/>> to see the impact of the 'snapshot too old' - both when disabled<br />> and enabled with various
valuesin the old_snapshot_threshold<br />> GUC.<br /><br />Thanks!<br /><br />> The benchmark is a simple
read-onlypgbench with prepared<br />> statements, i.e. doing something like this:<br />><br />>    pgbench -S
-Mprepared -j N -c N<br /><br />Do you have any plans to benchmark cases where the patch can have a<br />benefit? 
(Clearly,nobody would be interested in using the feature<br />with a read-only load; so while that makes a good "worst
case"<br/>scenario and is very valuable for testing the "off" versus<br />"reverted" comparison, it's not an intended
useor one that's<br />likely to happen in production.)<br /><br />> master-10-new - 91fd1df4 +
old_snapshot_threshold=10<br/>> master-10-new-2 - 91fd1df4 + old_snapshot_threshold=10 (rerun)<br /><br />So, these
runswere with identical software on the same data?  Any<br />differences are just noise?<br /><br />> * The results
area bit noisy, but I think in general this shows<br />> that for certain cases there's a clearly measurable
difference<br/>> (up to 5%) between the "disabled" and "reverted" cases. This is<br />> particularly visible on
thesmallest data set.<br /><br />In some cases, the differences are in favor of disabled over<br />reverted.<br /><br
/>>* What's fairly strange is that on the largest dataset (scale<br />> 10000), the "disabled" case is actually
consistentlyfaster than<br />> "reverted" - that seems a bit suspicious, I think. It's possible<br />> that I did
therevert wrong, though - the revert.patch is<br />> included in the tgz. This is why I also tested 689f9a05, but<br
/>>that's also slower than "disabled".<br /><br />Since there is not a consistent win of disabled or reverted
over<br/>the other, and what difference there is is often far less than the<br />difference between the two runs with
identicalsoftware, is there<br />any reasonable interpretation of this except that the difference is<br />"in the
noise"?<br/><br />> * The performance impact with the feature enabled seems rather<br />> significant, especially
onceyou exceed the number of physical<br />> cores (32 in this case). Then the drop is pretty clear - often<br
/>>~50% or more.<br />><br />> * 7e3da1c4 claims to bring the performance within 5% of the<br />> disabled
case,but that seems not to be the case.<br /><br />The commit comment says "At least in the tested case this brings<br
/>performancewithin 5% of when the feature is off, compared to<br />several times slower without this patch."  The
testedcase was a<br />read-write load, so your read-only tests do nothing to determine<br />whether this was the case
ingeneral for this type of load.<br />Partly, the patch decreases chasing through HOT chains and<br />increases the
numberof HOT updates, so there are compensating<br />benefits of performing early vacuum in a read-write load.<br /><br
/>>What it however does is bringing the 'non-immediate' cases close<br />> to the immediate ones (before the
performancedrop came much<br />> sooner in these cases - at 16 clients).<br /><br />Right.  This is, of course, just
thefirst optimization, that we<br />were able to get in "under the wire" before beta, but the other<br />optimizations
underconsideration would only tend to bring the<br />"enabled" cases closer together in performance, not make an
enabled<br/>case perform the same as when the feature was off -- especially for<br />a read-only workload.<br /><br
/>>* It's also seems to me the feature greatly amplifies the<br />> variability of the results, somehow. It's not
uncommonto see<br />> results like this:<br />><br />>  master-10-new-2    235516     331976    133316   
155563   133396<br />><br />> where after the first runs (already fairly variable) the<br />> performance
tanksto ~50%. This happens particularly with higher<br />> client counts, otherwise the max-min is within ~5% of the
max.<br/>> There are a few cases where this happens without the feature<br />> (i.e. old master, reverted or
disabled),but it's usually much<br />> smaller than with it enabled (immediate, 10 or 60). See the<br />>
'summary'sheet in the ODS spreadsheet.<br />><br />> I don't know what's the problem here - at first I thought
that<br/>> maybe something else was running on the machine, or that<br />> anti-wraparound autovacuum kicked in,
butthat seems not to be <br />> the case. There's nothing like that in the postgres log (also <br />> included in
the.tgz).<br /><br />I'm inclined to suspect NUMA effects.  It would be interesting to<br />try with the NUMA patch and
cpusetI submitted a while back or with<br />fixes in place for the Linux scheduler bugs which were reported<br />last
month. Which kernel version was this?<br /><br />--<br />Kevin Grittner<br />EDB: <a
href="http://www.enterprisedb.com">http://www.enterprisedb.com</a><br/>The Enterprise PostgreSQL Company<br /><br
/></div>

pgsql-hackers by date:

Previous
From: Etsuro Fujita
Date:
Subject: Re: Use %u to print user mapping's umid and userid
Next
From: Benedikt Grundmann
Date:
Subject: Re: between not propated into a simple equality join