Re: Oh, this is embarrassing: init file logic is still broken - Mailing list pgsql-hackers

From Josh Berkus
Subject Re: Oh, this is embarrassing: init file logic is still broken
Date
Msg-id 558B26B0.1070704@agliodbs.com
Whole thread Raw
In response to Oh, this is embarrassing: init file logic is still broken  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Oh, this is embarrassing: init file logic is still broken  (Tatsuo Ishii <ishii@postgresql.org>)
Re: Oh, this is embarrassing: init file logic is still broken  (Peter Geoghegan <pg@heroku.com>)
Re: Oh, this is embarrassing: init file logic is still broken  (Tatsuo Ishii <ishii@postgresql.org>)
List pgsql-hackers
On 06/23/2015 04:44 PM, Tom Lane wrote:
> Chasing a problem identified by my Salesforce colleagues led me to the
> conclusion that my commit f3b5565dd ("Use a safer method for determining
> whether relcache init file is stale") is rather borked.  It causes
> pg_trigger_tgrelid_tgname_index to be omitted from the relcache init file,
> because that index is not used by any syscache.  I had been aware of that
> actually, but considered it a minor issue.  It's not so minor though,
> because RelationCacheInitializePhase3 marks that index as nailed for
> performance reasons, and includes it in NUM_CRITICAL_LOCAL_INDEXES.
> That means that load_relcache_init_file *always* decides that the init
> file is busted and silently(!) ignores it.  So we're taking a nontrivial
> hit in backend startup speed as of the last set of minor releases.

OK, this is pretty bad in its real performance effects.  On a workload
which is dominated by new connection creation, we've lost about 17%
throughput.

To test it, I ran pgbench -s 100 -j 2 -c 6 -r -C -S -T 1200 against a
database which fits in shared_buffers on two different m3.large
instances on AWS (across the network, not on unix sockets).  A typical
run on 9.3.6 looks like this:

scaling factor: 100
query mode: simple
number of clients: 6
number of threads: 2
duration: 1200 s
number of transactions actually processed: 252322
tps = 210.267219 (including connections establishing)
tps = 31958.233736 (excluding connections establishing)
statement latencies in milliseconds:       0.002515        \set naccounts 100000 * :scale       0.000963
\setrandomaid 1 :naccounts       19.042859       SELECT abalance FROM pgbench_accounts WHERE aid
 
= :aid;

Whereas a typical run on 9.3.9 looks like this:

scaling factor: 100
query mode: simple
number of clients: 6
number of threads: 2
duration: 1200 s
number of transactions actually processed: 208180
tps = 173.482259 (including connections establishing)
tps = 31092.866153 (excluding connections establishing)
statement latencies in milliseconds:       0.002518        \set naccounts 100000 * :scale       0.000988
\setrandomaid 1 :naccounts       23.076961       SELECT abalance FROM pgbench_accounts WHERE aid
 
= :aid;

Numbers are pretty consistent on four runs each on two different
instances (+/- 4%), so I don't think this is Amazon variability we're
seeing.  I think the syscache invalidation is really costing us 17%.  :-(

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: Should we back-patch SSL renegotiation fixes?
Next
From: Peter Geoghegan
Date:
Subject: Are we sufficiently clear that jsonb containment is nested?