Hi!
One of our customers stumble onto a significant performance degradation while running multiple OLAP-like queries on a replica.
After some investigation, it became clear that the problem is in accessing old_snapshot_threshold parameter.
Accessing old_snapshot_threshold parameter is guarded by mutex_threshold. This is not a problem on primary
server, since we rarely call GetOldSnapshotThresholdTimestamp:
5028 void
5029 TestForOldSnapshot_impl(Snapshot snapshot, Relation relation)
5030 {
5031 ····if (RelationAllowsEarlyPruning(relation)
5032 ········&& (snapshot)->whenTaken < GetOldSnapshotThresholdTimestamp())
5033 ········ereport(ERROR,
5034 ················(errcode(ERRCODE_SNAPSHOT_TOO_OLD),
5035 ················ errmsg("snapshot too old")));
But in case of a replica, we have to call GetOldSnapshotThresholdTimestamp much often. So, this become a
bottleneck. The customer solve this issue by setting old_snapshot_threshold to 0. But, I think, we can
do something about it.
Some more investigation:
-- On primary --
$ ./bin/psql postgres -c "create database benchmark"
CREATE DATABASE
$ ./bin/pgbench -i -Uorlov -s300 benchmark
dropping old tables...
NOTICE: table "pgbench_accounts" does not exist, skipping
...
creating tables...
generating data (client-side)...
30000000 of 30000000 tuples (100%) done (elapsed 142.37 s, remaining 0.00 s)
vacuuming...
creating primary keys...
done in 177.67 s (drop tables 0.00 s, create tables 0.01 s, client-side generate 144.45 s, vacuum 0.59 s, primary keys 32.61 s).
-- On secondary --
$ touch 1.sql
$ vim 1.sql
$ cat 1.sql
\set bid random(1, 300)
BEGIN;
SELECT sum(aid) FROM pgbench_accounts where bid = :bid GROUP BY bid;
END;
$ ./bin/pgbench -f 1.sql -p5433 -Uorlov -j10 -c100 -T720 -P1 -n benchmark
pgbench (16devel)
progress: 1.0 s, 0.0 tps, lat 0.000 ms stddev 0.000, 0 failed
...
progress: 20.0 s, 0.0 tps, lat 0.000 ms stddev 0.000, 0 failed
$ perf record -F 99 -a -g --call-graph=dwarf sleep 5
$ perf script --header --fields comm,pid,tid,time,event,ip,sym,dso > file
$ grep s_lock file | wc -l
My proposal is to use atomic for threshold_timestamp and threshold_xid. PFA 0001 patch.
With patch 0001 we got:
$ grep s_lock file2 | wc -l
8
Maybe, we shall go farther and remove mutex_threshold here? This will lead to inconsistency of
threshold_timestamp and threshold_xid, but is this really a problem?
Thoughts?
--
Best regards,
Maxim Orlov.