Thread: old_snapshot_threshold bottleneck on replica

old_snapshot_threshold bottleneck on replica

From
Maxim Orlov
Date:
Hi!

One of our customers stumble onto a significant performance degradation while running multiple OLAP-like queries on a replica.
After some investigation, it became clear that the problem is in accessing old_snapshot_threshold parameter.

Accessing old_snapshot_threshold parameter is guarded by mutex_threshold. This is not a problem on primary
server, since we rarely call GetOldSnapshotThresholdTimestamp:

5028 void
5029 TestForOldSnapshot_impl(Snapshot snapshot, Relation relation)
5030 {
5031 ····if (RelationAllowsEarlyPruning(relation)
5032 ········&& (snapshot)->whenTaken < GetOldSnapshotThresholdTimestamp())
5033 ········ereport(ERROR,
5034 ················(errcode(ERRCODE_SNAPSHOT_TOO_OLD),
5035 ················ errmsg("snapshot too old")));

But in case of a replica, we have to call GetOldSnapshotThresholdTimestamp much often. So, this become a
bottleneck. The customer solve this issue by setting old_snapshot_threshold to 0. But, I think, we can
do something about it.

Some more investigation:

-- On primary --
$ ./bin/psql postgres -c "create database benchmark"
CREATE DATABASE
$ ./bin/pgbench -i -Uorlov -s300 benchmark
dropping old tables...
NOTICE:  table "pgbench_accounts" does not exist, skipping
...
creating tables...
generating data (client-side)...
30000000 of 30000000 tuples (100%) done (elapsed 142.37 s, remaining 0.00 s)
vacuuming...
creating primary keys...
done in 177.67 s (drop tables 0.00 s, create tables 0.01 s, client-side generate 144.45 s, vacuum 0.59 s, primary keys 32.61 s).

-- On secondary --
$ touch 1.sql
$ vim 1.sql
$ cat 1.sql
\set bid random(1, 300)
BEGIN;
SELECT sum(aid) FROM pgbench_accounts where bid = :bid GROUP BY bid;
END;
$ ./bin/pgbench -f 1.sql -p5433 -Uorlov -j10 -c100 -T720 -P1 -n benchmark
pgbench (16devel)
progress: 1.0 s, 0.0 tps, lat 0.000 ms stddev 0.000, 0 failed
...
progress: 20.0 s, 0.0 tps, lat 0.000 ms stddev 0.000, 0 failed

$ perf record -F 99 -a -g --call-graph=dwarf sleep 5
$ perf script --header --fields comm,pid,tid,time,event,ip,sym,dso > file
$ grep s_lock file | wc -l
3486

My proposal is to use atomic for threshold_timestamp and threshold_xid. PFA 0001 patch.
With patch 0001 we got:
$ grep s_lock file2 | wc -l 
8

Maybe, we shall go farther and remove mutex_threshold here? This will lead to inconsistency of
threshold_timestamp and threshold_xid, but is this really a problem?

Thoughts?

--
Best regards,
Maxim Orlov.
Attachment

Re: old_snapshot_threshold bottleneck on replica

From
Pavel Borisov
Date:
Hi, Maxim!

On Mon, 23 Jan 2023 at 18:40, Maxim Orlov <orlovmg@gmail.com> wrote:
>
> Hi!
>
> One of our customers stumble onto a significant performance degradation while running multiple OLAP-like queries on a
replica.
> After some investigation, it became clear that the problem is in accessing old_snapshot_threshold parameter.
>
> Accessing old_snapshot_threshold parameter is guarded by mutex_threshold. This is not a problem on primary
> server, since we rarely call GetOldSnapshotThresholdTimestamp:
>
> 5028 void
> 5029 TestForOldSnapshot_impl(Snapshot snapshot, Relation relation)
> 5030 {
> 5031 ····if (RelationAllowsEarlyPruning(relation)
> 5032 ········&& (snapshot)->whenTaken < GetOldSnapshotThresholdTimestamp())
> 5033 ········ereport(ERROR,
> 5034 ················(errcode(ERRCODE_SNAPSHOT_TOO_OLD),
> 5035 ················ errmsg("snapshot too old")));
>
> But in case of a replica, we have to call GetOldSnapshotThresholdTimestamp much often. So, this become a
> bottleneck. The customer solve this issue by setting old_snapshot_threshold to 0. But, I think, we can
> do something about it.
>
> Some more investigation:
>
> -- On primary --
> $ ./bin/psql postgres -c "create database benchmark"
> CREATE DATABASE
> $ ./bin/pgbench -i -Uorlov -s300 benchmark
> dropping old tables...
> NOTICE:  table "pgbench_accounts" does not exist, skipping
> ...
> creating tables...
> generating data (client-side)...
> 30000000 of 30000000 tuples (100%) done (elapsed 142.37 s, remaining 0.00 s)
> vacuuming...
> creating primary keys...
> done in 177.67 s (drop tables 0.00 s, create tables 0.01 s, client-side generate 144.45 s, vacuum 0.59 s, primary
keys32.61 s). 
>
> -- On secondary --
> $ touch 1.sql
> $ vim 1.sql
> $ cat 1.sql
> \set bid random(1, 300)
> BEGIN;
> SELECT sum(aid) FROM pgbench_accounts where bid = :bid GROUP BY bid;
> END;
> $ ./bin/pgbench -f 1.sql -p5433 -Uorlov -j10 -c100 -T720 -P1 -n benchmark
> pgbench (16devel)
> progress: 1.0 s, 0.0 tps, lat 0.000 ms stddev 0.000, 0 failed
> ...
> progress: 20.0 s, 0.0 tps, lat 0.000 ms stddev 0.000, 0 failed
>
> $ perf record -F 99 -a -g --call-graph=dwarf sleep 5
> $ perf script --header --fields comm,pid,tid,time,event,ip,sym,dso > file
> $ grep s_lock file | wc -l
>
> 3486
>
>
> My proposal is to use atomic for threshold_timestamp and threshold_xid. PFA 0001 patch.
> With patch 0001 we got:
>
> $ grep s_lock file2 | wc -l
> 8
>
>
> Maybe, we shall go farther and remove mutex_threshold here? This will lead to inconsistency of
> threshold_timestamp and threshold_xid, but is this really a problem?
>
> Thoughts?

I think optimizing locking and switching to atomics wherever it
improves performance is a good direction. If performance improvement
could be demonstrated in a more direct way it would be a good argument
to commit the improvement. Personally I like TPS plots like in [1].

[1] https://www.postgresql.org/message-id/CALT9ZEHSX1Hpz5xjDA62yHAHtpinkA6hg8Zt-odyxqppmKbQFA%40mail.gmail.com

Kind regards,
Pavel Borisov,
Supabase



Re: old_snapshot_threshold bottleneck on replica

From
Robert Haas
Date:
On Mon, Jan 23, 2023 at 9:40 AM Maxim Orlov <orlovmg@gmail.com> wrote:
> One of our customers stumble onto a significant performance degradation while running multiple OLAP-like queries on a
replica.
> After some investigation, it became clear that the problem is in accessing old_snapshot_threshold parameter.

It has been suggested that we remove that feature entirely.

> My proposal is to use atomic for threshold_timestamp and threshold_xid. PFA 0001 patch.

This patch changes threshold_timestamp and threshold_xid to use
atomics, but it does not remove mutex_threshold which, according to a
quick glance at the comments, protects exactly those two fields. So,
either:

(1) that mutex also protects something else and the existing comment
is wrong, or

(2) the mutex should have been removed but the patch neglected to do so, or

(3) the mutex is still needed for some reason, in which case either
(3a) the patch isn't actually safe or (3b) the patch needs comments to
explain what the new synchronization model is.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: old_snapshot_threshold bottleneck on replica

From
Maxim Orlov
Date:


On Tue, 24 Jan 2023 at 18:46, Robert Haas <robertmhaas@gmail.com> wrote:

(1) that mutex also protects something else and the existing comment
is wrong, or

(2) the mutex should have been removed but the patch neglected to do so, or

(3) the mutex is still needed for some reason, in which case either
(3a) the patch isn't actually safe or (3b) the patch needs comments to
explain what the new synchronization model is.

Yes, you're absolutely right. And my first intention was to remove this mutex completely.
But in TransactionIdLimitedForOldSnapshots these variable is using conjointly. So, I'm not
sure, is it completely safe to remove mutex. Actually, removing mutex and switch to atomics
was my first choice. I've run all the tests and no problems were found. But, at that time I choose
to be more conservative. Anyway, here is the new variant.

--
Best regards,
Maxim Orlov.
Attachment

Re: old_snapshot_threshold bottleneck on replica

From
Robert Haas
Date:
On Wed, Jan 25, 2023 at 3:52 AM Maxim Orlov <orlovmg@gmail.com> wrote:
> But in TransactionIdLimitedForOldSnapshots these variable is using conjointly. So, I'm not
> sure, is it completely safe to remove mutex.

Well, that's something we - and ideally you, as the patch author -
need to analyze and figure out. We can't just take a shot and hope for
the best.

> Actually, removing mutex and switch to atomics
> was my first choice. I've run all the tests and no problems were found

Unfortunately, that kind of testing is not very likely to find a
subtle synchronization problem. That's why a theoretical analysis is
so important.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: old_snapshot_threshold bottleneck on replica

From
Maxim Orlov
Date:

On Wed, 25 Jan 2023 at 16:52, Robert Haas <robertmhaas@gmail.com> wrote:
On Wed, Jan 25, 2023 at 3:52 AM Maxim Orlov <orlovmg@gmail.com> wrote:

Well, that's something we - and ideally you, as the patch author -
need to analyze and figure out. We can't just take a shot and hope for
the best.
 
I thank you for your advices. I've dived deeper into the problem and I think v2 patch is wrong.
Accessing threshold_timestamp and threshold_xid in TransactionIdLimitedForOldSnapshots
without lock would lead to an improper xlimit calculation.

So, my choice would be (3b). My goal is to optimize access to the threshold_timestamp to avoid
multiple spinlock acquisition on read. In the same time, simultaneous access to these variable
(threshold_timestamp and threshold_xid) should be protected with spinlock.

I remove atomic for threshold_xid and add comments on mutex_threshold. PFA, v3. I

--
Best regards,
Maxim Orlov.
Attachment

Re: old_snapshot_threshold bottleneck on replica

From
Robert Haas
Date:
On Fri, Jan 27, 2023 at 9:30 AM Maxim Orlov <orlovmg@gmail.com> wrote:
> I thank you for your advices. I've dived deeper into the problem and I think v2 patch is wrong.

Cool!

> Accessing threshold_timestamp and threshold_xid in TransactionIdLimitedForOldSnapshots
> without lock would lead to an improper xlimit calculation.

That would be a bummer.

> So, my choice would be (3b). My goal is to optimize access to the threshold_timestamp to avoid
> multiple spinlock acquisition on read. In the same time, simultaneous access to these variable
> (threshold_timestamp and threshold_xid) should be protected with spinlock.
>
> I remove atomic for threshold_xid and add comments on mutex_threshold. PFA, v3. I

Interesting, but it's still not entirely clear to me from reading the
comments why we should think that this is safe.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: old_snapshot_threshold bottleneck on replica

From
Maxim Orlov
Date:

On Fri, 27 Jan 2023 at 18:18, Robert Haas <robertmhaas@gmail.com> wrote:

Interesting, but it's still not entirely clear to me from reading the
comments why we should think that this is safe.
 
In overall, I think this is safe, because we do not change algorithm here. More specific, threshold_timestamp have only used in a few cases:
1). To get the latest value by calling GetOldSnapshotThresholdTimestamp. This will work, since we only change the sync type here from the spinlock to an atomic.
2). In TransactionIdLimitedForOldSnapshots, but here no changes in the behaviour will be done. Sync model will be the save as before the patch.
3). In SnapshotTooOldMagicForTest, which is a stub to make old_snapshot_threshold tests appear "working". But no coherence with the threshold_xid here.

So, we have a two use cases for the threshold_timestamp:
a). When the threshold_timestamp is used in conjunction with the threshold_xid. We must use spinlock to sync.
b). When the threshold_timestamp is used without conjunction with the threshold_xid. In this case, we use atomic values.

--
Best regards,
Maxim Orlov.

Re: old_snapshot_threshold bottleneck on replica

From
Andres Freund
Date:
Hi,

On 2023-01-24 10:46:28 -0500, Robert Haas wrote:
> On Mon, Jan 23, 2023 at 9:40 AM Maxim Orlov <orlovmg@gmail.com> wrote:
> > One of our customers stumble onto a significant performance degradation while running multiple OLAP-like queries on
areplica.
 
> > After some investigation, it became clear that the problem is in accessing old_snapshot_threshold parameter.
>
> It has been suggested that we remove that feature entirely.

Indeed. There's a lot of things wrong with it. We have reproducers for
creating wrong query results. Nobody has shown interest in fixing the
problems, for several years by now. It costs users that *do not* use the
feature performance (*).

I think we're doing our users a disservice by claiming to have this feature.

I don't think a lot of the existing code would survive if we were to create a
newer version, more maintainable / reliable, version of the feature.

Greetings,

Andres Freund

(*) E.g. TestForOldSnapshot() is called in a good number of places, and emits
    quite a bit of code. It's not executed, but the emitted code is large
    enough to lead to worse code being generated.



Re: old_snapshot_threshold bottleneck on replica

From
Thomas Munro
Date:
On Tue, Feb 14, 2023 at 9:45 AM Andres Freund <andres@anarazel.de> wrote:
> On 2023-01-24 10:46:28 -0500, Robert Haas wrote:
> > On Mon, Jan 23, 2023 at 9:40 AM Maxim Orlov <orlovmg@gmail.com> wrote:
> > > One of our customers stumble onto a significant performance degradation while running multiple OLAP-like queries
ona replica. 
> > > After some investigation, it became clear that the problem is in accessing old_snapshot_threshold parameter.
> >
> > It has been suggested that we remove that feature entirely.
>
> Indeed. There's a lot of things wrong with it. We have reproducers for
> creating wrong query results. Nobody has shown interest in fixing the
> problems, for several years by now. It costs users that *do not* use the
> feature performance (*).
>
> I think we're doing our users a disservice by claiming to have this feature.
>
> I don't think a lot of the existing code would survive if we were to create a
> newer version, more maintainable / reliable, version of the feature.

I raised this at the recent developer meeting and the assembled
hackers agreed.  Does anyone think we *shouldn't* drop the feature?  I
volunteered to write a removal patch for v17, so here's a first run
through to find all the traces of this feature.  In this first go I
removed everything I could think of, but we might want to keep some
vestiges.  I guess we might want to keep the registered error
class/code?  Should we invent a place where we keep stuff like #define
TestForOldSnapshot(...) expanding to nothing for some amount of time,
for extensions?  I dunno, I bet extensions doing stuff that
sophisticated already have a bunch of version tests anyway.  I suppose
keeping the GUC wouldn't really be helpful (if you're using it, you
probably want to know that it isn't available anymore and think about
the implications for your application).

Attachment

Re: old_snapshot_threshold bottleneck on replica

From
Thomas Munro
Date:
On Wed, Jun 14, 2023 at 3:56 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Tue, Feb 14, 2023 at 9:45 AM Andres Freund <andres@anarazel.de> wrote:
> > Indeed. There's a lot of things wrong with it. We have reproducers for
> > creating wrong query results. Nobody has shown interest in fixing the
> > problems, for several years by now. It costs users that *do not* use the
> > feature performance (*).
> >
> > I think we're doing our users a disservice by claiming to have this feature.
> >
> > I don't think a lot of the existing code would survive if we were to create a
> > newer version, more maintainable / reliable, version of the feature.
>
> I raised this at the recent developer meeting and the assembled
> hackers agreed.  Does anyone think we *shouldn't* drop the feature?  I
> volunteered to write a removal patch for v17, so here's a first run
> through to find all the traces of this feature.  In this first go I
> removed everything I could think of, but we might want to keep some
> vestiges.  I guess we might want to keep the registered error
> class/code?  Should we invent a place where we keep stuff like #define
> TestForOldSnapshot(...) expanding to nothing for some amount of time,
> for extensions?  I dunno, I bet extensions doing stuff that
> sophisticated already have a bunch of version tests anyway.  I suppose
> keeping the GUC wouldn't really be helpful (if you're using it, you
> probably want to know that it isn't available anymore and think about
> the implications for your application).

Done.

I hope we get "snapshot too old" back one day.



Re: old_snapshot_threshold bottleneck on replica

From
Peter Geoghegan
Date:
On Tue, Sep 5, 2023 at 12:58 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> I hope we get "snapshot too old" back one day.

Thanks for working on this. Though I wonder why you didn't do
something closer to a straight revert of the feature. Why is nbtree
still passing around snapshots needlessly?

Also, why are there still many comments referencing the feature?
There's the one above should_attempt_truncation(), for example.
Another appears above init_toast_snapshot(). Are these just
oversights, or was it deliberate? You said something about retaining
vestiges.

--
Peter Geoghegan



Re: old_snapshot_threshold bottleneck on replica

From
Thomas Munro
Date:
On Fri, Sep 8, 2023 at 1:53 PM Peter Geoghegan <pg@bowt.ie> wrote:
> On Tue, Sep 5, 2023 at 12:58 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> > I hope we get "snapshot too old" back one day.
>
> Thanks for working on this. Though I wonder why you didn't do
> something closer to a straight revert of the feature. Why is nbtree
> still passing around snapshots needlessly?
>
> Also, why are there still many comments referencing the feature?
> There's the one above should_attempt_truncation(), for example.
> Another appears above init_toast_snapshot(). Are these just
> oversights, or was it deliberate? You said something about retaining
> vestiges.

Oh.  Not intentional.  Looking now...



Re: old_snapshot_threshold bottleneck on replica

From
Thomas Munro
Date:
On Fri, Sep 8, 2023 at 2:00 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Fri, Sep 8, 2023 at 1:53 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > Thanks for working on this. Though I wonder why you didn't do
> > something closer to a straight revert of the feature. Why is nbtree
> > still passing around snapshots needlessly?

The code moved around quite a few times over several commits and quite
a lot since then, which is why I didn't go for straight revert, but
clearly the manual approach risked missing things.  I think the
attached removes all unused 'snapshot' arguments from AM-internal
functions.  Checked by compiling with clang's -Wunused-parameters, and
then searching for 'snapshot', and excluding the expected cases.

> > Also, why are there still many comments referencing the feature?
> > There's the one above should_attempt_truncation(), for example.
> > Another appears above init_toast_snapshot(). Are these just
> > oversights, or was it deliberate? You said something about retaining
> > vestiges.

Stray comments removed.

Attachment

Re: old_snapshot_threshold bottleneck on replica

From
Peter Geoghegan
Date:
On Thu, Sep 7, 2023 at 8:49 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> The code moved around quite a few times over several commits and quite
> a lot since then, which is why I didn't go for straight revert, but
> clearly the manual approach risked missing things.

It's not a big deal, obviously.

> I think the
> attached removes all unused 'snapshot' arguments from AM-internal
> functions.  Checked by compiling with clang's -Wunused-parameters, and
> then searching for 'snapshot', and excluding the expected cases.

This looks right to me.

Thanks
--
Peter Geoghegan



Re: old_snapshot_threshold bottleneck on replica

From
Thomas Munro
Date:
On Fri, Sep 8, 2023 at 4:22 PM Peter Geoghegan <pg@bowt.ie> wrote:
> This looks right to me.

Thanks, pushed.