Re: [HACKERS] WIP: long transactions on hot standby feedback replica/ proof of concept - Mailing list pgsql-hackers

From Ivan Kartyshov
Subject Re: [HACKERS] WIP: long transactions on hot standby feedback replica/ proof of concept
Date
Msg-id 4ae82b354af9a20624e1e1d05932087e@postgrespro.ru
Whole thread Raw
In response to Re: [HACKERS] WIP: long transactions on hot standby feedback replica/ proof of concept  (Alexander Korotkov <a.korotkov@postgrespro.ru>)
Responses Re: [HACKERS] WIP: long transactions on hot standby feedback replica/ proof of concept
List pgsql-hackers
Hello, thank you for your review.

Alexander Korotkov писал 2018-06-20 20:54:
> Thinking about that more I found that adding vacuum mark as an extra
> argument to LockAcquireExtended is also wrong.  It would be still hard
> to determine if we should log the lock in LogStandbySnapshot().  We'll
> have to add that flag to shared memory table of locks.  And that looks
> like a kludge.
> 
> Therefore, I'd like to propose another approach: introduce new lock
> level.  So, AccessExclusiveLock will be split into two
> AccessExclusiveLocalLock and AccessExclusiveLock.  In spite of
> AccessExclusiveLock, AccessExclusiveLocalLock will be not logged to
> standby, and used for heap truncation.
> 
> I expect some resistance to my proposal, because fixing this "small
> bug" doesn't deserve new lock level.  But current behavior of
> hot_standby_feedback is buggy.  From user prospective,
> hot_standby_feedback option is just non-worker, which causes master to
> bloat without protection for standby queries from cancel.  We need to
> fix that, but I don't see other proper way to do that except adding
> new lock level...

Your offer is very interesting, it made patch smaller and more 
beautiful.
So I update patch and made changes accordance with the proposed concept 
of
special AccessExclusiveLocalLock.

> I don't yet understand what is the problem here and why this patch
> should solve that.  As I get idea of this patch, GetOldestXmin() will
> initialize xmin of walsender with lowest xmin of replication slot.
> But xmin of replication slots will be anyway taken into account by
> GetSnapshotData() called by vacuum.
> 
>> How to test:
>> Make async replica, turn on feedback, reduce 
>> max_standby_streaming_delay
>> and aggressive autovacuum.
> 
> You forgot to mention, that one should setup the replication using
> replication slot.  Naturally, if replication slot isn't exist, then
> master shouldn't keep dead tuples for disconnected standby.  Because
> master doesn't know if standby will reconnect or is it gone forever.

I tried to solve hot_standby_feedback without replication slots,
so I found a little window of possibility of case, when vacuum on
master make heap truncation of pages that standby needs for just started
transaction.

How to test:
Make async replica, turn on feedback and reduce
max_standby_streaming_delay.
Make autovacuum more aggressive.
autovacuum = on
autovacuum_max_workers = 1
autovacuum_naptime = 1s
autovacuum_vacuum_threshold = 1
autovacuum_vacuum_cost_delay = 0



Test1:
Here we will do a load on the master and simulation of  a long
transaction with repeated 1 second SEQSCANS on the replica (by  calling
pg_sleep 1 second duration every 6 seconds).
MASTER        REPLICA
      hot_standby = on
      max_standby_streaming_delay = 1s
      hot_standby_feedback = on
start
CREATE TABLE test AS (SELECT id, 1 AS value
FROM generate_series(1,1) id);
pgbench -T600 -P2 -n --file=master.sql postgres
(update test set value = value;)
      start
      BEGIN TRANSACTION ISOLATION LEVEL READ COMMITTED;
      SELECT pg_sleep(value) FROM test;
      \watch 6



---Autovacuum truncate pages at the end
Result on replica:
FATAL: terminating connection due to conflict with recovery
DETAIL: User was holding a relation lock for too long.



On Patched version lazy_vacuum_truncation passed without fatal errors.



Only some times Error occurs because this tests is too synthetic
ERROR: canceling statement due to conflict with recovery
DETAIL: User was holding shared buffer pin for too long.
Because of rising ResolveRecoveryConflictWithSnapshot while
redo some visibility flags to avoid this conflict we can do test2 or
increase max_standby_streaming_delay.



Test2:
Here we will do a load on the master and simulation of  a long
transaction on the replica (by  taking LOCK on table)
MASTER        REPLICA
      hot_standby = on
      max_standby_streaming_delay = 1s
      hot_standby_feedback = on
start
CREATE TABLE test AS (SELECT id, 1 AS value FROM generate_series(1,1)
id);
pgbench -T600 -P2 -n --file=master.sql postgres
(update test set value = value;)
      start
      BEGIN TRANSACTION ISOLATION LEVEL READ COMMITTED;
      LOCK TABLE test IN ACCESS SHARE MODE;
      select * from test;
      \watch 6



---Autovacuum truncate pages at the end
Result on replica:
FATAL: terminating connection due to conflict with recovery
DETAIL: User was holding a relation lock for too long.

On Patched version lazy_vacuum_truncation passed without fatal errors.

I would like to here you opinion over this implementation.

-- 
Ivan Kartyshov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Attachment

pgsql-hackers by date:

Previous
From: Bruce Momjian
Date:
Subject: Re: POC for a function trust mechanism
Next
From: Larry Rosenman
Date:
Subject: Re: peripatus build failures....