Thread: [HACKERS] WIP: long transactions on hot standby feedback replica /proof of concept

[HACKERS] WIP: long transactions on hot standby feedback replica /proof of concept

From
i.kartyshov@postgrespro.ru
Date:
Our clients complain about this issue and therefore I want to raise the 
discussion and suggest several solutions to this problem:

I. Why does PG use Fatal when Error is enough to release lock that rises 
lock conflict?
"If (RecoveryConflictPending && DoingCommandRead)"

II. Do we really need to truncate the table on hot standby exactly at 
the same time when truncate on master occurs?

In my case conflict happens when the autovacuum truncates table tbl1 on 
master while backend on replica is performing a long transaction 
involving the same table tbl1. This happens because truncate takes an 
AccessExclusiveLock. To tackle this issue we have several options:

1. We can postpone the truncate on the master until all the replicas 
have finished their transactions (in this case, feedback requests to the 
master should be sent frequently)
Patch 1
vacuum_lazy_truncate.patch

2. Maybe there is an option somehow not to send AccessExclusiveLock and 
not to truncate table on the replica right away. We could try to wait a 
little and truncate tbl1 on replica again.

Here is a patch that makes replica skip truncate WAL record if some 
transaction using the same table is already running on replica. And it 
also forces master to send truncate_wal to replica with actual table 
length every time autovacuum starts.
Patch 2
standby_truncate_skip_v1.patch

In this case the transaction which is running for several hours won’t be 
interrupted because of truncate. Even if for some reason we haven’t 
truncated this table tbl1 right away, nothing terrible will happen. The 
next truncate wal record will reduce the file size by the actual length 
(if some inserts or updates have been performed on master).

If you have any ideas or concerns about suggested solution I’ll be glad 
to hear them.
Thanks!

-- 
Ivan Kartyshov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment
On Mon, Sep 4, 2017 at 4:34 PM,  <i.kartyshov@postgrespro.ru> wrote:
> Our clients complain about this issue and therefore I want to raise the
> discussion and suggest several solutions to this problem:
>
> I. Why does PG use Fatal when Error is enough to release lock that rises
> lock conflict?
> "If (RecoveryConflictPending && DoingCommandRead)"
>
> II. Do we really need to truncate the table on hot standby exactly at the
> same time when truncate on master occurs?
>
> In my case conflict happens when the autovacuum truncates table tbl1 on
> master while backend on replica is performing a long transaction involving
> the same table tbl1. This happens because truncate takes an
> AccessExclusiveLock. To tackle this issue we have several options:
>
> 1. We can postpone the truncate on the master until all the replicas have
> finished their transactions (in this case, feedback requests to the master
> should be sent frequently)
> Patch 1
> vacuum_lazy_truncate.patch
>
> 2. Maybe there is an option somehow not to send AccessExclusiveLock and not
> to truncate table on the replica right away. We could try to wait a little
> and truncate tbl1 on replica again.
>

Can max_standby_streaming_delay help in this situation (point number - 2)?


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




-----Original Message-----
From: pgsql-hackers-owner@postgresql.org [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Amit Kapila
Sent: Monday, September 4, 2017 3:32 PM
To: i.kartyshov@postgrespro.ru
Cc: pgsql-hackers <pgsql-hackers@postgresql.org>
Subject: Re: [HACKERS] WIP: long transactions on hot standby feedback replica / proof of concept

On Mon, Sep 4, 2017 at 4:34 PM,  <i.kartyshov@postgrespro.ru> wrote:
> Our clients complain about this issue and therefore I want to raise
> the discussion and suggest several solutions to this problem:
>
> I. Why does PG use Fatal when Error is enough to release lock that
> rises lock conflict?
> "If (RecoveryConflictPending && DoingCommandRead)"
>
> II. Do we really need to truncate the table on hot standby exactly at
> the same time when truncate on master occurs?
>
> In my case conflict happens when the autovacuum truncates table tbl1
> on master while backend on replica is performing a long transaction
> involving the same table tbl1. This happens because truncate takes an
> AccessExclusiveLock. To tackle this issue we have several options:
>
> 1. We can postpone the truncate on the master until all the replicas
> have finished their transactions (in this case, feedback requests to
> the master should be sent frequently) Patch 1
> vacuum_lazy_truncate.patch
>
> 2. Maybe there is an option somehow not to send AccessExclusiveLock
> and not to truncate table on the replica right away. We could try to
> wait a little and truncate tbl1 on replica again.
>

Can max_standby_streaming_delay help in this situation (point number - 2)?


--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Hello!
In this situation this parameter (max_standby_streaming_delay) wont help because if you have subsequent statement on
standby(following info is from documentation and from our experience ): Thus, if one query has resulted in significant
delay,subsequent conflicting queries will have much less grace time until the standby server has caught up again. And
younever now how to set this parameter exept to -1 which mean up to infinity delayed standby.  

On our experience only autovacuum on master took AccesExclusiveLock that raise this Fatal message on standby. After
thisAccessExclusive reached standby and max_standby_streaming_delay > -1 you definitely sooner or later  get this Fatal
onrecovery .  
With this patch we try to get rid of AccessEclusiveLock applied on standby while we have active statement on it.



--
Alex Ignatov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers




Re: [HACKERS] WIP: long transactions on hot standby feedback replica/ proof of concept

From
Alexander Korotkov
Date:
On Mon, Sep 4, 2017 at 4:51 PM, Alex Ignatov <a.ignatov@postgrespro.ru> wrote:
In this situation this parameter (max_standby_streaming_delay) wont help because if you have subsequent statement on standby (following info is from documentation and from our experience ): Thus, if one query has resulted in significant delay, subsequent conflicting queries will have much less grace time until the standby server has caught up again. And you never now how to set this parameter exept to -1 which mean up to infinity delayed standby.

On our experience only autovacuum on master took AccesExclusiveLock that raise this Fatal message on standby. After this AccessExclusive reached standby and max_standby_streaming_delay > -1 you definitely sooner or later  get this Fatal on recovery .
With this patch we try to get rid of AccessEclusiveLock applied on standby while we have active statement on it.

+1,
Aborting read-only query on standby because of vacuum on master seems to be rather unexpected behaviour for user when hot standby feedback is on.
I think we should work on this problem for v11.  Backpatching documentation for previous releases would be good too.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 

Re: [HACKERS] WIP: long transactions on hot standby feedback replica/ proof of concept

From
Alexander Korotkov
Date:
On Mon, Sep 4, 2017 at 2:04 PM, <i.kartyshov@postgrespro.ru> wrote:
Our clients complain about this issue and therefore I want to raise the discussion and suggest several solutions to this problem:

I. Why does PG use Fatal when Error is enough to release lock that rises lock conflict?
"If (RecoveryConflictPending && DoingCommandRead)"

II. Do we really need to truncate the table on hot standby exactly at the same time when truncate on master occurs?

In my case conflict happens when the autovacuum truncates table tbl1 on master while backend on replica is performing a long transaction involving the same table tbl1. This happens because truncate takes an AccessExclusiveLock. To tackle this issue we have several options:

1. We can postpone the truncate on the master until all the replicas have finished their transactions (in this case, feedback requests to the master should be sent frequently)
Patch 1
vacuum_lazy_truncate.patch
 
I've following comments on this patch:
1) You shouldn't use ">=" to compare xids.  You should use TransactionIdFollowsOrEquals() function which handles transaction id wraparound correctly.
2) As I understood, this patch makes heap truncate only when no concurrent transactions are running on both master and replicas with hot_standby_feedback enabled.  For busy system, it would be literally "never do heap truncate after vacuum".

2. Maybe there is an option somehow not to send AccessExclusiveLock and not to truncate table on the replica right away. We could try to wait a little and truncate tbl1 on replica again.

Here is a patch that makes replica skip truncate WAL record if some transaction using the same table is already running on replica. And it also forces master to send truncate_wal to replica with actual table length every time autovacuum starts.
Patch 2
standby_truncate_skip_v1.patch

In this case the transaction which is running for several hours won’t be interrupted because of truncate. Even if for some reason we haven’t truncated this table tbl1 right away, nothing terrible will happen. The next truncate wal record will reduce the file size by the actual length (if some inserts or updates have been performed on master).

Since you wrote this patch under on my request, let me clarify its idea little more.

Currently, lazy_truncate_heap() is very careful on getting AccessExclusiveLock to truncate heap.  It doesn't want either block other transaction or wait for this lock too long.  If lock isn't acquired after some number of tries lazy_truncate_heap() gives up.  However, once lazy_truncate_heap() acquires AccessExclusiveLock is acquired on master, it would be replayed on replicas where it will conflict with read-only transactions if any.  That leads to unexpected behaviour when hot_standby_feedback is on.

Idea of fixing that is to apply same logic of getting AccessExclusiveLock on standby as on master: give up if somebody is holding conflicting lock for long enough.  That allows standby to have more free pages at the end of heap than master have.  That shouldn't lead to errors since those pages are empty, but allows standby to waste some extra space.  In order to mitigate this deficiency, we're generating XLOG_SMGR_TRUNCATE records more frequent: on finish of every vacuum.  Therefore, if even standby gets some extra space of empty pages, it would be corrected during further vacuum cycles.

@@ -397,7 +425,6 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
  appendStringInfo(&buf, _("avg read rate: %.3f MB/s, avg write rate: %.3f MB/s\n"),
  read_rate, write_rate);
  appendStringInfo(&buf, _("system usage: %s"), pg_rusage_show(&ru0));
-
  ereport(LOG,
  (errmsg_internal("%s", buf.data)));
  pfree(buf.data);
@@ -1820,7 +1847,7 @@ lazy_truncate_heap(Relation onerel, LVRelStats *vacrelstats)
  vacrelstats->pages_removed += old_rel_pages - new_rel_pages;
  vacrelstats->rel_pages = new_rel_pages;
 
- ereport(elevel,
+ ereport(WARNING,
  (errmsg("\"%s\": truncated %u to %u pages",
  RelationGetRelationName(onerel),
  old_rel_pages, new_rel_pages),

Ivan, what these changes are supposed to do?

diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 2b26173..f04888e 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -816,7 +816,7 @@ LockAcquireExtended(const LOCKTAG *locktag,
  XLogStandbyInfoActive())
  {
  LogAccessExclusiveLockPrepare();
- log_lock = true;
+// log_lock = true;
  }
 
  /*

Perhaps, it's certainly bad idea to disable replaying AccessExclusiveLock *every time*, we should skip taking AccessExclusiveLock only while writing XLOG_SMGR_TRUNCATE record.  Besides that, this code violates out coding style.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 
On Tue, Sep 5, 2017 at 3:03 AM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> On Mon, Sep 4, 2017 at 2:04 PM, <i.kartyshov@postgrespro.ru> wrote:
>>
>> Our clients complain about this issue and therefore I want to raise the
>> discussion and suggest several solutions to this problem:
>>
>> I. Why does PG use Fatal when Error is enough to release lock that rises
>> lock conflict?
>> "If (RecoveryConflictPending && DoingCommandRead)"
>>
>> II. Do we really need to truncate the table on hot standby exactly at the
>> same time when truncate on master occurs?
>>
>> In my case conflict happens when the autovacuum truncates table tbl1 on
>> master while backend on replica is performing a long transaction involving
>> the same table tbl1. This happens because truncate takes an
>> AccessExclusiveLock. To tackle this issue we have several options:
>>
>> 1. We can postpone the truncate on the master until all the replicas have
>> finished their transactions (in this case, feedback requests to the master
>> should be sent frequently)
>> Patch 1
>> vacuum_lazy_truncate.patch
>
>
> I've following comments on this patch:
> 1) You shouldn't use ">=" to compare xids.  You should use
> TransactionIdFollowsOrEquals() function which handles transaction id
> wraparound correctly.
> 2) As I understood, this patch makes heap truncate only when no concurrent
> transactions are running on both master and replicas with
> hot_standby_feedback enabled.  For busy system, it would be literally "never
> do heap truncate after vacuum".
>
>> 2. Maybe there is an option somehow not to send AccessExclusiveLock and
>> not to truncate table on the replica right away. We could try to wait a
>> little and truncate tbl1 on replica again.
>>
>> Here is a patch that makes replica skip truncate WAL record if some
>> transaction using the same table is already running on replica. And it also
>> forces master to send truncate_wal to replica with actual table length every
>> time autovacuum starts.
>> Patch 2
>> standby_truncate_skip_v1.patch
>>
>> In this case the transaction which is running for several hours won’t be
>> interrupted because of truncate. Even if for some reason we haven’t
>> truncated this table tbl1 right away, nothing terrible will happen. The next
>> truncate wal record will reduce the file size by the actual length (if some
>> inserts or updates have been performed on master).
>
>
> Since you wrote this patch under on my request, let me clarify its idea
> little more.
>
> Currently, lazy_truncate_heap() is very careful on getting
> AccessExclusiveLock to truncate heap.  It doesn't want either block other
> transaction or wait for this lock too long.  If lock isn't acquired after
> some number of tries lazy_truncate_heap() gives up.  However, once
> lazy_truncate_heap() acquires AccessExclusiveLock is acquired on master, it
> would be replayed on replicas where it will conflict with read-only
> transactions if any.  That leads to unexpected behaviour when
> hot_standby_feedback is on.
>
> Idea of fixing that is to apply same logic of getting AccessExclusiveLock on
> standby as on master: give up if somebody is holding conflicting lock for
> long enough.  That allows standby to have more free pages at the end of heap
> than master have.  That shouldn't lead to errors since those pages are
> empty, but allows standby to waste some extra space.  In order to mitigate
> this deficiency, we're generating XLOG_SMGR_TRUNCATE records more frequent:
> on finish of every vacuum.  Therefore, if even standby gets some extra space
> of empty pages, it would be corrected during further vacuum cycles.
>

I think one deficiency of this solution is that it will keep on
generating extra WAL even if standby doesn't need it (as standby has
successfully truncated the relation).  I don't know if we can just get
away by saying that an additional WAL record per vacuum cycle is
harmless especially when that doesn't serve any purpose (like for the
cases when standby is always able to truncate the heap on first WAL
record).  Am I missing some part of solution which avoids it?

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



On 4 September 2017 at 09:06, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:

> Aborting read-only query on standby because of vacuum on master seems to be
> rather unexpected behaviour for user when hot standby feedback is on.
> I think we should work on this problem for v11.

Happy to help. My suggestion would be to discuss a potential theory of
operation and then code a patch.

As Alexander says, simply skipping truncation if standby is busy isn't
a great plan.

If we defer an action on standby replay, when and who will we apply
it? What happens if the standby is shutdown or crashes while an action
is pending.

Perhaps altering the way truncation requires an AccessExclusiveLock
would help workloads on both master and standby? If a Relation knew it
had a pending truncation then scans wouldn't need to go past newsize.
Once older lockers have gone we could simply truncate without the
lock. Would need a few pushups but seems less scary then trying to
make pending standby actions work well enough to commit.

-- 
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] WIP: long transactions on hot standby feedback replica/ proof of concept

From
i.kartyshov@postgrespro.ru
Date:
Alexander Korotkov писал 2017-09-05 00:33:
> I've following comments on this patch:
> 1) You shouldn't use ">=" to compare xids.  You should use
> TransactionIdFollowsOrEquals() function which handles transaction id
> wraparound correctly.
I fixed it and as an additional I add GUC parameter that could turn off 
this behavior. These parameter can be set in the postgresql.conf 
configuration file, or with the help of the ALTER SYSTEM command. For 
the changes to take effect, call the pg_reload_conf() function or reload 
the database server.
Changes are in vacuum_lazy_truncate_v3.patch

> 2) As I understood, this patch makes heap truncate only when no
> concurrent transactions are running on both master and replicas with
> hot_standby_feedback enabled.  For busy system, it would be literally
> "never do heap truncate after vacuum".
Yes, the difficulty is that if we have a lot of replicas and requests 
for them will overlap each other, then truncation will not occur on the 
loaded system

> Perhaps, it's certainly bad idea to disable replaying
> AccessExclusiveLock *every time*, we should skip taking
> AccessExclusiveLock only while writing XLOG_SMGR_TRUNCATE record.
> Besides that, this code violates out coding style.
In patch (standby_truncate_skip_v2.patch) fixed this behaviour and now 
it skip writing XLOG_STANDBY_LOCK records (with in AccessExclusiveLock) 
only while autovacuum truncations the table.

Amit Kapila писал 2017-09-05 12:52:
> I think one deficiency of this solution is that it will keep on
> generating extra WAL even if standby doesn't need it (as standby has
> successfully truncated the relation).  I don't know if we can just get
> away by saying that an additional WAL record per vacuum cycle is
> harmless especially when that doesn't serve any purpose (like for the
> cases when standby is always able to truncate the heap on first WAL
> record).  Am I missing some part of solution which avoids it?
You are right, and this implementation generating extra WALs and it 
could have some cases. If you have any solutions to avoid it, I`ll be 
glade to hear them.

Simon Riggs писал 2017-09-05 14:44:
> If we defer an action on standby replay, when and who will we apply
> it? What happens if the standby is shutdown or crashes while an action
> is pending.
> 
> Perhaps altering the way truncation requires an AccessExclusiveLock
> would help workloads on both master and standby? If a Relation knew it
> had a pending truncation then scans wouldn't need to go past newsize.
> Once older lockers have gone we could simply truncate without the
> lock. Would need a few pushups but seems less scary then trying to
> make pending standby actions work well enough to commit.
Yes, it is the first idea that come to my mind and the most difficult 
one. I think that in a few days I'll finish the patch with such ideas of 
implementation.


------
Ivan Kartyshov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment
Hello. I made some bugfixes and rewrite the patch.

Simon Riggs писал 2017-09-05 14:44:
> As Alexander says, simply skipping truncation if standby is busy isn't
> a great plan.
> 
> If we defer an action on standby replay, when and who will we apply
> it? What happens if the standby is shutdown or crashes while an action
> is pending.

After crash standby server will go to the nearest checkout and will 
replay
all his wal`s to reach consistent state and then open for read-only 
load.
(all collected wal`s will be replayed in right way, I meant wal`s that
truncates table and wal`s that make table grow)

-- 
Ivan Kartyshov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] WIP: long transactions on hot standby feedback replica/ proof of concept

From
Alexander Korotkov
Date:
On Tue, Oct 24, 2017 at 10:56 AM, Ivan Kartyshov <i.kartyshov@postgrespro.ru> wrote:
Hello. I made some bugfixes and rewrite the patch.

Simon Riggs писал 2017-09-05 14:44:
As Alexander says, simply skipping truncation if standby is busy isn't
a great plan.

If we defer an action on standby replay, when and who will we apply
it? What happens if the standby is shutdown or crashes while an action
is pending.

After crash standby server will go to the nearest checkout and will replay
all his wal`s to reach consistent state and then open for read-only load.
(all collected wal`s will be replayed in right way, I meant wal`s that
truncates table and wal`s that make table grow)

--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -49,6 +49,8 @@
 #include "utils/ps_status.h"
 #include "utils/resowner_private.h"
 
+#include <execinfo.h>
+#include <dlfcn.h>
 
 /* This configuration variable is used to set the lock table size */
 int max_locks_per_xact; /* set by guc.c */

Why do you need these includes?

+ FreeFakeRelcacheEntry(rel);
+ } else
+ {
+ XLogFlush(lsn);
+ }

This code violates our coding style.  According to it, "else" shouldn't share its line with braces.

Also, patch definitely needs some general comment explaining what's going on.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
On Tue, Oct 24, 2017 at 1:26 PM, Ivan Kartyshov
<i.kartyshov@postgrespro.ru> wrote:
> Hello. I made some bugfixes and rewrite the patch.

I don't think it's a good idea to deliberately leave the state of the
standby different from the state of the  master on the theory that it
won't matter.  I feel like that's something that's likely to come back
to bite us.

Giving LockAcquireExtended() an argument that causes some
AccessExclusiveLocks not all to be logged also seems pretty ugly,
especially because there are no comments whatsoever explaining the
rationale.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

On Mon, Oct 30, 2017 at 10:16 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Oct 24, 2017 at 1:26 PM, Ivan Kartyshov
> <i.kartyshov@postgrespro.ru> wrote:
>> Hello. I made some bugfixes and rewrite the patch.
>
> I don't think it's a good idea to deliberately leave the state of the
> standby different from the state of the  master on the theory that it
> won't matter.  I feel like that's something that's likely to come back
> to bite us.

I agree with Robert. What happen if we intentionally don't apply the
truncation WAL and switched over? If we insert a tuple on the new
master server to a block that has been truncated on the old master,
the WAL apply on the new standby will fail? I guess there are such
corner cases causing failures of WAL replay after switch-over.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: long transactions on hot standby feedback replica/ proof of concept

From
Alexander Korotkov
Date:
On Tue, Oct 31, 2017 at 5:16 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Mon, Oct 30, 2017 at 10:16 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Oct 24, 2017 at 1:26 PM, Ivan Kartyshov
> <i.kartyshov@postgrespro.ru> wrote:
>> Hello. I made some bugfixes and rewrite the patch.
>
> I don't think it's a good idea to deliberately leave the state of the
> standby different from the state of the  master on the theory that it
> won't matter.  I feel like that's something that's likely to come back
> to bite us.

I agree with Robert. What happen if we intentionally don't apply the
truncation WAL and switched over? If we insert a tuple on the new
master server to a block that has been truncated on the old master,
the WAL apply on the new standby will fail? I guess there are such
corner cases causing failures of WAL replay after switch-over.

Yes, that looks dangerous.  One approach to cope that could be teaching heap redo function to handle such these situations.  But I expect this approach to be criticized for bad design.  And I understand fairness of this criticism.

However, from user prospective of view, current behavior of hot_standby_feedback is just broken, because it both increases bloat and doesn't guarantee that read-only query on standby wouldn't be cancelled because of vacuum.  Therefore, we should be looking for solution: if one approach isn't good enough, then we should look for another approach.

I can propose following alternative approach: teach read-only queries on hot standby to tolerate concurrent relation truncation.  Therefore, when non-existent heap page is accessed on hot standby, we can know that it was deleted by concurrent truncation and should be assumed to be empty.  Any thoughts?

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 
On Tue, Oct 31, 2017 at 6:17 PM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> On Tue, Oct 31, 2017 at 5:16 AM, Masahiko Sawada <sawada.mshk@gmail.com>
> wrote:
>>
>> On Mon, Oct 30, 2017 at 10:16 PM, Robert Haas <robertmhaas@gmail.com>
>> wrote:
>> > On Tue, Oct 24, 2017 at 1:26 PM, Ivan Kartyshov
>> > <i.kartyshov@postgrespro.ru> wrote:
>> >> Hello. I made some bugfixes and rewrite the patch.
>> >
>> > I don't think it's a good idea to deliberately leave the state of the
>> > standby different from the state of the  master on the theory that it
>> > won't matter.  I feel like that's something that's likely to come back
>> > to bite us.
>>
>> I agree with Robert. What happen if we intentionally don't apply the
>> truncation WAL and switched over? If we insert a tuple on the new
>> master server to a block that has been truncated on the old master,
>> the WAL apply on the new standby will fail? I guess there are such
>> corner cases causing failures of WAL replay after switch-over.
>
>
> Yes, that looks dangerous.  One approach to cope that could be teaching heap
> redo function to handle such these situations.  But I expect this approach
> to be criticized for bad design.  And I understand fairness of this
> criticism.
>
> However, from user prospective of view, current behavior of
> hot_standby_feedback is just broken, because it both increases bloat and
> doesn't guarantee that read-only query on standby wouldn't be cancelled
> because of vacuum.  Therefore, we should be looking for solution: if one
> approach isn't good enough, then we should look for another approach.
>
> I can propose following alternative approach: teach read-only queries on hot
> standby to tolerate concurrent relation truncation.  Therefore, when
> non-existent heap page is accessed on hot standby, we can know that it was
> deleted by concurrent truncation and should be assumed to be empty.  Any
> thoughts?
>

You also meant that the applying WAL for AccessExclusiveLock is always
skipped on standby servers to allow scans to access the relation?

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

On Tue, Oct 31, 2017 at 2:47 PM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> However, from user prospective of view, current behavior of
> hot_standby_feedback is just broken, because it both increases bloat and
> doesn't guarantee that read-only query on standby wouldn't be cancelled
> because of vacuum.  Therefore, we should be looking for solution: if one
> approach isn't good enough, then we should look for another approach.
>
> I can propose following alternative approach: teach read-only queries on hot
> standby to tolerate concurrent relation truncation.  Therefore, when
> non-existent heap page is accessed on hot standby, we can know that it was
> deleted by concurrent truncation and should be assumed to be empty.  Any
> thoughts?

Sounds like it might break MVCC?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: long transactions on hot standby feedback replica/ proof of concept

From
Alexander Korotkov
Date:
On Thu, Nov 2, 2017 at 5:56 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Tue, Oct 31, 2017 at 2:47 PM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> However, from user prospective of view, current behavior of
> hot_standby_feedback is just broken, because it both increases bloat and
> doesn't guarantee that read-only query on standby wouldn't be cancelled
> because of vacuum.  Therefore, we should be looking for solution: if one
> approach isn't good enough, then we should look for another approach.
>
> I can propose following alternative approach: teach read-only queries on hot
> standby to tolerate concurrent relation truncation.  Therefore, when
> non-existent heap page is accessed on hot standby, we can know that it was
> deleted by concurrent truncation and should be assumed to be empty.  Any
> thoughts?

Sounds like it might break MVCC?

I don't know why it might be broken.  VACUUM truncates heap only when tail to be truncated is already empty.  When applying truncate WAL record, previous WAL records deleting all those tuples in the tail are already applied.  Thus, if even MVCC is broken and we will miss some tuples after heap truncation, they were anyway were gone before heap truncation.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 

Re: [HACKERS] WIP: long transactions on hot standby feedback replica/ proof of concept

From
Alexander Korotkov
Date:
On Wed, Nov 1, 2017 at 5:55 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Tue, Oct 31, 2017 at 6:17 PM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> On Tue, Oct 31, 2017 at 5:16 AM, Masahiko Sawada <sawada.mshk@gmail.com>
> wrote:
>>
>> On Mon, Oct 30, 2017 at 10:16 PM, Robert Haas <robertmhaas@gmail.com>
>> wrote:
>> > On Tue, Oct 24, 2017 at 1:26 PM, Ivan Kartyshov
>> > <i.kartyshov@postgrespro.ru> wrote:
>> >> Hello. I made some bugfixes and rewrite the patch.
>> >
>> > I don't think it's a good idea to deliberately leave the state of the
>> > standby different from the state of the  master on the theory that it
>> > won't matter.  I feel like that's something that's likely to come back
>> > to bite us.
>>
>> I agree with Robert. What happen if we intentionally don't apply the
>> truncation WAL and switched over? If we insert a tuple on the new
>> master server to a block that has been truncated on the old master,
>> the WAL apply on the new standby will fail? I guess there are such
>> corner cases causing failures of WAL replay after switch-over.
>
>
> Yes, that looks dangerous.  One approach to cope that could be teaching heap
> redo function to handle such these situations.  But I expect this approach
> to be criticized for bad design.  And I understand fairness of this
> criticism.
>
> However, from user prospective of view, current behavior of
> hot_standby_feedback is just broken, because it both increases bloat and
> doesn't guarantee that read-only query on standby wouldn't be cancelled
> because of vacuum.  Therefore, we should be looking for solution: if one
> approach isn't good enough, then we should look for another approach.
>
> I can propose following alternative approach: teach read-only queries on hot
> standby to tolerate concurrent relation truncation.  Therefore, when
> non-existent heap page is accessed on hot standby, we can know that it was
> deleted by concurrent truncation and should be assumed to be empty.  Any
> thoughts?
>

You also meant that the applying WAL for AccessExclusiveLock is always
skipped on standby servers to allow scans to access the relation?

Definitely not every AccessExclusiveLock WAL records should be skipped, but only whose were emitted during heap truncation.  There are other cases when AccessExclusiveLock WAL records are emitted, for instance, during DDL operations.  But, I'd like to focus on AccessExclusiveLock WAL records caused by VACUUM for now.  It's kind of understandable for users that DDL might cancel read-only query on standby.  So, if you're running long report query then you should wait with your DDL.  But VACUUM is a different story.  It runs automatically when you do normal DML queries.

AccessExclusiveLock WAL records by VACUUM could be either not emitted, or somehow distinguished and skipped on standby.  I haven't thought out that level of detail for now.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 
On Fri, Nov 3, 2017 at 5:57 PM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> On Thu, Nov 2, 2017 at 5:56 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> > I can propose following alternative approach: teach read-only queries on
>> > hot
>> > standby to tolerate concurrent relation truncation.  Therefore, when
>> > non-existent heap page is accessed on hot standby, we can know that it
>> > was
>> > deleted by concurrent truncation and should be assumed to be empty.  Any
>> > thoughts?
>>
>> Sounds like it might break MVCC?
>
> I don't know why it might be broken.  VACUUM truncates heap only when tail
> to be truncated is already empty.  When applying truncate WAL record,
> previous WAL records deleting all those tuples in the tail are already
> applied.  Thus, if even MVCC is broken and we will miss some tuples after
> heap truncation, they were anyway were gone before heap truncation.

Ah - I was thinking of the TRUNCATE command, rather than truncation by
VACUUM.  Your argument makes sense, although the case where the
relation is truncated and later re-extended might need some thought.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

On Sat, Nov 4, 2017 at 7:04 AM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> On Wed, Nov 1, 2017 at 5:55 AM, Masahiko Sawada <sawada.mshk@gmail.com>
> wrote:
>>
>> On Tue, Oct 31, 2017 at 6:17 PM, Alexander Korotkov
>> <a.korotkov@postgrespro.ru> wrote:
>> > On Tue, Oct 31, 2017 at 5:16 AM, Masahiko Sawada <sawada.mshk@gmail.com>
>> > wrote:
>> >>
>> >> On Mon, Oct 30, 2017 at 10:16 PM, Robert Haas <robertmhaas@gmail.com>
>> >> wrote:
>> >> > On Tue, Oct 24, 2017 at 1:26 PM, Ivan Kartyshov
>> >> > <i.kartyshov@postgrespro.ru> wrote:
>> >> >> Hello. I made some bugfixes and rewrite the patch.
>> >> >
>> >> > I don't think it's a good idea to deliberately leave the state of the
>> >> > standby different from the state of the  master on the theory that it
>> >> > won't matter.  I feel like that's something that's likely to come
>> >> > back
>> >> > to bite us.
>> >>
>> >> I agree with Robert. What happen if we intentionally don't apply the
>> >> truncation WAL and switched over? If we insert a tuple on the new
>> >> master server to a block that has been truncated on the old master,
>> >> the WAL apply on the new standby will fail? I guess there are such
>> >> corner cases causing failures of WAL replay after switch-over.
>> >
>> >
>> > Yes, that looks dangerous.  One approach to cope that could be teaching
>> > heap
>> > redo function to handle such these situations.  But I expect this
>> > approach
>> > to be criticized for bad design.  And I understand fairness of this
>> > criticism.
>> >
>> > However, from user prospective of view, current behavior of
>> > hot_standby_feedback is just broken, because it both increases bloat and
>> > doesn't guarantee that read-only query on standby wouldn't be cancelled
>> > because of vacuum.  Therefore, we should be looking for solution: if one
>> > approach isn't good enough, then we should look for another approach.
>> >
>> > I can propose following alternative approach: teach read-only queries on
>> > hot
>> > standby to tolerate concurrent relation truncation.  Therefore, when
>> > non-existent heap page is accessed on hot standby, we can know that it
>> > was
>> > deleted by concurrent truncation and should be assumed to be empty.  Any
>> > thoughts?
>> >
>>
>> You also meant that the applying WAL for AccessExclusiveLock is always
>> skipped on standby servers to allow scans to access the relation?
>
>
> Definitely not every AccessExclusiveLock WAL records should be skipped, but
> only whose were emitted during heap truncation.  There are other cases when
> AccessExclusiveLock WAL records are emitted, for instance, during DDL
> operations.  But, I'd like to focus on AccessExclusiveLock WAL records
> caused by VACUUM for now.  It's kind of understandable for users that DDL
> might cancel read-only query on standby.  So, if you're running long report
> query then you should wait with your DDL.  But VACUUM is a different story.
> It runs automatically when you do normal DML queries.
>
> AccessExclusiveLock WAL records by VACUUM could be either not emitted, or
> somehow distinguished and skipped on standby.  I haven't thought out that
> level of detail for now.
>

I understood. I'm concerned the fact that we cannot distinguish that
AccessExclusiveLock WAL came from the vacuum truncation or other
operation required AccessExclusiveLock so far. So I agree that we need
to invent a way for that.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Thank you for your valuable comments. I've made a few adjustments.

The main goal of my changes is to let long read-only transactions run on 
replica if hot_standby_feedback is turned on.


Patch1 - hsfeedback_av_truncate.patch is made to stop 
ResolveRecoveryConflictWithLock occurs on replica, after autovacuum lazy 
truncates heap on master cutting some pages at the end. When 
hot_standby_feedback is on, we know that the autovacuum does not remove 
anything superfluous, which could be needed on standby, so there is no 
need to rise any ResolveRecoveryConflict*.

1) Add to xl_standby_locks and xl_smgr_truncate isautovacuum flag, which 
tells us that autovacuum generates them.

2) When autovacuum decides to trim the table (using lazy_truncate_heap), 
it takes AccessExclusiveLock and sends this lock to the replica, but 
replica should ignore  AccessExclusiveLock if hot_standby_feedback=on.

3) When autovacuum truncate wal message is replayed on a replica, it 
takes ExclusiveLock on a table, so as not to interfere with read-only 
requests.

We have two cases of resolving ResolveRecoveryConflictWithLock if timers 
  (max_standby_streaming_delay and max_standby_archive_delay) have run 
out:
backend is idle in transaction (waiting input) - in this case backend 
will be sent SIGTERM
backend transaction is running query - in this case running transaction 
will be aborted

How to test:
Make async replica, turn on feedback and reduce 
max_standby_streaming_delay.
Make autovacuum more aggressive.
autovacuum = on
autovacuum_max_workers = 1
autovacuum_naptime = 1s
autovacuum_vacuum_threshold = 1
autovacuum_vacuum_cost_delay = 0

Test1:
Here we will do a load on the master and simulation of  a long 
transaction with repeated 1 second SEQSCANS on the replica (by  calling 
pg_sleep 1 second duration every 6 seconds).
MASTER        REPLICA
     hot_standby = on
     max_standby_streaming_delay = 1s
     hot_standby_feedback = on
start
CREATE TABLE test AS (SELECT id, 1 AS value
FROM generate_series(1,1) id);
pgbench -T600 -P2 -n --file=master.sql postgres
(update test set value = value;)
     start
     BEGIN TRANSACTION ISOLATION LEVEL READ COMMITTED;
     SELECT pg_sleep(value) FROM test;
     \watch 6

---Autovacuum truncate pages at the end
Result on replica:
FATAL: terminating connection due to conflict with recovery
DETAIL: User was holding a relation lock for too long.

On Patched version lazy_vacuum_truncation passed without fatal errors.

Only some times Error occurs because this tests is too synthetic
ERROR: canceling statement due to conflict with recovery
DETAIL: User was holding shared buffer pin for too long.
Because of rising ResolveRecoveryConflictWithSnapshot while
redo some visibility flags to avoid this conflict we can do test2 or 
increase max_standby_streaming_delay.

Test2:
Here we will do a load on the master and simulation of  a long 
transaction on the replica (by  taking LOCK on table)
MASTER        REPLICA
     hot_standby = on
     max_standby_streaming_delay = 1s
     hot_standby_feedback = on
start
CREATE TABLE test AS (SELECT id, 1 AS value FROM generate_series(1,1) 
id);
pgbench -T600 -P2 -n --file=master.sql postgres
(update test set value = value;)
     start
     BEGIN TRANSACTION ISOLATION LEVEL READ COMMITTED;
     LOCK TABLE test IN ACCESS SHARE MODE;
     select * from test;
     \watch 6

---Autovacuum truncate pages at the end
Result on replica:
FATAL: terminating connection due to conflict with recovery
DETAIL: User was holding a relation lock for too long.

On Patched version lazy_vacuum_truncation passed without fatal errors.

Test3:
Here we do a load on the master and simulation of  a long transaction 
with repeated 1 second SEQSCANS on the replica (by  calling pg_sleep 1 
second duration every 6 seconds).
MASTER        REPLICA
     hot_standby = on
     max_standby_streaming_delay = 4s
     hot_standby_feedback = on
start
CREATE TABLE test AS (SELECT id, 200 AS value
FROM generate_series(1,1) id);
pgbench -T600 -P2 -n --file=master.sql postgres
(update test set value = value;)
     start
     BEGIN TRANSACTION ISOLATION LEVEL READ COMMITTED;
     SELECT pg_sleep(value) FROM test;

---Autovacuum truncate pages at the end
Result on replica:
FATAL: terminating connection due to conflict with recovery
DETAIL: User was holding a relation lock for too long.

On Patched version lazy_vacuum_truncation passed without fatal errors.

This way we can make transactions with SEQSCAN, INDEXSCAN or BITMAPSCAN


Patch2 - hsfeedback_noninvalide_xmin.patch
When walsender is initialized, its xmin in PROCARRAY is set to 
GetOldestXmin() in order to prevent autovacuum running on master from 
truncating relation and removing some pages that are required by 
replica. This might happen if master's autovacuum and replica's query 
started simultaneously. And the replica has not yet reported its xmin 
value.

How to test:
Make async replica, turn on feedback, reduce max_standby_streaming_delay 
and aggressive autovacuum.
autovacuum = on
autovacuum_max_workers = 1
autovacuum_naptime = 1s
autovacuum_vacuum_threshold = 1
autovacuum_vacuum_cost_delay = 0

Test:
Here we will start replica and begi repeatable read transaction on 
table, then we stop replicas postmaster to prevent starting walreceiver 
worker (on master startup) and sending master it`s transaction xmin over 
hot_standby_feedback message.
MASTER        REPLICA
start
CREATE TABLE test AS (SELECT id, 1 AS value FROM 
generate_series(1,10000000) id);
stop
     start
     BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
     SELECT * FROM test;
     stop postmaster with gdb
start
DELETE FROM test WHERE id > 0;
wait till autovacuum delete and changed xmin
             release postmaster with gdb
--- Result on replica
FATAL: terminating connection due to conflict with recovery
DETAIL: User query might have needed to see row versions that must be 
removed.

There is one feature of the behavior of standby, which let us to allow 
the autovacuum to cut off the page table (at the end of relation) that 
no one else needs (because there is only dead and removed tuples). So if 
the standby SEQSCAN or another *SCAN mdread a page that is damaged or 
has been deleted, it will receive a zero page, and not break the request 
for ERROR.

Could you give me your ideas over these patches.

-- 
Ivan Kartyshov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Attachment
Hi!

Thank you for your work on this issue.

On Wed, Feb 28, 2018 at 5:25 PM Ivan Kartyshov
<i.kartyshov@postgrespro.ru> wrote:
> Thank you for your valuable comments. I've made a few adjustments.
>
> The main goal of my changes is to let long read-only transactions run on
> replica if hot_standby_feedback is turned on.
>
>
> Patch1 - hsfeedback_av_truncate.patch is made to stop
> ResolveRecoveryConflictWithLock occurs on replica, after autovacuum lazy
> truncates heap on master cutting some pages at the end. When
> hot_standby_feedback is on, we know that the autovacuum does not remove
> anything superfluous, which could be needed on standby, so there is no
> need to rise any ResolveRecoveryConflict*.
>
> 1) Add to xl_standby_locks and xl_smgr_truncate isautovacuum flag, which
> tells us that autovacuum generates them.
>
> 2) When autovacuum decides to trim the table (using lazy_truncate_heap),
> it takes AccessExclusiveLock and sends this lock to the replica, but
> replica should ignore  AccessExclusiveLock if hot_standby_feedback=on.
>
> 3) When autovacuum truncate wal message is replayed on a replica, it
> takes ExclusiveLock on a table, so as not to interfere with read-only
> requests.

I see that you use IsAutoVacuumWorkerProcess() function to determine
if this access exclusive lock is caused by vacuum.  And I see at least
two issues with that.

1) That wouldn't work for manually run vacuums.  I understand stat
query cancel caused by autovacuum is probably most annoying thing.
But unnecessary partial bug fix is undesirable.
2) Access exclusive locks are logged not only immediately by the
process taken that, but also all such locks are logged in
LogStandbySnapshot().  LogStandbySnapshot() is called by checkpointer,
bgwriter and others.  So, IsAutoVacuumWorkerProcess() wouldn't help in
such cases.  I understand that heap truncation is typically fast. And
probability that some process will dump all the access exclusive locks
while truncation is in progress is low.  But nevertheless we need to
handle that properly.

Based on this, I think we should give up with using
IsAutoVacuumWorkerProcess() to determine locks caused by vacuum.

Thinking about that more I found that adding vacuum mark as an extra
argument to LockAcquireExtended is also wrong.  It would be still hard
to determine if we should log the lock in LogStandbySnapshot().  We'll
have to add that flag to shared memory table of locks.  And that looks
like a kludge.

Therefore, I'd like to propose another approach: introduce new lock
level.  So, AccessExclusiveLock will be split into two
AccessExclusiveLocalLock and AccessExclusiveLock.  In spite of
AccessExclusiveLock, AccessExclusiveLocalLock will be not logged to
standby, and used for heap truncation.

I expect some resistance to my proposal, because fixing this "small
bug" doesn't deserve new lock level.  But current behavior of
hot_standby_feedback is buggy.  From user prospective,
hot_standby_feedback option is just non-worker, which causes master to
bloat without protection for standby queries from cancel.  We need to
fix that, but I don't see other proper way to do that except adding
new lock level...

> Patch2 - hsfeedback_noninvalide_xmin.patch
> When walsender is initialized, its xmin in PROCARRAY is set to
> GetOldestXmin() in order to prevent autovacuum running on master from
> truncating relation and removing some pages that are required by
> replica. This might happen if master's autovacuum and replica's query
> started simultaneously. And the replica has not yet reported its xmin
> value.

I don't yet understand what is the problem here and why this patch
should solve that.  As I get idea of this patch, GetOldestXmin() will
initialize xmin of walsender with lowest xmin of replication slot.
But xmin of replication slots will be anyway taken into account by
GetSnapshotData() called by vacuum.

> How to test:
> Make async replica, turn on feedback, reduce max_standby_streaming_delay
> and aggressive autovacuum.

You forgot to mention, that one should setup the replication using
replication slot.  Naturally, if replication slot isn't exist, then
master shouldn't keep dead tuples for disconnected standby.  Because
master doesn't know if standby will reconnect or is it gone forever.

> autovacuum = on
> autovacuum_max_workers = 1
> autovacuum_naptime = 1s
> autovacuum_vacuum_threshold = 1
> autovacuum_vacuum_cost_delay = 0
>
> Test:
> Here we will start replica and begi repeatable read transaction on
> table, then we stop replicas postmaster to prevent starting walreceiver
> worker (on master startup) and sending master it`s transaction xmin over
> hot_standby_feedback message.
> MASTER        REPLICA
> start
> CREATE TABLE test AS (SELECT id, 1 AS value FROM
> generate_series(1,10000000) id);
> stop
>      start

I guess you meant start of standby session here, not postmaster.
Otherwise, I don't understand how table test will reach standby.

>      BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
>      SELECT * FROM test;
>      stop postmaster with gdb
> start
> DELETE FROM test WHERE id > 0;
> wait till autovacuum delete and changed xmin
>              release postmaster with gdb
> --- Result on replica
> FATAL: terminating connection due to conflict with recovery
> DETAIL: User query might have needed to see row versions that must be
> removed.

I've tries it with replication slot, and just with
hsfeedback_av_truncate.patch everything is working fine: query on
standby isn't cancelled.  I don't see how
hsfeedback_noninvalide_xmin.patch influence the situation.  Without
replication slot, it always fails regardless any attached patches, and
that's right behavior.

BTW, I've registered this patch for July commitfest.  It would be nice
to finally fix this long-term bug.  Despite this is bug fix, I doubt
it should be backpatched, especially if we will accept the new lock
level.  We should rather apply patch to the documentation of supported
versions, which would claim that vacuum can still cancel queries on
standbys regardless hot_standby_feedback option.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


On Thu, Mar 1, 2018 at 3:24 AM, Ivan Kartyshov
<i.kartyshov@postgrespro.ru> wrote:
> Could you give me your ideas over these patches.

Hi Ivan,

Not sure if this is expected at this stage or not, but just in case
you didn't know... the tests under src/test/subscription seem to be
broken:

https://travis-ci.org/postgresql-cfbot/postgresql/builds/409358734

-- 
Thomas Munro
http://www.enterprisedb.com


Hello, thank you for your review.

Alexander Korotkov писал 2018-06-20 20:54:
> Thinking about that more I found that adding vacuum mark as an extra
> argument to LockAcquireExtended is also wrong.  It would be still hard
> to determine if we should log the lock in LogStandbySnapshot().  We'll
> have to add that flag to shared memory table of locks.  And that looks
> like a kludge.
> 
> Therefore, I'd like to propose another approach: introduce new lock
> level.  So, AccessExclusiveLock will be split into two
> AccessExclusiveLocalLock and AccessExclusiveLock.  In spite of
> AccessExclusiveLock, AccessExclusiveLocalLock will be not logged to
> standby, and used for heap truncation.
> 
> I expect some resistance to my proposal, because fixing this "small
> bug" doesn't deserve new lock level.  But current behavior of
> hot_standby_feedback is buggy.  From user prospective,
> hot_standby_feedback option is just non-worker, which causes master to
> bloat without protection for standby queries from cancel.  We need to
> fix that, but I don't see other proper way to do that except adding
> new lock level...

Your offer is very interesting, it made patch smaller and more 
beautiful.
So I update patch and made changes accordance with the proposed concept 
of
special AccessExclusiveLocalLock.

> I don't yet understand what is the problem here and why this patch
> should solve that.  As I get idea of this patch, GetOldestXmin() will
> initialize xmin of walsender with lowest xmin of replication slot.
> But xmin of replication slots will be anyway taken into account by
> GetSnapshotData() called by vacuum.
> 
>> How to test:
>> Make async replica, turn on feedback, reduce 
>> max_standby_streaming_delay
>> and aggressive autovacuum.
> 
> You forgot to mention, that one should setup the replication using
> replication slot.  Naturally, if replication slot isn't exist, then
> master shouldn't keep dead tuples for disconnected standby.  Because
> master doesn't know if standby will reconnect or is it gone forever.

I tried to solve hot_standby_feedback without replication slots,
so I found a little window of possibility of case, when vacuum on
master make heap truncation of pages that standby needs for just started
transaction.

How to test:
Make async replica, turn on feedback and reduce
max_standby_streaming_delay.
Make autovacuum more aggressive.
autovacuum = on
autovacuum_max_workers = 1
autovacuum_naptime = 1s
autovacuum_vacuum_threshold = 1
autovacuum_vacuum_cost_delay = 0



Test1:
Here we will do a load on the master and simulation of  a long
transaction with repeated 1 second SEQSCANS on the replica (by  calling
pg_sleep 1 second duration every 6 seconds).
MASTER        REPLICA
      hot_standby = on
      max_standby_streaming_delay = 1s
      hot_standby_feedback = on
start
CREATE TABLE test AS (SELECT id, 1 AS value
FROM generate_series(1,1) id);
pgbench -T600 -P2 -n --file=master.sql postgres
(update test set value = value;)
      start
      BEGIN TRANSACTION ISOLATION LEVEL READ COMMITTED;
      SELECT pg_sleep(value) FROM test;
      \watch 6



---Autovacuum truncate pages at the end
Result on replica:
FATAL: terminating connection due to conflict with recovery
DETAIL: User was holding a relation lock for too long.



On Patched version lazy_vacuum_truncation passed without fatal errors.



Only some times Error occurs because this tests is too synthetic
ERROR: canceling statement due to conflict with recovery
DETAIL: User was holding shared buffer pin for too long.
Because of rising ResolveRecoveryConflictWithSnapshot while
redo some visibility flags to avoid this conflict we can do test2 or
increase max_standby_streaming_delay.



Test2:
Here we will do a load on the master and simulation of  a long
transaction on the replica (by  taking LOCK on table)
MASTER        REPLICA
      hot_standby = on
      max_standby_streaming_delay = 1s
      hot_standby_feedback = on
start
CREATE TABLE test AS (SELECT id, 1 AS value FROM generate_series(1,1)
id);
pgbench -T600 -P2 -n --file=master.sql postgres
(update test set value = value;)
      start
      BEGIN TRANSACTION ISOLATION LEVEL READ COMMITTED;
      LOCK TABLE test IN ACCESS SHARE MODE;
      select * from test;
      \watch 6



---Autovacuum truncate pages at the end
Result on replica:
FATAL: terminating connection due to conflict with recovery
DETAIL: User was holding a relation lock for too long.

On Patched version lazy_vacuum_truncation passed without fatal errors.

I would like to here you opinion over this implementation.

-- 
Ivan Kartyshov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Attachment

Re: [HACKERS] WIP: long transactions on hot standby feedback replica/ proof of concept

From
Alexander Korotkov
Date:
Hi!

On Thu, Aug 9, 2018 at 11:26 PM Ivan Kartyshov
<i.kartyshov@postgrespro.ru> wrote:
> Alexander Korotkov писал 2018-06-20 20:54:
> > Thinking about that more I found that adding vacuum mark as an extra
> > argument to LockAcquireExtended is also wrong.  It would be still hard
> > to determine if we should log the lock in LogStandbySnapshot().  We'll
> > have to add that flag to shared memory table of locks.  And that looks
> > like a kludge.
> >
> > Therefore, I'd like to propose another approach: introduce new lock
> > level.  So, AccessExclusiveLock will be split into two
> > AccessExclusiveLocalLock and AccessExclusiveLock.  In spite of
> > AccessExclusiveLock, AccessExclusiveLocalLock will be not logged to
> > standby, and used for heap truncation.
> >
> > I expect some resistance to my proposal, because fixing this "small
> > bug" doesn't deserve new lock level.  But current behavior of
> > hot_standby_feedback is buggy.  From user prospective,
> > hot_standby_feedback option is just non-worker, which causes master to
> > bloat without protection for standby queries from cancel.  We need to
> > fix that, but I don't see other proper way to do that except adding
> > new lock level...
>
> Your offer is very interesting, it made patch smaller and more
> beautiful.
> So I update patch and made changes accordance with the proposed concept
> of
> special AccessExclusiveLocalLock.

> I would like to here you opinion over this implementation.

In general this patch looks good for me.  It seems that comments and
documentation need minor enhancements.  I'll make them and post the
revised patch.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] WIP: long transactions on hot standby feedback replica/ proof of concept

From
Alexander Korotkov
Date:
Hi!

On Fri, Aug 10, 2018 at 5:07 PM Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> On Thu, Aug 9, 2018 at 11:26 PM Ivan Kartyshov
> <i.kartyshov@postgrespro.ru> wrote:
> > Alexander Korotkov писал 2018-06-20 20:54:
> > > Thinking about that more I found that adding vacuum mark as an extra
> > > argument to LockAcquireExtended is also wrong.  It would be still hard
> > > to determine if we should log the lock in LogStandbySnapshot().  We'll
> > > have to add that flag to shared memory table of locks.  And that looks
> > > like a kludge.
> > >
> > > Therefore, I'd like to propose another approach: introduce new lock
> > > level.  So, AccessExclusiveLock will be split into two
> > > AccessExclusiveLocalLock and AccessExclusiveLock.  In spite of
> > > AccessExclusiveLock, AccessExclusiveLocalLock will be not logged to
> > > standby, and used for heap truncation.
> > >
> > > I expect some resistance to my proposal, because fixing this "small
> > > bug" doesn't deserve new lock level.  But current behavior of
> > > hot_standby_feedback is buggy.  From user prospective,
> > > hot_standby_feedback option is just non-worker, which causes master to
> > > bloat without protection for standby queries from cancel.  We need to
> > > fix that, but I don't see other proper way to do that except adding
> > > new lock level...
> >
> > Your offer is very interesting, it made patch smaller and more
> > beautiful.
> > So I update patch and made changes accordance with the proposed concept
> > of
> > special AccessExclusiveLocalLock.
>
> > I would like to here you opinion over this implementation.
>
> In general this patch looks good for me.  It seems that comments and
> documentation need minor enhancements.  I'll make them and post the
> revised patch.

Find the revised patch attached.  It contains some enhancements in
comments, formatting and documentation.  BTW, I decided that we should
enumerate ACCESS EXCLUSIVE LOCAL lock before ACCESS EXCLUSIVE lock,
because we enumerate lock on ascending strength.  So, since ACCESS
EXCLUSIVE is WAL-logged, it's definitely "stronger".

I think that introduction of new lock level is significant change and
can't be backpatched.  But I think it worth to backpatch a note to the
documentation, which clarifies why hot_standby_feedback might have
unexpected behavior.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachment
On Wed, Feb 28, 2018 at 11:24 PM, Ivan Kartyshov
<i.kartyshov@postgrespro.ru> wrote:
> Thank you for your valuable comments. I've made a few adjustments.
>

Thank you for working on this!

> The main goal of my changes is to let long read-only transactions run on
> replica if hot_standby_feedback is turned on.

FWIW the idea proposed on the thread[1], which allows us to disable
the heap truncation by vacuum per tables, might help this issue. Since
the original problem on that thread is a performance problem an
alternative proposal would be better, but I think the way would be
helpful for this issue too and is simple. A downside of that idea is
that there is additional work for user to configure the reloption to
each tables.

[1] https://www.postgresql.org/message-id/CAHGQGwE5UqFqSq1%3DkV3QtTUtXphTdyHA-8rAj4A%3DY%2Be4kyp3BQ%40mail.gmail.com

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


Re: [HACKERS] WIP: long transactions on hot standby feedback replica/ proof of concept

From
Alexander Korotkov
Date:
Hi!

On Tue, Aug 14, 2018 at 12:05 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Wed, Feb 28, 2018 at 11:24 PM, Ivan Kartyshov
> <i.kartyshov@postgrespro.ru> wrote:
> > The main goal of my changes is to let long read-only transactions run on
> > replica if hot_standby_feedback is turned on.
>
> FWIW the idea proposed on the thread[1], which allows us to disable
> the heap truncation by vacuum per tables, might help this issue. Since
> the original problem on that thread is a performance problem an
> alternative proposal would be better, but I think the way would be
> helpful for this issue too and is simple. A downside of that idea is
> that there is additional work for user to configure the reloption to
> each tables.

Thank you for pointing this thread.  It's possible to cope this
downside to introduce GUC which would serve as default, when reloption
is not defined.  But real downside is inability to truncate the heap.
So, options are good if they provide workaround for users, but that
shouldn't stop us from fixing issues around heap truncation.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] WIP: long transactions on hot standby feedback replica/ proof of concept

From
Alexander Korotkov
Date:
On Thu, Aug 16, 2018 at 2:16 PM Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> On Tue, Aug 14, 2018 at 12:05 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Wed, Feb 28, 2018 at 11:24 PM, Ivan Kartyshov
> > <i.kartyshov@postgrespro.ru> wrote:
> > > The main goal of my changes is to let long read-only transactions run on
> > > replica if hot_standby_feedback is turned on.
> >
> > FWIW the idea proposed on the thread[1], which allows us to disable
> > the heap truncation by vacuum per tables, might help this issue. Since
> > the original problem on that thread is a performance problem an
> > alternative proposal would be better, but I think the way would be
> > helpful for this issue too and is simple. A downside of that idea is
> > that there is additional work for user to configure the reloption to
> > each tables.
>
> Thank you for pointing this thread.  It's possible to cope this
> downside to introduce GUC which would serve as default, when reloption
> is not defined.  But real downside is inability to truncate the heap.
> So, options are good if they provide workaround for users, but that
> shouldn't stop us from fixing issues around heap truncation.

So, do we have any objections to committing this?

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Hi,

On 2018-08-17 18:00:20 +0300, Alexander Korotkov wrote:
> So, do we have any objections to committing this?

I think this needs more review by other senior hackers in the community.

Greetings,

Andres Freund


Andres Freund <andres@anarazel.de> writes:
> On 2018-08-17 18:00:20 +0300, Alexander Korotkov wrote:
>> So, do we have any objections to committing this?

> I think this needs more review by other senior hackers in the community.

TBH it sounds like a horrible hack.  Disable vacuum truncation?
That can't be a good idea.  If it were something we were doing
behind the scenes as a temporary expedient, maybe we could hold
our noses and do it for awhile.  But once you introduce a reloption
based on this, you can't ever get rid of it.

I think we need to spend more effort looking for a less invasive,
less compromised solution.

            regards, tom lane


On 2018-08-17 11:35:40 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On 2018-08-17 18:00:20 +0300, Alexander Korotkov wrote:
> >> So, do we have any objections to committing this?
> 
> > I think this needs more review by other senior hackers in the community.
> 
> TBH it sounds like a horrible hack.  Disable vacuum truncation?

There's another patch, which I thought Alexander was referring to, that
does something a bit smarger.  On a super short skim it seems to
introduce a separate type of AEL lock that's not replicated, by my
reading?

Greetings,

Andres Freund


Re: [HACKERS] WIP: long transactions on hot standby feedback replica/ proof of concept

From
Alexander Korotkov
Date:
On Fri, Aug 17, 2018 at 6:41 PM Andres Freund <andres@anarazel.de> wrote:
> On 2018-08-17 11:35:40 -0400, Tom Lane wrote:
> > Andres Freund <andres@anarazel.de> writes:
> > > On 2018-08-17 18:00:20 +0300, Alexander Korotkov wrote:
> > >> So, do we have any objections to committing this?
> >
> > > I think this needs more review by other senior hackers in the community.
> >
> > TBH it sounds like a horrible hack.  Disable vacuum truncation?
>
> There's another patch, which I thought Alexander was referring to, that
> does something a bit smarger.  On a super short skim it seems to
> introduce a separate type of AEL lock that's not replicated, by my
> reading?

Yes, that's correct.  On standby read-only queries can tolerate
concurrent heap truncation.  So, there is no point to replicate AEL to
standby.  Previous versions of patch tries to do that by introducing
some flag.  But it appears that correct way to do this is just new
non-replicated type of AEL lock.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Alexander Korotkov <a.korotkov@postgrespro.ru> writes:
> On Fri, Aug 17, 2018 at 6:41 PM Andres Freund <andres@anarazel.de> wrote:
>> There's another patch, which I thought Alexander was referring to, that
>> does something a bit smarger.  On a super short skim it seems to
>> introduce a separate type of AEL lock that's not replicated, by my
>> reading?

> Yes, that's correct.  On standby read-only queries can tolerate
> concurrent heap truncation.

Uh, what???

            regards, tom lane


Re: [HACKERS] WIP: long transactions on hot standby feedback replica/ proof of concept

From
Alexander Korotkov
Date:
On Fri, Aug 17, 2018 at 8:38 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Alexander Korotkov <a.korotkov@postgrespro.ru> writes:
> > On Fri, Aug 17, 2018 at 6:41 PM Andres Freund <andres@anarazel.de> wrote:
> >> There's another patch, which I thought Alexander was referring to, that
> >> does something a bit smarger.  On a super short skim it seems to
> >> introduce a separate type of AEL lock that's not replicated, by my
> >> reading?
>
> > Yes, that's correct.  On standby read-only queries can tolerate
> > concurrent heap truncation.
>
> Uh, what???

VACUUM truncates heap relation only after deletion of all the tuples
from tail pages.  So, on standby heap truncation record would be
replayed only after heap tuples deletion records are replayed.
Therefore, if query on standby should see some of those tuples, WAL
replay stop (or query cancel) should happened before corresponding
tuples being deleted by our recovery conflict with snapshot logic.
When we're going to replay heap truncation record, no query should see
records in the tail pages.

And in md.c we already have logic to return zeroed pages, when trying
to read past relation in recovery.

        /*
         * Short read: we are at or past EOF, or we read a partial block at
         * EOF.  Normally this is an error; upper levels should never try to
         * read a nonexistent block.  However, if zero_damaged_pages is ON or
         * we are InRecovery, we should instead return zeroes without
         * complaining.  This allows, for example, the case of trying to
         * update a block that was later truncated away.
         */
        if (zero_damaged_pages || InRecovery)
            MemSet(buffer, 0, BLCKSZ);
        else
            ereport(ERROR,
                    (errcode(ERRCODE_DATA_CORRUPTED),
                     errmsg("could not read block %u in file \"%s\":
read only %d of %d bytes",
                            blocknum, FilePathName(v->mdfd_vfd),
                            nbytes, BLCKSZ)));

But, I'm concerned if FileSeek() could return error.  And also what
_mdfd_getseg() would do on truncated segment.  It seems that in
recovery, it will automatically extend the relation.  That
unacceptable for this purpose.  So, sorry for bothering, this patch
definitely needs to be revised.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Alexander Korotkov <a.korotkov@postgrespro.ru> writes:
> On Fri, Aug 17, 2018 at 8:38 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Alexander Korotkov <a.korotkov@postgrespro.ru> writes:
>>> Yes, that's correct.  On standby read-only queries can tolerate
>>> concurrent heap truncation.

>> Uh, what???

> VACUUM truncates heap relation only after deletion of all the tuples
> from tail pages.

Right; the problem I'm worried about is possible accesses to
already-removed pages by concurrent scans.

> And in md.c we already have logic to return zeroed pages, when trying
> to read past relation in recovery.

But then you are injecting bad pages into the shared buffer arena.
In any case, depending on that behavior seems like a bad idea, because
it's a pretty questionable kluge in itself.

Another point is that the truncation code attempts to remove all
to-be-truncated-away pages from the shared buffer arena, but that only
works if nobody else is loading such pages into shared buffers
concurrently.  In the presence of concurrent scans, we might be left
with valid-looking buffers for pages that have been truncated away
on-disk.  That could cause all sorts of fun later.  Yeah, the buffers
should contain only dead tuples ... but, for example, they might not
be hinted dead.  If somebody sets one of those hint bits and then
writes the buffer back out to disk, you've got real problems.

Perhaps there's some gold to be mined by treating truncation locks
differently from other AELs, but I don't think you can just ignore
them on the standby, any more than they can be ignored on the master.

            regards, tom lane


Re: [HACKERS] WIP: long transactions on hot standby feedback replica/ proof of concept

From
Alexander Korotkov
Date:
On Fri, Aug 17, 2018 at 9:55 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Alexander Korotkov <a.korotkov@postgrespro.ru> writes:
> > On Fri, Aug 17, 2018 at 8:38 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> Alexander Korotkov <a.korotkov@postgrespro.ru> writes:
> >>> Yes, that's correct.  On standby read-only queries can tolerate
> >>> concurrent heap truncation.
>
> >> Uh, what???
>
> > VACUUM truncates heap relation only after deletion of all the tuples
> > from tail pages.
>
> Right; the problem I'm worried about is possible accesses to
> already-removed pages by concurrent scans.
>
> > And in md.c we already have logic to return zeroed pages, when trying
> > to read past relation in recovery.
>
> But then you are injecting bad pages into the shared buffer arena.
> In any case, depending on that behavior seems like a bad idea, because
> it's a pretty questionable kluge in itself.
>
> Another point is that the truncation code attempts to remove all
> to-be-truncated-away pages from the shared buffer arena, but that only
> works if nobody else is loading such pages into shared buffers
> concurrently.  In the presence of concurrent scans, we might be left
> with valid-looking buffers for pages that have been truncated away
> on-disk.  That could cause all sorts of fun later.  Yeah, the buffers
> should contain only dead tuples ... but, for example, they might not
> be hinted dead.  If somebody sets one of those hint bits and then
> writes the buffer back out to disk, you've got real problems.
>
> Perhaps there's some gold to be mined by treating truncation locks
> differently from other AELs, but I don't think you can just ignore
> them on the standby, any more than they can be ignored on the master.

Thank you for the explanation.  I see that injecting past OEF pages
into shared buffers doesn't look good.  I start thinking about letting
caller of ReadBuffer() (or its variation) handle past OEF situation.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Alexander Korotkov <a.korotkov@postgrespro.ru> writes:
> On Fri, Aug 17, 2018 at 9:55 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Another point is that the truncation code attempts to remove all
>> to-be-truncated-away pages from the shared buffer arena, but that only
>> works if nobody else is loading such pages into shared buffers
>> concurrently.  In the presence of concurrent scans, we might be left
>> with valid-looking buffers for pages that have been truncated away
>> on-disk.  That could cause all sorts of fun later.  Yeah, the buffers
>> should contain only dead tuples ... but, for example, they might not
>> be hinted dead.  If somebody sets one of those hint bits and then
>> writes the buffer back out to disk, you've got real problems.

> Thank you for the explanation.  I see that injecting past OEF pages
> into shared buffers doesn't look good.  I start thinking about letting
> caller of ReadBuffer() (or its variation) handle past OEF situation.

That'd still have the same race condition, though: between the time
we start to drop the doomed pages from shared buffers, and the time
we actually perform ftruncate, concurrent scans could re-load such
pages into shared buffers.

Could it work to ftruncate first and flush shared buffers after?
Probably not, I think the write-back-dirty-hint-bits scenario
breaks that one.

If this were easy, we'd have fixed it years ago :-(.  It'd sure
be awfully nice not to need AEL during autovacuum, even transiently;
but I'm not sure how we get there without adding an unpleasant amount
of substitute locking in table scans.

            regards, tom lane


Re: [HACKERS] WIP: long transactions on hot standby feedback replica/ proof of concept

From
Alexander Korotkov
Date:
On Fri, Aug 17, 2018 at 9:55 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> But then you are injecting bad pages into the shared buffer arena.
> In any case, depending on that behavior seems like a bad idea, because
> it's a pretty questionable kluge in itself.
>
> Another point is that the truncation code attempts to remove all
> to-be-truncated-away pages from the shared buffer arena, but that only
> works if nobody else is loading such pages into shared buffers
> concurrently.  In the presence of concurrent scans, we might be left
> with valid-looking buffers for pages that have been truncated away
> on-disk.  That could cause all sorts of fun later.  Yeah, the buffers
> should contain only dead tuples ... but, for example, they might not
> be hinted dead.  If somebody sets one of those hint bits and then
> writes the buffer back out to disk, you've got real problems.

I didn't yet get how that's possible... count_nondeletable_pages()
ensures that every to-be-truncated-away page doesn't contain any used
item.  From reading [1] I got that we might have even live tuples if
truncation was failed.  But if truncation wasn't failed, we shouldn't
get this hint problem.  Please, correct me if I'm wrong.

After reading [1] and [2] I got that there are at least 3 different
issues with heap truncation:
1) Data corruption on file truncation error (explained in [1]).
2) Expensive scanning of the whole shared buffers before file truncation.
3) Cancel of read-only queries on standby even if hot_standby_feedback
is on, caused by replication of AccessExclusiveLock.

It seems that fixing any of those issues requires redesign of heap
truncation.  So, ideally redesign of heap truncation should fix all
the issues of above.  Or at least it should be understood how the rest
of issues can be fixed later using the new design.

I would like to share some my sketchy thoughts about new heap
truncation design.  Let's imagine we introduced dirty_barrier buffer
flag, which prevents dirty buffer from being written (and
correspondingly evicted).  Then truncation algorithm could look like
this:

1) Acquire ExclusiveLock on relation.
2) Calculate truncation point using count_nondeletable_pages(), while
simultaneously placing dirty_barrier flag on dirty buffers and saving
their numbers to array.  Assuming no writes are performing
concurrently, no to-be-truncated-away pages should be written from
this point.
3) Truncate data files.
4) Iterate past truncation point buffers and clean dirty and
dirty_barrier flags from them (using numbers we saved to array on step
#2).
5) Release relation lock.
*) On exception happen after step #2, iterate past truncation point
buffers and clean dirty_barrier flags from them (using numbers we
saved to array on step #2)

After heap truncation using this algorithm, shared buffers may contain
past-OEF buffers.  But those buffers are empty (no used items) and
clean.  So, real-only queries shouldn't hint those buffers dirty
because there are no used items.  Normally, these buffers will be just
evicted away from the shared buffer arena.  If relation extension will
happen short after heap truncation then some of those buffers could be
found after relation extension.  I think this situation could be
handled.  For instance, we can teach vacuum to claim page as new once
all the tuples were gone.

We're taking only exclusive lock here.  And assuming we will teach our
scans to treat page-past-OEF situation as no-visible-tuples-found,
concurrent read-only queries will work concurrently with heap
truncate.  Also we don't have to scan whole shared buffers, only past
truncation point buffers are scanned at step #2.  Later flags are
cleaned only from truncation point dirty buffers.  Data corruption on
truncation error also shouldn't happen as well, because we don't
forget to write any dirty buffers before insure that data files were
successfully truncated.

The problem I see with this approach so far is placing too many
dirty_barrier flags can affect concurrent activity.  In order to cope
that we may, for instance, truncate relation in multiple iterations
when we find too many past truncation point dirty buffers.

Any thoughts?

Links.
1. https://www.postgresql.org/message-id/flat/5BBC590AE8DF4ED1A170E4D48F1B53AC%40tunaPC
2.
https://www.postgresql.org/message-id/flat/CAHGQGwE5UqFqSq1%3DkV3QtTUtXphTdyHA-8rAj4A%3DY%2Be4kyp3BQ%40mail.gmail.com

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] WIP: long transactions on hot standby feedback replica/ proof of concept

From
Alexander Korotkov
Date:
On Tue, Aug 21, 2018 at 4:10 PM Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> After reading [1] and [2] I got that there are at least 3 different
> issues with heap truncation:
> 1) Data corruption on file truncation error (explained in [1]).
> 2) Expensive scanning of the whole shared buffers before file truncation.
> 3) Cancel of read-only queries on standby even if hot_standby_feedback
> is on, caused by replication of AccessExclusiveLock.
>
> It seems that fixing any of those issues requires redesign of heap
> truncation.  So, ideally redesign of heap truncation should fix all
> the issues of above.  Or at least it should be understood how the rest
> of issues can be fixed later using the new design.
>
> I would like to share some my sketchy thoughts about new heap
> truncation design.  Let's imagine we introduced dirty_barrier buffer
> flag, which prevents dirty buffer from being written (and
> correspondingly evicted).  Then truncation algorithm could look like
> this:
>
> 1) Acquire ExclusiveLock on relation.
> 2) Calculate truncation point using count_nondeletable_pages(), while
> simultaneously placing dirty_barrier flag on dirty buffers and saving
> their numbers to array.  Assuming no writes are performing
> concurrently, no to-be-truncated-away pages should be written from
> this point.
> 3) Truncate data files.
> 4) Iterate past truncation point buffers and clean dirty and
> dirty_barrier flags from them (using numbers we saved to array on step
> #2).
> 5) Release relation lock.
> *) On exception happen after step #2, iterate past truncation point
> buffers and clean dirty_barrier flags from them (using numbers we
> saved to array on step #2)
>
> After heap truncation using this algorithm, shared buffers may contain
> past-OEF buffers.  But those buffers are empty (no used items) and
> clean.  So, real-only queries shouldn't hint those buffers dirty
> because there are no used items.  Normally, these buffers will be just
> evicted away from the shared buffer arena.  If relation extension will
> happen short after heap truncation then some of those buffers could be
> found after relation extension.  I think this situation could be
> handled.  For instance, we can teach vacuum to claim page as new once
> all the tuples were gone.
>
> We're taking only exclusive lock here.  And assuming we will teach our
> scans to treat page-past-OEF situation as no-visible-tuples-found,
> concurrent read-only queries will work concurrently with heap
> truncate.  Also we don't have to scan whole shared buffers, only past
> truncation point buffers are scanned at step #2.  Later flags are
> cleaned only from truncation point dirty buffers.  Data corruption on
> truncation error also shouldn't happen as well, because we don't
> forget to write any dirty buffers before insure that data files were
> successfully truncated.
>
> The problem I see with this approach so far is placing too many
> dirty_barrier flags can affect concurrent activity.  In order to cope
> that we may, for instance, truncate relation in multiple iterations
> when we find too many past truncation point dirty buffers.
>
> Any thoughts?

Given I've no feedback on this idea yet, I'll try to implement a PoC
patch for that.  It doesn't look to be difficult.  And we'll see how
does it work.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


On Tue, Aug 21, 2018 at 9:10 AM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> After heap truncation using this algorithm, shared buffers may contain
> past-OEF buffers.  But those buffers are empty (no used items) and
> clean.  So, real-only queries shouldn't hint those buffers dirty
> because there are no used items.  Normally, these buffers will be just
> evicted away from the shared buffer arena.  If relation extension will
> happen short after heap truncation then some of those buffers could be
> found after relation extension.  I think this situation could be
> handled.  For instance, we can teach vacuum to claim page as new once
> all the tuples were gone.

I think this all sounds pretty dangerous and fragile, especially in
view of the pluggable storage work.  If we start to add new storage
formats, deductions based on the specifics of the current heap's
hint-bit behavior may turn out not to be valid.  Now maybe you could
speculate that it won't matter because perhaps truncation will work
differently in other storage formats too, but it doesn't sound to me
like we'd be wise to bet on it working out that way.

IIRC, Andres had some patches revising shared buffer management that
allowed the ending block number to be maintained in shared memory.
I'm waving my hands here, but with that kind of a system you can
imagine that maybe there could also be a flag bit indicating whether a
truncation is in progress.  So, reduce the number of page and set the
bit; then zap all the pages above that value that are still present in
shared_buffers; then clear the bit.  Or maybe we don't even need the
bit, but I think we do need some kind of race-free mechanism to make
sure that we never try to read pages that either have been truncated
away or in the process of being truncated away.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] WIP: long transactions on hot standby feedback replica/ proof of concept

From
Alexander Korotkov
Date:
Hi!

Thank you for feedback.

On Sun, Aug 26, 2018 at 4:09 AM Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Aug 21, 2018 at 9:10 AM, Alexander Korotkov
> <a.korotkov@postgrespro.ru> wrote:
> > After heap truncation using this algorithm, shared buffers may contain
> > past-OEF buffers.  But those buffers are empty (no used items) and
> > clean.  So, real-only queries shouldn't hint those buffers dirty
> > because there are no used items.  Normally, these buffers will be just
> > evicted away from the shared buffer arena.  If relation extension will
> > happen short after heap truncation then some of those buffers could be
> > found after relation extension.  I think this situation could be
> > handled.  For instance, we can teach vacuum to claim page as new once
> > all the tuples were gone.
>
> I think this all sounds pretty dangerous and fragile, especially in
> view of the pluggable storage work.  If we start to add new storage
> formats, deductions based on the specifics of the current heap's
> hint-bit behavior may turn out not to be valid.  Now maybe you could
> speculate that it won't matter because perhaps truncation will work
> differently in other storage formats too, but it doesn't sound to me
> like we'd be wise to bet on it working out that way.

Hmm, I'm not especially concerned about pluggable storages here.
Pluggable storages are deciding themselves how do they manage vacuum
including relation truncation if needed.  They might reuse or not
reuse function for relation truncation, which we have for heap.  The
thing we should do for that relation truncation function is
understandable and predictable interface.  So, if relation truncation
function cuts relation tailing pages, which are previously cleaned as
new.  For me, that looks fair enough.

The aspect I'm more concerned here about is whether we miss ability
for detecting some of IO errors, if we don't distinguish new pages
from pages whose tuples were removed by vacuum.

> IIRC, Andres had some patches revising shared buffer management that
> allowed the ending block number to be maintained in shared memory.
> I'm waving my hands here, but with that kind of a system you can
> imagine that maybe there could also be a flag bit indicating whether a
> truncation is in progress.  So, reduce the number of page and set the
> bit; then zap all the pages above that value that are still present in
> shared_buffers; then clear the bit.  Or maybe we don't even need the
> bit, but I think we do need some kind of race-free mechanism to make
> sure that we never try to read pages that either have been truncated
> away or in the process of being truncated away.

If we would have some pre-relation shared memory information, then we
can make it work even without special write barrier bit.  Instead we
can place a mark to the whole relation, which would say "please hold
on with writes past following pending truncation point".  Also having
ending block number in the shared memory can save us from trying to
read block past EOF.  So, I'm sure that when Andres work for revising
shared buffer management will be complete, we would be able to solve
these problems better.  But there is also a question of time.  As I
get, revising shared buffer management could not realistically get
committed to PostgreSQL 12.  And we have pretty nasty set of problems
here.  For me it would be nice to do something with them during this
release cycle.  But for sure, we should keep in the mind how this
solution should be revising once we have new shared buffer management.


------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


On Mon, Aug 27, 2018 at 11:38 AM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> The aspect I'm more concerned here about is whether we miss ability
> for detecting some of IO errors, if we don't distinguish new pages
> from pages whose tuples were removed by vacuum.

My main concern is correctness.  If we ever have valid-looking buffers
in shared_buffers after the corresponding data has been truncated away
on disk, we've got to make sure that nobody ever confuses one of them
with an actually-valid buffer.  Reading over your algorithm, I can't
convince myself that you have that case nailed down tightly enough.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


> On Fri, Aug 24, 2018 at 5:53 PM Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:
>
> Given I've no feedback on this idea yet, I'll try to implement a PoC
> patch for that.  It doesn't look to be difficult.  And we'll see how
> does it work.

Unfortunately, current version of the patch doesn't pass the tests and fails on
initdb. But maybe you already have this PoC you were talking about, that will
also incorporate the feedback from this thread? For now I'll move it to the
next CF.


Hi,

On 2018-11-29 13:45:15 +0100, Dmitry Dolgov wrote:
> > On Fri, Aug 24, 2018 at 5:53 PM Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:
> >
> > Given I've no feedback on this idea yet, I'll try to implement a PoC
> > patch for that.  It doesn't look to be difficult.  And we'll see how
> > does it work.
> 
> Unfortunately, current version of the patch doesn't pass the tests and fails on
> initdb. But maybe you already have this PoC you were talking about, that will
> also incorporate the feedback from this thread? For now I'll move it to the
> next CF.

Given that nothing has happened since, I'm going to close this CF entry
with returned as feedback.

Greetings,

Andres Freund


On Thu, Nov 29, 2018 at 3:44 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
>
> > On Fri, Aug 24, 2018 at 5:53 PM Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:
> >
> > Given I've no feedback on this idea yet, I'll try to implement a PoC
> > patch for that.  It doesn't look to be difficult.  And we'll see how
> > does it work.
>
> Unfortunately, current version of the patch doesn't pass the tests and fails on
> initdb. But maybe you already have this PoC you were talking about, that will
> also incorporate the feedback from this thread? For now I'll move it to the
> next CF.

Finally, I managed to write a PoC.

If look at the list of problems I've enumerated in [1], this PoC is
aimed for 1 and 3.

> 1) Data corruption on file truncation error (explained in [1]).
> 2) Expensive scanning of the whole shared buffers before file truncation.
> 3) Cancel of read-only queries on standby even if hot_standby_feedback
> is on, caused by replication of AccessExclusiveLock.

2 is pretty independent problem and could be addressed later.

Basically, this patch does following:
1. Introduces new flag BM_DIRTY_BARRIER, which prevents dirty buffer
from being written out.
2. Implements two-phase truncation of node buffers.  First phase is
prior to file truncation and marks past truncation point dirty buffers
as BM_DIRTY_BARRIER.  Second phase is post file truncation and
actually wipes out past truncation point buffers.
3. On exception happen during file truncation, BM_DIRTY_BARRIER flag
will be released from buffers.  Thus, no data corruption should
happens here.  If file truncation was partially complete, then file
might be extended by write of dirty buffer.  I'm not sure how likely
is it, but extension could lead to the errors again.  But this still
shouldn't cause a data corruption.
4. Having too many buffers marked as BM_DIRTY_BARRIER, would paralyze
buffer manager.  This is why we're keeping not more than NBuffers/2 to
be marked as BM_DIRTY_BARRIER.  If limit is exceeded, then dirty
buffers are just written at the first phase.
5. lazy_truncate_heap() now takes ExclusiveLock instead of
AccessExclusiveLock.  This part is not really complete.  At least, we
need to ensure that past truncation point reads, caused by real-only
queries concurrent to truncation, don't lead to real errors.

Any thoughts?

1. https://www.postgresql.org/message-id/CAPpHfdtD3U2DpGZQJNe21s9s1s-Va7NRNcP1isvdCuJxzYypcg%40mail.gmail.com

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachment