Thread: Non-reserved replication slots and slot advancing

Non-reserved replication slots and slot advancing

From
Michael Paquier
Date:
Hi all,

I have been chewing for the last couple of days on this email from
Horiguchi-san:
https://www.postgresql.org/message-id/20180622.163312.254556300.horiguchi.kyotaro@lab.ntt.co.jp

As summarized, it is actually strange to be able to advance a slot which
has a non-reserved restart_lsn.  For example, take that which can happen
on HEAD:
=# select pg_create_physical_replication_slot('toto');
 pg_create_physical_replication_slot
-------------------------------------
 (toto,)
(1 row)
=# select pg_replication_slot_advance('toto', '0/1');
 pg_replication_slot_advance
-----------------------------
 (toto,0/1)
(1 row)
=# select slot_name, restart_lsn from pg_replication_slots ;
 slot_name | restart_lsn
-----------+-------------
 toto      | 0/1
(1 row)

KeepLogSeg() is careful enough to not go past LSN positions which do not
exist anymore in the backend, and a client trying to consume a LSN to a
position which does not exist would get back an error to the backend
mentioning that the wanted segment does not exist anymore.  Hence it
seems to me that being to advance a slot for an unreserved slot should
result in an error.  At least, that's easier to restrict things this
way as it could still be possible to advance a slot up to a position
where a WAL segment is still available on the backend.  But I think that
allowing such a thing would be full of race conditions with WAL
recycling and I am not sure about the use case.

I think that documenting the fast that restart_lsn is NULL for a slot
which has not been reserved yet would be nice.  Attached is an idea of
patch, ideas are welcome.  An open item has been added as well.

Thanks,
--
Michael

Attachment

Re: Non-reserved replication slots and slot advancing

From
Andres Freund
Date:
Hi,

On 2018-06-26 16:13:05 +0900, Michael Paquier wrote:
> I have been chewing for the last couple of days on this email from
> Horiguchi-san:
> https://www.postgresql.org/message-id/20180622.163312.254556300.horiguchi.kyotaro@lab.ntt.co.jp
> 
> As summarized, it is actually strange to be able to advance a slot which
> has a non-reserved restart_lsn.  For example, take that which can happen
> on HEAD:
> =# select pg_create_physical_replication_slot('toto');
>  pg_create_physical_replication_slot
> -------------------------------------
>  (toto,)
> (1 row)
> =# select pg_replication_slot_advance('toto', '0/1');
>  pg_replication_slot_advance
> -----------------------------
>  (toto,0/1)
> (1 row)
> =# select slot_name, restart_lsn from pg_replication_slots ;
>  slot_name | restart_lsn
> -----------+-------------
>  toto      | 0/1
> (1 row)

I'm not clear to why this is a problem? Seems like either behaviour can
be argued for. I don't really have an opinion either way. I'd just
remove the item from the open items list, I don't think we need to hold
up the release for it?

Greetings,

Andres Freund


Re: Non-reserved replication slots and slot advancing

From
Alvaro Herrera
Date:
On 2018-Jul-03, Andres Freund wrote:

> I'm not clear to why this is a problem? Seems like either behaviour can
> be argued for. I don't really have an opinion either way. I'd just
> remove the item from the open items list, I don't think we need to hold
> up the release for it?

After reading this more carefully, isn't the problem that as soon as you
get a slot into the 0/1 restart_lsn state, WAL recycling/deletion no
longer happens?  That does sound like a bad thing to me.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Non-reserved replication slots and slot advancing

From
Andres Freund
Date:
On 2018-07-03 13:23:50 -0400, Alvaro Herrera wrote:
> On 2018-Jul-03, Andres Freund wrote:
> 
> > I'm not clear to why this is a problem? Seems like either behaviour can
> > be argued for. I don't really have an opinion either way. I'd just
> > remove the item from the open items list, I don't think we need to hold
> > up the release for it?
> 
> After reading this more carefully, isn't the problem that as soon as you
> get a slot into the 0/1 restart_lsn state, WAL recycling/deletion no
> longer happens?  That does sound like a bad thing to me.

Fair enough, but that's what a plain slot allows you as well, pretty
fundamentally, no? The precise point at which recycling will be blocked
will differer, sure.

Greetings,

Andres Freund


Re: Non-reserved replication slots and slot advancing

From
Alvaro Herrera
Date:
On 2018-Jul-03, Andres Freund wrote:

> On 2018-07-03 13:23:50 -0400, Alvaro Herrera wrote:
> > On 2018-Jul-03, Andres Freund wrote:
> > 
> > > I'm not clear to why this is a problem? Seems like either behaviour can
> > > be argued for. I don't really have an opinion either way. I'd just
> > > remove the item from the open items list, I don't think we need to hold
> > > up the release for it?
> > 
> > After reading this more carefully, isn't the problem that as soon as you
> > get a slot into the 0/1 restart_lsn state, WAL recycling/deletion no
> > longer happens?  That does sound like a bad thing to me.
> 
> Fair enough, but that's what a plain slot allows you as well, pretty
> fundamentally, no? The precise point at which recycling will be blocked
> will differer, sure.

Yeah, well, I suppose that other mechanisms to use slots are less of a
foot-gun -- by creating one their start_lsn is set to some reasonable
value.  With slot advancing, it seems easier to get into trouble.  Of
course, you can set the slot to a LSN that is valid now, and then not do
anything with it, in which case you're also screwed.

As I recall, this slot advancing business is new in pg11, and I think it
makes sense to provide a decent API that prevents you from doing
something extremely stupid.

Getting this fixed is +0.2 from me -- I'm not really on the side of this
being a severe bug as all that.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Non-reserved replication slots and slot advancing

From
Michael Paquier
Date:
On Tue, Jul 03, 2018 at 01:51:48PM -0400, Alvaro Herrera wrote:
> Getting this fixed is +0.2 from me -- I'm not really on the side of this
> being a severe bug as all that.

This can also cause incorrect results for clients querying
pg_replication_slots when measuring bloat in pg_wal/, hence as this
thing is new I would think that restricting the API first, and then
perhaps relaxing it with future improvements makes the most sense.
--
Michael

Attachment

Re: Non-reserved replication slots and slot advancing

From
Michael Paquier
Date:
On Tue, Jul 03, 2018 at 01:51:48PM -0400, Alvaro Herrera wrote:
> On 2018-Jul-03, Andres Freund wrote:
>> Fair enough, but that's what a plain slot allows you as well, pretty
>> fundamentally, no? The precise point at which recycling will be blocked
>> will differer, sure.

ReplicationSlotsComputeRequiredLSN() is careful enough to discard slots
which have their restart_lsn set to InvalidXLogRecPtr, so they are not
accounted within the minimum LSN calculated for segment retention.  Any
fake value added by a user advancing a non-reserved slot is.

At the end, are their any objections into fixing the issue and
tightening the advancing API?
--
Michael

Attachment

Re: Non-reserved replication slots and slot advancing

From
Alvaro Herrera
Date:
On 2018-Jul-04, Michael Paquier wrote:

> At the end, are their any objections into fixing the issue and
> tightening the advancing API?

None from me.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Non-reserved replication slots and slot advancing

From
Michael Paquier
Date:
On Wed, Jul 04, 2018 at 09:57:31AM -0400, Alvaro Herrera wrote:
> None from me.

Thanks Alvaro.  For now the patch uses the following error message:
+SELECT pg_replication_slot_advance('regression_slot3', '0/1'); -- error
+ERROR:  cannot move slot with non-reserved restart_lsn

Mentioning directly the column name of pg_replication_slots is confusing
I think.  Here are some suggestions of perhaps better error messages:
1) cannot move unreserved slot.
2) cannot move slot which has never been reserved.

Any better ideas?
--
Michael

Attachment

Re: Non-reserved replication slots and slot advancing

From
Alvaro Herrera
Date:
On 2018-Jul-05, Michael Paquier wrote:

> On Wed, Jul 04, 2018 at 09:57:31AM -0400, Alvaro Herrera wrote:
> > None from me.
> 
> Thanks Alvaro.  For now the patch uses the following error message:
> +SELECT pg_replication_slot_advance('regression_slot3', '0/1'); -- error
> +ERROR:  cannot move slot with non-reserved restart_lsn
> 
> Mentioning directly the column name of pg_replication_slots is confusing
> I think.  Here are some suggestions of perhaps better error messages:
> 1) cannot move unreserved slot.
> 2) cannot move slot which has never been reserved.

Yeah, I don't like it very much.  Let's avoid having an obscure column
name in there.

Do we use the term "reserved" anywhere else?  It just doesn't click for
me.  Other than "All rights reserved", that is ...

As for the patch itself: why is the problem that the slot is "not
reserved" in the first place?  I think what we should be actually
checking is that the target LSN is within valid limits, ie. the end
state of the slot after the operation, rather than the initial state of
the slot before the operation.

If we made this code check the end state, we could make the error
message be something like "target LSN is not within the allocated range"
or something like that.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Non-reserved replication slots and slot advancing

From
Michael Paquier
Date:
On Thu, Jul 05, 2018 at 12:35:29PM -0400, Alvaro Herrera wrote:
> Do we use the term "reserved" anywhere else?  It just doesn't click for
> me.  Other than "All rights reserved", that is ...

This is used in a couple of places in the docs:
$ git grep -i slot | grep reserved
src/sgml/catalogs.sgml:
if the <literal>LSN</literal> of this slot has never been reserved.
src/sgml/func.sgml:
replication slot be reserved immediately; otherwise
src/sgml/protocol.sgml:
Drops a replication slot, freeing any reserved server-side resources

> As for the patch itself: why is the problem that the slot is "not
> reserved" in the first place?  I think what we should be actually
> checking is that the target LSN is within valid limits, ie. the end
> state of the slot after the operation, rather than the initial state of
> the slot before the operation.
>
> If we made this code check the end state, we could make the error
> message be something like "target LSN is not within the allocated range"
> or something like that.

The code relies on the LSN used within the slot to define if it can be
advanced or not.

/*
 * Check if the slot is not moving backwards.  Physical slots rely simply
 * on restart_lsn as a minimum point, while logical slots have confirmed
 * consumption up to confirmed_lsn, meaning that in both cases data older
 * than that is not available anymore.
 */
if (OidIsValid(MyReplicationSlot->data.database))
    minlsn =  MyReplicationSlot->data.confirmed_flush;
else
    minlsn = MyReplicationSlot->data.restart_lsn;

And then is issues an error only if the LSN to move to is lower than the
minimum found.
if (moveto < minlsn)
{
    ReplicationSlotRelease();
    ereport(ERROR,
            (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
             errmsg("cannot move slot to %X/%X, minimum is %X/%X",
                    (uint32) (moveto >> 32), (uint32) moveto,
                    (uint32) (minlsn >> 32), (uint32) minlsn)));
}

For a non-reserved slot, there is nothing smaller than 0, so this error
would not trigger, and even if we enforce it to trigger with minlsn ==
InvalidXLogRecPtr then it would make no sense.  So the root of the
problem is that there is no lower-bound which can be used as a
comparison base.  Well, logically that would be the LSN of the oldest
segment still present in pg_wal, but if you begin to authorize that then
you have race conditions with checkpoints running in parallel which
could recycle the requested position.  So it seems to me that ERRORing
when there is no way to define correctly the bounds where the move can
be done is the safest solution.
--
Michael

Attachment

Re: Non-reserved replication slots and slot advancing

From
Kyotaro HORIGUCHI
Date:
Hello.

At Fri, 6 Jul 2018 08:16:08 +0900, Michael Paquier <michael@paquier.xyz> wrote in <20180705231608.GA2366@paquier.xyz>
> On Thu, Jul 05, 2018 at 12:35:29PM -0400, Alvaro Herrera wrote:
> > Do we use the term "reserved" anywhere else?  It just doesn't click for
> > me.  Other than "All rights reserved", that is ...
> 
> This is used in a couple of places in the docs:
> $ git grep -i slot | grep reserved
> src/sgml/catalogs.sgml:
> if the <literal>LSN</literal> of this slot has never been reserved.
> src/sgml/func.sgml:
> replication slot be reserved immediately; otherwise
> src/sgml/protocol.sgml:
> Drops a replication slot, freeing any reserved server-side resources

FWIW I saw the word first in the function definition.

> Table 9.83. Replication SQL Functions
> g_create_physical_replication_slot(slot_name name [, immediately_reserve boolean, temporary boolean])

> > As for the patch itself: why is the problem that the slot is "not
> > reserved" in the first place?  I think what we should be actually
> > checking is that the target LSN is within valid limits, ie. the end
> > state of the slot after the operation, rather than the initial state of
> > the slot before the operation.
> > 
> > If we made this code check the end state, we could make the error
> > message be something like "target LSN is not within the allocated range"
> > or something like that.
> 
> The code relies on the LSN used within the slot to define if it can be
> advanced or not.
> 
> /*
>  * Check if the slot is not moving backwards.  Physical slots rely simply
>  * on restart_lsn as a minimum point, while logical slots have confirmed
>  * consumption up to confirmed_lsn, meaning that in both cases data older
>  * than that is not available anymore.
>  */
> if (OidIsValid(MyReplicationSlot->data.database))
>     minlsn =  MyReplicationSlot->data.confirmed_flush;
> else
>     minlsn = MyReplicationSlot->data.restart_lsn;
> 
> And then is issues an error only if the LSN to move to is lower than the
> minimum found.
> if (moveto < minlsn)
> {
>     ReplicationSlotRelease();
>     ereport(ERROR,
>             (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
>              errmsg("cannot move slot to %X/%X, minimum is %X/%X",
>                     (uint32) (moveto >> 32), (uint32) moveto,
>                     (uint32) (minlsn >> 32), (uint32) minlsn)));
> }
> 
> For a non-reserved slot, there is nothing smaller than 0, so this error
> would not trigger, and even if we enforce it to trigger with minlsn ==
> InvalidXLogRecPtr then it would make no sense.  So the root of the
> problem is that there is no lower-bound which can be used as a
> comparison base.  Well, logically that would be the LSN of the oldest
> segment still present in pg_wal, but if you begin to authorize that then
> you have race conditions with checkpoints running in parallel which
> could recycle the requested position.  So it seems to me that ERRORing
> when there is no way to define correctly the bounds where the move can
> be done is the safest solution.

The only use case I can see of physical-slot advancing is
allowing the master to remove retained WAL files to vacate WAL
directory before reconnection when the "diconnected" standby can
recover using master's archive.  That has (almost) the same
effect with removing then recreating the slot with non-resreved.

When we want to preserve WAL files for a newly created standby,
we would create a reserved slot or use persistent-slot option on
taking a base backup. There's no point in advancing non-reserved
slots in the case.

In short, I don't see a point in advancing non-reserved physical
slots.

Addition to that, as for advancing to around the checkpoint redo
location or even to the flush location, it can leave a broken (or
desync) slot as Michael mentioned.

Form the resasons above, I'd like to vote +1 for just ERRORing
in the case.


Physical replication protocol seems to have a window to cause the
same problem but that needs a crash in a very narrow window,
which doesn't seem to occur in reality.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Non-reserved replication slots and slot advancing

From
Michael Paquier
Date:
On Fri, Jul 06, 2018 at 01:07:47PM +0900, Kyotaro HORIGUCHI wrote:
> Form the reasons above, I'd like to vote +1 for just ERRORing
> in the case.

Thanks for the lookup.  Do you have a good idea for the error message to
provide to users in this case?  I would tend to think that "cannot move
slot which has never been reserved" or "cannot move slot which has never
been used", particularly the first one, are fine enough, but I am not
stopped at one single idea in particular either.
--
Michael

Attachment

Re: Non-reserved replication slots and slot advancing

From
Kyotaro HORIGUCHI
Date:
At Fri, 6 Jul 2018 14:26:42 +0900, Michael Paquier <michael@paquier.xyz> wrote in <20180706052642.GB776@paquier.xyz>
> On Fri, Jul 06, 2018 at 01:07:47PM +0900, Kyotaro HORIGUCHI wrote:
> > Form the reasons above, I'd like to vote +1 for just ERRORing
> > in the case.
> 
> Thanks for the lookup.  Do you have a good idea for the error message to
> provide to users in this case?  I would tend to think that "cannot move
> slot which has never been reserved" or "cannot move slot which has never
> been used", particularly the first one, are fine enough, but I am not
> stopped at one single idea in particular either.

Mmm. I feel so much that it's a kind of too much for me, but
FWIW...

I'm not so in favor of the word "reserve" in first place since it
doesn't seem intuitive for me, but "active" is already used for
the state where the connection with the peer is made. (The word
"reserve" may be misused since in the source code "reserve" is
used as "to reserve WAL", but used as "reserve a slot" in
documentation.)

"use", as you mentioned, is neutral but somewhat vague and what
is worse a immediately-reserved slot can be advanced before the
first use.

Most straightforward wording would be:

"ERROR: cannot advance/move slots with invalid restart_lsn"
"HINT: It will have a valid value after the first connection."

Another candidate from me is "initialize". This requires related
documentation fix. (first attached).
# I noticed that the documentation forgets that
# pg_replication_slots now stores two LSNs.

"ERROR: cannot advance/move uninitialized slots"

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index edc9be92a6..697146468c 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -19186,7 +19186,7 @@ postgres=# SELECT * FROM pg_walfile_name_offset(pg_stop_backup());
         <indexterm>
          <primary>pg_create_physical_replication_slot</primary>
         </indexterm>
-        <literal><function>pg_create_physical_replication_slot(<parameter>slot_name</parameter> <type>name</type>
<optional>,<parameter>immediately_reserve</parameter> <type>boolean</type>, <parameter>temporary</parameter>
<type>boolean</type></optional>)</function></literal>
+        <literal><function>pg_create_physical_replication_slot(<parameter>slot_name</parameter> <type>name</type>
<optional>,<parameter>initialize</parameter> <type>boolean</type>, <parameter>temporary</parameter>
<type>boolean</type></optional>)</function></literal>
        </entry>
        <entry>
         (<parameter>slot_name</parameter> <type>name</type>, <parameter>lsn</parameter> <type>pg_lsn</type>)
@@ -19194,9 +19194,11 @@ postgres=# SELECT * FROM pg_walfile_name_offset(pg_stop_backup());
        <entry>
         Creates a new physical replication slot named
         <parameter>slot_name</parameter>. The optional second parameter,
-        when <literal>true</literal>, specifies that the <acronym>LSN</acronym> for this
-        replication slot be reserved immediately; otherwise
-        the <acronym>LSN</acronym> is reserved on first connection from a streaming
+        when <literal>true</literal>, specifies that the <literal>restart_lsn</literal>
+        for this replication slot be initialized immediately with REDO
+        <acronym>LSN</acronym> of the last checkpoint; otherwise it is
+        initialized on first connection from a streaming
+
         replication client. Streaming changes from a physical slot is only
         possible with the streaming-replication protocol —
         see <xref linkend="protocol-replication"/>. The optional third
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index 2806e1076c..703b61768b 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -43,7 +43,7 @@ Datum
 pg_create_physical_replication_slot(PG_FUNCTION_ARGS)
 {
     Name        name = PG_GETARG_NAME(0);
-    bool        immediately_reserve = PG_GETARG_BOOL(1);
+    bool        initialize = PG_GETARG_BOOL(1);
     bool        temporary = PG_GETARG_BOOL(2);
     Datum        values[2];
     bool        nulls[2];
@@ -67,7 +67,7 @@ pg_create_physical_replication_slot(PG_FUNCTION_ARGS)
     values[0] = NameGetDatum(&MyReplicationSlot->data.name);
     nulls[0] = false;
 
-    if (immediately_reserve)
+    if (initialize)
     {
         /* Reserve WAL as the user asked for it */
         ReplicationSlotReserveWal();

Re: Non-reserved replication slots and slot advancing

From
Michael Paquier
Date:
On Fri, Jul 06, 2018 at 03:37:57PM +0900, Kyotaro HORIGUCHI wrote:
> I'm not so in favor of the word "reserve" in first place since it
> doesn't seem intuitive for me, but "active" is already used for
> the state where the connection with the peer is made. (The word
> "reserve" may be misused since in the source code "reserve" is
> used as "to reserve WAL", but used as "reserve a slot" in
> documentation.)

That's the term used for now three releases, counting v11 in the pack,
so I would not change that now.  The concept of non-initialized slots is
fuzzy as well as it could be attached to some other meta-data.

So, chewing on all that, I would suggest the following error message as
the attached patch and just move on:
+ERROR:  cannot move slot not reserving WAL
--
Michael

Attachment

Re: Non-reserved replication slots and slot advancing

From
Kyotaro HORIGUCHI
Date:
Hello.

At Mon, 9 Jul 2018 14:18:51 +0900, Michael Paquier <michael@paquier.xyz> wrote in <20180709051851.GA30202@paquier.xyz>
> On Fri, Jul 06, 2018 at 03:37:57PM +0900, Kyotaro HORIGUCHI wrote:
> > I'm not so in favor of the word "reserve" in first place since it
> > doesn't seem intuitive for me, but "active" is already used for
> > the state where the connection with the peer is made. (The word
> > "reserve" may be misused since in the source code "reserve" is
> > used as "to reserve WAL", but used as "reserve a slot" in
> > documentation.)
> 
> That's the term used for now three releases, counting v11 in the pack,
> so I would not change that now.  The concept of non-initialized slots is
> fuzzy as well as it could be attached to some other meta-data.
> 
> So, chewing on all that, I would suggest the following error message as
> the attached patch and just move on:
> +ERROR:  cannot move slot not reserving WAL

I'm fine with this. Thank you.

Looking the attched patch, I noticed that both "WAL" and "wal"
are used in similar ERROR messages. Grepping the source tree
showed me that it is always in upper case letters except in the
case it is a part of other words like variable/column/function
names or "walsender". This is the same with the word "lsn".

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Non-reserved replication slots and slot advancing

From
Michael Paquier
Date:
On Mon, Jul 09, 2018 at 03:13:04PM +0900, Kyotaro HORIGUCHI wrote:
> Looking the attached patch, I noticed that both "WAL" and "wal"
> are used in similar ERROR messages. Grepping the source tree
> showed me that it is always in upper case letters except in the
> case it is a part of other words like variable/column/function
> names or "walsender". This is the same with the word "lsn".

Thanks for the lookup.

I see.  Indeed, let's fix at the same time the error message close by.
xlog.c uses "WAL location (LSN)" for the same thing, so I am sticking
with that as per the attached.  I'll go commit that if there are no
objections.  If you see any others which you would like to fix, please
feel free to send a patch.
--
Michael

Attachment

Re: Non-reserved replication slots and slot advancing

From
Kyotaro HORIGUCHI
Date:
Hello.

At Mon, 9 Jul 2018 15:48:33 +0900, Michael Paquier <michael@paquier.xyz> wrote in <20180709064833.GB30202@paquier.xyz>
> On Mon, Jul 09, 2018 at 03:13:04PM +0900, Kyotaro HORIGUCHI wrote:
> > Looking the attached patch, I noticed that both "WAL" and "wal"
> > are used in similar ERROR messages. Grepping the source tree
> > showed me that it is always in upper case letters except in the
> > case it is a part of other words like variable/column/function
> > names or "walsender". This is the same with the word "lsn".
> 
> Thanks for the lookup.
> 
> I see.  Indeed, let's fix at the same time the error message close by.
> xlog.c uses "WAL location (LSN)" for the same thing, so I am sticking
> with that as per the attached.  I'll go commit that if there are no
> objections.  If you see any others which you would like to fix, please
> feel free to send a patch.

I like it than "WAL LSN"' so I'm fine with the new message. I
don't find anything to fix in xlog.c for now.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Non-reserved replication slots and slot advancing

From
Andres Freund
Date:
On 2018-07-09 15:48:33 +0900, Michael Paquier wrote:
> On Mon, Jul 09, 2018 at 03:13:04PM +0900, Kyotaro HORIGUCHI wrote:
> > Looking the attached patch, I noticed that both "WAL" and "wal"
> > are used in similar ERROR messages. Grepping the source tree
> > showed me that it is always in upper case letters except in the
> > case it is a part of other words like variable/column/function
> > names or "walsender". This is the same with the word "lsn".
> 
> Thanks for the lookup.
> 
> I see.  Indeed, let's fix at the same time the error message close by.
> xlog.c uses "WAL location (LSN)" for the same thing, so I am sticking
> with that as per the attached.  I'll go commit that if there are no
> objections.  If you see any others which you would like to fix, please
> feel free to send a patch.

I object to this wording change - it doesn't seem to add anything but
confusion.

> diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
> index 2806e1076c..b55dadfe2f 100644
> --- a/src/backend/replication/slotfuncs.c
> +++ b/src/backend/replication/slotfuncs.c
> @@ -465,7 +465,7 @@ pg_replication_slot_advance(PG_FUNCTION_ARGS)
>  
>      if (XLogRecPtrIsInvalid(moveto))
>          ereport(ERROR,
> -                (errmsg("invalid target wal lsn")));
> +                (errmsg("invalid target WAL location (LSN)")));

and it seems to at least bend the rules of the style guide a bit.


>      /* Build a tuple descriptor for our result type */
>      if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
> @@ -483,6 +483,15 @@ pg_replication_slot_advance(PG_FUNCTION_ARGS)
>      /* Acquire the slot so we "own" it */
>      ReplicationSlotAcquire(NameStr(*slotname), true);
>  
> +    /* A slot whose restart_lsn has never been reserved cannot be advanced */
> +    if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn))
> +    {
> +        ReplicationSlotRelease();
> +        ereport(ERROR,
> +                (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
> +                 errmsg("cannot move slot not reserving WAL")));
> +    }
> +

I don't like this error message. It's unclear what refers to exactly
what.  Nor is "move" a terminology used otherwise.  How about:
"cannot advance replication slot that has not previously reserved WAL"
or something similar?

Also, ERRCODE_FEATURE_NOT_SUPPORTED doesn't quite seem right. How about
ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE?

Greetings,

Andres Freund


Re: Non-reserved replication slots and slot advancing

From
Alvaro Herrera
Date:
On 2018-Jul-09, Andres Freund wrote:

> On 2018-07-09 15:48:33 +0900, Michael Paquier wrote:

> > +    if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn))
> > +    {
> > +        ReplicationSlotRelease();
> > +        ereport(ERROR,
> > +                (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
> > +                 errmsg("cannot move slot not reserving WAL")));
> > +    }
> > +
> 
> I don't like this error message. It's unclear what refers to exactly
> what.  Nor is "move" a terminology used otherwise.  How about:
> "cannot advance replication slot that has not previously reserved WAL"
> or something similar?
> 
> Also, ERRCODE_FEATURE_NOT_SUPPORTED doesn't quite seem right. How about
> ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE?

+1 to both of Andres' suggestions.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Non-reserved replication slots and slot advancing

From
Michael Paquier
Date:
On Mon, Jul 09, 2018 at 02:48:28PM -0400, Alvaro Herrera wrote:
> On 2018-Jul-09, Andres Freund wrote:
> > On 2018-07-09 15:48:33 +0900, Michael Paquier wrote:
> > > +    if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn))
> > > +    {
> > > +        ReplicationSlotRelease();
> > > +        ereport(ERROR,
> > > +                (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
> > > +                 errmsg("cannot move slot not reserving WAL")));
> > > +    }
> > > +
> >
> > I don't like this error message. It's unclear what refers to exactly
> > what.  Nor is "move" a terminology used otherwise.  How about:
> > "cannot advance replication slot that has not previously reserved WAL"
> > or something similar?

"move" is used in another error message ten lines below.

> >
> > Also, ERRCODE_FEATURE_NOT_SUPPORTED doesn't quite seem right. How about
> > ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE?
>
> +1 to both of Andres' suggestions.

Those indeed sound better.  What do you think of the attached?
--
Michael

Attachment

Re: Non-reserved replication slots and slot advancing

From
Andres Freund
Date:
Hi,

On 2018-07-10 09:54:28 +0900, Michael Paquier wrote:
> > > Also, ERRCODE_FEATURE_NOT_SUPPORTED doesn't quite seem right. How about
> > > ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE?
> > 
> > +1 to both of Andres' suggestions.
> 
> Those indeed sound better.  What do you think of the attached?

Looks generally good. One remark:

> +    /* A slot whose restart_lsn has never been reserved cannot be advanced */
> +    if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn))
> +    {
> +        ReplicationSlotRelease();
> +        ereport(ERROR,
> +                (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> +                 errmsg("cannot advance replication slot that has not previously reserved WAL")));
> +    }

Why is the ReplicationSlotRelease() needed here? Souldn't the error
handling code do so automatically?

Greetings,

Andres Freund


Re: Non-reserved replication slots and slot advancing

From
Michael Paquier
Date:
On Tue, Jul 10, 2018 at 12:26:30AM -0700, Andres Freund wrote:
> Why is the ReplicationSlotRelease() needed here? Souldn't the error
> handling code do so automatically?

Oh, indeed.  I didn't know that PostgresMain was doing some cleanup
here.  There is a second place where this can be removed, which comes
from the original commit which added the advance function.  An updated
version is attached.  Do you have other comments?
--
Michael

Attachment

Re: Non-reserved replication slots and slot advancing

From
Andres Freund
Date:
On 2018-07-10 16:59:07 +0900, Michael Paquier wrote:
> On Tue, Jul 10, 2018 at 12:26:30AM -0700, Andres Freund wrote:
> > Why is the ReplicationSlotRelease() needed here? Souldn't the error
> > handling code do so automatically?
> 
> Oh, indeed.  I didn't know that PostgresMain was doing some cleanup
> here.  There is a second place where this can be removed, which comes
> from the original commit which added the advance function.

Gna, it's almost like the original code wasn't properly reviewed.


> An updated version is attached.  Do you have other comments?

Looks sane, without having tested it.


>      if (moveto < minlsn)
> -    {
> -        ReplicationSlotRelease();
>          ereport(ERROR,
>                  (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
> -                 errmsg("cannot move slot to %X/%X, minimum is %X/%X",
> +                 errmsg("cannot advance replication slot to %X/%X, minimum is %X/%X",
>                          (uint32) (moveto >> 32), (uint32) moveto,
>                          (uint32) (minlsn >> 32), (uint32) minlsn)));
> -    }

If you're touching this, I'd also change the errcode here.


Greetings,

Andres Freund


Re: Non-reserved replication slots and slot advancing

From
Michael Paquier
Date:
On Tue, Jul 10, 2018 at 01:16:30AM -0700, Andres Freund wrote:
> On 2018-07-10 16:59:07 +0900, Michael Paquier wrote:
> >      if (moveto < minlsn)
> > -    {
> > -        ReplicationSlotRelease();
> >          ereport(ERROR,
> >                  (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
> > -                 errmsg("cannot move slot to %X/%X, minimum is %X/%X",
> > +                 errmsg("cannot advance replication slot to %X/%X, minimum is %X/%X",
> >                          (uint32) (moveto >> 32), (uint32) moveto,
> >                          (uint32) (minlsn >> 32), (uint32) minlsn)));
> > -    }
>
> If you're touching this, I'd also change the errcode here.

Yep, let's change that as well.  If you want to look at that stuff more
deeply, please feel free.  Otherwise I could always push what I have
now.
--
Michael

Attachment

Re: Non-reserved replication slots and slot advancing

From
Alvaro Herrera
Date:
On 2018-Jul-10, Michael Paquier wrote:

> Yep, let's change that as well.  If you want to look at that stuff more
> deeply, please feel free.  Otherwise I could always push what I have
> now.

I say please push already.  We can push more fixes later if they are
needed.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Non-reserved replication slots and slot advancing

From
Michael Paquier
Date:
On Tue, Jul 10, 2018 at 09:51:27AM -0400, Alvaro Herrera wrote:
> On 2018-Jul-10, Michael Paquier wrote:
> I say please push already.  We can push more fixes later if they are
> needed.

OK, I have pushed what I have now, with all the suggestions from
upthread included.
--
Michael

Attachment