Thread: Non-reserved replication slots and slot advancing
Hi all, I have been chewing for the last couple of days on this email from Horiguchi-san: https://www.postgresql.org/message-id/20180622.163312.254556300.horiguchi.kyotaro@lab.ntt.co.jp As summarized, it is actually strange to be able to advance a slot which has a non-reserved restart_lsn. For example, take that which can happen on HEAD: =# select pg_create_physical_replication_slot('toto'); pg_create_physical_replication_slot ------------------------------------- (toto,) (1 row) =# select pg_replication_slot_advance('toto', '0/1'); pg_replication_slot_advance ----------------------------- (toto,0/1) (1 row) =# select slot_name, restart_lsn from pg_replication_slots ; slot_name | restart_lsn -----------+------------- toto | 0/1 (1 row) KeepLogSeg() is careful enough to not go past LSN positions which do not exist anymore in the backend, and a client trying to consume a LSN to a position which does not exist would get back an error to the backend mentioning that the wanted segment does not exist anymore. Hence it seems to me that being to advance a slot for an unreserved slot should result in an error. At least, that's easier to restrict things this way as it could still be possible to advance a slot up to a position where a WAL segment is still available on the backend. But I think that allowing such a thing would be full of race conditions with WAL recycling and I am not sure about the use case. I think that documenting the fast that restart_lsn is NULL for a slot which has not been reserved yet would be nice. Attached is an idea of patch, ideas are welcome. An open item has been added as well. Thanks, -- Michael
Attachment
Hi, On 2018-06-26 16:13:05 +0900, Michael Paquier wrote: > I have been chewing for the last couple of days on this email from > Horiguchi-san: > https://www.postgresql.org/message-id/20180622.163312.254556300.horiguchi.kyotaro@lab.ntt.co.jp > > As summarized, it is actually strange to be able to advance a slot which > has a non-reserved restart_lsn. For example, take that which can happen > on HEAD: > =# select pg_create_physical_replication_slot('toto'); > pg_create_physical_replication_slot > ------------------------------------- > (toto,) > (1 row) > =# select pg_replication_slot_advance('toto', '0/1'); > pg_replication_slot_advance > ----------------------------- > (toto,0/1) > (1 row) > =# select slot_name, restart_lsn from pg_replication_slots ; > slot_name | restart_lsn > -----------+------------- > toto | 0/1 > (1 row) I'm not clear to why this is a problem? Seems like either behaviour can be argued for. I don't really have an opinion either way. I'd just remove the item from the open items list, I don't think we need to hold up the release for it? Greetings, Andres Freund
On 2018-Jul-03, Andres Freund wrote: > I'm not clear to why this is a problem? Seems like either behaviour can > be argued for. I don't really have an opinion either way. I'd just > remove the item from the open items list, I don't think we need to hold > up the release for it? After reading this more carefully, isn't the problem that as soon as you get a slot into the 0/1 restart_lsn state, WAL recycling/deletion no longer happens? That does sound like a bad thing to me. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2018-07-03 13:23:50 -0400, Alvaro Herrera wrote: > On 2018-Jul-03, Andres Freund wrote: > > > I'm not clear to why this is a problem? Seems like either behaviour can > > be argued for. I don't really have an opinion either way. I'd just > > remove the item from the open items list, I don't think we need to hold > > up the release for it? > > After reading this more carefully, isn't the problem that as soon as you > get a slot into the 0/1 restart_lsn state, WAL recycling/deletion no > longer happens? That does sound like a bad thing to me. Fair enough, but that's what a plain slot allows you as well, pretty fundamentally, no? The precise point at which recycling will be blocked will differer, sure. Greetings, Andres Freund
On 2018-Jul-03, Andres Freund wrote: > On 2018-07-03 13:23:50 -0400, Alvaro Herrera wrote: > > On 2018-Jul-03, Andres Freund wrote: > > > > > I'm not clear to why this is a problem? Seems like either behaviour can > > > be argued for. I don't really have an opinion either way. I'd just > > > remove the item from the open items list, I don't think we need to hold > > > up the release for it? > > > > After reading this more carefully, isn't the problem that as soon as you > > get a slot into the 0/1 restart_lsn state, WAL recycling/deletion no > > longer happens? That does sound like a bad thing to me. > > Fair enough, but that's what a plain slot allows you as well, pretty > fundamentally, no? The precise point at which recycling will be blocked > will differer, sure. Yeah, well, I suppose that other mechanisms to use slots are less of a foot-gun -- by creating one their start_lsn is set to some reasonable value. With slot advancing, it seems easier to get into trouble. Of course, you can set the slot to a LSN that is valid now, and then not do anything with it, in which case you're also screwed. As I recall, this slot advancing business is new in pg11, and I think it makes sense to provide a decent API that prevents you from doing something extremely stupid. Getting this fixed is +0.2 from me -- I'm not really on the side of this being a severe bug as all that. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Jul 03, 2018 at 01:51:48PM -0400, Alvaro Herrera wrote: > Getting this fixed is +0.2 from me -- I'm not really on the side of this > being a severe bug as all that. This can also cause incorrect results for clients querying pg_replication_slots when measuring bloat in pg_wal/, hence as this thing is new I would think that restricting the API first, and then perhaps relaxing it with future improvements makes the most sense. -- Michael
Attachment
On Tue, Jul 03, 2018 at 01:51:48PM -0400, Alvaro Herrera wrote: > On 2018-Jul-03, Andres Freund wrote: >> Fair enough, but that's what a plain slot allows you as well, pretty >> fundamentally, no? The precise point at which recycling will be blocked >> will differer, sure. ReplicationSlotsComputeRequiredLSN() is careful enough to discard slots which have their restart_lsn set to InvalidXLogRecPtr, so they are not accounted within the minimum LSN calculated for segment retention. Any fake value added by a user advancing a non-reserved slot is. At the end, are their any objections into fixing the issue and tightening the advancing API? -- Michael
Attachment
On 2018-Jul-04, Michael Paquier wrote: > At the end, are their any objections into fixing the issue and > tightening the advancing API? None from me. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, Jul 04, 2018 at 09:57:31AM -0400, Alvaro Herrera wrote: > None from me. Thanks Alvaro. For now the patch uses the following error message: +SELECT pg_replication_slot_advance('regression_slot3', '0/1'); -- error +ERROR: cannot move slot with non-reserved restart_lsn Mentioning directly the column name of pg_replication_slots is confusing I think. Here are some suggestions of perhaps better error messages: 1) cannot move unreserved slot. 2) cannot move slot which has never been reserved. Any better ideas? -- Michael
Attachment
On 2018-Jul-05, Michael Paquier wrote: > On Wed, Jul 04, 2018 at 09:57:31AM -0400, Alvaro Herrera wrote: > > None from me. > > Thanks Alvaro. For now the patch uses the following error message: > +SELECT pg_replication_slot_advance('regression_slot3', '0/1'); -- error > +ERROR: cannot move slot with non-reserved restart_lsn > > Mentioning directly the column name of pg_replication_slots is confusing > I think. Here are some suggestions of perhaps better error messages: > 1) cannot move unreserved slot. > 2) cannot move slot which has never been reserved. Yeah, I don't like it very much. Let's avoid having an obscure column name in there. Do we use the term "reserved" anywhere else? It just doesn't click for me. Other than "All rights reserved", that is ... As for the patch itself: why is the problem that the slot is "not reserved" in the first place? I think what we should be actually checking is that the target LSN is within valid limits, ie. the end state of the slot after the operation, rather than the initial state of the slot before the operation. If we made this code check the end state, we could make the error message be something like "target LSN is not within the allocated range" or something like that. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, Jul 05, 2018 at 12:35:29PM -0400, Alvaro Herrera wrote: > Do we use the term "reserved" anywhere else? It just doesn't click for > me. Other than "All rights reserved", that is ... This is used in a couple of places in the docs: $ git grep -i slot | grep reserved src/sgml/catalogs.sgml: if the <literal>LSN</literal> of this slot has never been reserved. src/sgml/func.sgml: replication slot be reserved immediately; otherwise src/sgml/protocol.sgml: Drops a replication slot, freeing any reserved server-side resources > As for the patch itself: why is the problem that the slot is "not > reserved" in the first place? I think what we should be actually > checking is that the target LSN is within valid limits, ie. the end > state of the slot after the operation, rather than the initial state of > the slot before the operation. > > If we made this code check the end state, we could make the error > message be something like "target LSN is not within the allocated range" > or something like that. The code relies on the LSN used within the slot to define if it can be advanced or not. /* * Check if the slot is not moving backwards. Physical slots rely simply * on restart_lsn as a minimum point, while logical slots have confirmed * consumption up to confirmed_lsn, meaning that in both cases data older * than that is not available anymore. */ if (OidIsValid(MyReplicationSlot->data.database)) minlsn = MyReplicationSlot->data.confirmed_flush; else minlsn = MyReplicationSlot->data.restart_lsn; And then is issues an error only if the LSN to move to is lower than the minimum found. if (moveto < minlsn) { ReplicationSlotRelease(); ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("cannot move slot to %X/%X, minimum is %X/%X", (uint32) (moveto >> 32), (uint32) moveto, (uint32) (minlsn >> 32), (uint32) minlsn))); } For a non-reserved slot, there is nothing smaller than 0, so this error would not trigger, and even if we enforce it to trigger with minlsn == InvalidXLogRecPtr then it would make no sense. So the root of the problem is that there is no lower-bound which can be used as a comparison base. Well, logically that would be the LSN of the oldest segment still present in pg_wal, but if you begin to authorize that then you have race conditions with checkpoints running in parallel which could recycle the requested position. So it seems to me that ERRORing when there is no way to define correctly the bounds where the move can be done is the safest solution. -- Michael
Attachment
Hello. At Fri, 6 Jul 2018 08:16:08 +0900, Michael Paquier <michael@paquier.xyz> wrote in <20180705231608.GA2366@paquier.xyz> > On Thu, Jul 05, 2018 at 12:35:29PM -0400, Alvaro Herrera wrote: > > Do we use the term "reserved" anywhere else? It just doesn't click for > > me. Other than "All rights reserved", that is ... > > This is used in a couple of places in the docs: > $ git grep -i slot | grep reserved > src/sgml/catalogs.sgml: > if the <literal>LSN</literal> of this slot has never been reserved. > src/sgml/func.sgml: > replication slot be reserved immediately; otherwise > src/sgml/protocol.sgml: > Drops a replication slot, freeing any reserved server-side resources FWIW I saw the word first in the function definition. > Table 9.83. Replication SQL Functions > g_create_physical_replication_slot(slot_name name [, immediately_reserve boolean, temporary boolean]) > > As for the patch itself: why is the problem that the slot is "not > > reserved" in the first place? I think what we should be actually > > checking is that the target LSN is within valid limits, ie. the end > > state of the slot after the operation, rather than the initial state of > > the slot before the operation. > > > > If we made this code check the end state, we could make the error > > message be something like "target LSN is not within the allocated range" > > or something like that. > > The code relies on the LSN used within the slot to define if it can be > advanced or not. > > /* > * Check if the slot is not moving backwards. Physical slots rely simply > * on restart_lsn as a minimum point, while logical slots have confirmed > * consumption up to confirmed_lsn, meaning that in both cases data older > * than that is not available anymore. > */ > if (OidIsValid(MyReplicationSlot->data.database)) > minlsn = MyReplicationSlot->data.confirmed_flush; > else > minlsn = MyReplicationSlot->data.restart_lsn; > > And then is issues an error only if the LSN to move to is lower than the > minimum found. > if (moveto < minlsn) > { > ReplicationSlotRelease(); > ereport(ERROR, > (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), > errmsg("cannot move slot to %X/%X, minimum is %X/%X", > (uint32) (moveto >> 32), (uint32) moveto, > (uint32) (minlsn >> 32), (uint32) minlsn))); > } > > For a non-reserved slot, there is nothing smaller than 0, so this error > would not trigger, and even if we enforce it to trigger with minlsn == > InvalidXLogRecPtr then it would make no sense. So the root of the > problem is that there is no lower-bound which can be used as a > comparison base. Well, logically that would be the LSN of the oldest > segment still present in pg_wal, but if you begin to authorize that then > you have race conditions with checkpoints running in parallel which > could recycle the requested position. So it seems to me that ERRORing > when there is no way to define correctly the bounds where the move can > be done is the safest solution. The only use case I can see of physical-slot advancing is allowing the master to remove retained WAL files to vacate WAL directory before reconnection when the "diconnected" standby can recover using master's archive. That has (almost) the same effect with removing then recreating the slot with non-resreved. When we want to preserve WAL files for a newly created standby, we would create a reserved slot or use persistent-slot option on taking a base backup. There's no point in advancing non-reserved slots in the case. In short, I don't see a point in advancing non-reserved physical slots. Addition to that, as for advancing to around the checkpoint redo location or even to the flush location, it can leave a broken (or desync) slot as Michael mentioned. Form the resasons above, I'd like to vote +1 for just ERRORing in the case. Physical replication protocol seems to have a window to cause the same problem but that needs a crash in a very narrow window, which doesn't seem to occur in reality. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On Fri, Jul 06, 2018 at 01:07:47PM +0900, Kyotaro HORIGUCHI wrote: > Form the reasons above, I'd like to vote +1 for just ERRORing > in the case. Thanks for the lookup. Do you have a good idea for the error message to provide to users in this case? I would tend to think that "cannot move slot which has never been reserved" or "cannot move slot which has never been used", particularly the first one, are fine enough, but I am not stopped at one single idea in particular either. -- Michael
Attachment
At Fri, 6 Jul 2018 14:26:42 +0900, Michael Paquier <michael@paquier.xyz> wrote in <20180706052642.GB776@paquier.xyz> > On Fri, Jul 06, 2018 at 01:07:47PM +0900, Kyotaro HORIGUCHI wrote: > > Form the reasons above, I'd like to vote +1 for just ERRORing > > in the case. > > Thanks for the lookup. Do you have a good idea for the error message to > provide to users in this case? I would tend to think that "cannot move > slot which has never been reserved" or "cannot move slot which has never > been used", particularly the first one, are fine enough, but I am not > stopped at one single idea in particular either. Mmm. I feel so much that it's a kind of too much for me, but FWIW... I'm not so in favor of the word "reserve" in first place since it doesn't seem intuitive for me, but "active" is already used for the state where the connection with the peer is made. (The word "reserve" may be misused since in the source code "reserve" is used as "to reserve WAL", but used as "reserve a slot" in documentation.) "use", as you mentioned, is neutral but somewhat vague and what is worse a immediately-reserved slot can be advanced before the first use. Most straightforward wording would be: "ERROR: cannot advance/move slots with invalid restart_lsn" "HINT: It will have a valid value after the first connection." Another candidate from me is "initialize". This requires related documentation fix. (first attached). # I noticed that the documentation forgets that # pg_replication_slots now stores two LSNs. "ERROR: cannot advance/move uninitialized slots" regards. -- Kyotaro Horiguchi NTT Open Source Software Center diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index edc9be92a6..697146468c 100644 --- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -19186,7 +19186,7 @@ postgres=# SELECT * FROM pg_walfile_name_offset(pg_stop_backup()); <indexterm> <primary>pg_create_physical_replication_slot</primary> </indexterm> - <literal><function>pg_create_physical_replication_slot(<parameter>slot_name</parameter> <type>name</type> <optional>,<parameter>immediately_reserve</parameter> <type>boolean</type>, <parameter>temporary</parameter> <type>boolean</type></optional>)</function></literal> + <literal><function>pg_create_physical_replication_slot(<parameter>slot_name</parameter> <type>name</type> <optional>,<parameter>initialize</parameter> <type>boolean</type>, <parameter>temporary</parameter> <type>boolean</type></optional>)</function></literal> </entry> <entry> (<parameter>slot_name</parameter> <type>name</type>, <parameter>lsn</parameter> <type>pg_lsn</type>) @@ -19194,9 +19194,11 @@ postgres=# SELECT * FROM pg_walfile_name_offset(pg_stop_backup()); <entry> Creates a new physical replication slot named <parameter>slot_name</parameter>. The optional second parameter, - when <literal>true</literal>, specifies that the <acronym>LSN</acronym> for this - replication slot be reserved immediately; otherwise - the <acronym>LSN</acronym> is reserved on first connection from a streaming + when <literal>true</literal>, specifies that the <literal>restart_lsn</literal> + for this replication slot be initialized immediately with REDO + <acronym>LSN</acronym> of the last checkpoint; otherwise it is + initialized on first connection from a streaming + replication client. Streaming changes from a physical slot is only possible with the streaming-replication protocol — see <xref linkend="protocol-replication"/>. The optional third diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c index 2806e1076c..703b61768b 100644 --- a/src/backend/replication/slotfuncs.c +++ b/src/backend/replication/slotfuncs.c @@ -43,7 +43,7 @@ Datum pg_create_physical_replication_slot(PG_FUNCTION_ARGS) { Name name = PG_GETARG_NAME(0); - bool immediately_reserve = PG_GETARG_BOOL(1); + bool initialize = PG_GETARG_BOOL(1); bool temporary = PG_GETARG_BOOL(2); Datum values[2]; bool nulls[2]; @@ -67,7 +67,7 @@ pg_create_physical_replication_slot(PG_FUNCTION_ARGS) values[0] = NameGetDatum(&MyReplicationSlot->data.name); nulls[0] = false; - if (immediately_reserve) + if (initialize) { /* Reserve WAL as the user asked for it */ ReplicationSlotReserveWal();
On Fri, Jul 06, 2018 at 03:37:57PM +0900, Kyotaro HORIGUCHI wrote: > I'm not so in favor of the word "reserve" in first place since it > doesn't seem intuitive for me, but "active" is already used for > the state where the connection with the peer is made. (The word > "reserve" may be misused since in the source code "reserve" is > used as "to reserve WAL", but used as "reserve a slot" in > documentation.) That's the term used for now three releases, counting v11 in the pack, so I would not change that now. The concept of non-initialized slots is fuzzy as well as it could be attached to some other meta-data. So, chewing on all that, I would suggest the following error message as the attached patch and just move on: +ERROR: cannot move slot not reserving WAL -- Michael
Attachment
Hello. At Mon, 9 Jul 2018 14:18:51 +0900, Michael Paquier <michael@paquier.xyz> wrote in <20180709051851.GA30202@paquier.xyz> > On Fri, Jul 06, 2018 at 03:37:57PM +0900, Kyotaro HORIGUCHI wrote: > > I'm not so in favor of the word "reserve" in first place since it > > doesn't seem intuitive for me, but "active" is already used for > > the state where the connection with the peer is made. (The word > > "reserve" may be misused since in the source code "reserve" is > > used as "to reserve WAL", but used as "reserve a slot" in > > documentation.) > > That's the term used for now three releases, counting v11 in the pack, > so I would not change that now. The concept of non-initialized slots is > fuzzy as well as it could be attached to some other meta-data. > > So, chewing on all that, I would suggest the following error message as > the attached patch and just move on: > +ERROR: cannot move slot not reserving WAL I'm fine with this. Thank you. Looking the attched patch, I noticed that both "WAL" and "wal" are used in similar ERROR messages. Grepping the source tree showed me that it is always in upper case letters except in the case it is a part of other words like variable/column/function names or "walsender". This is the same with the word "lsn". regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On Mon, Jul 09, 2018 at 03:13:04PM +0900, Kyotaro HORIGUCHI wrote: > Looking the attached patch, I noticed that both "WAL" and "wal" > are used in similar ERROR messages. Grepping the source tree > showed me that it is always in upper case letters except in the > case it is a part of other words like variable/column/function > names or "walsender". This is the same with the word "lsn". Thanks for the lookup. I see. Indeed, let's fix at the same time the error message close by. xlog.c uses "WAL location (LSN)" for the same thing, so I am sticking with that as per the attached. I'll go commit that if there are no objections. If you see any others which you would like to fix, please feel free to send a patch. -- Michael
Attachment
Hello. At Mon, 9 Jul 2018 15:48:33 +0900, Michael Paquier <michael@paquier.xyz> wrote in <20180709064833.GB30202@paquier.xyz> > On Mon, Jul 09, 2018 at 03:13:04PM +0900, Kyotaro HORIGUCHI wrote: > > Looking the attached patch, I noticed that both "WAL" and "wal" > > are used in similar ERROR messages. Grepping the source tree > > showed me that it is always in upper case letters except in the > > case it is a part of other words like variable/column/function > > names or "walsender". This is the same with the word "lsn". > > Thanks for the lookup. > > I see. Indeed, let's fix at the same time the error message close by. > xlog.c uses "WAL location (LSN)" for the same thing, so I am sticking > with that as per the attached. I'll go commit that if there are no > objections. If you see any others which you would like to fix, please > feel free to send a patch. I like it than "WAL LSN"' so I'm fine with the new message. I don't find anything to fix in xlog.c for now. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On 2018-07-09 15:48:33 +0900, Michael Paquier wrote: > On Mon, Jul 09, 2018 at 03:13:04PM +0900, Kyotaro HORIGUCHI wrote: > > Looking the attached patch, I noticed that both "WAL" and "wal" > > are used in similar ERROR messages. Grepping the source tree > > showed me that it is always in upper case letters except in the > > case it is a part of other words like variable/column/function > > names or "walsender". This is the same with the word "lsn". > > Thanks for the lookup. > > I see. Indeed, let's fix at the same time the error message close by. > xlog.c uses "WAL location (LSN)" for the same thing, so I am sticking > with that as per the attached. I'll go commit that if there are no > objections. If you see any others which you would like to fix, please > feel free to send a patch. I object to this wording change - it doesn't seem to add anything but confusion. > diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c > index 2806e1076c..b55dadfe2f 100644 > --- a/src/backend/replication/slotfuncs.c > +++ b/src/backend/replication/slotfuncs.c > @@ -465,7 +465,7 @@ pg_replication_slot_advance(PG_FUNCTION_ARGS) > > if (XLogRecPtrIsInvalid(moveto)) > ereport(ERROR, > - (errmsg("invalid target wal lsn"))); > + (errmsg("invalid target WAL location (LSN)"))); and it seems to at least bend the rules of the style guide a bit. > /* Build a tuple descriptor for our result type */ > if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE) > @@ -483,6 +483,15 @@ pg_replication_slot_advance(PG_FUNCTION_ARGS) > /* Acquire the slot so we "own" it */ > ReplicationSlotAcquire(NameStr(*slotname), true); > > + /* A slot whose restart_lsn has never been reserved cannot be advanced */ > + if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn)) > + { > + ReplicationSlotRelease(); > + ereport(ERROR, > + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), > + errmsg("cannot move slot not reserving WAL"))); > + } > + I don't like this error message. It's unclear what refers to exactly what. Nor is "move" a terminology used otherwise. How about: "cannot advance replication slot that has not previously reserved WAL" or something similar? Also, ERRCODE_FEATURE_NOT_SUPPORTED doesn't quite seem right. How about ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE? Greetings, Andres Freund
On 2018-Jul-09, Andres Freund wrote: > On 2018-07-09 15:48:33 +0900, Michael Paquier wrote: > > + if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn)) > > + { > > + ReplicationSlotRelease(); > > + ereport(ERROR, > > + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), > > + errmsg("cannot move slot not reserving WAL"))); > > + } > > + > > I don't like this error message. It's unclear what refers to exactly > what. Nor is "move" a terminology used otherwise. How about: > "cannot advance replication slot that has not previously reserved WAL" > or something similar? > > Also, ERRCODE_FEATURE_NOT_SUPPORTED doesn't quite seem right. How about > ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE? +1 to both of Andres' suggestions. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Mon, Jul 09, 2018 at 02:48:28PM -0400, Alvaro Herrera wrote: > On 2018-Jul-09, Andres Freund wrote: > > On 2018-07-09 15:48:33 +0900, Michael Paquier wrote: > > > + if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn)) > > > + { > > > + ReplicationSlotRelease(); > > > + ereport(ERROR, > > > + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), > > > + errmsg("cannot move slot not reserving WAL"))); > > > + } > > > + > > > > I don't like this error message. It's unclear what refers to exactly > > what. Nor is "move" a terminology used otherwise. How about: > > "cannot advance replication slot that has not previously reserved WAL" > > or something similar? "move" is used in another error message ten lines below. > > > > Also, ERRCODE_FEATURE_NOT_SUPPORTED doesn't quite seem right. How about > > ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE? > > +1 to both of Andres' suggestions. Those indeed sound better. What do you think of the attached? -- Michael
Attachment
Hi, On 2018-07-10 09:54:28 +0900, Michael Paquier wrote: > > > Also, ERRCODE_FEATURE_NOT_SUPPORTED doesn't quite seem right. How about > > > ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE? > > > > +1 to both of Andres' suggestions. > > Those indeed sound better. What do you think of the attached? Looks generally good. One remark: > + /* A slot whose restart_lsn has never been reserved cannot be advanced */ > + if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn)) > + { > + ReplicationSlotRelease(); > + ereport(ERROR, > + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), > + errmsg("cannot advance replication slot that has not previously reserved WAL"))); > + } Why is the ReplicationSlotRelease() needed here? Souldn't the error handling code do so automatically? Greetings, Andres Freund
On Tue, Jul 10, 2018 at 12:26:30AM -0700, Andres Freund wrote: > Why is the ReplicationSlotRelease() needed here? Souldn't the error > handling code do so automatically? Oh, indeed. I didn't know that PostgresMain was doing some cleanup here. There is a second place where this can be removed, which comes from the original commit which added the advance function. An updated version is attached. Do you have other comments? -- Michael
Attachment
On 2018-07-10 16:59:07 +0900, Michael Paquier wrote: > On Tue, Jul 10, 2018 at 12:26:30AM -0700, Andres Freund wrote: > > Why is the ReplicationSlotRelease() needed here? Souldn't the error > > handling code do so automatically? > > Oh, indeed. I didn't know that PostgresMain was doing some cleanup > here. There is a second place where this can be removed, which comes > from the original commit which added the advance function. Gna, it's almost like the original code wasn't properly reviewed. > An updated version is attached. Do you have other comments? Looks sane, without having tested it. > if (moveto < minlsn) > - { > - ReplicationSlotRelease(); > ereport(ERROR, > (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), > - errmsg("cannot move slot to %X/%X, minimum is %X/%X", > + errmsg("cannot advance replication slot to %X/%X, minimum is %X/%X", > (uint32) (moveto >> 32), (uint32) moveto, > (uint32) (minlsn >> 32), (uint32) minlsn))); > - } If you're touching this, I'd also change the errcode here. Greetings, Andres Freund
On Tue, Jul 10, 2018 at 01:16:30AM -0700, Andres Freund wrote: > On 2018-07-10 16:59:07 +0900, Michael Paquier wrote: > > if (moveto < minlsn) > > - { > > - ReplicationSlotRelease(); > > ereport(ERROR, > > (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), > > - errmsg("cannot move slot to %X/%X, minimum is %X/%X", > > + errmsg("cannot advance replication slot to %X/%X, minimum is %X/%X", > > (uint32) (moveto >> 32), (uint32) moveto, > > (uint32) (minlsn >> 32), (uint32) minlsn))); > > - } > > If you're touching this, I'd also change the errcode here. Yep, let's change that as well. If you want to look at that stuff more deeply, please feel free. Otherwise I could always push what I have now. -- Michael
Attachment
On 2018-Jul-10, Michael Paquier wrote: > Yep, let's change that as well. If you want to look at that stuff more > deeply, please feel free. Otherwise I could always push what I have > now. I say please push already. We can push more fixes later if they are needed. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Jul 10, 2018 at 09:51:27AM -0400, Alvaro Herrera wrote: > On 2018-Jul-10, Michael Paquier wrote: > I say please push already. We can push more fixes later if they are > needed. OK, I have pushed what I have now, with all the suggestions from upthread included. -- Michael