Thread: API stability [was: pgsql: Fix possible recovery trouble if TRUNCATE overlaps a checkpoint.]

On 24.03.22 20:32, Robert Haas wrote:
> Fix possible recovery trouble if TRUNCATE overlaps a checkpoint.

This patch changed the delayChkpt field of struct PGPROC from bool to 
int.  Back-porting this change could be considered an API breaking 
change for extensions using this field.

I'm not certain about padding behavior of compilers in general (or 
standards requirements around that), but at least on my machine, it 
seems sizeof(PGPROC) did not change, so padding led to subsequent fields 
still having the same offset.

Nonetheless, the meaning of the field itself changed.  And the 
additional assert now also triggers for the following pseudo-code of the 
extension I'm concerned about:

     /*
      * Prevent checkpoints being emitted in between additional
      * information in the logical message and the following
      * prepare record.
      */
     MyProc->delayChkpt = true;

     LogLogicalMessage(...);

     /* Note that this will also reset the delayChkpt flag. */
     PrepareTransaction(...);


Now, I'm well aware this is not an official API, it just happens to be 
accessible for extensions.  So I guess the underlying question is:  What 
can extension developers expect?  Which parts are okay to change even in 
stable branches and which can be relied upon to remain stable?

And for this specific case: Is it worth reverting this change and 
applying a fully backwards compatible fix, instead?

Regards

Markus Wanner



On Tue, Apr 5, 2022 at 9:02 AM Markus Wanner
<markus.wanner@enterprisedb.com> wrote:
> And for this specific case: Is it worth reverting this change and
> applying a fully backwards compatible fix, instead?

I think it's normally our policy to avoid changing definitions of
accessible structs in back branches, except that we allow ourselves
the indulgence of adding new members at the end or in padding space.
So what would probably be best is if, in the back-branches, we changed
"delayChkpt" back to a boolean, renamed it to delayChkptStart, and
added a separate Boolean called delayChkptEnd. Maybe that could be
added just after statusFlags, where I think it would fall into padding
space.

I think as the person who committed that patch I'm on the hook to fix
this if nobody else would like to do it, but let me ask whether
Kyotaro Horiguchi would like to propose a patch, since the original
patch did, and/or whether you would like to propose a patch, as the
person reporting the issue.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Robert Haas <robertmhaas@gmail.com> writes:
> On Tue, Apr 5, 2022 at 9:02 AM Markus Wanner
> <markus.wanner@enterprisedb.com> wrote:
>> And for this specific case: Is it worth reverting this change and
>> applying a fully backwards compatible fix, instead?

> I think it's normally our policy to avoid changing definitions of
> accessible structs in back branches, except that we allow ourselves
> the indulgence of adding new members at the end or in padding space.
> So what would probably be best is if, in the back-branches, we changed
> "delayChkpt" back to a boolean, renamed it to delayChkptStart, and
> added a separate Boolean called delayChkptEnd. Maybe that could be
> added just after statusFlags, where I think it would fall into padding
> space.

Renaming it would constitute an API break, which is if anything worse
than an ABI break.

While we're complaining at you, let me point out that changing a field's
content and semantics while not changing its name is a time bomb waiting
to break any third-party code that looks at (or modifies...) the field.

What I think you need to do is:

1. In the back branches, revert delayChkpt to its previous type and
semantics.  Squeeze a separate delayChkptEnd bool in somewhere
(you can't change the struct size either ...).

2. In HEAD, rename the field to something like delayChkptFlags,
to ensure that any code touching it has to be inspected and updated.

In other words, this is already an API break in HEAD, and that's
fine, but it didn't break it hard enough to draw anyone's attention,
which is not fine.

            regards, tom lane



On Tue, Apr 5, 2022 at 10:17 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Renaming it would constitute an API break, which is if anything worse
> than an ABI break.

I don't think so, because an API break will cause a compilation
failure, which an extension author can easily fix.

> While we're complaining at you, let me point out that changing a field's
> content and semantics while not changing its name is a time bomb waiting
> to break any third-party code that looks at (or modifies...) the field.
>
> What I think you need to do is:
>
> 1. In the back branches, revert delayChkpt to its previous type and
> semantics.  Squeeze a separate delayChkptEnd bool in somewhere
> (you can't change the struct size either ...).
>
> 2. In HEAD, rename the field to something like delayChkptFlags,
> to ensure that any code touching it has to be inspected and updated.

Well, we can do it that way, I suppose.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Robert Haas <robertmhaas@gmail.com> writes:
> On Tue, Apr 5, 2022 at 10:17 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Renaming it would constitute an API break, which is if anything worse
>> than an ABI break.

> I don't think so, because an API break will cause a compilation
> failure, which an extension author can easily fix.

My point is that we want that to happen in HEAD, but it's not okay
for it to happen in a minor release of a stable branch.

            regards, tom lane



On Tue, Apr 5, 2022 at 10:32 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
> > On Tue, Apr 5, 2022 at 10:17 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> Renaming it would constitute an API break, which is if anything worse
> >> than an ABI break.
>
> > I don't think so, because an API break will cause a compilation
> > failure, which an extension author can easily fix.
>
> My point is that we want that to happen in HEAD, but it's not okay
> for it to happen in a minor release of a stable branch.

I understand, but I am not sure that I agree. I think that if an
extension stops compiling against a back-branch, someone will notice
the next time they try to compile it and will fix it. Maybe that's not
amazing, but I don't think it's a huge deal either. On the other hand,
if existing builds that someone has already shipped stop working with
a new server release, that's a much larger issue. The extension
packager can't go back and retroactively add a dependency on the
server version to the already-shipped package. A new package can be
shipped and specify a minimum minor version with which it will work,
but an old package is what it is.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Robert Haas <robertmhaas@gmail.com> writes:
> On Tue, Apr 5, 2022 at 10:32 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> My point is that we want that to happen in HEAD, but it's not okay
>> for it to happen in a minor release of a stable branch.

> I understand, but I am not sure that I agree. I think that if an
> extension stops compiling against a back-branch, someone will notice
> the next time they try to compile it and will fix it. Maybe that's not
> amazing, but I don't think it's a huge deal either.

Well, perhaps it's not the end of the world, but it's still a large
PITA for the maintainer of such an extension.  They can't "just fix it"
because some percentage of their userbase will still need to compile
against older minor releases.  Nor have you provided any way to handle
that requirement via conditional compilation.

            regards, tom lane



On Tue, Apr 05, 2022 at 03:16:20PM -0400, Tom Lane wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Tue, Apr 5, 2022 at 10:32 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> My point is that we want that to happen in HEAD, but it's not okay
>>> for it to happen in a minor release of a stable branch.
>
>> I understand, but I am not sure that I agree. I think that if an
>> extension stops compiling against a back-branch, someone will notice
>> the next time they try to compile it and will fix it. Maybe that's not
>> amazing, but I don't think it's a huge deal either.

I agree with Tom's argument.  The internals of this structure should
not have changed in a stable branch.

> Well, perhaps it's not the end of the world, but it's still a large
> PITA for the maintainer of such an extension.  They can't "just fix it"
> because some percentage of their userbase will still need to compile
> against older minor releases.  Nor have you provided any way to handle
> that requirement via conditional compilation.

For example, I recall that some external extensions make use of
sizeof(PGPROC) for their own business.  Isn't 412ad7a going to be a
problem to change this structure's internals for already-compiled code
on stable branches?
--
Michael

Attachment
On Thu, Apr 7, 2022 at 2:28 AM Michael Paquier <michael@paquier.xyz> wrote:
> > Well, perhaps it's not the end of the world, but it's still a large
> > PITA for the maintainer of such an extension.  They can't "just fix it"
> > because some percentage of their userbase will still need to compile
> > against older minor releases.  Nor have you provided any way to handle
> > that requirement via conditional compilation.
>
> For example, I recall that some external extensions make use of
> sizeof(PGPROC) for their own business.  Isn't 412ad7a going to be a
> problem to change this structure's internals for already-compiled code
> on stable branches?

I don't think that commit changed sizeof(PGPROC), but it did affect
the position of the delayChkpt and statusFlags members within the
struct, which is what we now need to fix. Since I don't hear anyone
else volunteering to take care of that, I'll go work on it.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



On Tue, Apr 5, 2022 at 10:17 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> What I think you need to do is:
>
> 1. In the back branches, revert delayChkpt to its previous type and
> semantics.  Squeeze a separate delayChkptEnd bool in somewhere
> (you can't change the struct size either ...).
>
> 2. In HEAD, rename the field to something like delayChkptFlags,
> to ensure that any code touching it has to be inspected and updated.

Here are patches for master and v14 to do things this way. Comments?

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Attachment
Robert Haas <robertmhaas@gmail.com> writes:
> Here are patches for master and v14 to do things this way. Comments?

WFM.

            regards, tom lane



> On 7. 4. 2022, at 17:19, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Tue, Apr 5, 2022 at 10:17 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> What I think you need to do is:
>>
>> 1. In the back branches, revert delayChkpt to its previous type and
>> semantics.  Squeeze a separate delayChkptEnd bool in somewhere
>> (you can't change the struct size either ...).
>>
>> 2. In HEAD, rename the field to something like delayChkptFlags,
>> to ensure that any code touching it has to be inspected and updated.
>
> Here are patches for master and v14 to do things this way. Comments?


Yeah I think this should do it (compilers should warn on master even without the rename, but who notices that right? :)
)

Thanks,
Petr




On Thu, Apr 07, 2022 at 11:19:15AM -0400, Robert Haas wrote:
> Here are patches for master and v14 to do things this way. Comments?

Thanks for the patches.  They look correct.  For ~14, I'd rather avoid
the code duplication done by GetVirtualXIDsDelayingChkptEnd() and
HaveVirtualXIDsDelayingChkpt() that could be avoided with an extra
bool argument to the existing routine.  The same kind of duplication
happens with GetVirtualXIDsDelayingChkpt() and
GetVirtualXIDsDelayingChkptEnd().
--
Michael

Attachment
Michael Paquier <michael@paquier.xyz> writes:
> On Thu, Apr 07, 2022 at 11:19:15AM -0400, Robert Haas wrote:
>> Here are patches for master and v14 to do things this way. Comments?

> Thanks for the patches.  They look correct.  For ~14, I'd rather avoid
> the code duplication done by GetVirtualXIDsDelayingChkptEnd() and
> HaveVirtualXIDsDelayingChkpt() that could be avoided with an extra
> bool argument to the existing routine.

Isn't adding another argument an API break?  (If there's any outside
code calling GetVirtualXIDsDelayingChkpt, which it seems like there
might be.)

            regards, tom lane



Re: API stability

From
Kyotaro Horiguchi
Date:
At Fri, 8 Apr 2022 08:47:42 +0900, Michael Paquier <michael@paquier.xyz> wrote in 
> On Thu, Apr 07, 2022 at 11:19:15AM -0400, Robert Haas wrote:
> > Here are patches for master and v14 to do things this way. Comments?
> 
> Thanks for the patches.  They look correct.  For ~14, I'd rather avoid
> the code duplication done by GetVirtualXIDsDelayingChkptEnd() and
> HaveVirtualXIDsDelayingChkpt() that could be avoided with an extra
> bool argument to the existing routine.  The same kind of duplication
> happens with GetVirtualXIDsDelayingChkpt() and
> GetVirtualXIDsDelayingChkptEnd().

FWIW, it is done in [1].

[1] https://www.postgresql.org/message-id/20220406.164521.17171257901083417.horikyota.ntt%40gmail.com

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: API stability

From
Kyotaro Horiguchi
Date:
(Mmm. My mailer automatically teared off the [was: ..] part from the
subject..)

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



On Thu, Apr 7, 2022 at 7:52 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Michael Paquier <michael@paquier.xyz> writes:
> > On Thu, Apr 07, 2022 at 11:19:15AM -0400, Robert Haas wrote:
> >> Here are patches for master and v14 to do things this way. Comments?
>
> > Thanks for the patches.  They look correct.  For ~14, I'd rather avoid
> > the code duplication done by GetVirtualXIDsDelayingChkptEnd() and
> > HaveVirtualXIDsDelayingChkpt() that could be avoided with an extra
> > bool argument to the existing routine.
>
> Isn't adding another argument an API break?  (If there's any outside
> code calling GetVirtualXIDsDelayingChkpt, which it seems like there
> might be.)

Yeah, that's exactly why I didn't do what Michael proposes. If we're
going to go to this trouble to avoid changing the layout of a PGPROC,
we must be doing that on the theory that extension code cares about
delayChkpt. And if that is so, it seems reasonable to suppose that it
might also want to call the associated functions.

Honestly, I wouldn't have thought that this mattered, because I
wouldn't have guessed that any non-core code cared about delayChkpt.
But I would have been wrong.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



On Thu, Apr 07, 2022 at 10:19:35PM -0400, Robert Haas wrote:
> Yeah, that's exactly why I didn't do what Michael proposes. If we're
> going to go to this trouble to avoid changing the layout of a PGPROC,
> we must be doing that on the theory that extension code cares about
> delayChkpt. And if that is so, it seems reasonable to suppose that it
> might also want to call the associated functions.

Compatibility does not strike me as a problem with two static inline
functions used as wrappers of their common logic.

> Honestly, I wouldn't have thought that this mattered, because I
> wouldn't have guessed that any non-core code cared about delayChkpt.
> But I would have been wrong.

That's a minor point.  If you wish to keep this code as you are
proposing, that's fine as well by me.
--
Michael

Attachment
On Fri, 2022-04-08 at 08:47 +0900, Michael Paquier wrote:
> On Thu, Apr 07, 2022 at 11:19:15AM -0400, Robert Haas wrote:
> > Here are patches for master and v14 to do things this way.
> > Comments?
> 
> Thanks for the patches.  They look correct.

+1, looks good to me and addresses my specific original concern.

> For ~14, I'd rather avoid
> the code duplication done by GetVirtualXIDsDelayingChkptEnd() and
> HaveVirtualXIDsDelayingChkpt() that could be avoided with an extra
> bool argument to the existing routine.  The same kind of duplication
> happens with GetVirtualXIDsDelayingChkpt() and
> GetVirtualXIDsDelayingChkptEnd().

I agree with Michael, it would be nice to not duplicate the code, but
use a common underlying method.  A modified patch is attached.

Best Regards

Markus


Attachment
On Fri, Apr 8, 2022 at 4:47 AM Markus Wanner
<markus.wanner@enterprisedb.com> wrote:
> I agree with Michael, it would be nice to not duplicate the code, but
> use a common underlying method.  A modified patch is attached.

I don't think this is better, but I don't think it's worth arguing
about, either, so I'll do it this way if nobody objects.

Meanwhile, I've committed the patch for master to master.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



On Fri, Apr 8, 2022 at 11:50 AM Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Apr 8, 2022 at 4:47 AM Markus Wanner
> <markus.wanner@enterprisedb.com> wrote:
> > I agree with Michael, it would be nice to not duplicate the code, but
> > use a common underlying method.  A modified patch is attached.
>
> I don't think this is better, but I don't think it's worth arguing
> about, either, so I'll do it this way if nobody objects.
>
> Meanwhile, I've committed the patch for master to master.

Well, I've just realized that Kyotaro Horiguchi volunteered to fix
this on an email thread I did not see because of the way Gmail breaks
the thread if you change the subject line. And he developed a very
similar patch to what we have here. I'm going to use this one as the
basis for going forward because I've already studied it in detail and
it's less work for me to stick with what I know than to go study
something else. But, he also noticed something which we didn't notice
here, which is that before v13, the commit in question actually
changed the size of PGXACT, which is really quite bad -- it needs to
be 12 bytes for performance reasons. And there's no spare bytes
available, so I think we should follow one of the suggestions that he
had over in that email thread, and put delayChkptEnd in PGPROC even
though delayChkpt is in PGXACT.

Comments?

-- 
Robert Haas
EDB: http://www.enterprisedb.com



On Mon, 2022-04-11 at 15:21 -0400, Robert Haas wrote:
> ... before v13, the commit in question actually
> changed the size of PGXACT, which is really quite bad -- it needs to
> be 12 bytes for performance reasons. And there's no spare bytes
> available, so I think we should follow one of the suggestions that he
> had over in that email thread, and put delayChkptEnd in PGPROC even
> though delayChkpt is in PGXACT.

This makes sense to me.  Kudos to Kyotaro for considering this.

At first read, this sounded like a trade-off between compatibility and
performance for PG 12 and older.  But I realize leaving delayChkpt in
PGXACT and adding just delayChkptEnd to PGPROC is compatible and leaves
PGXACT at a size of 12 bytes.  So this sounds like a good approach to
me.

Best Regards

Markus




(My mailer has been fixed.)

At Mon, 11 Apr 2022 21:45:59 +0200, Markus Wanner <markus.wanner@enterprisedb.com> wrote in 
> On Mon, 2022-04-11 at 15:21 -0400, Robert Haas wrote:
> > ... before v13, the commit in question actually
> > changed the size of PGXACT, which is really quite bad -- it needs to
> > be 12 bytes for performance reasons. And there's no spare bytes
> > available, so I think we should follow one of the suggestions that he
> > had over in that email thread, and put delayChkptEnd in PGPROC even
> > though delayChkpt is in PGXACT.
> 
> This makes sense to me.  Kudos to Kyotaro for considering this.
> 
> At first read, this sounded like a trade-off between compatibility and
> performance for PG 12 and older.  But I realize leaving delayChkpt in
> PGXACT and adding just delayChkptEnd to PGPROC is compatible and leaves
> PGXACT at a size of 12 bytes.  So this sounds like a good approach to
> me.

Thanks!

So, I created the patches for back-patching from 10 to 14.  (With
fixed a silly bug of the v1-pg14 that HaveVirtualXIDsDelayingChkpt and
HaveVirtualXIDsDelayingChkptEnd are inverted..)

They revert delayChkpt-related changes made by the patch and add
delayChkptEnd stuff.  I compared among every pair of the patches for
neighbouring versions, to make sure not making the same change in
different way and they have the same set of hunks.

This version takes the way sharing the common static function
(*ChkptGuts) between the functions *Chkpt() and *ChkptEnd().

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Attachment
On Tue, Apr 12, 2022 at 6:55 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
> (My mailer has been fixed.)

Cool.

> So, I created the patches for back-patching from 10 to 14.  (With
> fixed a silly bug of the v1-pg14 that HaveVirtualXIDsDelayingChkpt and
> HaveVirtualXIDsDelayingChkptEnd are inverted..)

I am very sorry not to use these, but as I said in my previous post on
this thread, I prefer to commit what I wrote and Markus revised rather
than these versions. I have done that now.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



At Thu, 14 Apr 2022 11:13:01 -0400, Robert Haas <robertmhaas@gmail.com> wrote in 
> On Tue, Apr 12, 2022 at 6:55 AM Kyotaro Horiguchi
> <horikyota.ntt@gmail.com> wrote:
> > (My mailer has been fixed.)
> 
> Cool.
> 
> > So, I created the patches for back-patching from 10 to 14.  (With
> > fixed a silly bug of the v1-pg14 that HaveVirtualXIDsDelayingChkpt and
> > HaveVirtualXIDsDelayingChkptEnd are inverted..)
> 
> I am very sorry not to use these, but as I said in my previous post on
> this thread, I prefer to commit what I wrote and Markus revised rather
> than these versions. I have done that now.

Not at all. It's just an unfortunate crossing.
Thanks for fixing this.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center