Thread: Assertion failure with replication origins and PREPARE TRANSACTIOn
Hi all, (CCing some folks who worked on this area lately) The following sequence of commands generates an assertion failure, as of $subject: select pg_replication_origin_create('popo'); select pg_replication_origin_session_setup('popo'); begin; select txid_current(); prepare transaction 'popo'; -- assertion fails The problem originates from 1eb6d65, down to 11, where we finish by triggering this assertion before replorigin_session_origin_lsn is not valid: + if (replorigin) + { + Assert(replorigin_session_origin_lsn != InvalidXLogRecPtr); + hdr->origin_lsn = replorigin_session_origin_lsn; + hdr->origin_timestamp = replorigin_session_origin_timestamp; + } As far as I understand this code and based on the docs, pg_replication_origin_xact_setup(), that would set of the session origin LSN and timestamp, is an optional choise. replorigin_session_advance() would be a no-op for remote_lsn, and local_lsn requires an update. Now please note that I am really fluent with this stuff, so feel free to correct me. The intention of the code also seems that XACT_XINFO_HAS_ORIGIN should still be set, but with no data. At the end, it seems to me that the assertion could just be dropped as per the attached, as we'd finish by setting up origin_lsn and origin_timestamp in the 2PC file header with some invalid data. The else block could just be removed, but then one would need to guess from origin.c that both replorigin_session_* may not be set. I am not completely sure that this is the right move, though, as pg_replication_origin_status has a *very* limited amount of tests. Adding a test to replorigin.sql with 2PC transactions able to trigger the problem is easy once we rely on a different slot to not influence the existing tests, as the expected contents of pg_replication_origin_status could be messed up in the follow-up tests. Thoughts? -- Michael
Attachment
On Mon, Dec 13, 2021 at 12:44 PM Michael Paquier <michael@paquier.xyz> wrote: > > Hi all, > (CCing some folks who worked on this area lately) > > The following sequence of commands generates an assertion failure, as > of $subject: > select pg_replication_origin_create('popo'); > select pg_replication_origin_session_setup('popo'); > begin; > select txid_current(); > prepare transaction 'popo'; -- assertion fails > > The problem originates from 1eb6d65, down to 11, where we finish by > triggering this assertion before replorigin_session_origin_lsn is not > valid: > + if (replorigin) > + { > + Assert(replorigin_session_origin_lsn != InvalidXLogRecPtr); > + hdr->origin_lsn = replorigin_session_origin_lsn; > + hdr->origin_timestamp = replorigin_session_origin_timestamp; > + } > > As far as I understand this code and based on the docs, > pg_replication_origin_xact_setup(), that would set of the session > origin LSN and timestamp, is an optional choise. > replorigin_session_advance() would be a no-op for remote_lsn, and > local_lsn requires an update. Now please note that I am really fluent > with this stuff, so feel free to correct me. The intention of the > code also seems that XACT_XINFO_HAS_ORIGIN should still be set, but > with no data. > > At the end, it seems to me that the assertion could just be dropped as > per the attached, as we'd finish by setting up origin_lsn and > origin_timestamp in the 2PC file header with some invalid data. Why do we check if replorigin_session_origin_lsn is not invalid data only when PREPARE TRANSACTION? Looking at commit and rollback code, we don't have assertions or checks that check replorigin_session_origin_lsn/timestamp is valid data. So it looks like we accept also invalid data in those cases since replorigin_advance doesn’t move LSN backward while applying a commit or rollback record even if LSN is invalid. The same is true for PREPARE records, i.g., even if replication origin LSN is invalid, it doesn't go backward. If replication origin LSN and timestamp in commit record and rollback record must be valid data too, I think we should similar checks for commit and rollback code and I think the assertions will fail in the case I reported before[1]. Regards, [1] https://www.postgresql.org/message-id/CAD21AoAMuTrXezGqaV1Q5H-Hf%2BzKqGbU8XuCZk9iQMYueF3%2B8w%40mail.gmail.com -- Masahiko Sawada EDB: https://www.enterprisedb.com/
On Mon, Dec 13, 2021 at 04:30:36PM +0900, Masahiko Sawada wrote: > Why do we check if replorigin_session_origin_lsn is not invalid data > only when PREPARE TRANSACTION? Well, it does not matter for the case of PREPARE TRANSACTION, does it? we would include values for the the origin LSN and timestamp in any case as these are fixed in the 2PC file header. > Looking at commit and rollback code, we > don't have assertions or checks that check > replorigin_session_origin_lsn/timestamp is valid data. So it looks > like we accept also invalid data in those cases since > replorigin_advance doesn’t move LSN backward while applying a commit > or rollback record even if LSN is invalid. The same is true for > PREPARE records, i.g., even if replication origin LSN is invalid, it > doesn't go backward. If replication origin LSN and timestamp in commit > record and rollback record must be valid data too, I think we should > similar checks for commit and rollback code and I think the assertions > will fail in the case I reported before[1]. It seems to me that the origin LSN and timestamp are optional, so as one may choose to not call pg_replication_origin_xact_setup() (as said in my first message), and we would not require more sanity checks when advancing the replication origin in the commit and rollback code paths. Let's see what others think here. -- Michael
Attachment
On Mon, Dec 13, 2021 at 2:00 PM Michael Paquier <michael@paquier.xyz> wrote: > > On Mon, Dec 13, 2021 at 04:30:36PM +0900, Masahiko Sawada wrote: > > Why do we check if replorigin_session_origin_lsn is not invalid data > > only when PREPARE TRANSACTION? > > Well, it does not matter for the case of PREPARE TRANSACTION, does it? > we would include values for the the origin LSN and timestamp in > any case as these are fixed in the 2PC file header. > > > Looking at commit and rollback code, we > > don't have assertions or checks that check > > replorigin_session_origin_lsn/timestamp is valid data. So it looks > > like we accept also invalid data in those cases since > > replorigin_advance doesn’t move LSN backward while applying a commit > > or rollback record even if LSN is invalid. The same is true for > > PREPARE records, i.g., even if replication origin LSN is invalid, it > > doesn't go backward. If replication origin LSN and timestamp in commit > > record and rollback record must be valid data too, I think we should > > similar checks for commit and rollback code and I think the assertions > > will fail in the case I reported before[1]. > > It seems to me that the origin LSN and timestamp are optional, so as > one may choose to not call pg_replication_origin_xact_setup() (as said > in my first message), and we would not require more sanity checks when > advancing the replication origin in the commit and rollback code > paths. > This is my understanding as well. I think here the point of Sawada-San is why to have additional for replorigin_session_origin_lsn in prepare code path? I think the way you have it in your patch is correct as well but it is probably better to keep the check only based on replorigin so as to keep this check consistent in all code paths. -- With Regards, Amit Kapila.
On Mon, Dec 13, 2021 at 03:46:55PM +0530, Amit Kapila wrote: > This is my understanding as well. I think here the point of Sawada-San > is why to have additional for replorigin_session_origin_lsn in prepare > code path? I think the way you have it in your patch is correct as > well but it is probably better to keep the check only based on > replorigin so as to keep this check consistent in all code paths. Well, I don't think that it is a big deal one way or the other, as we'd finish with InvalidXLogRecPtr for the LSN and 0 for the timestamp anyway. If both of you feel that just removing the assertion rather than adding an extra check is better, that's fine by me :) -- Michael
Attachment
On Mon, Dec 13, 2021 at 07:53:43PM +0900, Michael Paquier wrote: > Well, I don't think that it is a big deal one way or the other, as > we'd finish with InvalidXLogRecPtr for the LSN and 0 for the timestamp > anyway. If both of you feel that just removing the assertion rather > than adding an extra check is better, that's fine by me :) Looked at that today, and done this way. The tests have been extended a bit more with one ROLLBACK and one ROLLBACK PREPARED, while checking for the contents decoded. -- Michael