Thread: [BUG] FailedAssertion in SnapBuildPurgeOlderTxn
Hi!
PREAMBLE
For a last couple of months, I stumbled into a problem while running tests on ARM (Debain, aarch64) and some more wired platforms
for my 64–bit XIDs patch set. Test contrib/test_decoding (catalog_change_snapshot) rarely failed with the next message:
TRAP: FailedAssertion("TransactionIdIsNormal(InitialRunningXacts[0]) && TransactionIdIsNormal(builder->xmin)", File: "snapbuild.c"
I have plenty of failing on ARM, couple on x86 and none (if memory serves) on x86–64.
At first, my thought was to blame my 64–bit XID patch for what, but this is not the case. This error persist from PG15 to PG10
without any patch applied. Though hard to reproduce.
PROBLEM
After some investigation, I think, the problem is in the snapbuild.c (commit 272248a0c1b1, see [0]). We do allocate InitialRunningXacts
array in the context of builder->context, but for the time when we call SnapBuildPurgeOlderTxn this context may be already free'd. Thus,
InitialRunningXacts array become array of 2139062143 (i.e. 0x7F7F7F7F) values. This is not caused buildfarm to fail due to that code:
if (!NormalTransactionIdPrecedes(InitialRunningXacts[0],
builder->xmin))
return;
builder->xmin))
return;
Since the cluster is initialised with XID way less than 0x7F7F7F7F, we get to return here, but the problem is still existing.
I've attached the patch based on branch REL_15_STABLE to reproduce the problem on x86-64.
On my patch set of 64–bit XID's this problem manifested since I do init cluster with XID far beyond 32–bit bound.
Alternatively, I did try to use my patch [1] to init cluster with first transaction 2139062143 (i.e. 0x7F7F7F7F).
Then put pg_sleep call just like in the attached patch:
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -968,6 +968,8 @@ SnapBuildPurgeOlderTxn(SnapBuild *builder)
if (NInitialRunningXacts == 0)
return;
+ pg_usleep(1000000L * 2L);
+
/* bound check if there is at least one transaction to remove */
if (!NormalTransactionIdPrecedes(InitialRunningXacts[0],
builder->xmin))
+++ b/src/backend/replication/logical/snapbuild.c
@@ -968,6 +968,8 @@ SnapBuildPurgeOlderTxn(SnapBuild *builder)
if (NInitialRunningXacts == 0)
return;
+ pg_usleep(1000000L * 2L);
+
/* bound check if there is at least one transaction to remove */
if (!NormalTransactionIdPrecedes(InitialRunningXacts[0],
builder->xmin))
Run installcheck-force for many times for a test_decoding/catalog_change_snapshot's and got a segfault.
CONCLUSION
In snapbuild.c, context allocated array InitialRunningXacts may be free'd, this caused assertion failed (at best) or
may lead to the more serious problems.
P.S.
Simple fix like:
@@ -1377,7 +1379,7 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
* changes. See SnapBuildXidSetCatalogChanges.
*/
NInitialRunningXacts = nxacts;
- InitialRunningXacts = MemoryContextAlloc(builder->context, sz);
+ InitialRunningXacts = MemoryContextAlloc(TopMemoryContext, sz);
memcpy(InitialRunningXacts, running->xids, sz);
qsort(InitialRunningXacts, nxacts, sizeof(TransactionId), xidComparator);
* changes. See SnapBuildXidSetCatalogChanges.
*/
NInitialRunningXacts = nxacts;
- InitialRunningXacts = MemoryContextAlloc(builder->context, sz);
+ InitialRunningXacts = MemoryContextAlloc(TopMemoryContext, sz);
memcpy(InitialRunningXacts, running->xids, sz);
qsort(InitialRunningXacts, nxacts, sizeof(TransactionId), xidComparator);
seems to solve the described problem, but I'm not in the context of [0] and why array is allocated in builder->context.
Best regards,
Maxim Orlov.
Attachment
Hi, On 2022-11-21 15:47:12 +0300, Maxim Orlov wrote: > After some investigation, I think, the problem is in the snapbuild.c > (commit 272248a0c1b1, see [0]). We do allocate InitialRunningXacts > array in the context of builder->context, but for the time when we call > SnapBuildPurgeOlderTxn this context may be already free'd. Thus, > InitialRunningXacts array become array of 2139062143 (i.e. 0x7F7F7F7F) > values. This is not caused buildfarm to fail due to that code: Amit, that does indeed seem to be a problem... Greetings, Andres Freund
On Tue, Nov 22, 2022 at 2:22 AM Andres Freund <andres@anarazel.de> wrote: > > On 2022-11-21 15:47:12 +0300, Maxim Orlov wrote: > > After some investigation, I think, the problem is in the snapbuild.c > > (commit 272248a0c1b1, see [0]). We do allocate InitialRunningXacts > > array in the context of builder->context, but for the time when we call > > SnapBuildPurgeOlderTxn this context may be already free'd. Thus, > > InitialRunningXacts array become array of 2139062143 (i.e. 0x7F7F7F7F) > > values. This is not caused buildfarm to fail due to that code: > > Amit, that does indeed seem to be a problem... > I'll look into it. -- With Regards, Amit Kapila.
On Mon, Nov 21, 2022 at 6:17 PM Maxim Orlov <orlovmg@gmail.com> wrote: > > PROBLEM > > After some investigation, I think, the problem is in the snapbuild.c (commit 272248a0c1b1, see [0]). We do allocate InitialRunningXacts > array in the context of builder->context, but for the time when we call SnapBuildPurgeOlderTxn this context may be alreadyfree'd. > I think you are seeing it freed in SnapBuildPurgeOlderTxn when we finish and restart decoding in the same session. After finishing the first decoding, it frees the decoding context but we forgot to reset NInitialRunningXacts and InitialRunningXacts array. So, next time when we start decoding in the same session where we don't restore any serialized snapshot, it can lead to the problem you are seeing because NInitialRunningXacts (and InitialRunningXacts array) won't have sane values. This can happen in the catalog_change_snapshot test as we have multiple permutations and those use the same session across a restart of decoding. > > Simple fix like: > @@ -1377,7 +1379,7 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn > * changes. See SnapBuildXidSetCatalogChanges. > */ > NInitialRunningXacts = nxacts; > - InitialRunningXacts = MemoryContextAlloc(builder->context, sz); > + InitialRunningXacts = MemoryContextAlloc(TopMemoryContext, sz); > memcpy(InitialRunningXacts, running->xids, sz); > qsort(InitialRunningXacts, nxacts, sizeof(TransactionId), xidComparator); > > seems to solve the described problem, but I'm not in the context of [0] and why array is allocated in builder->context. > It will leak the memory for InitialRunningXacts. We need to reset NInitialRunningXacts (and InitialRunningXacts) as mentioned above. Thank you for the report and initial analysis. I have added Sawada-San to know his views as he was the primary author of this work. -- With Regards, Amit Kapila.
Thank you for the report and initial analysis. I have added Sawada-San
to know his views as he was the primary author of this work.
--
With Regards,
Amit Kapila.
--
Best regards,
Maxim Orlov.
Hi, On Tue, Nov 22, 2022 at 6:37 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, Nov 21, 2022 at 6:17 PM Maxim Orlov <orlovmg@gmail.com> wrote: > > > > PROBLEM > > > > After some investigation, I think, the problem is in the snapbuild.c (commit 272248a0c1b1, see [0]). We do allocate InitialRunningXacts > > array in the context of builder->context, but for the time when we call SnapBuildPurgeOlderTxn this context may be alreadyfree'd. > > > > I think you are seeing it freed in SnapBuildPurgeOlderTxn when we > finish and restart decoding in the same session. After finishing the > first decoding, it frees the decoding context but we forgot to reset > NInitialRunningXacts and InitialRunningXacts array. So, next time when > we start decoding in the same session where we don't restore any > serialized snapshot, it can lead to the problem you are seeing because > NInitialRunningXacts (and InitialRunningXacts array) won't have sane > values. > > This can happen in the catalog_change_snapshot test as we have > multiple permutations and those use the same session across a restart > of decoding. I have the same analysis. In order to restart the decoding from the LSN where we don't restore any serialized snapshot, we somehow need to advance the slot's restart_lsn. In this case, I think it happened since the same session drops at the end of the first scenario and creates the replication slot with the same name at the beginning of the second scenario in catalog_change_snapshot.spec. > > > > > Simple fix like: > > @@ -1377,7 +1379,7 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn > > * changes. See SnapBuildXidSetCatalogChanges. > > */ > > NInitialRunningXacts = nxacts; > > - InitialRunningXacts = MemoryContextAlloc(builder->context, sz); > > + InitialRunningXacts = MemoryContextAlloc(TopMemoryContext, sz); > > memcpy(InitialRunningXacts, running->xids, sz); > > qsort(InitialRunningXacts, nxacts, sizeof(TransactionId), xidComparator); > > > > seems to solve the described problem, but I'm not in the context of [0] and why array is allocated in builder->context. > > > > It will leak the memory for InitialRunningXacts. We need to reset > NInitialRunningXacts (and InitialRunningXacts) as mentioned above. > > Thank you for the report and initial analysis. I have added Sawada-San > to know his views as he was the primary author of this work. Thanks! I've attached a draft patch. To fix it, I think we can reset InitialRunningXacts and NInitialRunningXacts at FreeSnapshotBuilder() and add an assertion in AllocateSnapshotBuilder() to make sure both are reset. Regarding the tests, the patch includes a new scenario to reproduce this issue. However, since the issue can be reproduced also by the existing scenario (with low probability, though), I'm not sure it's worth adding the new scenario. I've not checked if the patch works for version 14 or older yet. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Attachment
I've attached a draft patch. To fix it, I think we can reset
InitialRunningXacts and NInitialRunningXacts at FreeSnapshotBuilder()
and add an assertion in AllocateSnapshotBuilder() to make sure both
are reset.
Thanks for the patch. It works fine. I've tested this patch for 15 and 11 versions on x86_64 and ARM
and see no fails. But the function pg_current_xact_id added by 4c04be9b05ad doesn't exist in PG11.
Regarding the tests, the patch includes a new scenario to
reproduce this issue. However, since the issue can be reproduced also
by the existing scenario (with low probability, though), I'm not sure
it's worth adding the new scenario.
AFAICS, the test added doesn't 100% reproduce this issue, so, maybe, it does not worth it. But, I do not have a strong opinion here.
Let's add tests in a separate commit and let the actual committer to decide what to do, should we?
--
Best regards,
Maxim Orlov.
On Tue, Nov 22, 2022 at 10:33 PM Maxim Orlov <orlovmg@gmail.com> wrote: >> >> >> Regarding the tests, the patch includes a new scenario to >> reproduce this issue. However, since the issue can be reproduced also >> by the existing scenario (with low probability, though), I'm not sure >> it's worth adding the new scenario. > > AFAICS, the test added doesn't 100% reproduce this issue, so, maybe, it does not worth it. But, I do not have a strongopinion here. > Let's add tests in a separate commit and let the actual committer to decide what to do, should we? > +1 to not have a test for this as the scenario can already be tested by the existing set of tests. -- With Regards, Amit Kapila.
On Wed, Nov 23, 2022 at 12:00 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Tue, Nov 22, 2022 at 10:33 PM Maxim Orlov <orlovmg@gmail.com> wrote: > >> > >> > >> Regarding the tests, the patch includes a new scenario to > >> reproduce this issue. However, since the issue can be reproduced also > >> by the existing scenario (with low probability, though), I'm not sure > >> it's worth adding the new scenario. > > > > AFAICS, the test added doesn't 100% reproduce this issue, so, maybe, it does not worth it. But, I do not have a strongopinion here. > > Let's add tests in a separate commit and let the actual committer to decide what to do, should we? > > > > +1 to not have a test for this as the scenario can already be tested > by the existing set of tests. Agreed not to have a test case for this. I've attached an updated patch. I've confirmed this patch works for all supported branches. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Attachment
Agreed not to have a test case for this.
I've attached an updated patch. I've confirmed this patch works for
all supported branches.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
It works for me as well. Thanks!
I've created a commitfest entry for this patch, see https://commitfest.postgresql.org/41/4024/
Hope, it will be committed soon.
--
Best regards,
Maxim Orlov.
On Thu, Nov 24, 2022 at 1:48 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Wed, Nov 23, 2022 at 12:00 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > Agreed not to have a test case for this. > > I've attached an updated patch. I've confirmed this patch works for > all supported branches. > I have slightly changed the checks used in the patch, otherwise looks good to me. I am planning to push (v11-v15) the attached tomorrow unless there are more comments. -- With Regards, Amit Kapila.
Attachment
On Thu, Nov 24, 2022 at 4:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Thu, Nov 24, 2022 at 1:48 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Wed, Nov 23, 2022 at 12:00 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > Agreed not to have a test case for this. > > > > I've attached an updated patch. I've confirmed this patch works for > > all supported branches. > > > > I have slightly changed the checks used in the patch, otherwise looks > good to me. I am planning to push (v11-v15) the attached tomorrow > unless there are more comments. > Pushed. -- With Regards, Amit Kapila.
On Fri, 25 Nov 2022 at 09:40, Amit Kapila <amit.kapila16@gmail.com> wrote:
On Thu, Nov 24, 2022 at 4:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Nov 24, 2022 at 1:48 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Wed, Nov 23, 2022 at 12:00 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > Agreed not to have a test case for this.
> >
> > I've attached an updated patch. I've confirmed this patch works for
> > all supported branches.
> >
>
> I have slightly changed the checks used in the patch, otherwise looks
> good to me. I am planning to push (v11-v15) the attached tomorrow
> unless there are more comments.
>
Pushed.
A big thanks to you! Could you also, close the commitfest entry https://commitfest.postgresql.org/41/4024/, please?
--
Best regards,
Maxim Orlov.
On Fri, Nov 25, 2022 at 5:58 PM Maxim Orlov <orlovmg@gmail.com> wrote: > > > > On Fri, 25 Nov 2022 at 09:40, Amit Kapila <amit.kapila16@gmail.com> wrote: >> >> On Thu, Nov 24, 2022 at 4:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote: >> > >> > On Thu, Nov 24, 2022 at 1:48 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> > > >> > > On Wed, Nov 23, 2022 at 12:00 PM Amit Kapila <amit.kapila16@gmail.com> wrote: >> > > >> > > Agreed not to have a test case for this. >> > > >> > > I've attached an updated patch. I've confirmed this patch works for >> > > all supported branches. >> > > >> > >> > I have slightly changed the checks used in the patch, otherwise looks >> > good to me. I am planning to push (v11-v15) the attached tomorrow >> > unless there are more comments. >> > >> >> Pushed. > > A big thanks to you! Could you also, close the commitfest entry https://commitfest.postgresql.org/41/4024/, please? Closed. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com