Re: Assertion failure when streaming logical changes - Mailing list pgsql-hackers

From Craig Ringer
Subject Re: Assertion failure when streaming logical changes
Date
Msg-id CAMsr+YH_R-x3y8=843wdsnYGS9O-D3QxDbGbEvvfr2_t8E=-ug@mail.gmail.com
Whole thread Raw
In response to Re: Assertion failure when streaming logical changes  (Heikki Linnakangas <hlinnakangas@vmware.com>)
Responses Re: Assertion failure when streaming logical changes
List pgsql-hackers


On 11 February 2015 at 08:51, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:

The new xmin tracking code assumes that if a snapshots's regd_count > 0, it has already been pushed to the RegisteredSnapshots heap. That assumption doesn't hold because the logical decoding stuff creates its own snapshots and sets regd_count=1 to prevent snapmgr.c from freeing them, even though they haven't been registered with RegisterSnapshot.

The second paragraph in this comment in snapmgr.c needs fixing:

 * Likewise, any snapshots that have been exported by pg_export_snapshot
 * have regd_count = 1 and are counted in RegisteredSnapshots, but are not
 * tracked by any resource owner.
 *
 * The same is true for historic snapshots used during logical decoding,
 * their lifetime is managed separately (as they life longer as one xact.c
 * transaction).

Besides the spelling, that's incorrect: historic snapshots are *not* counted in RegisteredSnapshots. That was harmless before commit 94028691, but it was always wrong.


I think setting regd_count=1 outside snapmgr.c is a pretty ugly hack. snapbuild.c also abuses active_count as a reference counter, which is similarly ugly. I think regd_count and active_count should both be treated as private to snapmgr.c, and initialized to zero elsewhere.

As a minimal fix, we could change the logical decoding code to not use regd_count to prevent snapmgr.c from freeing its snapshots, but use active_count for that too. Setting regd_count to 1 in SnapBuildBuildSnapshot() seems unnecessary in the first place, as the callers also call SnapBuildSnapIncRefcount(). Patch attached.

It might be a good idea to apply this if nothing better is forthcoming. Logical decoding in WALsenders is broken at the moment.

Otherwise it needs to go on the 9.5 blockers list.

But could we get rid of those active_count manipulations too? Could you replace the SnapBuildSnap[Inc|Dec]Refcount calls with [Un]RegisterSnapshot()?

It would be interesting to know why it was done that way in the first place, rather than using the snapshot management infrastructure. I presume it needed to do something not directly offered by the snapshot manager but I haven't managed to grasp what, exactly.


--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

pgsql-hackers by date:

Previous
From: Shigeru Hanada
Date:
Subject: Re: Custom/Foreign-Join-APIs (Re: [v9.5] Custom Plan API)
Next
From: Sawada Masahiko
Date:
Subject: Re: Proposal : REINDEX xxx VERBOSE