Thread: massive FPI_FOR_HINT load after promote

massive FPI_FOR_HINT load after promote

From
Alvaro Herrera
Date:
Last week, James reported to us that after promoting a replica, some
seqscan was taking a huge amount of time; on investigation he saw that
there was a high rate of FPI_FOR_HINT wal messages by the seqscan.
Looking closely at the generated traffic, HEAP_XMIN_COMMITTED was being
set on some tuples.

Now this may seem obvious to some as a drawback of the current system,
but I was taken by surprise.  The problem was simply that when a page is
examined by a seqscan, we do HeapTupleSatisfiesVisibility of each tuple
in isolation; and for each tuple we call SetHintBits().  And only the
first time the FPI happens; by the time we get to the second tuple, the
page is already dirty, so there's no need to emit an FPI.  But the FPI
we sent only had the bit on the first tuple ... so the standby will not
have the bit set for any subsequent tuple.  And on promotion, the
standby will have to have the bits set for all those tuples, unless you
happened to dirty the page again later for other reasons.

So if you have some table where tuples gain hint bits in bulk, and
rarely modify the pages afterwards, and promote before those pages are
frozen, then you may end up with a massive amount of pages that will
need hinting after the promote, which can become troublesome.

Attached is a TAP file that reproduces the problem.  It always fails,
but in the log file you can see the tuples in the primary are all hinted
committed, while on the standby only the first one is hinted committed.



One simple idea to try to forestall this problem would be to modify the
algorithm so that all tuples are scanned and hinted if the page is going
to be dirtied -- then send a single FPI setting bits for all tuples,
instead of just on the first tuple.

-- 
Álvaro Herrera

Attachment

Re: massive FPI_FOR_HINT load after promote

From
Masahiko Sawada
Date:
On Tue, 11 Aug 2020 at 07:56, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
>
> Last week, James reported to us that after promoting a replica, some
> seqscan was taking a huge amount of time; on investigation he saw that
> there was a high rate of FPI_FOR_HINT wal messages by the seqscan.
> Looking closely at the generated traffic, HEAP_XMIN_COMMITTED was being
> set on some tuples.
>
> Now this may seem obvious to some as a drawback of the current system,
> but I was taken by surprise.  The problem was simply that when a page is
> examined by a seqscan, we do HeapTupleSatisfiesVisibility of each tuple
> in isolation; and for each tuple we call SetHintBits().  And only the
> first time the FPI happens; by the time we get to the second tuple, the
> page is already dirty, so there's no need to emit an FPI.  But the FPI
> we sent only had the bit on the first tuple ... so the standby will not
> have the bit set for any subsequent tuple.  And on promotion, the
> standby will have to have the bits set for all those tuples, unless you
> happened to dirty the page again later for other reasons.
>
> So if you have some table where tuples gain hint bits in bulk, and
> rarely modify the pages afterwards, and promote before those pages are
> frozen, then you may end up with a massive amount of pages that will
> need hinting after the promote, which can become troublesome.

Did the case you observed not use hot standby? I thought the impact of
this issue could be somewhat alleviated in hot standby cases since
read queries on the hot standby can set hint bits.

>
> One simple idea to try to forestall this problem would be to modify the
> algorithm so that all tuples are scanned and hinted if the page is going
> to be dirtied -- then send a single FPI setting bits for all tuples,
> instead of just on the first tuple.
>

This idea seems good to me but I'm concerned a bit that the
probability of concurrent processes writing FPI for the same page
might get higher since concurrent processes could set hint bits at the
same time. If it's true, I wonder if we can advertise hint bits are
being updated to prevent concurrent FPI writes for the same page.

Regards,

--
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: massive FPI_FOR_HINT load after promote

From
James Coleman
Date:
On Tue, Aug 11, 2020 at 2:55 AM Masahiko Sawada
<masahiko.sawada@2ndquadrant.com> wrote:
>
> On Tue, 11 Aug 2020 at 07:56, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> >
> > Last week, James reported to us that after promoting a replica, some
> > seqscan was taking a huge amount of time; on investigation he saw that
> > there was a high rate of FPI_FOR_HINT wal messages by the seqscan.
> > Looking closely at the generated traffic, HEAP_XMIN_COMMITTED was being
> > set on some tuples.
> >
> > Now this may seem obvious to some as a drawback of the current system,
> > but I was taken by surprise.  The problem was simply that when a page is
> > examined by a seqscan, we do HeapTupleSatisfiesVisibility of each tuple
> > in isolation; and for each tuple we call SetHintBits().  And only the
> > first time the FPI happens; by the time we get to the second tuple, the
> > page is already dirty, so there's no need to emit an FPI.  But the FPI
> > we sent only had the bit on the first tuple ... so the standby will not
> > have the bit set for any subsequent tuple.  And on promotion, the
> > standby will have to have the bits set for all those tuples, unless you
> > happened to dirty the page again later for other reasons.
> >
> > So if you have some table where tuples gain hint bits in bulk, and
> > rarely modify the pages afterwards, and promote before those pages are
> > frozen, then you may end up with a massive amount of pages that will
> > need hinting after the promote, which can become troublesome.
>
> Did the case you observed not use hot standby? I thought the impact of
> this issue could be somewhat alleviated in hot standby cases since
> read queries on the hot standby can set hint bits.

We do have hot standby enabled, and there are sometimes large queries
that may do seq scans that run against a replica, but there are
multiple replicas (and each one would have to have the bits set), and
a given replica that gets promoted in our topology isn't guaranteed to
be one that's seen those reads.

James



Re: massive FPI_FOR_HINT load after promote

From
Alvaro Herrera
Date:
On 2020-Aug-11, Masahiko Sawada wrote:

> On Tue, 11 Aug 2020 at 07:56, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

> > So if you have some table where tuples gain hint bits in bulk, and
> > rarely modify the pages afterwards, and promote before those pages are
> > frozen, then you may end up with a massive amount of pages that will
> > need hinting after the promote, which can become troublesome.
> 
> Did the case you observed not use hot standby? I thought the impact of
> this issue could be somewhat alleviated in hot standby cases since
> read queries on the hot standby can set hint bits.

Oh, interesting, I didn't know that.  However, it's not 100% true: the
standby can set the bit in shared buffers, but it does not mark the page
dirty.  So when the page is evicted, those bits that were set are lost.
That's not great.  See MarkBufferDirtyHint:

        /*
         * If we need to protect hint bit updates from torn writes, WAL-log a
         * full page image of the page. This full page image is only necessary
         * if the hint bit update is the first change to the page since the
         * last checkpoint.
         *
         * We don't check full_page_writes here because that logic is included
         * when we call XLogInsert() since the value changes dynamically.
         */
        if (XLogHintBitIsNeeded() &&
            (pg_atomic_read_u32(&bufHdr->state) & BM_PERMANENT))
        {
            /*
             * If we must not write WAL, due to a relfilenode-specific
             * condition or being in recovery, don't dirty the page.  We can
             * set the hint, just not dirty the page as a result so the hint
             * is lost when we evict the page or shutdown.
             *
             * See src/backend/storage/page/README for longer discussion.
             */
            if (RecoveryInProgress() ||
                RelFileNodeSkippingWAL(bufHdr->tag.rnode))
                return;


> > One simple idea to try to forestall this problem would be to modify the
> > algorithm so that all tuples are scanned and hinted if the page is going
> > to be dirtied -- then send a single FPI setting bits for all tuples,
> > instead of just on the first tuple.
> 
> This idea seems good to me but I'm concerned a bit that the
> probability of concurrent processes writing FPI for the same page
> might get higher since concurrent processes could set hint bits at the
> same time. If it's true, I wonder if we can advertise hint bits are
> being updated to prevent concurrent FPI writes for the same page.

Hmm, a very good point.  Sounds like we would need to obtain an
exclusive lock on the page .. but that would be very problematic.

I don't have a concrete proposal to solve this problem ATM, but it's
more and more looking like it's a serious problem.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: massive FPI_FOR_HINT load after promote

From
Masahiko Sawada
Date:
On Wed, 12 Aug 2020 at 02:42, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
>
> On 2020-Aug-11, Masahiko Sawada wrote:
>
> > On Tue, 11 Aug 2020 at 07:56, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
>
> > > So if you have some table where tuples gain hint bits in bulk, and
> > > rarely modify the pages afterwards, and promote before those pages are
> > > frozen, then you may end up with a massive amount of pages that will
> > > need hinting after the promote, which can become troublesome.
> >
> > Did the case you observed not use hot standby? I thought the impact of
> > this issue could be somewhat alleviated in hot standby cases since
> > read queries on the hot standby can set hint bits.
>
> Oh, interesting, I didn't know that.  However, it's not 100% true: the
> standby can set the bit in shared buffers, but it does not mark the page
> dirty.  So when the page is evicted, those bits that were set are lost.
> That's not great.  See MarkBufferDirtyHint:

Yeah, you're right.

>
> > > One simple idea to try to forestall this problem would be to modify the
> > > algorithm so that all tuples are scanned and hinted if the page is going
> > > to be dirtied -- then send a single FPI setting bits for all tuples,
> > > instead of just on the first tuple.
> >
> > This idea seems good to me but I'm concerned a bit that the
> > probability of concurrent processes writing FPI for the same page
> > might get higher since concurrent processes could set hint bits at the
> > same time. If it's true, I wonder if we can advertise hint bits are
> > being updated to prevent concurrent FPI writes for the same page.
>
> Hmm, a very good point.  Sounds like we would need to obtain an
> exclusive lock on the page .. but that would be very problematic.
>

I think that when the page is going to be dirty only updating hint
bits on the page and writing FPI need to be performed exclusively. So
perhaps we can add a flag, say BM_UPDATE_HINTBITS, to buffer
descriptor indicating the hint bits are being updated.

Regards,

--
Masahiko Sawada            http://www.2ndQuadrant.com/


PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: massive FPI_FOR_HINT load after promote

From
Simon Riggs
Date:
On Mon, 10 Aug 2020 at 23:56, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
>
> The problem was simply that when a page is
> examined by a seqscan, we do HeapTupleSatisfiesVisibility of each tuple
> in isolation; and for each tuple we call SetHintBits().  And only the
> first time the FPI happens; by the time we get to the second tuple, the
> page is already dirty, so there's no need to emit an FPI.  But the FPI
> we sent only had the bit on the first tuple ... so the standby will not
> have the bit set for any subsequent tuple.  And on promotion, the
> standby will have to have the bits set for all those tuples, unless you
> happened to dirty the page again later for other reasons.

Which probably means that pg_rewind is broken because it won't be able
to rewind correctly.

> One simple idea to try to forestall this problem would be to modify the
> algorithm so that all tuples are scanned and hinted if the page is going
> to be dirtied -- then send a single FPI setting bits for all tuples,
> instead of just on the first tuple.

This would make latency much worse for non seqscan cases.

Certainly for seqscans it would make sense to emit a message that sets
all tuples at once, or possibly emit an FPI and then follow that with
a second message that sets all other hints on the page.

-- 
Simon Riggs                http://www.2ndQuadrant.com/
Mission Critical Databases