Thread: massive FPI_FOR_HINT load after promote
Last week, James reported to us that after promoting a replica, some seqscan was taking a huge amount of time; on investigation he saw that there was a high rate of FPI_FOR_HINT wal messages by the seqscan. Looking closely at the generated traffic, HEAP_XMIN_COMMITTED was being set on some tuples. Now this may seem obvious to some as a drawback of the current system, but I was taken by surprise. The problem was simply that when a page is examined by a seqscan, we do HeapTupleSatisfiesVisibility of each tuple in isolation; and for each tuple we call SetHintBits(). And only the first time the FPI happens; by the time we get to the second tuple, the page is already dirty, so there's no need to emit an FPI. But the FPI we sent only had the bit on the first tuple ... so the standby will not have the bit set for any subsequent tuple. And on promotion, the standby will have to have the bits set for all those tuples, unless you happened to dirty the page again later for other reasons. So if you have some table where tuples gain hint bits in bulk, and rarely modify the pages afterwards, and promote before those pages are frozen, then you may end up with a massive amount of pages that will need hinting after the promote, which can become troublesome. Attached is a TAP file that reproduces the problem. It always fails, but in the log file you can see the tuples in the primary are all hinted committed, while on the standby only the first one is hinted committed. One simple idea to try to forestall this problem would be to modify the algorithm so that all tuples are scanned and hinted if the page is going to be dirtied -- then send a single FPI setting bits for all tuples, instead of just on the first tuple. -- Álvaro Herrera
Attachment
On Tue, 11 Aug 2020 at 07:56, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > > Last week, James reported to us that after promoting a replica, some > seqscan was taking a huge amount of time; on investigation he saw that > there was a high rate of FPI_FOR_HINT wal messages by the seqscan. > Looking closely at the generated traffic, HEAP_XMIN_COMMITTED was being > set on some tuples. > > Now this may seem obvious to some as a drawback of the current system, > but I was taken by surprise. The problem was simply that when a page is > examined by a seqscan, we do HeapTupleSatisfiesVisibility of each tuple > in isolation; and for each tuple we call SetHintBits(). And only the > first time the FPI happens; by the time we get to the second tuple, the > page is already dirty, so there's no need to emit an FPI. But the FPI > we sent only had the bit on the first tuple ... so the standby will not > have the bit set for any subsequent tuple. And on promotion, the > standby will have to have the bits set for all those tuples, unless you > happened to dirty the page again later for other reasons. > > So if you have some table where tuples gain hint bits in bulk, and > rarely modify the pages afterwards, and promote before those pages are > frozen, then you may end up with a massive amount of pages that will > need hinting after the promote, which can become troublesome. Did the case you observed not use hot standby? I thought the impact of this issue could be somewhat alleviated in hot standby cases since read queries on the hot standby can set hint bits. > > One simple idea to try to forestall this problem would be to modify the > algorithm so that all tuples are scanned and hinted if the page is going > to be dirtied -- then send a single FPI setting bits for all tuples, > instead of just on the first tuple. > This idea seems good to me but I'm concerned a bit that the probability of concurrent processes writing FPI for the same page might get higher since concurrent processes could set hint bits at the same time. If it's true, I wonder if we can advertise hint bits are being updated to prevent concurrent FPI writes for the same page. Regards, -- Masahiko Sawada http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Aug 11, 2020 at 2:55 AM Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote: > > On Tue, 11 Aug 2020 at 07:56, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > > > > Last week, James reported to us that after promoting a replica, some > > seqscan was taking a huge amount of time; on investigation he saw that > > there was a high rate of FPI_FOR_HINT wal messages by the seqscan. > > Looking closely at the generated traffic, HEAP_XMIN_COMMITTED was being > > set on some tuples. > > > > Now this may seem obvious to some as a drawback of the current system, > > but I was taken by surprise. The problem was simply that when a page is > > examined by a seqscan, we do HeapTupleSatisfiesVisibility of each tuple > > in isolation; and for each tuple we call SetHintBits(). And only the > > first time the FPI happens; by the time we get to the second tuple, the > > page is already dirty, so there's no need to emit an FPI. But the FPI > > we sent only had the bit on the first tuple ... so the standby will not > > have the bit set for any subsequent tuple. And on promotion, the > > standby will have to have the bits set for all those tuples, unless you > > happened to dirty the page again later for other reasons. > > > > So if you have some table where tuples gain hint bits in bulk, and > > rarely modify the pages afterwards, and promote before those pages are > > frozen, then you may end up with a massive amount of pages that will > > need hinting after the promote, which can become troublesome. > > Did the case you observed not use hot standby? I thought the impact of > this issue could be somewhat alleviated in hot standby cases since > read queries on the hot standby can set hint bits. We do have hot standby enabled, and there are sometimes large queries that may do seq scans that run against a replica, but there are multiple replicas (and each one would have to have the bits set), and a given replica that gets promoted in our topology isn't guaranteed to be one that's seen those reads. James
On 2020-Aug-11, Masahiko Sawada wrote: > On Tue, 11 Aug 2020 at 07:56, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > > So if you have some table where tuples gain hint bits in bulk, and > > rarely modify the pages afterwards, and promote before those pages are > > frozen, then you may end up with a massive amount of pages that will > > need hinting after the promote, which can become troublesome. > > Did the case you observed not use hot standby? I thought the impact of > this issue could be somewhat alleviated in hot standby cases since > read queries on the hot standby can set hint bits. Oh, interesting, I didn't know that. However, it's not 100% true: the standby can set the bit in shared buffers, but it does not mark the page dirty. So when the page is evicted, those bits that were set are lost. That's not great. See MarkBufferDirtyHint: /* * If we need to protect hint bit updates from torn writes, WAL-log a * full page image of the page. This full page image is only necessary * if the hint bit update is the first change to the page since the * last checkpoint. * * We don't check full_page_writes here because that logic is included * when we call XLogInsert() since the value changes dynamically. */ if (XLogHintBitIsNeeded() && (pg_atomic_read_u32(&bufHdr->state) & BM_PERMANENT)) { /* * If we must not write WAL, due to a relfilenode-specific * condition or being in recovery, don't dirty the page. We can * set the hint, just not dirty the page as a result so the hint * is lost when we evict the page or shutdown. * * See src/backend/storage/page/README for longer discussion. */ if (RecoveryInProgress() || RelFileNodeSkippingWAL(bufHdr->tag.rnode)) return; > > One simple idea to try to forestall this problem would be to modify the > > algorithm so that all tuples are scanned and hinted if the page is going > > to be dirtied -- then send a single FPI setting bits for all tuples, > > instead of just on the first tuple. > > This idea seems good to me but I'm concerned a bit that the > probability of concurrent processes writing FPI for the same page > might get higher since concurrent processes could set hint bits at the > same time. If it's true, I wonder if we can advertise hint bits are > being updated to prevent concurrent FPI writes for the same page. Hmm, a very good point. Sounds like we would need to obtain an exclusive lock on the page .. but that would be very problematic. I don't have a concrete proposal to solve this problem ATM, but it's more and more looking like it's a serious problem. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, 12 Aug 2020 at 02:42, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > > On 2020-Aug-11, Masahiko Sawada wrote: > > > On Tue, 11 Aug 2020 at 07:56, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > > > > So if you have some table where tuples gain hint bits in bulk, and > > > rarely modify the pages afterwards, and promote before those pages are > > > frozen, then you may end up with a massive amount of pages that will > > > need hinting after the promote, which can become troublesome. > > > > Did the case you observed not use hot standby? I thought the impact of > > this issue could be somewhat alleviated in hot standby cases since > > read queries on the hot standby can set hint bits. > > Oh, interesting, I didn't know that. However, it's not 100% true: the > standby can set the bit in shared buffers, but it does not mark the page > dirty. So when the page is evicted, those bits that were set are lost. > That's not great. See MarkBufferDirtyHint: Yeah, you're right. > > > > One simple idea to try to forestall this problem would be to modify the > > > algorithm so that all tuples are scanned and hinted if the page is going > > > to be dirtied -- then send a single FPI setting bits for all tuples, > > > instead of just on the first tuple. > > > > This idea seems good to me but I'm concerned a bit that the > > probability of concurrent processes writing FPI for the same page > > might get higher since concurrent processes could set hint bits at the > > same time. If it's true, I wonder if we can advertise hint bits are > > being updated to prevent concurrent FPI writes for the same page. > > Hmm, a very good point. Sounds like we would need to obtain an > exclusive lock on the page .. but that would be very problematic. > I think that when the page is going to be dirty only updating hint bits on the page and writing FPI need to be performed exclusively. So perhaps we can add a flag, say BM_UPDATE_HINTBITS, to buffer descriptor indicating the hint bits are being updated. Regards, -- Masahiko Sawada http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Mon, 10 Aug 2020 at 23:56, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > > The problem was simply that when a page is > examined by a seqscan, we do HeapTupleSatisfiesVisibility of each tuple > in isolation; and for each tuple we call SetHintBits(). And only the > first time the FPI happens; by the time we get to the second tuple, the > page is already dirty, so there's no need to emit an FPI. But the FPI > we sent only had the bit on the first tuple ... so the standby will not > have the bit set for any subsequent tuple. And on promotion, the > standby will have to have the bits set for all those tuples, unless you > happened to dirty the page again later for other reasons. Which probably means that pg_rewind is broken because it won't be able to rewind correctly. > One simple idea to try to forestall this problem would be to modify the > algorithm so that all tuples are scanned and hinted if the page is going > to be dirtied -- then send a single FPI setting bits for all tuples, > instead of just on the first tuple. This would make latency much worse for non seqscan cases. Certainly for seqscans it would make sense to emit a message that sets all tuples at once, or possibly emit an FPI and then follow that with a second message that sets all other hints on the page. -- Simon Riggs http://www.2ndQuadrant.com/ Mission Critical Databases