Re: BitmapHeapScan streaming read user and prelim refactoring - Mailing list pgsql-hackers

From Melanie Plageman
Subject Re: BitmapHeapScan streaming read user and prelim refactoring
Date
Msg-id CAAKRu_bQ9a2dB42Tz-mrsR+2MuG-ojRmqbX-x-TC08Uo_RunNw@mail.gmail.com
Whole thread Raw
In response to Re: BitmapHeapScan streaming read user and prelim refactoring  (Andres Freund <andres@anarazel.de>)
Responses Re: BitmapHeapScan streaming read user and prelim refactoring
List pgsql-hackers
Thank you so much for this thorough review!!!!

On Wed, Feb 14, 2024 at 2:42 PM Andres Freund <andres@anarazel.de> wrote:
>
>
> On 2024-02-13 18:11:25 -0500, Melanie Plageman wrote:
>
> > From d6dd6eb21dcfbc41208f87d1d81ffe3960130889 Mon Sep 17 00:00:00 2001
> > From: Melanie Plageman <melanieplageman@gmail.com>
> > Date: Mon, 12 Feb 2024 18:50:29 -0500
> > Subject: [PATCH v1 03/11] BitmapHeapScan begin scan after bitmap setup
> >
> > There is no reason for table_beginscan_bm() to begin the actual scan of
> > the underlying table in ExecInitBitmapHeapScan(). We can begin the
> > underlying table scan after the index scan has been completed and the
> > bitmap built.
> >
> > The one use of the scan descriptor during initialization was
> > ExecBitmapHeapInitializeWorker(), which set the scan descriptor snapshot
> > with one from an array in the parallel state. This overwrote the
> > snapshot set in table_beginscan_bm().
> >
> > By saving that worker snapshot as a member in the BitmapHeapScanState
> > during initialization, it can be restored in table_beginscan_bm() after
> > returning from the table AM specific begin scan function.
>
> I don't understand what the point of passing two different snapshots to
> table_beginscan_bm() is. What does that even mean? Why can't we just use the
> correct snapshot initially?

Indeed. Honestly, it was an unlabeled TODO for me. I wasn't quite sure
how to get the same behavior as in master. Fixed in attached v2.

Now the parallel worker still restores and registers that snapshot in
ExecBitmapHeapInitializeWorker() and then saves it in the
BitmapHeapScanState. We then pass SO_TEMP_SNAPSHOT as an extra flag
(to set rs_flags) to table_beginscan_bm() if there is a parallel
worker snapshot saved in the BitmapHeapScanState.

> > From a3f62e4299663d418531ae61bb16ea39f0836fac Mon Sep 17 00:00:00 2001
> > From: Melanie Plageman <melanieplageman@gmail.com>
> > Date: Mon, 12 Feb 2024 19:03:24 -0500
> > Subject: [PATCH v1 04/11] BitmapPrefetch use prefetch block recheck for skip
> >  fetch
> >
> > Previously BitmapPrefetch() used the recheck flag for the current block
> > to determine whether or not it could skip prefetching the proposed
> > prefetch block. It makes more sense for it to use the recheck flag from
> > the TBMIterateResult for the prefetch block instead.
>
> I'd mention the commit that introduced the current logic and link to the
> the thread that you started about this.

Done

> > From d56be7741765d93002649ef912ef4b8256a5b9af Mon Sep 17 00:00:00 2001
> > From: Melanie Plageman <melanieplageman@gmail.com>
> > Date: Mon, 12 Feb 2024 19:04:48 -0500
> > Subject: [PATCH v1 05/11] Update BitmapAdjustPrefetchIterator parameter type
> >  to BlockNumber
> >
> > BitmapAdjustPrefetchIterator() only used the blockno member of the
> > passed in TBMIterateResult to ensure that the prefetch iterator and
> > regular iterator stay in sync. Pass it the BlockNumber only. This will
> > allow us to move away from using the TBMIterateResult outside of table
> > AM specific code.
>
> Hm - I'm not convinced this is a good direction - doesn't that arguably
> *increase* TAM awareness? Perhaps it doesn't make much sense to use bitmap
> heap scans in a TAM without blocks, but still.

This is removed in later commits and is an intermediate state to try
and move the TBMIterateResult out of BitmapHeapNext(). I can find
another way to achieve this if it is important.

> > From 202b16d3a381210e8dbee69e68a8310be8ee11d2 Mon Sep 17 00:00:00 2001
> > From: Melanie Plageman <melanieplageman@gmail.com>
> > Date: Mon, 12 Feb 2024 20:15:05 -0500
> > Subject: [PATCH v1 06/11] Push BitmapHeapScan skip fetch optimization into
> >  table AM
> >
> > This resolves the long-standing FIXME in BitmapHeapNext() which said that
> > the optmization to skip fetching blocks of the underlying table when
> > none of the column data was needed should be pushed into the table AM
> > specific code.
>
> Long-standing? Sure, it's old enough to walk, but we have FIXMEs that are old
> enough to drink, at least in some countries. :)

;) I've updated the commit message. Though it is longstanding in that
it predates Melanie + Postgres.

> > The table AM agnostic functions for prefetching still need to know if
> > skipping fetching is permitted for this scan. However, this dependency
> > will be removed when that prefetching code is removed in favor of the
> > upcoming streaming read API.
>
> > ---
> >  src/backend/access/heap/heapam.c          |  10 +++
> >  src/backend/access/heap/heapam_handler.c  |  29 +++++++
> >  src/backend/executor/nodeBitmapHeapscan.c | 100 ++++++----------------
> >  src/include/access/heapam.h               |   2 +
> >  src/include/access/tableam.h              |  17 ++--
> >  src/include/nodes/execnodes.h             |   6 --
> >  6 files changed, 74 insertions(+), 90 deletions(-)
> >
> > diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
> > index 707460a5364..7aae1ecf0a9 100644
> > --- a/src/backend/access/heap/heapam.c
> > +++ b/src/backend/access/heap/heapam.c
> > @@ -955,6 +955,8 @@ heap_beginscan(Relation relation, Snapshot snapshot,
> >       scan->rs_base.rs_flags = flags;
> >       scan->rs_base.rs_parallel = parallel_scan;
> >       scan->rs_strategy = NULL;       /* set in initscan */
> > +     scan->vmbuffer = InvalidBuffer;
> > +     scan->empty_tuples = 0;
>
> These don't follow the existing naming pattern for HeapScanDescData. While I
> explicitly dislike the practice of adding prefixes to struct members, I don't
> think mixing conventions within a single struct improves things.

I've updated the names. What does rs even stand for?

> I also think it'd be good to note in comments that the vm buffer currently is
> only used for bitmap heap scans, otherwise one might think they'd also be used
> for normal scans, where we don't need them, because of the page level flag.

Done.

> Also, perhaps worth renaming "empty_tuples" to something indicating that it's
> the number of empty tuples to be returned later? num_empty_tuples_pending or
> such? Or the current "return_empty_tuples".

Done.

> > @@ -1043,6 +1045,10 @@ heap_rescan(TableScanDesc sscan, ScanKey key, bool set_params,
> >       if (BufferIsValid(scan->rs_cbuf))
> >               ReleaseBuffer(scan->rs_cbuf);
> >
> > +     if (BufferIsValid(scan->vmbuffer))
> > +             ReleaseBuffer(scan->vmbuffer);
> > +     scan->vmbuffer = InvalidBuffer;
>
> It does not matter one iota here, but personally I prefer moving the write
> inside the if, as dirtying the cacheline after we just figured out whe

I've now followed this convention throughout my patchset in the places
where I noticed it.

> > diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
> > index 9372b49bfaa..c0fb06c9688 100644
> > --- a/src/backend/executor/nodeBitmapHeapscan.c
> > +++ b/src/backend/executor/nodeBitmapHeapscan.c
> > @@ -108,6 +108,7 @@ BitmapHeapNext(BitmapHeapScanState *node)
> >        */
> >       if (!node->initialized)
> >       {
> > +             bool can_skip_fetch;
> >               /*
> >                * We can potentially skip fetching heap pages if we do not need any
> >                * columns of the table, either for checking non-indexable quals or
>
> Pretty sure pgindent will move this around.

This is gone now, but I have pgindented all the commits so it
shouldn't be a problem again.

> > +++ b/src/include/access/tableam.h
> > @@ -62,6 +62,7 @@ typedef enum ScanOptions
> >
> >       /* unregister snapshot at scan end? */
> >       SO_TEMP_SNAPSHOT = 1 << 9,
> > +     SO_CAN_SKIP_FETCH = 1 << 10,
> >  }                    ScanOptions;
>
> Would be nice to add a comment explaining what this flag means.

Done.

> > From 500c84019b982a1e6c8b8dd40240c8510d83c287 Mon Sep 17 00:00:00 2001
> > From: Melanie Plageman <melanieplageman@gmail.com>
> > Date: Tue, 13 Feb 2024 10:05:04 -0500
> > Subject: [PATCH v1 07/11] BitmapHeapScan scan desc counts lossy and exact
> >  pages
> >
> > Future commits will remove the TBMIterateResult from BitmapHeapNext(),
> > pushing it into the table AM-specific code. So we will have to keep
> > track of the number of lossy and exact pages in the scan descriptor.
> > Doing this change to lossy/exact page counting in a separate commit just
> > simplifies the diff.
>
> > ---
> >  src/backend/access/heap/heapam.c          |  2 ++
> >  src/backend/access/heap/heapam_handler.c  |  9 +++++++++
> >  src/backend/executor/nodeBitmapHeapscan.c | 18 +++++++++++++-----
> >  src/include/access/relscan.h              |  4 ++++
> >  4 files changed, 28 insertions(+), 5 deletions(-)
> >
> > diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
> > index 7aae1ecf0a9..88b4aad5820 100644
> > --- a/src/backend/access/heap/heapam.c
> > +++ b/src/backend/access/heap/heapam.c
> > @@ -957,6 +957,8 @@ heap_beginscan(Relation relation, Snapshot snapshot,
> >       scan->rs_strategy = NULL;       /* set in initscan */
> >       scan->vmbuffer = InvalidBuffer;
> >       scan->empty_tuples = 0;
> > +     scan->rs_base.lossy_pages = 0;
> > +     scan->rs_base.exact_pages = 0;
> >
> >       /*
> >        * Disable page-at-a-time mode if it's not a MVCC-safe snapshot.
> > diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
> > index baba09c87c0..6e85ef7a946 100644
> > --- a/src/backend/access/heap/heapam_handler.c
> > +++ b/src/backend/access/heap/heapam_handler.c
> > @@ -2242,6 +2242,15 @@ heapam_scan_bitmap_next_block(TableScanDesc scan,
> >       Assert(ntup <= MaxHeapTuplesPerPage);
> >       hscan->rs_ntuples = ntup;
> >
> > +     /* Only count exact and lossy pages with visible tuples */
> > +     if (ntup > 0)
> > +     {
> > +             if (tbmres->ntuples >= 0)
> > +                     scan->exact_pages++;
> > +             else
> > +                     scan->lossy_pages++;
> > +     }
> > +
> >       return ntup > 0;
> >  }
> >
> > diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
> > index c0fb06c9688..19d115de06f 100644
> > --- a/src/backend/executor/nodeBitmapHeapscan.c
> > +++ b/src/backend/executor/nodeBitmapHeapscan.c
> > @@ -53,6 +53,8 @@
> >  #include "utils/spccache.h"
> >
> >  static TupleTableSlot *BitmapHeapNext(BitmapHeapScanState *node);
> > +static inline void BitmapAccumCounters(BitmapHeapScanState *node,
> > +                                                                        TableScanDesc scan);
> >  static inline void BitmapDoneInitializingSharedState(ParallelBitmapHeapState *pstate);
> >  static inline void BitmapAdjustPrefetchIterator(BitmapHeapScanState *node,
> >                                                                                               BlockNumber blockno);
> > @@ -234,11 +236,6 @@ BitmapHeapNext(BitmapHeapScanState *node)
> >                               continue;
> >                       }
> >
> > -                     if (tbmres->ntuples >= 0)
> > -                             node->exact_pages++;
> > -                     else
> > -                             node->lossy_pages++;
> > -
> >                       /* Adjust the prefetch target */
> >                       BitmapAdjustPrefetchTarget(node);
> >               }
> > @@ -315,9 +312,20 @@ BitmapHeapNext(BitmapHeapScanState *node)
> >       /*
> >        * if we get here it means we are at the end of the scan..
> >        */
> > +     BitmapAccumCounters(node, scan);
> >       return ExecClearTuple(slot);
> >  }
> >
> > +static inline void
> > +BitmapAccumCounters(BitmapHeapScanState *node,
> > +                                     TableScanDesc scan)
> > +{
> > +     node->exact_pages += scan->exact_pages;
> > +     scan->exact_pages = 0;
> > +     node->lossy_pages += scan->lossy_pages;
> > +     scan->lossy_pages = 0;
> > +}
> > +
>
> I don't think this is quite right - you're calling BitmapAccumCounters() only
> when the scan doesn't return anything anymore, but there's no guarantee
> that'll ever be reached. E.g. a bitmap heap scan below a limit node.  I think
> this needs to be in a) ExecEndBitmapHeapScan() b) ExecReScanBitmapHeapScan()

The scan descriptor isn't available in ExecEnd/ReScanBitmapHeapScan().
So, if we count in the scan descriptor we can't accumulate into the
BitmapHeapScanState there. The reason to count in the scan descriptor
is that it is in the table AM where we know if we have a lossy or
exact page -- and we only have the scan descriptor not the
BitmapHeapScanState in the table AM.

I added a call to BitmapAccumCounters before the tuple is returned for
correctness in this version (not ideal, I realize). See below for
thoughts about what we could do instead.

> >  /*
> >   *   BitmapDoneInitializingSharedState - Shared state is initialized
> >   *
> > diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
> > index 521043304ab..b74e08dd745 100644
> > --- a/src/include/access/relscan.h
> > +++ b/src/include/access/relscan.h
> > @@ -40,6 +40,10 @@ typedef struct TableScanDescData
> >       ItemPointerData rs_mintid;
> >       ItemPointerData rs_maxtid;
> >
> > +     /* Only used for Bitmap table scans */
> > +     long            exact_pages;
> > +     long            lossy_pages;
> > +
> >       /*
> >        * Information about type and behaviour of the scan, a bitmask of members
> >        * of the ScanOptions enum (see tableam.h).
>
> I wonder if this really is the best place for the data to be accumulated. This
> requires the accounting to be implemented in each AM, which doesn't obviously
> seem required.  Why can't the accounting continue to live in
> nodeBitmapHeapscan.c, to be done after each table_scan_bitmap_next_block()
> call?

Yes, I would really prefer not to do it in the table AM. But, we only
count exact and lossy pages for which at least one or more tuples were
visible (change this and you'll see tests fail). So, we need to decide
if we are going to increment the counters somewhere where we have
access to that information. In the case of heap, that is really only
once I have the value of ntup in heapam_scan_bitmap_next_block(). To
get that information back out to BitmapHeapNext(), I considered adding
another parameter to heapam_scan_bitmap_next_block() -- maybe an enum
like this:

/*
 * BitmapHeapScans's bitmaps can choose to store per page information in a
 * lossy or exact way. Exact pages in the bitmap have the individual tuple
 * offsets that need to be visited while lossy pages in the bitmap have only the
 * block number of the page.
 */
typedef enum BitmapBlockResolution
{
    BITMAP_BLOCK_NO_VISIBLE,
    BITMAP_BLOCK_LOSSY,
    BITMAP_BLOCK_EXACT,
} BitmapBlockResolution;

which we then use to increment the counter. But while I was writing
this code, I found myself narrating in the comment that the reason
this had to be set inside of the table AM is that only the table AM
knows if it wants to count the block as lossy, exact, or not count it.
So, that made me question if it really should be in the
BitmapHeapScanState.

I also explored passing the table scan descriptor to
show_tidbitmap_info() -- but that had its own problems.

> > From 555743e4bc885609d20768f7f2990c6ba69b13a9 Mon Sep 17 00:00:00 2001
> > From: Melanie Plageman <melanieplageman@gmail.com>
> > Date: Tue, 13 Feb 2024 10:57:07 -0500
> > Subject: [PATCH v1 09/11] Make table_scan_bitmap_next_block() async friendly
> >
> > table_scan_bitmap_next_block() previously returned false if we did not
> > wish to call table_scan_bitmap_next_tuple() on the tuples on the page.
> > This could happen when there were no visible tuples on the page or, due
> > to concurrent activity on the table, the block returned by the iterator
> > is past the known end of the table.
>
> This sounds a bit like the block is actually past the end of the table,
> but in reality this happens if the block is past the end of the table as it
> was when the scan was started. Somehow that feels significant, but I don't
> really know why I think that.

I have tried to update the commit message to make it clearer. I was
actually wondering: now that we do table_beginscan_bm() in
BitmapHeapNext() instead of ExecInitBitmapHeapScan(), have we reduced
or eliminated the opportunity for this to be true? initscan() sets
rs_nblocks and that now happens much later.

> > diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
> > index 88b4aad5820..d8569373987 100644
> > --- a/src/backend/access/heap/heapam.c
> > +++ b/src/backend/access/heap/heapam.c
> > @@ -959,6 +959,8 @@ heap_beginscan(Relation relation, Snapshot snapshot,
> >       scan->empty_tuples = 0;
> >       scan->rs_base.lossy_pages = 0;
> >       scan->rs_base.exact_pages = 0;
> > +     scan->rs_base.shared_tbmiterator = NULL;
> > +     scan->rs_base.tbmiterator = NULL;
> >
> >       /*
> >        * Disable page-at-a-time mode if it's not a MVCC-safe snapshot.
> > @@ -1051,6 +1053,18 @@ heap_rescan(TableScanDesc sscan, ScanKey key, bool set_params,
> >               ReleaseBuffer(scan->vmbuffer);
> >       scan->vmbuffer = InvalidBuffer;
> >
> > +     if (scan->rs_base.rs_flags & SO_TYPE_BITMAPSCAN)
> > +     {
> > +             if (scan->rs_base.shared_tbmiterator)
> > +                     tbm_end_shared_iterate(scan->rs_base.shared_tbmiterator);
> > +
> > +             if (scan->rs_base.tbmiterator)
> > +                     tbm_end_iterate(scan->rs_base.tbmiterator);
> > +     }
> > +
> > +     scan->rs_base.shared_tbmiterator = NULL;
> > +     scan->rs_base.tbmiterator = NULL;
> > +
> >       /*
> >        * reinitialize scan descriptor
> >        */
>
> If every AM would need to implement this, perhaps this shouldn't be done here,
> but in generic code?

Fixed.

> > --- a/src/backend/access/heap/heapam_handler.c
> > +++ b/src/backend/access/heap/heapam_handler.c
> > @@ -2114,17 +2114,49 @@ heapam_estimate_rel_size(Relation rel, int32 *attr_widths,
> >
> >  static bool
> >  heapam_scan_bitmap_next_block(TableScanDesc scan,
> > -                                                       TBMIterateResult *tbmres)
> > +                                                       bool *recheck, BlockNumber *blockno)
> >  {
> >       HeapScanDesc hscan = (HeapScanDesc) scan;
> > -     BlockNumber block = tbmres->blockno;
> > +     BlockNumber block;
> >       Buffer          buffer;
> >       Snapshot        snapshot;
> >       int                     ntup;
> > +     TBMIterateResult *tbmres;
> >
> >       hscan->rs_cindex = 0;
> >       hscan->rs_ntuples = 0;
> >
> > +     *blockno = InvalidBlockNumber;
> > +     *recheck = true;
> > +
> > +     do
> > +     {
> > +             if (scan->shared_tbmiterator)
> > +                     tbmres = tbm_shared_iterate(scan->shared_tbmiterator);
> > +             else
> > +                     tbmres = tbm_iterate(scan->tbmiterator);
> > +
> > +             if (tbmres == NULL)
> > +             {
> > +                     /* no more entries in the bitmap */
> > +                     Assert(hscan->empty_tuples == 0);
> > +                     return false;
> > +             }
> > +
> > +             /*
> > +              * Ignore any claimed entries past what we think is the end of the
> > +              * relation. It may have been extended after the start of our scan (we
> > +              * only hold an AccessShareLock, and it could be inserts from this
> > +              * backend).  We don't take this optimization in SERIALIZABLE
> > +              * isolation though, as we need to examine all invisible tuples
> > +              * reachable by the index.
> > +              */
> > +     } while (!IsolationIsSerializable() && tbmres->blockno >= hscan->rs_nblocks);
>
> Hm. Isn't it a problem that we have no CHECK_FOR_INTERRUPTS() in this loop?

Yes. fixed.

> > @@ -2251,7 +2274,14 @@ heapam_scan_bitmap_next_block(TableScanDesc scan,
> >                       scan->lossy_pages++;
> >       }
> >
> > -     return ntup > 0;
> > +     /*
> > +      * Return true to indicate that a valid block was found and the bitmap is
> > +      * not exhausted. If there are no visible tuples on this page,
> > +      * hscan->rs_ntuples will be 0 and heapam_scan_bitmap_next_tuple() will
> > +      * return false returning control to this function to advance to the next
> > +      * block in the bitmap.
> > +      */
> > +     return true;
> >  }
>
> Why can't we fetch the next block immediately?

We don't know that we want another block until we've gone through this
page and seen there were no visible tuples, so we'd somehow have to
jump back up to the top of the function to get the next block -- which
is basically what is happening in my revised control flow. We call
heapam_scan_bitmap_next_tuple() and rs_ntuples is 0, so we end up
calling heapam_scan_bitmap_next_block() right away.

> > @@ -201,46 +197,23 @@ BitmapHeapNext(BitmapHeapScanState *node)
> >
                 can_skip_fetch); 
> >               }
> >
> > -             node->tbmiterator = tbmiterator;
> > -             node->shared_tbmiterator = shared_tbmiterator;
> > +             scan->tbmiterator = tbmiterator;
> > +             scan->shared_tbmiterator = shared_tbmiterator;
>
> It seems a bit odd that this code modifies the scan descriptor, instead of
> passing the iterator, or perhaps better the bitmap itself, to
> table_beginscan_bm()?

On rescan we actually will have initialized = false and make new
iterators but have the old scan descriptor. So, we need to be able to
set the iterator in the scan to the new iterator.

> > diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
> > index b74e08dd745..bf7ee044268 100644
> > --- a/src/include/access/relscan.h
> > +++ b/src/include/access/relscan.h
> > @@ -16,6 +16,7 @@
> >
> >  #include "access/htup_details.h"
> >  #include "access/itup.h"
> > +#include "nodes/tidbitmap.h"
>
> I'd like to avoid exposing this to everything including relscan.h. I think we
> could just forward declare the structs and use them here to avoid that?

Done

> > From aac60985d6bc70bfedf77a77ee3c512da87bfcb1 Mon Sep 17 00:00:00 2001
> > From: Melanie Plageman <melanieplageman@gmail.com>
> > Date: Tue, 13 Feb 2024 14:27:57 -0500
> > Subject: [PATCH v1 11/11] BitmapHeapScan uses streaming read API
> >
> > Remove all of the code to do prefetching from BitmapHeapScan code and
> > rely on the streaming read API prefetching. Heap table AM implements a
> > streaming read callback which uses the iterator to get the next valid
> > block that needs to be fetched for the streaming read API.
> > ---
> >  src/backend/access/gin/ginget.c           |  15 +-
> >  src/backend/access/gin/ginscan.c          |   7 +
> >  src/backend/access/heap/heapam.c          |  71 +++++
> >  src/backend/access/heap/heapam_handler.c  |  78 +++--
> >  src/backend/executor/nodeBitmapHeapscan.c | 328 +---------------------
> >  src/backend/nodes/tidbitmap.c             |  80 +++---
> >  src/include/access/heapam.h               |   2 +
> >  src/include/access/tableam.h              |  14 +-
> >  src/include/nodes/execnodes.h             |  19 --
> >  src/include/nodes/tidbitmap.h             |   8 +-
> >  10 files changed, 178 insertions(+), 444 deletions(-)
> >
> > diff --git a/src/backend/access/gin/ginget.c b/src/backend/access/gin/ginget.c
> > index 0b4f2ebadb6..3ce28078a6f 100644
> > --- a/src/backend/access/gin/ginget.c
> > +++ b/src/backend/access/gin/ginget.c
> > @@ -373,7 +373,10 @@ restartScanEntry:
> >                       if (entry->matchBitmap)
> >                       {
> >                               if (entry->matchIterator)
> > +                             {
> >                                       tbm_end_iterate(entry->matchIterator);
> > +                                     pfree(entry->matchResult);
> > +                             }
> >                               entry->matchIterator = NULL;
> >                               tbm_free(entry->matchBitmap);
> >                               entry->matchBitmap = NULL;
> > @@ -386,6 +389,7 @@ restartScanEntry:
> >               if (entry->matchBitmap && !tbm_is_empty(entry->matchBitmap))
> >               {
> >                       entry->matchIterator = tbm_begin_iterate(entry->matchBitmap);
> > +                     entry->matchResult = palloc0(TBM_ITERATE_RESULT_SIZE);
>
> Do we actually have to use palloc0? TBM_ITERATE_RESULT_SIZE ain't small, so
> zeroing all of it isn't free.

Tests actually did fail when I didn't use palloc0.

This code is different now though. There are a few new patches in v2
that 1) make the offsets array in the TBMIterateResult fixed size and
then this makes it possible to 2) make matchResult an inline member of
the GinScanEntry. I have a TODO in the code asking if setting blockno
in the TBMIterateResult to InvalidBlockNumber is sufficient
"resetting".

> > +static BlockNumber bitmapheap_pgsr_next_single(PgStreamingRead *pgsr, void *pgsr_private,
> > +                                                     void *per_buffer_data);
>
> Is it correct to have _single in the name here? Aren't we also using for
> parallel scans?

Right. I had a separate parallel version and then deleted it. This is now fixed.

> > +static BlockNumber
> > +bitmapheap_pgsr_next_single(PgStreamingRead *pgsr, void *pgsr_private,
> > +                                                     void *per_buffer_data)
> > +{
> > +     TBMIterateResult *tbmres = per_buffer_data;
> > +     HeapScanDesc hdesc = (HeapScanDesc) pgsr_private;
> > +
> > +     for (;;)
> > +     {
> > +             if (hdesc->rs_base.shared_tbmiterator)
> > +                     tbm_shared_iterate(hdesc->rs_base.shared_tbmiterator, tbmres);
> > +             else
> > +                     tbm_iterate(hdesc->rs_base.tbmiterator, tbmres);
> > +
> > +             /* no more entries in the bitmap */
> > +             if (!BlockNumberIsValid(tbmres->blockno))
> > +                     return InvalidBlockNumber;
> > +
> > +             /*
> > +              * Ignore any claimed entries past what we think is the end of the
> > +              * relation. It may have been extended after the start of our scan (we
> > +              * only hold an AccessShareLock, and it could be inserts from this
> > +              * backend).  We don't take this optimization in SERIALIZABLE
> > +              * isolation though, as we need to examine all invisible tuples
> > +              * reachable by the index.
> > +              */
> > +             if (!IsolationIsSerializable() && tbmres->blockno >= hdesc->rs_nblocks)
> > +                     continue;
> > +
> > +
> > +             if (hdesc->rs_base.rs_flags & SO_CAN_SKIP_FETCH &&
> > +                     !tbmres->recheck &&
> > +                     VM_ALL_VISIBLE(hdesc->rs_base.rs_rd, tbmres->blockno, &hdesc->vmbuffer))
> > +             {
> > +                     hdesc->empty_tuples += tbmres->ntuples;
> > +                     continue;
> > +             }
> > +
> > +             return tbmres->blockno;
> > +     }
> > +
> > +     /* not reachable */
> > +     Assert(false);
> > +}
>
> Need to check for interrupts somewhere here.

Done.

> > @@ -124,15 +119,6 @@ BitmapHeapNext(BitmapHeapScanState *node)
>
> There's still a comment in BitmapHeapNext talking about prefetching with two
> iterators etc. That seems outdated now.

Fixed.

> >  /*
> >   * tbm_iterate - scan through next page of a TIDBitmap
> >   *
> > - * Returns a TBMIterateResult representing one page, or NULL if there are
> > - * no more pages to scan.  Pages are guaranteed to be delivered in numerical
> > - * order.  If result->ntuples < 0, then the bitmap is "lossy" and failed to
> > - * remember the exact tuples to look at on this page --- the caller must
> > - * examine all tuples on the page and check if they meet the intended
> > - * condition.  If result->recheck is true, only the indicated tuples need
> > - * be examined, but the condition must be rechecked anyway.  (For ease of
> > - * testing, recheck is always set true when ntuples < 0.)
> > + * Caller must pass in a TBMIterateResult to be filled.
> > + *
> > + * Pages are guaranteed to be delivered in numerical order.  tbmres->blockno is
> > + * set to InvalidBlockNumber when there are no more pages to scan. If
> > + * tbmres->ntuples < 0, then the bitmap is "lossy" and failed to remember the
> > + * exact tuples to look at on this page --- the caller must examine all tuples
> > + * on the page and check if they meet the intended condition.  If
> > + * tbmres->recheck is true, only the indicated tuples need be examined, but the
> > + * condition must be rechecked anyway.  (For ease of testing, recheck is always
> > + * set true when ntuples < 0.)
> >   */
> > -TBMIterateResult *
> > -tbm_iterate(TBMIterator *iterator)
> > +void
> > +tbm_iterate(TBMIterator *iterator, TBMIterateResult *tbmres)
>
> Hm - it seems a tad odd that we later have to find out if the scan is done
> iterating by checking if blockno is valid, when tbm_iterate already knew. But
> I guess the code would be a bit uglier if we needed the result of
> tbm_[shared_]iterate(), due to the two functions.

Yes.

> Right now ExecEndBitmapHeapScan() frees the tbm before it does table_endscan()
> - which seems problematic, as heap_endscan() will do stuff like
> tbm_end_iterate(), which imo shouldn't be called after the tbm has been freed,
> even if that works today.

I've flipped the order -- I end the scan then free the bitmap.

> It seems a bit confusing that your changes seem to treat
> BitmapHeapScanState->initialized as separate from ->scan, even though afaict
> scan should be NULL iff initialized is false and vice versa.

I thought so too, but it seems on rescan that the node->initialized is
set to false but the scan is reused. So, we want to only make a new
scan descriptor if it is truly the beginning of a new scan.

- Melanie

Attachment

pgsql-hackers by date:

Previous
From: "Hayato Kuroda (Fujitsu)"
Date:
Subject: RE: pg_upgrade and logical replication
Next
From: John Naylor
Date:
Subject: Re: [PoC] Improve dead tuple storage for lazy vacuum