Re: BitmapHeapScan streaming read user and prelim refactoring - Mailing list pgsql-hackers

From Melanie Plageman
Subject Re: BitmapHeapScan streaming read user and prelim refactoring
Date
Msg-id 20240323002211.on5vb5ulk6lsdb2u@liskov
Whole thread Raw
In response to Re: BitmapHeapScan streaming read user and prelim refactoring  (Heikki Linnakangas <hlinnaka@iki.fi>)
Responses Re: BitmapHeapScan streaming read user and prelim refactoring
List pgsql-hackers
On Tue, Mar 19, 2024 at 02:33:35PM +0200, Heikki Linnakangas wrote:
> On 18/03/2024 17:19, Melanie Plageman wrote:
> > I've attached v7 rebased over this commit.
> 
> If we delayed table_beginscan_bm() call further, after starting the TBM
> iterator, we could skip it altogether when the iterator is empty.
> 
> That's a further improvement, doesn't need to be part of this patch set.
> Just caught my eye while reading this.

Hmm. You mean like until after the first call to tbm_[shared]_iterate()?
AFAICT, tbm_begin_iterate() doesn't tell us anything about whether or
not the iterator is "empty". Do you mean cases when the bitmap has no
blocks in it? It seems like we should be able to tell that from the
TIDBitmap.

> 
> > v7-0003-Push-BitmapHeapScan-skip-fetch-optimization-into-.patch
> 
> I suggest to avoid the double negative with SO_CAN_SKIP_FETCH, and call the
> flag e.g. SO_NEED_TUPLE.

Agreed. Done in attached v8. Though I wondered if it was a bit weird
that the flag is set in the common case and not set in the uncommon
case...

> As yet another preliminary patch before the streaming read API, it would be
> nice to move the prefetching code to heapam.c too.

I've done this, but I can say it is not very pretty. see 0013. I had to
add a bunch of stuff to TableScanDescData and HeapScanDescData which are
only used for bitmapheapscans. I don't know if it makes the BHS
streaming read user patch easier to review, but I don't think what I
have in 0013 is committable to Postgres. Maybe there was another way I
could have approached it. Let me know what you think.

In addition to bloating the table descriptors, note that it was
difficult to avoid one semantic change -- with 0013, we no longer
prefetch or adjust prefetch target when emitting each empty tuple --
though I think this is could actually be desirable.

> What's the point of having separate table_scan_bitmap_next_block() and
> table_scan_bitmap_next_tuple() functions anymore? The AM owns the TBM
> iterator now. The executor node updates the lossy/exact page counts, but
> that's the only per-page thing it does now.

Oh, interesting. Good point. I've done this in 0015. If you like the way
it turned out, I can probably rebase this back into an earlier point in
the set and end up dropping some of the other incremental changes (e.g.
0008).

> >         /*
> >          * If this is the first scan of the underlying table, create the table
> >          * scan descriptor and begin the scan.
> >          */
> >         if (!scan)
> >         {
> >             uint32        extra_flags = 0;
> > 
> >             /*
> >              * We can potentially skip fetching heap pages if we do not need
> >              * any columns of the table, either for checking non-indexable
> >              * quals or for returning data.  This test is a bit simplistic, as
> >              * it checks the stronger condition that there's no qual or return
> >              * tlist at all. But in most cases it's probably not worth working
> >              * harder than that.
> >              */
> >             if (node->ss.ps.plan->qual == NIL && node->ss.ps.plan->targetlist == NIL)
> >                 extra_flags |= SO_CAN_SKIP_FETCH;
> > 
> >             scan = node->ss.ss_currentScanDesc = table_beginscan_bm(
> >                                                                     node->ss.ss_currentRelation,
> >                                                                     node->ss.ps.state->es_snapshot,
> >                                                                     0,
> >                                                                     NULL,
> >                                                                     extra_flags);
> >         }
> > 
> >         scan->tbmiterator = tbmiterator;
> >         scan->shared_tbmiterator = shared_tbmiterator;
> 
> How about passing the iterator as an argument to table_beginscan_bm()? You'd
> then need some other function to change the iterator on rescan, though. Not
> sure what exactly to do here, but feels that this part of the API is not
> fully thought-out. Needs comments at least, to explain who sets tbmiterator
> / shared_tbmiterator and when. For comparison, for a TID scan there's a
> separate scan_set_tidrange() table AM function. Maybe follow that example
> and introduce scan_set_tbm_iterator().

I've spent quite a bit of time playing around with the code trying to
make it less terrible than what I had before.

On rescan, we have to actually make the whole bitmap and iterator. And,
we don't have what we need to do that in table_rescan()/heap_rescan().
From what I can tell, scan_set_tidrange() is useful because it can be
called from both the beginscan and rescan functions without invoking it
directly from TidNext().

In our case, any wrapper function we wrote would basically just assign
the iterator to the scan in BitmapHeapNext().

I've reorganized this code structure a bit, so see if you like it more
now. I rebased the changes into some of the other patches, so you'll
just have to look at the result and see what you think.

> It's bit awkward to have separate tbmiterator and shared_tbmiterator fields.
> Could regular and shared iterators be merged, or wrapped under a common
> interface?

This is a good idea. I've done that in 0014. It made the code nicer, but
I just wonder if it will add too much overhead.

- Melanie



pgsql-hackers by date:

Previous
From: Jeff Davis
Date:
Subject: Re: New Table Access Methods for Multi and Single Inserts
Next
From: Thomas Munro
Date:
Subject: Re: pg_upgrade --copy-file-range