Home > mailing lists

Re: Asynchronous and "direct" IO support for PostgreSQL. - Mailing list pgsql-hackers

From	Melanie Plageman
Subject	Re: Asynchronous and "direct" IO support for PostgreSQL.
Date	July 30, 2021 22:35:30
Msg-id	CAAKRu_YU27x8qOhhH6NW_9BeiwC3bioyR0oWA8973HP6ZrwGtg@mail.gmail.com Whole thread Raw
In response to	Re: Asynchronous and "direct" IO support for PostgreSQL. (Andres Freund <andres@anarazel.de>)
Responses	Re: Asynchronous and "direct" IO support for PostgreSQL.
List	pgsql-hackers

Tree view

On Wed, Jul 28, 2021 at 2:10 PM Andres Freund <andres@anarazel.de> wrote:
> On 2021-07-28 13:37:48 -0400, Melanie Plageman wrote:
>
> > Each IO will have its own TBMIterateResult allocated and returned by the
> > PgStreamingRead helper and freed later by
> > heapam_scan_bitmap_next_block() before requesting the next block.
> > Previously it was allocated once and saved in the TBMIterator in the
> > BitmapHeapScanState node and reused. Because of this, the table AM API
> > routine, table_scan_bitmap_next_block() now defines the TBMIterateResult
> > as an output parameter.
> >
> > I haven't made the GIN code reasonable yet either (it uses the TID
> > bitmap functions that I've changed).
>
> I don't quite understand the need to change the tidbitmap interface, or
> maybe rather I'm not convinced that pessimistically preallocating space
> is a good idea?

TBMIterator cannot contain a TBMIterateResult because it prefetches
blocks and calls tbm_iterate() for each one, which would overwrite the
relevant information in the TBMIterateResult before it has been returned
to heapam_scan_bitmap_next_block().*

Thus, we need at least as many TBMIterateResults as the size of the
prefetch window at its largest.

We could save some memory if we separated the data in TBMIterateResult
and made a new struct, let's call it BitmapBlockState, with just the
block number, buffer number, and recheck to be used and returned by
bitmapheapscan_pgsr_next_single().

We need both block and buffer because we need to distinguish between
hit_end, skip_fetch, and invalid block number conditions in the caller.
We need recheck before initiating IO to determine if we should
skip_fetch.

Then a separate struct which is much the same as the existing
TBMIterateResult could be maintained in the BitmapHeapScanState node and
passed into heapam_scan_bitmap_next_block() along with the bitmap (a new
parameter).

In heapam_scan_bitmap_next_block(), after getting the BitmapBlockState
from pg_streaming_read_get_next(), we could call tbm_find_pageentry()
with the block number and bitmap.
For a non-lossy page, we could then scrape the offsets and ntuples using
the PageTableEntry. If it is lossy, we would set recheck and ntuples
accordingly. (I do wonder if that allows us to distinguish between a
lossy page and a block number that is erroneous and isn't in the
bitmap--but maybe that can't happen.)

However, we would still have as many palloc() calls (one for every block
to create the BitmapBlockState. We would have less outstanding memory by
limiting the number of offsets arrays created.
We would still need to pass the recheck flag, ntuples, and buffer back
up to BitmapHeapNext(), so, at that point we would still need a data
structure that is basically the same as the existing TBMIterateResult.

Alternatively, we could keep an array of TBMIterateResults the size of
the prefetch window and reuse them -- though I'm not sure where to keep
it and how to manage it when the window gets resized.

In my current patch, I allocate and free one TBMIterateResult for each
block. The amount of outstanding memory will be #ios in prefetch window
* sizeof(TBMIterateResult).

We don't want to always palloc() memory for the TBMIterateResult inside
of tbm_iterate(), since other users (like GIN) still only need one
TBMIterateResult

So, if the TBMIterateResult is not inside of the TBMIterator and
tbm_iterate() does not allocate the memory, we need to pass it in as an
output parameter, and, if we do that, it felt odd to also return it --
hence the function signature change.

One alternative I tried was having the TBMIterator have a pointer to the
TBMIterateResult and then users of it can allocate the TBMIterateResult
and set it in the TBMIterator before calling tbm_iterate(). But, then we
need to expose the TBMIterator outside of the TID bitmap API. Also, it
felt weird to have a member of the iterator which must not be NULL when
tbm_iterate() is called but which isn't set up in tbm_begin_iterate().

>
>
> > static bool
> > heapam_scan_bitmap_next_block(TableScanDesc scan,
> > - TBMIterateResult *tbmres)
> > + TBMIterateResult **tbmres)
>
> ISTM that we possibly shouldn't expose the TBMIterateResult outside of
> the AM after this change? It feels somewhat like an implementation
> detail now. It seems somewhat odd to expose a ** to set a pointer that
> nodeBitmapHeapscan.c then doesn't really deal with itself.

All the members of the TBMIterateResult are populated in
bitmapheapscan_pgsr_next_single() and then
most of it is used by heapam_scan_bitmap_next_block() to
- detect error conditions and done-ness
- fill in the HeapScanDesc with the information needed by
heapam_scan_bitmap_next_tuple() (rs_cbuf [which is basically
redundant with TBMIterateResult->buffer] and rs_vistuples)

However, some of the information is used up in BitmapHeapNext() and in
heapam_scan_bitmap_next_tuple() and doesn't go in the HeapScanDesc:
- BitmapHeapNext() uses the state of the TBMIterateResult to determine
if the bitmap is exhausted, since the return value of
table_scan_bitmap_next_block() indicates an error condition and not
done-ness
- BitmapHeapNext() uses recheck to determine whether or not to
recheck qual conditions
- heapam_scan_bitmap_next_tuple() uses the validity of the buffer to
determine if it should return empty tuples
- heapam_scan_bitmap_next_tuple() uses ntuples to determine how many
empty tuples to return

So, if we don't want to pass around a TBMIterateResult, we would have to
1) change the return value of heapam_scan_bitmap_next_block() and 2)
find another appropriate place for the information above (or another way
to represent the encoded information).

It is also worth noting that heapam_scan_bitmap_next_tuple() took a
TBMIterateResult before without using it, so I assume your foresaw other
table AMs using it?

Overall, the whole thing still feels a bit yucky to me. It doesn't quite
feel like the right things are in the right places, but, I haven't put
my finger on the culprit.

I do think putting the buffer in the TBMIterateResult is an
inappropriate addition to the TID Bitmap API.

Also, I would like to move this code:

if (node->tbmres->ntuples >= 0)
node->exact_pages++;
else
node->lossy_pages++;

from where it is in BitmapHeapNext(). It seems odd that that is the only
part of BitmapHeapNext() that reaches inside of the TBMIterateResult.
Also, as it is, it is incorrect--it doesn't count the first page. I
could duplicate it under the first call to
table_scan_bitmap_next_block(), but I wasn't looking forward to doing
so.

>
> > @@ -695,8 +693,7 @@ tbm_begin_iterate(TIDBitmap *tbm)
> > * Create the TBMIterator struct, with enough trailing space to serve the
> > * needs of the TBMIterateResult sub-struct.
> > */
> > - iterator = (TBMIterator *) palloc(sizeof(TBMIterator) +
> > - MAX_TUPLES_PER_PAGE * sizeof(OffsetNumber));
> > + iterator = (TBMIterator *) palloc(sizeof(TBMIterator));
> > iterator->tbm = tbm;
>
> Hm?

I removed the TBMIterateResult from the TBMIterator, so, we should no
longer allocate memory for the offsets array when creating the
TBMIterator.

* I think that having TBMIterateResult inside of TBMIterator is not
well-defined C language behavior. In [1], it says

"Structures with flexible array members (or unions who have a
recursive-possibly structure member with flexible array member) cannot
appear as array elements or as members of other structures."

[1] https://en.cppreference.com/w/c/language/struct

pgsql-hackers by date:

From: Andres Freund
Date: 30 July 2021, 22:34:12
Subject: Re: [PoC] Improve dead tuple storage for lazy vacuum

From: Robert Haas
Date: 30 July 2021, 22:47:53
Subject: Re: Have I found an interval arithmetic bug?