Re: Use streaming read API in ANALYZE - Mailing list pgsql-hackers

From Nazir Bilal Yavuz
Subject Re: Use streaming read API in ANALYZE
Date
Msg-id CAN55FZ2OHDHQYRxg22rRVinUBgxKDYiqQKa2tjW4KNqPVCZ+7g@mail.gmail.com
Whole thread Raw
In response to Re: Use streaming read API in ANALYZE  (Melanie Plageman <melanieplageman@gmail.com>)
Responses Re: Use streaming read API in ANALYZE
List pgsql-hackers
Hi,

On Wed, 3 Apr 2024 at 23:44, Melanie Plageman <melanieplageman@gmail.com> wrote:
>
>
> I've reviewed the patches inline below and attached a patch that has
> some of my ideas on top of your patch.

Thank you!

>
> > From 8d396a42186325f920d5a05e7092d8e1b66f3cdf Mon Sep 17 00:00:00 2001
> > From: Nazir Bilal Yavuz <byavuz81@gmail.com>
> > Date: Wed, 3 Apr 2024 15:14:15 +0300
> > Subject: [PATCH v6] Use streaming read API in ANALYZE
> >
> > ANALYZE command gets random tuples using BlockSampler algorithm. Use
> > streaming reads to get these tuples by using BlockSampler algorithm in
> > streaming read API prefetch logic.
> > ---
> >  src/include/access/heapam.h              |  6 +-
> >  src/backend/access/heap/heapam_handler.c | 22 +++---
> >  src/backend/commands/analyze.c           | 85 ++++++++----------------
> >  3 files changed, 42 insertions(+), 71 deletions(-)
> >
> > diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
> > index a307fb5f245..633caee9d95 100644
> > --- a/src/include/access/heapam.h
> > +++ b/src/include/access/heapam.h
> > @@ -25,6 +25,7 @@
> >  #include "storage/bufpage.h"
> >  #include "storage/dsm.h"
> >  #include "storage/lockdefs.h"
> > +#include "storage/read_stream.h"
> >  #include "storage/shm_toc.h"
> >  #include "utils/relcache.h"
> >  #include "utils/snapshot.h"
> > @@ -388,9 +389,8 @@ extern bool HeapTupleIsSurelyDead(HeapTuple htup,
> >                                                                 struct GlobalVisState *vistest);
> >
> >  /* in heap/heapam_handler.c*/
> > -extern void heapam_scan_analyze_next_block(TableScanDesc scan,
> > -                                                                                BlockNumber blockno,
> > -                                                                                BufferAccessStrategy bstrategy);
> > +extern bool heapam_scan_analyze_next_block(TableScanDesc scan,
> > +                                                                                ReadStream *stream);
> >  extern bool heapam_scan_analyze_next_tuple(TableScanDesc scan,
> >                                                                                  TransactionId OldestXmin,
> >                                                                                  double *liverows, double
*deadrows,
> > diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
> > index 0952d4a98eb..d83fbbe6af3 100644
> > --- a/src/backend/access/heap/heapam_handler.c
> > +++ b/src/backend/access/heap/heapam_handler.c
> > @@ -1054,16 +1054,16 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
> >  }
> >
> >  /*
> > - * Prepare to analyze block `blockno` of `scan`.  The scan has been started
> > - * with SO_TYPE_ANALYZE option.
> > + * Prepare to analyze block returned from streaming object.  If the block returned
> > + * from streaming object is valid, true is returned; otherwise false is returned.
> > + * The scan has been started with SO_TYPE_ANALYZE option.
> >   *
> >   * This routine holds a buffer pin and lock on the heap page.  They are held
> >   * until heapam_scan_analyze_next_tuple() returns false.  That is until all the
> >   * items of the heap page are analyzed.
> >   */
> > -void
> > -heapam_scan_analyze_next_block(TableScanDesc scan, BlockNumber blockno,
> > -                                                        BufferAccessStrategy bstrategy)
> > +bool
> > +heapam_scan_analyze_next_block(TableScanDesc scan, ReadStream *stream)
> >  {
> >       HeapScanDesc hscan = (HeapScanDesc) scan;
> >
> > @@ -1076,11 +1076,15 @@ heapam_scan_analyze_next_block(TableScanDesc scan, BlockNumber blockno,
> >        * doing much work per tuple, the extra lock traffic is probably better
> >        * avoided.
>
> Personally I think heapam_scan_analyze_next_block() should be inlined.
> It only has a few lines. I would find it clearer inline. At the least,
> there is no reason for it (or heapam_scan_analyze_next_tuple()) to take
> a TableScanDesc instead of a HeapScanDesc.

I agree.

>
> >        */
> > -     hscan->rs_cblock = blockno;
> > -     hscan->rs_cindex = FirstOffsetNumber;
> > -     hscan->rs_cbuf = ReadBufferExtended(scan->rs_rd, MAIN_FORKNUM,
> > -                                                                             blockno, RBM_NORMAL, bstrategy);
> > +     hscan->rs_cbuf = read_stream_next_buffer(stream, NULL);
> > +     if (hscan->rs_cbuf == InvalidBuffer)
> > +             return false;
> > +
> >       LockBuffer(hscan->rs_cbuf, BUFFER_LOCK_SHARE);
> > +
> > +     hscan->rs_cblock = BufferGetBlockNumber(hscan->rs_cbuf);
> > +     hscan->rs_cindex = FirstOffsetNumber;
> > +     return true;
> >  }
>
> >  /*
> > diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
> > index 2fb39f3ede1..764520d5aa2 100644
> > --- a/src/backend/commands/analyze.c
> > +++ b/src/backend/commands/analyze.c
> > @@ -1102,6 +1102,20 @@ examine_attribute(Relation onerel, int attnum, Node *index_expr)
> >       return stats;
> >  }
> >
> > +/*
> > + * Prefetch callback function to get next block number while using
> > + * BlockSampling algorithm
> > + */
> > +static BlockNumber
> > +block_sampling_streaming_read_next(ReadStream *stream,
> > +                                                                void *user_data,
> > +                                                                void *per_buffer_data)
> > +{
> > +     BlockSamplerData *bs = user_data;
> > +
> > +     return BlockSampler_HasMore(bs) ? BlockSampler_Next(bs) : InvalidBlockNumber;
>
> I don't see the point of BlockSampler_HasMore() anymore. I removed it in
> the attached and made BlockSampler_Next() return InvalidBlockNumber
> under the same conditions. Is there a reason not to do this? There
> aren't other callers. If the BlockSampler_Next() wasn't part of an API,
> we could just make it the streaming read callback, but that might be
> weird as it is now.

I agree. There is no reason to have BlockSampler_HasMore() after
streaming read API changes.

> That and my other ideas in attached. Let me know what you think.

I agree with your changes but I am not sure if others agree with all
the changes you have proposed. So, I didn't merge 0001 and your ideas
yet, instead I wrote a commit message, added some comments, changed ->
'if (bs->t >= bs->N || bs->m >= bs->n)' to 'if (K <= 0 || k <= 0)' and
attached it as 0002.

--
Regards,
Nazir Bilal Yavuz
Microsoft

Attachment

pgsql-hackers by date:

Previous
From: Andrew Dunstan
Date:
Subject: Re: WIP Incremental JSON Parser
Next
From: Amit Kapila
Date:
Subject: Re: Introduce XID age and inactive timeout based replication slot invalidation