Re: Add pg_walinspect function with block info columns - Mailing list pgsql-hackers

From Melanie Plageman
Subject Re: Add pg_walinspect function with block info columns
Date
Msg-id CAAKRu_bJvbcYBRj2cN6G2xV7B7-Ja+pjTO1nEnEhRR8OXYiABA@mail.gmail.com
Whole thread Raw
In response to Re: Add pg_walinspect function with block info columns  (Matthias van de Meent <boekewurm+postgres@gmail.com>)
Responses Re: Add pg_walinspect function with block info columns  (Peter Geoghegan <pg@bowt.ie>)
List pgsql-hackers
I'm excited to see that pg_get_wal_block_info() was merged. Thanks for
working on this!

Apologies for jumping back in here a bit late. I've been playing around
with it and wanted to comment on the performance of JOINing to
pg_get_wal_records_info().

On Tue, Mar 7, 2023 at 7:08 AM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
>
> On Tue, 7 Mar 2023 at 01:34, Michael Paquier <michael@paquier.xyz> wrote:
> >
> > On Mon, Mar 06, 2023 at 04:08:28PM +0100, Matthias van de Meent wrote:
> > > On Mon, 6 Mar 2023 at 15:40, Bharath Rupireddy
> > >> IMO, pg_get_wal_records_extended_info as proposed doesn't look good to
> > >> me as it outputs most of the columns that are already given by
> > >> pg_get_wal_records_info.What I think the best way at this point is to
> > >> make it return the following:
> > >> lsn pg_lsn
> > >> block_id int8
> > >> spcOid oid
> > >> dbOid oid
> > >> relNumber oid
> > >> forkNames text
> > >> fpi bytea
> > >> fpi_info text
> >
> > I would add the length of the block data (without the hole and
> > compressed, as the FPI data should always be presented as
> > uncompressed), and the block data if any (without the block data
> > length as one can guess it based on the bytea data length).  Note that
> > a block can have both a FPI and some data assigned to it, as far as I
> > recall.
> >
> > > The basic idea is to create a single entrypoint to all relevant data
> > > from DecodedXLogRecord in SQL, not multiple.
> >
> > While I would agree with this principle on simplicity's ground in
> > terms of minimizing the SQL interface and the pg_wal/ lookups, I
> > disagree about it on unsability ground, because we can avoid extra SQL
> > tweaks with more functions.  One recent example I have in mind is
> > partitionfuncs.c, which can actually be achieved with a WITH RECURSIVE
> > on the catalogs.
>
> Correct, but in that case the user would build the same query (or at
> least with the same complexity) as what we're executing under the
> hood, right?
>
> > There are of course various degrees of complexity,
> > and perhaps unnest() cannot qualify as one, but having two functions
> > returning normalized records (one for the record information, and a
> > second for the block information), is a rather good balance between
> > usability and interface complexity, in my experience.
>
> I would agree, if it weren't for the reasons written below.
>
> >  If you have two
> > functions, a JOIN is enough to cross-check the block data and the
> > record data,
>
> Joins are expensive on large datasets; and because WAL is one of the
> largest datasets in the system, why would we want to force the user to
> JOIN them if we can produce the data in one pre-baked data structure
> without a need to join?

I wanted to experiment to see how much slower it is to do the join
between pg_get_wal_block_info() and pg_get_wal_records_info() and
profile where the time was spent.

I saved the wal lsn before and after a bunch of inserts (generates
~1,000,000 records).

On master, I did a join like this:

  SELECT count(*) FROM
    pg_get_wal_block_info(:start_lsn, :end_lsn) b JOIN
    pg_get_wal_records_info(:start_lsn, :end_lsn) w ON w.start_lsn = b.lsn;

which took 1191 ms.

After patching master to add in the columns from
pg_get_wal_records_info() which are not returned by
pg_get_wal_block_info() (except block_ref column of course), this query:

  SELECT COUNT(*) FROM pg_get_wal_block_info(:start_lsn, :end_lsn);

took 467 ms.

Perhaps this difference isn't important, but I found it noticeable.

A large chunk of the time is spent joining the tuplestores.

The second largest chunk of time seems to be in GetWalRecordInfo()'s
calls to XLogRecGetBlockRefInfo(), which spends quite a bit of time in
string construction -- mainly for strings we wouldn't end up needing
after joining to the block info function.

Surprisingly, the string construction seemed to overshadow the
performance impact of doubling the decoding passes over the WAL records.

So maybe it is worth including more record-level info?

On an unrelated note, I had thought that we should have some kind of
parameter to pg_get_wal_block_info() to control whether or not we output
the FPI to save us from wasting time decompressing the FPIs.

AFAIK, we don't have access to projection information from inside the
function, so a parameter like "output_fpi" or the like would have to do.

I wanted to share what I found in trying this in case someone else had
had that thought.

TL;DR, it doesn't seem to matter from a performance perspective if we
skip decompressing the FPIs in pg_get_wal_block_info().

I hacked "output_fpi" into pg_get_wal_block_info(), enabled pglz wal
compression, and generated a boatload of FPIs by dirtying buffers, doing
a checkpoint and then updating those pages again right after the
checkpoint. With output_fpi = true (same as master), my call to
pg_get_wal_block_info() took around 7 seconds and with output_fpi =
false, it took around 6 seconds. Not an impressive difference.

I noticed that only around 2% of the time is spent in pglz_decompress().
Most of the time (about half) is spent in building tuples for the
tuplestore, copying memory around, and writing the tuplestore out to a
file. Another 10-20% is spent decoding the records--which has to be done
regardless.

I wonder if there are cases where the decompression overhead would matter.
My conclusion is that it isn't worth bothering with such a parameter.

- Melanie



pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: Combine pg_walinspect till_end_of_wal functions with others
Next
From: Andres Freund
Date:
Subject: Re: Track IO times in pg_stat_io