Re: Online checksums patch - once again - Mailing list pgsql-hackers

From Daniel Gustafsson
Subject Re: Online checksums patch - once again
Date
Msg-id A6B7AB7C-138B-4A95-8CA6-864CC326D067@yesql.se
Whole thread Raw
In response to Re: Online checksums patch - once again  (Justin Pryzby <pryzby@telsasoft.com>)
List pgsql-hackers
> On 28 Jul 2020, at 04:33, Justin Pryzby <pryzby@telsasoft.com> wrote:
>
> On Thu, Jun 25, 2020 at 11:43:00AM +0200, Daniel Gustafsson wrote:
>> The attached v19 fixes a few doc issues I had missed.
>
> +   They can also be enabled or disabled at a later timne, either as an offline
> => time

Fixed.

> +     * awaiting shutdown, but we can continue turning off checksums anyway
> => a waiting

This was intentional, as it refers to the impending requested shutdown and not
one which is blocked waiting.  I've reworded the comment to make this clearer.

> +     * We are starting a checksumming process scratch, and need to start by
> => FROM scratch

Fixed.

> +     * to inprogress new relations will set relhaschecksums in pg_class so it
> => inprogress COMMA

Fixed.

> +         * Relation no longer exist. We don't consider this an error since
> => exists

Fixed.

> +     * so when the cluster comes back up processing will habe to be resumed.
> => have

Fixed.

> +             "completed one pass over all databases for checksum enabling, %i databases processed",
> => I think this will be confusing to be hardcoded "one".  It'll say "one" over
> and over.

Good point, I've reworded this based on the number of processed databases to
indicate what will happen next.  This call should probably be removed in case
we merge and ship this feature, but it's handy for this part of the patch
process.

> +         * still exist.
> => exists

Fixed.

> In many places, you refer to "datachecksumsworker" (sums) but in nine places
> you refer to datachecksumworker (sum).

Good catch, they should all be using "sums". Fixed.

> +ProcessSingleRelationFork(Relation reln, ForkNumber forkNum, BufferAccessStrategy strategy)
> +{
> +    BlockNumber numblocks = RelationGetNumberOfBlocksInFork(reln, forkNum);
>
> => I think looping over numblocks is safe since new blocks are intended to be
> written with checksum, right?  Maybe it's good to say that here.

Fixed.

> +    BlockNumber b;
>
> blknum will be easier to grep for

Fixed.

> +            (errmsg("background worker \"datachecksumsworker\" starting for database oid %d",
> => Should be %u or similar (several of these)

Fixed.  As per Roberts review downthread, these will be reworded in a future
version.

> It looks like you rewrite every page, even if it already has correct checksum,
> to handle replicas.  I wonder if it's possible/reasonable/good to skip pages
> with correct checksum when wal_level=minimal ?

That would AFAICT be possible, but I'm not sure it's worth adding that before
the patch is deemed safe in its simpler form.  I've added a comment to record
this as a potential future optimization.

> It looks like it's not possible to change the checksum delay while a checksum
> worker is already running.  That may be important to allow: 1) decreased delay
> during slow periods; 2) increased delay if the process is significantly done,
> but needs to be throttled to avoid disrupting production environment.
>
> Have you collaborated with Julien about this one?  His patch adds new GUCs:
> https://www.postgresql.org/message-id/20200714090808.GA20780@nol
> checksum_cost_delay
> checksum_cost_page
> checksum_cost_limit

I honestly hadn't thought about that, but I very much agree that any controls
introduced should work the same for both of these patches.

> Maybe you'd say that Julien's pg_check_relation() should accept parameters
> instead of adding GUCs.  I think you should be in agreement on that.  It'd be
> silly if the verification function added three GUCs and allowed adjusting
> throttle midcourse, but the checksum writer process didn't use them.

Agreed.  I'm not a fan of using yet more GUCs for controlling this, but I don't
a good argument against it.  It's in line with the cost-based vacuum delays so
I guess it's the most appropriate interface.

> If you used something like that, I guess you'd also want to distinguish
> checksum_cost_page_read vs write.  Possibly, the GUCs part should be a
> preliminary shared patch 0001 that you both used.

+1.

Thanks for the review!  I will attach a v20 to Robers email with these changes
included as well.

cheers ./daniel




pgsql-hackers by date:

Previous
From: Daniel Gustafsson
Date:
Subject: Re: clarify "rewritten" in pg_checksums docs
Next
From: Thomas Munro
Date:
Subject: Re: WIP: WAL prefetch (another approach)