Re: Document atthasmissing default optimization avoids verification table scan - Mailing list pgsql-hackers

From James Coleman
Subject Re: Document atthasmissing default optimization avoids verification table scan
Date
Msg-id CAAaqYe96Qhm9h7cvSasaQ4Oh1nKM41ALh85XmWv9F_GaANpqRA@mail.gmail.com
Whole thread Raw
In response to Re: Document atthasmissing default optimization avoids verification table scan  ("David G. Johnston" <david.g.johnston@gmail.com>)
List pgsql-hackers
On Sun, Mar 27, 2022 at 11:12 PM David G. Johnston
<david.g.johnston@gmail.com> wrote:
>
> On Sun, Mar 27, 2022 at 11:17 AM James Coleman <jtc331@gmail.com> wrote:
>>
>> Hmm,  I didn't realize that was project policy,
>
>
> Guideline/Rule of Thumb is probably a better concept.

Ah, OK, thanks.

>>
>> but I'm a bit
>> surprised given that the sentence which 0001 replaces seems like a
>> direct violation of that also: "In neither case is a rewrite of the
>> table required."
>>
>
> IMO mostly due to the structuring of the paragraphs; something like the following makes it less problematic (and as
shownbelow may be sufficient to address the purpose of this patch) 
>
> """
> [...]
> The following alterations of the table require the entire table, and/or its indexes, to be rewritten; which may take
asignificant amount of time for a large table, and will temporarily require as much as double the disk space. 
>
> Changing the type of an existing column will require the entire table and its indexes to be rewritten. As an
exception,if the USING clause does not change the column contents and the old type is either binary coercible to the
newtype, or an unconstrained domain over the new type, a table rewrite is not needed; but any indexes on the affected
columnsmust still be rewritten. 
>
> Adding a column with a volatile DEFAULT also requires the entire table and its indexes to be rewritten.
>
> The reason a non-volatile (or absent) DEFAULT does not require a rewrite of the table is because the DEFAULT
expression(or NULL) is evaluated at the time of the statement and the result is stored in the table's metadata. 
>
> The following alterations of the table require that it be scanned in its entirety to ensure that no existing values
arecontrary to the new constraints placed on the table.  Constraints backed by indexes will scan the table as a
side-effectof populating the new index with data. 
>
> Adding a CHECK constraint requires scanning the table to verify that existing rows meet the constraint.  The same
goesfor adding a NOT NULL constraint to an existing column. 
>
> A newly attached partition requires scanning the table to verify that existing rows meet the partition constraint.
>
> A foreign key constraint requires scanning the table to verify that all existing values exist on the referenced
table.
>
> The main reason for providing the option to specify multiple changes in a single ALTER TABLE is that multiple table
scansor rewrites can thereby be combined into a single pass over the table. 
>
> Scanning a large table to verify a new constraint can take a long time, and other updates to the table are locked out
untilthe ALTER TABLE ADD CONSTRAINT command is committed. For CHECK and FOREIGN KEY constraints there is an option, NOT
VALID,that reduces the impact of adding a constraint on concurrent updates. With NOT VALID, the ADD CONSTRAINT command
doesnot scan the table and can be committed immediately. After that, a VALIDATE CONSTRAINT command can be issued to
verifythat existing rows satisfy the constraint. The validation step does not need to lock out concurrent updates,
sinceit knows that other transactions will be enforcing the constraint for rows that they insert or update; only
pre-existingrows need to be checked. Hence, validation acquires only a SHARE UPDATE EXCLUSIVE lock on the table being
altered.(If the constraint is a foreign key then a ROW SHARE lock is also required on the table referenced by the
constraint.)In addition to improving concurrency, it can be useful to use NOT VALID and VALIDATE CONSTRAINT in cases
wherethe table is known to contain pre-existing violations. Once the constraint is in place, no new violations can be
inserted,and the existing problems can be corrected at leisure until VALIDATE CONSTRAINT finally succeeds. 
>
> The DROP COLUMN form does not physically remove the column, but simply makes it invisible to SQL operations.
Subsequentinsert and update operations in the table will store a null value for the column. Thus, dropping a column is
quickbut it will not immediately reduce the on-disk size of your table, as the space occupied by the dropped column is
notreclaimed. The space will be reclaimed over time as existing rows are updated. 
>
> To force immediate reclamation of space occupied by a dropped column, you can execute one of the forms of ALTER TABLE
thatperforms a rewrite of the whole table. This results in reconstructing each row with the dropped column replaced by
anull value. 
>
> The rewriting forms of ALTER TABLE are not MVCC-safe. After a table rewrite, the table will appear empty to
concurrenttransactions, if they are using a snapshot taken before the rewrite occurred. See Section 13.5 for more
details.
> [...]
> """
>
> I'm liking the idea of breaking out multiple features into their own sentences or paragraphs instead of saying:
>
> "Adding a column with a volatile DEFAULT or changing the type of an existing column"
>
> "Adding a CHECK or NOT NULL constraint"
>
> This later combination probably doesn't catch my attention except for this discussion and the fact that there are
multipleways to add these constraints and we might as well be clear about whether ALTER COLUMN or ADD COLUMN makes a
difference. On that note, the behavior implied by this wording is that adding a check constraint even during ADD COLUMN
willresult in scanning the table even when a table rewrite is not required. If that is the case at present nothing
actuallysays that - if one agrees that the exact same sentence doesn't imply that a table scan is performed when adding
aNOT NULL constraint during ADD COLUMN (which doesn't happen). 
> That seems like enough material to extract out from the ALTER TABLE page and stick elsewhere if one is so motivated.
Theremay be other stuff too - but the next paragraph covers some SET DATA TYPE nuances which seem like a different
dynamic.

Coming back to this with fresh eyes in the morning and comparing your
idea above to the existing doc page, and I really like this approach.
I'll be marking this specific patch as withdrawn and opening a new
patch for the restructuring.

I also noticed an error in the existing docs (we no longer need to
rebuild indexes when a table rewrite is skipped), and I'll be sending
a separate patch to fix that separately.

Thanks,
James Coleman



pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: Add parameter jit_warn_above_fraction
Next
From: James Coleman
Date:
Subject: Correct docs re: rewriting indexes when table rewrite is skipped