Re: Document atthasmissing default optimization avoids verification table scan - Mailing list pgsql-hackers

From David G. Johnston
Subject Re: Document atthasmissing default optimization avoids verification table scan
Date
Msg-id CAKFQuwZyBaJjNepdTM3kO8PLaCpRdRd8+mtLT8QdE73oAsGv8Q@mail.gmail.com
Whole thread Raw
In response to Re: Document atthasmissing default optimization avoids verification table scan  (James Coleman <jtc331@gmail.com>)
Responses Re: Document atthasmissing default optimization avoids verification table scan
List pgsql-hackers
On Sun, Mar 27, 2022 at 11:17 AM James Coleman <jtc331@gmail.com> wrote:
Hmm,  I didn't realize that was project policy,

Guideline/Rule of Thumb is probably a better concept.
 
but I'm a bit
surprised given that the sentence which 0001 replaces seems like a
direct violation of that also: "In neither case is a rewrite of the
table required."


IMO mostly due to the structuring of the paragraphs; something like the following makes it less problematic (and as shown below may be sufficient to address the purpose of this patch)

"""
[...]
The following alterations of the table require the entire table, and/or its indexes, to be rewritten; which may take a significant amount of time for a large table, and will temporarily require as much as double the disk space.

Changing the type of an existing column will require the entire table and its indexes to be rewritten. As an exception, if the USING clause does not change the column contents and the old type is either binary coercible to the new type, or an unconstrained domain over the new type, a table rewrite is not needed; but any indexes on the affected columns must still be rewritten.

Adding a column with a volatile DEFAULT also requires the entire table and its indexes to be rewritten.  

The reason a non-volatile (or absent) DEFAULT does not require a rewrite of the table is because the DEFAULT expression (or NULL) is evaluated at the time of the statement and the result is stored in the table's metadata.

The following alterations of the table require that it be scanned in its entirety to ensure that no existing values are contrary to the new constraints placed on the table.  Constraints backed by indexes will scan the table as a side-effect of populating the new index with data.
 
Adding a CHECK constraint requires scanning the table to verify that existing rows meet the constraint.  The same goes for adding a NOT NULL constraint to an existing column.

A newly attached partition requires scanning the table to verify that existing rows meet the partition constraint.

A foreign key constraint requires scanning the table to verify that all existing values exist on the referenced table.

The main reason for providing the option to specify multiple changes in a single ALTER TABLE is that multiple table scans or rewrites can thereby be combined into a single pass over the table.

Scanning a large table to verify a new constraint can take a long time, and other updates to the table are locked out until the ALTER TABLE ADD CONSTRAINT command is committed. For CHECK and FOREIGN KEY constraints there is an option, NOT VALID, that reduces the impact of adding a constraint on concurrent updates. With NOT VALID, the ADD CONSTRAINT command does not scan the table and can be committed immediately. After that, a VALIDATE CONSTRAINT command can be issued to verify that existing rows satisfy the constraint. The validation step does not need to lock out concurrent updates, since it knows that other transactions will be enforcing the constraint for rows that they insert or update; only pre-existing rows need to be checked. Hence, validation acquires only a SHARE UPDATE EXCLUSIVE lock on the table being altered. (If the constraint is a foreign key then a ROW SHARE lock is also required on the table referenced by the constraint.) In addition to improving concurrency, it can be useful to use NOT VALID and VALIDATE CONSTRAINT in cases where the table is known to contain pre-existing violations. Once the constraint is in place, no new violations can be inserted, and the existing problems can be corrected at leisure until VALIDATE CONSTRAINT finally succeeds.

The DROP COLUMN form does not physically remove the column, but simply makes it invisible to SQL operations. Subsequent insert and update operations in the table will store a null value for the column. Thus, dropping a column is quick but it will not immediately reduce the on-disk size of your table, as the space occupied by the dropped column is not reclaimed. The space will be reclaimed over time as existing rows are updated.

To force immediate reclamation of space occupied by a dropped column, you can execute one of the forms of ALTER TABLE that performs a rewrite of the whole table. This results in reconstructing each row with the dropped column replaced by a null value.

The rewriting forms of ALTER TABLE are not MVCC-safe. After a table rewrite, the table will appear empty to concurrent transactions, if they are using a snapshot taken before the rewrite occurred. See Section 13.5 for more details.
[...]
"""

I'm liking the idea of breaking out multiple features into their own sentences or paragraphs instead of saying:

"Adding a column with a volatile DEFAULT or changing the type of an existing column"

"Adding a CHECK or NOT NULL constraint"

This later combination probably doesn't catch my attention except for this discussion and the fact that there are multiple ways to add these constraints and we might as well be clear about whether ALTER COLUMN or ADD COLUMN makes a difference.  On that note, the behavior implied by this wording is that adding a check constraint even during ADD COLUMN will result in scanning the table even when a table rewrite is not required. If that is the case at present nothing actually says that - if one agrees that the exact same sentence doesn't imply that a table scan is performed when adding a NOT NULL constraint during ADD COLUMN (which doesn't happen).
That seems like enough material to extract out from the ALTER TABLE page and stick elsewhere if one is so motivated.  There may be other stuff too - but the next paragraph covers some SET DATA TYPE nuances which seem like a different dynamic.

David J.

pgsql-hackers by date:

Previous
From: Thomas Munro
Date:
Subject: Re: A test for replay of regression tests
Next
From: Julien Rouhaud
Date:
Subject: Re: Add a pg_get_query_def function (was Re: Deparsing rewritten query)