Re: Optional skipping of unchanged relations during ANALYZE? - Mailing list pgsql-hackers

From VASUKI M
Subject Re: Optional skipping of unchanged relations during ANALYZE?
Date
Msg-id CAE2r8H7vnihrOE5i+RvpU72jcyRkxMkhCY9mzBb+PBrTzSmo5g@mail.gmail.com
Whole thread
In response to Re: Optional skipping of unchanged relations during ANALYZE?  (Sami Imseih <samimseih@gmail.com>)
Responses Re: Optional skipping of unchanged relations during ANALYZE?
Re: Optional skipping of unchanged relations during ANALYZE?
List pgsql-hackers

Hi all,

First of all, sorry for the delay in posting the updated patch. I was occupied with some other work and holidays, and it took me some time to revisit the design carefully. After several iterations and reworking the logic based on the earlier feedback, I have now arrived at what I believe is a cleaner and more aligned implementation. I am attaching v4 of the patch for review.

The intention is to provide a SQL-level equivalent of vacuumdb --analyze-only --missing-stats-only, while keeping the default ANALYZE behavior completely unchanged. When this option is specified, ANALYZE will process only those relations that are missing statistics, and skip relations that already have complete statistics.

A relation is considered to be missing statistics if at least one analyzable attribute (as determined by examine_attribute()) does not have a corresponding row in pg_statistic, or if there is an extended statistics object in pg_statistic_ext without a matching row in pg_statistic_ext_data. In other words, the implementation relies on catalog inspection rather than counters or timestamps, and it reuses existing ANALYZE internals instead of redefining missing-stats logic independently.

The check is placed inside analyze_rel(), after the standard relation validation and skip conditions, so that it does not interfere with privilege checks or special-relation handling. The default ANALYZE path remains unchanged.

In terms of behavior:

* A brand new empty table is analyzed (since it has no statistics).
* Re-running on an empty table analyzes again, because there are still no pg_statistic rows.
* A table with data but no statistics is analyzed.
* Re-running after statistics exist causes the table to be skipped.
* If a new column is added and lacks statistics, the table is analyzed again.
* After statistics are created for that column, subsequent runs skip the table.
* If statistics are manually deleted or effectively lost (e.g., crash recovery scenarios affecting stats tracking), the table is analyzed again.

Repeated runs therefore converge toward a no-op once all relations have complete statistics.

Regression tests are included.

As discussed earlier in the thread, I plan to start a new discussion and patch series for a separate ANALYZE (MODIFIED_STATS) option that would reuse autoanalyze-style thresholds. I believe keeping MISSING_STATS_ONLY and MODIFIED_STATS as separate, clearly defined options makes the semantics easier to reason about.

I would greatly appreciate further review and feedback on this version. Thank you all for the detailed guidance and suggestions so far — especially regarding reuse of examine_attribute() and alignment with vacuumdb behavior. This process has been very educational for me.

Thanks,
Vasuki M
C-DAC, Chennai
Attachment

pgsql-hackers by date:

Previous
From: Nitin Motiani
Date:
Subject: Re: [PATCH] Support reading large objects with pg_read_all_data
Next
From: Ashutosh Bapat
Date:
Subject: Re: Report bytes and transactions actually sent downtream