Hi all,
First of all, sorry for the delay in posting the updated patch. I was occupied with some other work and holidays, and it took me some time to revisit the design carefully. After several iterations and reworking the logic based on the earlier feedback, I have now arrived at what I believe is a cleaner and more aligned implementation. I am attaching v4 of the patch for review.
The intention is to provide a SQL-level equivalent of vacuumdb --analyze-only --missing-stats-only, while keeping the default ANALYZE behavior completely unchanged. When this option is specified, ANALYZE will process only those relations that are missing statistics, and skip relations that already have complete statistics.
A relation is considered to be missing statistics if at least one analyzable attribute (as determined by examine_attribute()) does not have a corresponding row in pg_statistic, or if there is an extended statistics object in pg_statistic_ext without a matching row in pg_statistic_ext_data. In other words, the implementation relies on catalog inspection rather than counters or timestamps, and it reuses existing ANALYZE internals instead of redefining missing-stats logic independently.
The check is placed inside analyze_rel(), after the standard relation validation and skip conditions, so that it does not interfere with privilege checks or special-relation handling. The default ANALYZE path remains unchanged.
In terms of behavior:
* A brand new empty table is analyzed (since it has no statistics).
* Re-running on an empty table analyzes again, because there are still no pg_statistic rows.
* A table with data but no statistics is analyzed.
* Re-running after statistics exist causes the table to be skipped.
* If a new column is added and lacks statistics, the table is analyzed again.
* After statistics are created for that column, subsequent runs skip the table.
* If statistics are manually deleted or effectively lost (e.g., crash recovery scenarios affecting stats tracking), the table is analyzed again.
Repeated runs therefore converge toward a no-op once all relations have complete statistics.
Regression tests are included.
As discussed earlier in the thread, I plan to start a new discussion and patch series for a separate ANALYZE (MODIFIED_STATS) option that would reuse autoanalyze-style thresholds. I believe keeping MISSING_STATS_ONLY and MODIFIED_STATS as separate, clearly defined options makes the semantics easier to reason about.
I would greatly appreciate further review and feedback on this version. Thank you all for the detailed guidance and suggestions so far — especially regarding reuse of examine_attribute() and alignment with vacuumdb behavior. This process has been very educational for me.
Thanks,
Vasuki M
C-DAC, Chennai