Re: Optional skipping of unchanged relations during ANALYZE? - Mailing list pgsql-hackers
| From | VASUKI M |
|---|---|
| Subject | Re: Optional skipping of unchanged relations during ANALYZE? |
| Date | |
| Msg-id | CAE2r8H7hYGYi4QM85Q7bxs4RbT0Vn63c9ONFbwhAjuSGDzah_A@mail.gmail.com Whole thread |
| In response to | Re: Optional skipping of unchanged relations during ANALYZE? (Ilia Evdokimov <ilya.evdokimov@tantorlabs.com>) |
| Responses |
Re: Optional skipping of unchanged relations during ANALYZE?
|
| List | pgsql-hackers |
Hello Robert, Sami, Ilia, and everyone,
Thank you all for the detailed review and thoughtful feedback. I have carefully gone through the comments and revised the patch accordingly. I truly appreciate the guidance — it helped clarify both the design and the implementation details.
Regarding Robert’s point about the two distinct use cases, I agree that MISSING_STATS_ONLY and MODIFIED_STATS represent separate concerns. MISSING_STATS_ONLY is catalog-driven and persistent in nature, ensuring that newly created tables, newly added columns, and newly defined extended statistics are not left without statistics. MODIFIED_STATS, on the other hand, is more closely aligned with modification thresholds and autoanalyze-like behavior, which is transient and threshold-based.
Keeping these concerns separate makes the semantics clearer and easier to reason about. In particular, as Robert mentioned, MISSING_STATS_ONLY must remain a standalone flag for scenarios such as integration into vacuumdb, where ensuring that missing statistics are generated is the primary goal.
To align with the CommitFest process, I have created a separate entry for this feature here:[1]
I will treat MODIFIED_STATS as a separate patch and discussion thread.
Addressing Sami’s technical comments:
1. Duplicate examine_attribute() calls
You were absolutely correct that the earlier version of the patch caused examine_attribute() to be invoked twice for each attribute: once in relation_has_missing_column_stats() and again during normal attribute processing. That approach was inefficient and awkward.
In the revised version, I have removed the early relation-level column scanning logic. Instead, the missing-statistics check is now integrated directly inside examine_attribute(). The function now takes a boolean missing_stats_only parameter. When this option is specified, examine_attribute() performs the pg_statistic lookup for that specific attribute. If statistics already exist, the function simply returns NULL, causing the column to be skipped naturally.
This ensures:
-Each attribute is examined only once.
-The missing-statistics logic fits naturally into the existing flow.
-We avoid redundant catalog lookups.
Thank you all for the detailed review and thoughtful feedback. I have carefully gone through the comments and revised the patch accordingly. I truly appreciate the guidance — it helped clarify both the design and the implementation details.
Regarding Robert’s point about the two distinct use cases, I agree that MISSING_STATS_ONLY and MODIFIED_STATS represent separate concerns. MISSING_STATS_ONLY is catalog-driven and persistent in nature, ensuring that newly created tables, newly added columns, and newly defined extended statistics are not left without statistics. MODIFIED_STATS, on the other hand, is more closely aligned with modification thresholds and autoanalyze-like behavior, which is transient and threshold-based.
Keeping these concerns separate makes the semantics clearer and easier to reason about. In particular, as Robert mentioned, MISSING_STATS_ONLY must remain a standalone flag for scenarios such as integration into vacuumdb, where ensuring that missing statistics are generated is the primary goal.
To align with the CommitFest process, I have created a separate entry for this feature here:[1]
I will treat MODIFIED_STATS as a separate patch and discussion thread.
Addressing Sami’s technical comments:
1. Duplicate examine_attribute() calls
You were absolutely correct that the earlier version of the patch caused examine_attribute() to be invoked twice for each attribute: once in relation_has_missing_column_stats() and again during normal attribute processing. That approach was inefficient and awkward.
In the revised version, I have removed the early relation-level column scanning logic. Instead, the missing-statistics check is now integrated directly inside examine_attribute(). The function now takes a boolean missing_stats_only parameter. When this option is specified, examine_attribute() performs the pg_statistic lookup for that specific attribute. If statistics already exist, the function simply returns NULL, causing the column to be skipped naturally.
This ensures:
-Each attribute is examined only once.
-The missing-statistics logic fits naturally into the existing flow.
-We avoid redundant catalog lookups.
Inside do_analyze_rel(), after collecting vacattrstats, we now check whether:
-attr_cnt == 0
-no expression index attributes remain
-and there are no missing extended statistics
If all of these are true under MISSING_STATS_ONLY, the relation is skipped cleanly.This follows the structure you suggested and simplifies the overall design.
2. Test coverage
I have expanded the regression tests to cover the scenarios you mentioned.This ensures that MISSING_STATS_ONLY behaves correctly across more complex schema configurations.
3. Logging behavior
-attr_cnt == 0
-no expression index attributes remain
-and there are no missing extended statistics
If all of these are true under MISSING_STATS_ONLY, the relation is skipped cleanly.This follows the structure you suggested and simplifies the overall design.
2. Test coverage
I have expanded the regression tests to cover the scenarios you mentioned.This ensures that MISSING_STATS_ONLY behaves correctly across more complex schema configurations.
3. Logging behavior
Thank you for the guidance on logging.I have removed the DEBUG-level elog() calls introduced in earlier versions. Logging is now consistent with existing ANALYZE behavior and uses ereport().
When VERBOSE mode is enabled and a relation is skipped due to MISSING_STATS_ONLY, the output now includes a single INFO-level message of the form:
INFO: Skipping analyzing "database.namespace.relation"
This matches the style used elsewhere in do_analyze_rel() and avoids unnecessary additional details when no statistics are collected.There is no additional logging when a relation is processed normally, keeping behavior aligned with existing ANALYZE semantics.
Regarding autoanalyze and MODIFIED_STATS
As Ilia clarified, autoanalyze decisions are purely threshold-driven and do not include missing-statistics logic. That reinforces the decision to keep MISSING_STATS_ONLY separate and explicit.[2]
If there are further suggestions regarding semantics, naming, or additional edge cases that should be tested, I would be very happy to incorporate them.
Thank you again for your time and detailed review.
Regards,
Vasuki M
C-DAC,Chennai
[1] https://commitfest.postgresql.org/patch/6516/
[2] https://www.postgresql.org/message-id/flat/aZSm77WEh8pxQYtf%40nathan#1ab5c06a7d2247d90e71fb995fa21a39
When VERBOSE mode is enabled and a relation is skipped due to MISSING_STATS_ONLY, the output now includes a single INFO-level message of the form:
INFO: Skipping analyzing "database.namespace.relation"
This matches the style used elsewhere in do_analyze_rel() and avoids unnecessary additional details when no statistics are collected.There is no additional logging when a relation is processed normally, keeping behavior aligned with existing ANALYZE semantics.
Regarding autoanalyze and MODIFIED_STATS
As Ilia clarified, autoanalyze decisions are purely threshold-driven and do not include missing-statistics logic. That reinforces the decision to keep MISSING_STATS_ONLY separate and explicit.[2]
If there are further suggestions regarding semantics, naming, or additional edge cases that should be tested, I would be very happy to incorporate them.
Thank you again for your time and detailed review.
Regards,
Vasuki M
C-DAC,Chennai
[1] https://commitfest.postgresql.org/patch/6516/
[2] https://www.postgresql.org/message-id/flat/aZSm77WEh8pxQYtf%40nathan#1ab5c06a7d2247d90e71fb995fa21a39
Attachment
pgsql-hackers by date: