Re: PROC_IN_ANALYZE stillborn 13 years ago - Mailing list pgsql-hackers

From Simon Riggs
Subject Re: PROC_IN_ANALYZE stillborn 13 years ago
Date
Msg-id CANP8+j+BNs35nFx-vyEuc_6E3kHn7JFm=o1aqaUtbZi_JjuT2g@mail.gmail.com
Whole thread Raw
In response to Re: PROC_IN_ANALYZE stillborn 13 years ago  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
On Thu, 6 Aug 2020 at 22:35, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Robert Haas <robertmhaas@gmail.com> writes:
> ... how
> important is stability to ANALYZE? If you *either* retake your MVCC
> snapshots periodically as you re-scan the table *or* use a non-MVCC
> snapshot for the scan, you can get those same kinds of artifacts: you
> might see two copies of a just-updated row, or none. Maybe this would
> actually *break* something - e.g. could there be code that would get
> confused if we sample multiple rows for the same value in a column
> that has a UNIQUE index? But I think mostly the consequences would be
> that you might get somewhat different results from the statistics.

Yeah, that's an excellent point.  I can imagine somebody complaining
"this query clearly matches a unique index, why is the planner estimating
multiple rows out?".  But most of the time it wouldn't matter much.
(And I think you can get cases like that anyway today.)

> It's not clear to me that it would even be correct to categorize those
> somewhat-different results as "less accurate."

Estimating two rows where the correct answer is one row is clearly
"less accurate".  But I suspect you'd have to be quite unlucky to
get such a result in practice from Simon's proposal, as long as we
weren't super-aggressive about changing ANALYZE's snapshot a lot.

Seems like we're agreed we can use more than one snapshot, the only discussion is "how many?"

The more you take the more weirdness you will see, so adopting an approach of one-snapshot-per-row seems like the worst case for accuracy, even if it does make analyze faster.

(If we do want to speed up ANALYZE, we should use the system block sampling approach, but the argument against that is less independence of rows.)

Keeping the discussion on reducing the impact of bernoulli sampled analyze, I was imagining we would do one snapshot for each block of rows with default statistics_target, so that default behavior would be unaffected. Larger settings would be chunked according to the default, so stats_target=10k(max) would take a 10000/100 = 100 snapshots. That approach allows people to vary that using an existing parameter if needed.

--
Simon Riggs                http://www.2ndQuadrant.com/
Mission Critical Databases

pgsql-hackers by date:

Previous
From: Amit Langote
Date:
Subject: Re: FailedAssertion("pd_idx == pinfo->nparts", File: "execPartition.c", Line: 1689)
Next
From: Ashutosh Bapat
Date:
Subject: Re: walsender waiting_for_ping spuriously set