Re: PROC_IN_ANALYZE stillborn 13 years ago - Mailing list pgsql-hackers

From Andres Freund
Subject Re: PROC_IN_ANALYZE stillborn 13 years ago
Date
Msg-id 20200807213727.qjfvltalwt63xbib@alap3.anarazel.de
Whole thread Raw
In response to Re: PROC_IN_ANALYZE stillborn 13 years ago  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
Hi,

On 2020-08-06 18:02:26 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > In fact using conceptually like a new snapshot for each sample tuple
> > actually seems like it'd be somewhat of an improvement over using a
> > single snapshot.
> 
> Dunno, that feels like a fairly bad idea to me.  It seems like it would
> overemphasize the behavior of whatever queries happened to be running
> concurrently with the ANALYZE.  I do follow the argument that using a
> single snapshot for the whole ANALYZE overemphasizes a single instant
> in time, but I don't think that leads to the conclusion that we shouldn't
> use a snapshot at all.

I didn't actually want to suggest that we should take a separate
snapshot for every sampled row - that'd be excessively costly. What I
wanted to say was that I don't think that I don't see a clear accuraccy
benefit. E.g. not seeing any of the values inserted more recently will
under-emphasize those in the histogram.

What precisely do you mean with "overemphasize" above? I mean those will
e the rows most likely to live after the analyze is done, so including
them doesn't seem like a bad thing to me?


> Another angle that would be worth considering, aside from the issue
> of whether the sample used for pg_statistic becomes more or less
> representative, is what impact all this would have on the tuple count
> estimates that go to the stats collector and pg_class.reltuples.
> Right now, we don't have a great story at all on how the stats collector's
> count is affected by combining VACUUM/ANALYZE table-wide counts with
> the incremental deltas reported by transactions happening concurrently
> with VACUUM/ANALYZE.  Would changing this behavior make that better,
> or worse, or about the same?

Hm. Vacuum already counts rows that are inserted concurrently with the
vacuum scan, if it encounters them. Analyze doesn't. Seems like we'd at
least be wrong in a more consistent manner than before...

IIUC both analyze and vacuum will overwrite concurrent changes to
n_live_tuples. So taking concurrently committed changes into account
seems like it'd be the right thing?

We probably could make this more accurate by accounting separately for
"recently inserted and committed" rows, and taking the difference of
n_live_tuples before/after into account.  But I'm a bit doubtful that
it's worth it?

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: PROC_IN_ANALYZE stillborn 13 years ago
Next
From: Peter Geoghegan
Date:
Subject: Re: Should the nbtree page split REDO routine's locking work more like the locking on the primary?