Re: Use extended statistics to estimate (Var op Var) clauses - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: Use extended statistics to estimate (Var op Var) clauses
Date
Msg-id 69ce226b-cfbb-2712-bfd7-ec62cf0e0b2a@enterprisedb.com
Whole thread Raw
In response to Re: Use extended statistics to estimate (Var op Var) clauses  (Mark Dilger <mark.dilger@enterprisedb.com>)
Responses Re: Use extended statistics to estimate (Var op Var) clauses  (Mark Dilger <mark.dilger@enterprisedb.com>)
Re: Use extended statistics to estimate (Var op Var) clauses  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
List pgsql-hackers
Hi Mark,

This thread inspired me to do something fairly similar - a generator 
that generates queries of varying complexity, executes them on table 
with and without extended statistics. I've been thinking about that 
before, but this finally pushed me to do that, and some of the results 
are fairly interesting ...

I've pushed everything (generator and results) to this github repo:

   https://github.com/tvondra/stats-test

with a summary of all results here:

   https://github.com/tvondra/stats-test/blob/master/results.md


Some basic facts about the generator.py (see query_generator):

* It's using a fixed seed to make it deterministic.

* A small fraction of generated queries is sampled and executed (5%).

* Thanks to a fixed seed we generate/sample the same set of queries for 
different runs, which allows us to compare runs easily.

* The queries use 2 - 5 clauses, either (Var op Const) or (Var op Var).

* The operators are the usual equality/inequality ones.

* The clauses are combined using AND/OR (also randomly picked).

* There's a random set of parens added, to vary the operator precedence 
(otherwise it'd be driven entirely by AND/OR).

* There are two datasets - a random and correlated one, with different 
number of distinct values in each column (10, 100, 1000, 10000).

* The statistics target is set to 10, 100, 1000, 10000.


It's a bit hacky, with various bits hard-coded at the moment. But it 
could be extended to do other stuff fairly easily, I think.

Anyway, the repository contains results for three cases:

1) master
2) patched: master with the (Var op Var) patch
3) fixed: patched, with a fix for "simple" clauses (a crude patch)

And for each case we have three row counts:

* actual (from explain analyze)
* estimate without extended stats
* estimate with extended stats

And then we can calculate "estimation error" as

   estimate / actual

both with and without statistics. Results for two cases can be plotted 
as a scatter plot, with the two estimation errors as (x,y) values. The 
idea is that this shows how a patch affects estimates - a point (100,10) 
means that it was 100x over-estimated, and with the patch it's just 10x, 
and similarly for other points.

This is what the charts at

   https://github.com/tvondra/stats-test/blob/master/results.md

do, for each combination of parameters (dataset, statistics target and 
number of distinct values). There's one chart without extended stats, 
one with extended stats.

An "ideal" chart would look like like a single point (1,1) which means 
"accurate estimates without/with patch", or (?,1) which means "poor 
estimates before, accurate estimates now". Diagonal means "no change".

In principle, we expect the charts to look like this:

* random: diagonal charts, because there should be no extended stats 
built, hence no impact on estimates is expected

* correlated: getting closer to 1.0, which looks like a horizontal line 
in the chart

Consider for example this:

https://github.com/tvondra/stats-test/raw/master/correlated-1000-10.png

which clearly shows that the first patch is almost exactly the same as 
master, while with the fix the estimates improve significantly (and are 
almost perfect), at least with the statistics.

Without stats there's a bunch of queries that suddenly get from 
"perfect" to much worse (looks like a vertical line on the left chart).

But there are other "strange" cases with "interesting patterns", like 
for example

* 
https://raw.githubusercontent.com/tvondra/stats-test/master/correlated-100-100.png

* 
https://raw.githubusercontent.com/tvondra/stats-test/master/correlated-1000-100.png

* 
https://raw.githubusercontent.com/tvondra/stats-test/master/random-10000-10.png

This likely shows the patches are a significant improvement for some 
queries (either getting better than master, or even making the estimates 
pretty accurate). But it's probably worth looking into the queries that 
got worse, etc.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

pgsql-hackers by date:

Previous
From: David Rowley
Date:
Subject: Re: Window Function "Run Conditions"
Next
From: Greg Nancarrow
Date:
Subject: Re: Fix uninitialized variable access (src/backend/utils/mmgr/freepage.c)