Re: Additional improvements to extended statistics - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: Additional improvements to extended statistics
Date
Msg-id 57eb142c-d5a4-cfca-675b-0dfd9ef7c2e2@2ndquadrant.com
Whole thread Raw
In response to Re: Additional improvements to extended statistics  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Responses Re: Additional improvements to extended statistics
Re: Additional improvements to extended statistics
List pgsql-hackers
Hi,

Here is an improved WIP version of the patch series, modified to address
the issue with repeatedly applying the extended statistics, as discussed
with Dean in this thread. It's a bit rough and not committable, but I
need some feedback so I'm posting it in this state.

(Note: The WIP patch is expected to fail regression tests. A couple
stats_ext regression tests fail due to changed estimate - I've left it
like that to make the changes more obvious for now.)

Earlier in this thread I used this example:


   CREATE TABLE t (a int, b int);
   INSERT INTO t SELECT mod(i, 10), mod(i, 10)
     FROM generate_series(1,100000) s(i);
   CREATE STATISTICS s (mcv) ON a,b FROM t;
   ANALYZE t;

   EXPLAIN SELECT * FROM t WHERE a = 0 OR b = 0;

which had this call graph with two statext_mcv_clauselist_selectivity
calls (which was kinda the issue):

   clauselist_selectivity
     statext_clauselist_selectivity
       statext_mcv_clauselist_selectivity  <--- (1)
         clauselist_selectivity_simple
           clause_selectivity
             clauselist_selectivity_or
               statext_clauselist_selectivity
                 statext_mcv_clauselist_selectivity  <--- (2)
                   clauselist_selectivity_simple_or
                     clause_selectivity
                     clause_selectivity
                   mcv_clauselist_selectivity
               clauselist_selectivity_simple_or
         mcv_clauselist_selectivity
       clauselist_selectivity_simple
         (already estimated)

with the patches applied, the call looks like this:

   clauselist_selectivity_internal (use_extended_stats=1)
     statext_clauselist_selectivity
       statext_mcv_clauselist_selectivity (is_or=0)
         clauselist_selectivity_simple
           clause_selectivity_internal (use_extended_stats=0)
             clauselist_selectivity_or (use_extended_stats=0)
               clauselist_selectivity_simple_or
                 clause_selectivity_internal (use_extended_stats=0)
                 clause_selectivity_internal (use_extended_stats=0)
         mcv_clauselist_selectivity (is_or=0)
       clauselist_selectivity_simple

The nested call is removed, which I think addresses the issue. As for
the effect on estimates, there's a couple regression tests where the
estimates change - not much though, an example is:

 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial
WHERE a = 0 OR b = 0 OR c = 10');
  estimated | actual
 -----------+--------
-       412 |    104
+       308 |    104
 (1 row)

This is on top of 0001, though. Interestingly enough, this ends up with
the same estimate as current master, but I consider that a coincidence.


As for the patches:

0001 is the original patch improving estimates of OR clauses

0002 adds thin wrappers for clause[list]_selectivity, with "internal"
functions allowing to specify whether to keep considering extended stats

0003 does the same for the "simple" functions


I've kept it like this to demonstrate that 0002 is not sufficient. In my
response from March 24 I wrote this:

> Isn't it the case that clauselist_selectivity_simple (and the OR 
> variant) should ignore extended stats entirely? That is, we'd need
> to add a flag (or _simple variant) to clause_selectivity, so that it
> calls causelist_selectivity_simple_or.
But that's actually wrong, as 0002 shows (as it breaks a couple of
regression tests), because of the way we handle OR clauses. At the top
level, an OR-clause is actually just a single clause and it may get
passed to clauselist_selectivity_simple. So entirely disabling extended
stats for the "simple" functions would also mean disabling extended
stats for a large number of OR clauses. Which is clearly wrong.

So 0003 addresses that, by adding a flag to the two "simple" functions.
Ultimately, this should probably do the same thing as 0002 and add thin
wrappers, because the existing functions are part of the public API.

Dean, does this address the issue you had in mind? Can you come up with
an example of that issue in the form of a regression test or something?


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

pgsql-hackers by date:

Previous
From: Eugen Konkov
Date:
Subject: Re: Proposition for autoname columns
Next
From: Dean Rasheed
Date:
Subject: Re: Additional improvements to extended statistics