Re: [PERFORM] OLAP/reporting queries fall into nested loops overseq scans or other horrible planner choices - Mailing list pgsql-performance

From Laurenz Albe
Subject Re: [PERFORM] OLAP/reporting queries fall into nested loops overseq scans or other horrible planner choices
Date
Msg-id 1509696112.2586.3.camel@cybertec.at
Whole thread Raw
In response to Re: [PERFORM] OLAP/reporting queries fall into nested loops over seqscans or other horrible planner choices  (Gunther <raj@gusw.net>)
List pgsql-performance
Gunther wrote:
> > Bad choices are almost always caused by bad estimates.
> > Granted, there is no way that estimates can ever be perfect.
> > ...
> > Looking deeper, I would say that wrongly chosen nested loop joins
> > often come from an underestimate that is close to zero.
> > PostgreSQL already clamps row count estimates to 1, that is, it will
> > choose an estimate of 1 whenever it thinks fewer rows will be returned.
> > 
> > Perhaps using a higher clamp like 2 would get rid of many of your
> > problems, but it is a difficult gamble as it will also prevent some
> > nested loop joins that would have been the best solution.
> 
> Wow, that is very interesting! Are you saying that if PgSQL can't know 
> what the cardinality is, it assumes a default of 1? That would be very 
> slanted a guess. I would think a couple of hundred would be more 
> appropriate, or 10% of the average of the base tables for which it does 
> have statistics. I would wonder if changing 1 to 2 would make much 
> difference, as Seq Search over 1 to 10 tuples should generally be better 
> than any other approach, as long as the 1-10 tuples are already readily 
> available.

No, it is not like that.
When PostgreSQL cannot come up with a "real" estimate, it uses
default selectivity estimates.

See include/utils/selfuncs.h:

/*
 * Note: the default selectivity estimates are not chosen entirely at random.
 * We want them to be small enough to ensure that indexscans will be used if
 * available, for typical table densities of ~100 tuples/page.  Thus, for
 * example, 0.01 is not quite small enough, since that makes it appear that
 * nearly all pages will be hit anyway.  Also, since we sometimes estimate
 * eqsel as 1/num_distinct, we probably want DEFAULT_NUM_DISTINCT to equal
 * 1/DEFAULT_EQ_SEL.
 */

/* default selectivity estimate for equalities such as "A = b" */
#define DEFAULT_EQ_SEL  0.005

/* default selectivity estimate for inequalities such as "A < b" */
#define DEFAULT_INEQ_SEL  0.3333333333333333

/* default selectivity estimate for range inequalities "A > b AND A < c" */
#define DEFAULT_RANGE_INEQ_SEL  0.005

/* default selectivity estimate for pattern-match operators such as LIKE */
#define DEFAULT_MATCH_SEL       0.005

/* default number of distinct values in a table */
#define DEFAULT_NUM_DISTINCT  200

/* default selectivity estimate for boolean and null test nodes */
#define DEFAULT_UNK_SEL                 0.005
#define DEFAULT_NOT_UNK_SEL             (1.0 - DEFAULT_UNK_SEL)

Those selectivity estimates are factors, not absolute numbers.

The clamp to 1 happens when, after applying all selectivity factors, the
result is less than 1, precisely to keep the optimizer from choosing a plan
that would become very expensive if a branch is executed *at all*.

> > Finally, even though the official line of PostgreSQL is to *not* have
> > query hints, and for a number of good reasons, this is far from being
> > an unanimous decision.  The scales may tip at some point, though I
> > personally hope that this point is not too close.
> 
> I am glad to hear that hints are not completely ruled out by the 
> development team. Definitely Oracle hints are painful and should not be 
> replicated as is.  Butmay be I can nudge your (and others') personal 
> tastes with the following.

Didn't work for me.
Your hints look just like what Oracle does.

There have been better proposals that aim at fixing the selectivity
estimates, e.g. "multiply your estimate for this join by three".

> In my Aqua 
> Data Studio, if I put the set statement before the select statement, the 
> combined statement doesn't return any results. May be I am doing 
> something wrong. If there is a way, then I would ave what I need.

Check the SQL statements that are generated by your Aqua Data Studio!

Yours,
Laurenz Albe


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

pgsql-performance by date:

Previous
From: legrand legrand
Date:
Subject: [PERFORM] Re: OLAP/reporting queries fall into nested loops over seq scans orother horrible planner choices
Next
From: Thomas Kellerer
Date:
Subject: [PERFORM] Re: OLAP/reporting queries fall into nested loops over seq scans orother horrible planner choices