How the planner uses statistics - Mailing list pgsql-docs

From Mark Kirkwood
Subject How the planner uses statistics
Date
Msg-id 42082E1C.8090408@coretech.co.nz
Whole thread Raw
Responses Re: How the planner uses statistics
List pgsql-docs
I wanted to understand how the planner 'knows' how many rows are likely
to be emitted in a given stage of a query, and wrote down some examples
for my own benefit - I then wondered if this would be a good addition to
the 'Performance Tips' chapter. So... err here it is.

Comments welcome.

best wishes

Mark

--- perform.sgml.orig    Sat Feb  5 12:45:36 2005
+++ perform.sgml    Tue Feb  8 15:08:02 2005
@@ -470,6 +470,288 @@

  </sect1>

+
+ <sect1 id="planner-stats-how">
+  <title>How the Planner Uses Statistics</title>
+
+  <indexterm zone="planner-stats-how">
+   <primary>statistics</primary>
+   <secondary>of the planner</secondary>
+  </indexterm>
+
+  <para>
+   This section builds on the material covered in the previous two and
+   shows how the planner uses the system statistics to estimate the number of
+   rows each stage of a query might return. We will adopt the approach of
+   showing by example, which should provide a good feel for how this works.
+  </para>
+
+  <para>
+   Continuing with the examples drawn from the regression test
+   database (and 8.0 sources), let's start with a simple query which has
+   one restriction in its <literal>WHERE</literal> clause :
+
+<programlisting>
+EXPLAIN SELECT * FROM tenk1 WHERE unique1 < 1000;
+
+                         QUERY PLAN
+------------------------------------------------------------
+ Seq Scan on tenk1  (cost=0.00..470.00 rows=1031 width=244)
+   Filter: (unique1 < 1000)
+
+</programlisting>
+
+   The planner examines the <literal>WHERE</literal> clause condition:
+
+<programlisting>
+unique1 < 1000
+</programlisting>
+
+   and looks up the restriction function for the operator
+   <literal><</literal> in <classname>pg_operator</classname>.
+   This is held in the column <structfield>oprrest</structfield>,
+   and the result in this case is <function>scalarltsel</function>.
+   The <function>scalarltsel</function> function retrieves the histogram for
+   <structfield>unique1</structfield> from <classname>pg_statistics</classname>
+   - we can follow this by using the simpler <classname>pg_stats</classname>
+   view:
+
+<programlisting>
+SELECT histogram_bounds FROM pg_stats
+WHERE tablename='tenk1' AND attname='unique1';
+
+                   histogram_bounds
+------------------------------------------------------
+ {1,970,1943,2958,3971,5069,6028,7007,7919,8982,9995}
+</programlisting>
+
+   Next the fraction of the histogram occupied by <quote>< 1000</quote>
+   is worked out. This is the selectivity. The histogram divides the range
+   into equal frequency buckets, so all we have to do is locate the bucket
+   that our value is in and count <emphasis>part</emphasis> of it and
+   <emphasis>all</emphasis> of the ones before. The value 1000 is clearly in
+   the second (970 - 1943) bucket, so by assuming a linear distribution of
+   values inside each bucket we can calculate the selectivity as:
+
+<programlisting>
+selectivity = (1 + (1000 - 970)/(1943 - 970)) / 10
+            = 0.1031
+</programlisting>
+
+   that is, one whole bucket plus a linear fraction of the second, divided by
+   the number of buckets. The estimated number of rows can now be calculated as
+   the product of the selectivity and the cardinality of
+   <classname>tenk1</classname> :
+
+<programlisting>
+rows = 10000 * 0.1031
+     = 1031
+</programlisting>
+
+  </para>
+
+  <para>
+   Next let's consider an example with a <literal>WHERE</literal> clause using
+   the <literal>=</literal> operator :
+
+<programlisting>
+EXPLAIN SELECT * FROM tenk1 WHERE stringu1 = 'ATAAAA';
+
+                        QUERY PLAN
+----------------------------------------------------------
+ Seq Scan on tenk1  (cost=0.00..470.00 rows=31 width=244)
+   Filter: (stringu1 = 'ATAAAA'::name)
+</programlisting>
+
+   Again the planner examines the <literal>WHERE</literal> clause condition:
+
+<programlisting>
+stringu1 = 'ATAAAA'
+</programlisting>
+
+   and looks up the restriction function for <literal>=</literal>, which is
+   <function>eqsel</function>. This case is a bit different, as the most
+   common values - <acronym>MCV</acronym>s, are used to determine the
+   selectivity. Let's have a look at these, with some extra columns that will
+   be useful later :
+
+<programlisting>
+SELECT null_frac, n_distinct, most_common_vals, most_common_freqs FROM pg_stats
+WHERE tablename='tenk1' AND attname='stringu1';
+
+ null_frac  | n_distinct |
+                           most_common_vals                              |
+                           most_common_freqs
+-------------------------------------------------------------------------+-----
+          0 |        672 |
+ {FDAAAA,NHAAAA,ATAAAA,BGAAAA,EBAAAA,MOAAAA,NDAAAA,OWAAAA,BHAAAA,BJAAAA} |
+ {0.00333333,0.00333333,0.003,0.003,0.003,0.003,0.003,0.003,0.00266667,0.00266667}
+</programlisting>
+
+   The selectivity is merely the frequency corresponding to 'ATAAAA':
+
+<programlisting>
+selectivity = 0.003
+</programlisting>
+
+   The estimated number of rows is just the product of this with the
+   cardinality of <classname>tenk1</classname> as before :
+
+<programlisting>
+rows = 10000 * 0.003
+     = 30
+</programlisting>
+
+   The number displayed by <command>EXPLAIN</command> is one more than this,
+   due to some post estimation checks.
+  </para>
+
+  <para>
+   Now consider the same query, but with a constant that is not in the
+   <acronym>MCV</acronym> list :
+
+<programlisting>
+EXPLAIN SELECT * FROM tenk1 WHERE stringu1 = 'xxx';
+
+                        QUERY PLAN
+----------------------------------------------------------
+ Seq Scan on tenk1  (cost=0.00..470.00 rows=15 width=244)
+   Filter: (stringu1 = 'xxx'::name)
+</programlisting>
+
+   This is quite a different problem, how to estimate the selectivity when the
+   value is <emphasis>not</emphasis> in the MCV list. The approach is to use
+   the fact that the value is not in the list, combined with the knowledge
+   of the frequencies for all of the MCVs :
+
+<programlisting>
+selectivity = (1.0 - (0.00333333 + 0.00333333 + 0.003 + 0.003 + 0.003
+            + 0.003 + 0.003 + 0.003 + 0.00266667 + 0.00266667)) / (672 - 10)
+            = 0.001465
+</programlisting>
+
+   That is, add up all the frequencies for the <acronym>MCV</acronym>s and subtract them from
+   one - because it is <emphasis>not</emphasis> one of these, and divide by
+   the <emphasis>remaining</emphasis> distinct values. Notice that there are no
+   null values so we don't have to worry about those. The estimated number of
+   rows is calculated as usual :
+
+<programlisting>
+rows = 10000 * 0.001465
+     = 15
+</programlisting>
+
+  </para>
+
+  <para>
+   In the case where there is more than one condition in the
+   <literal>WHERE</literal> clause, for example :
+
+<programlisting>
+EXPLAIN SELECT * FROM tenk1 WHERE unique1 < 1000 AND stringu1 = 'xxx';
+
+                       QUERY PLAN
+-----------------------------------------------------------
+ Seq Scan on tenk1  (cost=0.00..495.00 rows=2 width=244)
+   Filter: ((unique1 < 1000) AND (stringu1 = 'xxx'::name))
+</programlisting>
+
+   then independence is assumed and the selectivities of the individual
+   restrictions are multiplied together :
+
+<programlisting>
+selectivity = selectivity(unique1 < 1000) * selectivity(stringu1 = 'xxx')
+            = 0.1031 * 0.001465
+            = 0.00015104
+</programlisting>
+
+   The row estimates are calculated as before :
+
+<programlisting>
+rows = 10000 * 0.00015104
+     = 2
+</programlisting>
+  </para>
+
+  <para>
+   Let's examine a query that includes a <literal>JOIN</literal> :
+
+<programlisting>
+EXPLAIN SELECT *  FROM tenk1 t1, tenk2 t2
+WHERE t1.unique1 < 50 AND t1.unique2 = t2.unique2;
+
+                                      QUERY PLAN
+-----------------------------------------------------------------------------------------
+ Nested Loop  (cost=0.00..346.90 rows=51 width=488)
+   ->  Index Scan using tenk1_unique1 on tenk1 t1  (cost=0.00..192.57 rows=51 width=244)
+         Index Cond: (unique1 < 50)
+   ->  Index Scan using tenk2_unique2 on tenk2 t2  (cost=0.00..3.01 rows=1 width=244)
+         Index Cond: ("outer".unique2 = t2.unique2)
+</programlisting>
+
+   The restriction on <classname>tenk1</classname>
+   <quote>unique1 < 50</quote> is evaluated before the nested-loop join.
+   This is handled analogously to the initial example. The restriction operator
+   for <literal><</literal> is <function>scalarlteqsel</function> as before,
+   but this time the value 50 is in the first bucket of the
+   <structfield>unique1</structfield> histogram :
+
+<programlisting>
+selectivity = ((50 - 1) / (970 - 1)) / 10
+            = 0.005057
+
+rows        = 10000 * 0.005057
+            = 51
+</programlisting>
+
+   The restriction for the join is :
+
+<programlisting>
+t2.unique2 = t1.unique2
+</programlisting>
+
+   This is due to the join method being nested-loop, with
+   <classname>tenk1</classname> being in the outer loop. The operator is just
+   our familiar <literal>=<literal>, however the restriction function is
+   obtained from the <structfield>oprjoin</structfield> column of
+   <classname>pg_operator</classname> - and is <function>eqjoinsel</function>.
+   Additionally we use the statistical information for both
+   <classname>tenk2</classname> and <classname>tenk1</classname> :
+
+<programlisting>
+SELECT tablename, null_frac,n_distinct, most_common_vals FROM pg_stats
+WHERE tablename IN ('tenk1', 'tenk2') AND attname='unique2';
+
+tablename  | null_frac | n_distinct | most_common_vals
+-----------+-----------+------------+------------------
+ tenk1     |         0 |         -1 |
+ tenk2     |         0 |         -1 |
+</programlisting>
+
+   In this case there is no <acronym>MCV</acronym> information for
+   <structfield>unique2</structfield> because all the values appear to be unique,
+   so we can use an algorithm that relies only on the number of distinct values
+   for both relations together with their null fractions :
+
+<programlisting>
+selectivity = (1 - 0) * (1 - 0) * min(1 / 10000, 1 / 1000)
+            = 0.0001
+</programlisting>
+
+   This is, subtract the null fraction from one for each of the relations,
+   and divide by the maximum  of the two distinct values. The number of rows
+   that the join is likely to emit is calculated as the cardinality of
+   cartesian product of the two nodes in the nested-loop, multiplied by the
+   selectivity :
+
+<programlisting>
+rows = 51 * 10000 * 0.0001
+     = 51
+</programlisting>
+  </para>
+
+ </sect1>
+
  <sect1 id="explicit-joins">
   <title>Controlling the Planner with Explicit <literal>JOIN</> Clauses</title>



pgsql-docs by date:

Previous
From: John DeSoi
Date:
Subject: Re: PostgreSQL 8.0.0 documentation for the MacOS X Help Viewer
Next
From: Rainer Brandt
Date:
Subject: raise not in index