Thread: Bogus ANALYZE results for an otherwise-unique column with many nulls

Bogus ANALYZE results for an otherwise-unique column with many nulls

From
Tom Lane
Date:
I looked into the problem described at
https://www.postgresql.org/message-id/flat/VisenaEmail.26.df42f82acae38a58.156463942b8%40tc7-visena
and I believe I've reproduced it: the requirement is that the inner join
column for the antijoin must contain a lot of NULL values, and what isn't
NULL must be unique or nearly so.  If ANALYZE doesn't come across any
duplicated values, it will set the column's stadistinct value to "-1",
which causes the planner to believe that each row of the inner table
produces a unique value, resulting in a bogus answer from
get_variable_numdistinct() and thence a bogus join size estimate.

Here's an example in the regression database, making use of the existing
unique column tenk1.unique1:

regression=# create table manynulls as select case when random() < 0.1 then unique1 else null end as unique1 from
tenk1;
SELECT 10000
regression=# analyze manynulls;
ANALYZE
regression=# explain analyze select * from tenk1 t where not exists(select 1 from manynulls m where m.unique1 =
t.unique1);
                                                        QUERY PLAN
   

---------------------------------------------------------------------------------------------------------------------------
 Hash Anti Join  (cost=261.00..756.50 rows=1 width=244) (actual time=4.632..14.729 rows=8973 loops=1)
   Hash Cond: (t.unique1 = m.unique1)
   ->  Seq Scan on tenk1 t  (cost=0.00..458.00 rows=10000 width=244) (actual time=0.015..2.683 rows=10000 loops=1)
   ->  Hash  (cost=136.00..136.00 rows=10000 width=4) (actual time=4.553..4.553 rows=1027 loops=1)
         Buckets: 16384  Batches: 1  Memory Usage: 165kB
         ->  Seq Scan on manynulls m  (cost=0.00..136.00 rows=10000 width=4) (actual time=0.019..2.668 rows=10000
loops=1)
 Planning time: 0.808 ms
 Execution time: 15.670 ms
(8 rows)

So the antijoin size estimate is way off, but it's hardly the planner's
fault because the stats are insane:

regression=# select attname,null_frac,n_distinct from pg_stats where tablename = 'manynulls';
 attname | null_frac | n_distinct
---------+-----------+------------
 unique1 |    0.8973 |         -1
(1 row)

With the patch attached below, ANALYZE produces

regression=# analyze manynulls;
ANALYZE
regression=# select attname,null_frac,n_distinct from pg_stats where tablename = 'manynulls';
 attname | null_frac | n_distinct
---------+-----------+------------
 unique1 |    0.8973 |    -0.1027
(1 row)

and now the join size estimate is dead on:

regression=# explain analyze select * from tenk1 t where not exists(select 1 from manynulls m where m.unique1 =
t.unique1);
                                                        QUERY PLAN
   

---------------------------------------------------------------------------------------------------------------------------
 Hash Anti Join  (cost=261.00..847.69 rows=8973 width=244) (actual time=4.501..13.888 rows=8973 loops=1)
   Hash Cond: (t.unique1 = m.unique1)
   ->  Seq Scan on tenk1 t  (cost=0.00..458.00 rows=10000 width=244) (actual time=0.031..4.959 rows=10000 loops=1)
   ->  Hash  (cost=136.00..136.00 rows=10000 width=4) (actual time=4.404..4.404 rows=1027 loops=1)
         Buckets: 16384  Batches: 1  Memory Usage: 165kB
         ->  Seq Scan on manynulls m  (cost=0.00..136.00 rows=10000 width=4) (actual time=0.034..2.576 rows=10000
loops=1)
 Planning time: 1.388 ms
 Execution time: 14.542 ms
(8 rows)

What I did in the patch is to scale the formerly fixed "-1.0" stadistinct
estimate to discount the fraction of nulls we found.  An alternative
possibility might be to decree that a fractional stadistinct considers
only non-nulls, but then all the other paths through ANALYZE would be
wrong.  The spec for it in pg_statistic.h doesn't suggest any such
interpretation, either.

Looking around, there are a couple of places outside commands/analyze.c
that are making the same mistake, so this patch isn't complete, but it
illustrates what needs to be done.

This is a bit reminiscent of the nulls-accounting problem we fixed in
commit be4b4dc75, though that tended to result in underestimates not
overestimates of the number of distinct values.  We didn't back-patch
that fix, so probably we shouldn't back-patch this either.  On the other
hand, it is far more open-and-shut that this is wrong.  Thoughts?

            regards, tom lane

diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 5fcedd7..9ac7122 100644
*** a/src/backend/commands/analyze.c
--- b/src/backend/commands/analyze.c
*************** compute_distinct_stats(VacAttrStatsP sta
*** 2049,2056 ****

          if (nmultiple == 0)
          {
!             /* If we found no repeated values, assume it's a unique column */
!             stats->stadistinct = -1.0;
          }
          else if (track_cnt < track_max && toowide_cnt == 0 &&
                   nmultiple == track_cnt)
--- 2049,2059 ----

          if (nmultiple == 0)
          {
!             /*
!              * If we found no repeated non-null values, assume it's a unique
!              * column; but be sure to discount for any nulls we found.
!              */
!             stats->stadistinct = -1.0 * (1.0 - stats->stanullfrac);
          }
          else if (track_cnt < track_max && toowide_cnt == 0 &&
                   nmultiple == track_cnt)
*************** compute_scalar_stats(VacAttrStatsP stats
*** 2426,2433 ****

          if (nmultiple == 0)
          {
!             /* If we found no repeated values, assume it's a unique column */
!             stats->stadistinct = -1.0;
          }
          else if (toowide_cnt == 0 && nmultiple == ndistinct)
          {
--- 2429,2439 ----

          if (nmultiple == 0)
          {
!             /*
!              * If we found no repeated non-null values, assume it's a unique
!              * column; but be sure to discount for any nulls we found.
!              */
!             stats->stadistinct = -1.0 * (1.0 - stats->stanullfrac);
          }
          else if (toowide_cnt == 0 && nmultiple == ndistinct)
          {
*************** compute_scalar_stats(VacAttrStatsP stats
*** 2753,2759 ****
          else
              stats->stawidth = stats->attrtype->typlen;
          /* Assume all too-wide values are distinct, so it's a unique column */
!         stats->stadistinct = -1.0;
      }
      else if (null_cnt > 0)
      {
--- 2759,2765 ----
          else
              stats->stawidth = stats->attrtype->typlen;
          /* Assume all too-wide values are distinct, so it's a unique column */
!         stats->stadistinct = -1.0 * (1.0 - stats->stanullfrac);
      }
      else if (null_cnt > 0)
      {

Re: Bogus ANALYZE results for an otherwise-unique column with many nulls

From
Tom Lane
Date:
I wrote:
> Looking around, there are a couple of places outside commands/analyze.c
> that are making the same mistake, so this patch isn't complete, but it
> illustrates what needs to be done.

Here's a more complete patch.

            regards, tom lane

diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 5fcedd7..9ac7122 100644
*** a/src/backend/commands/analyze.c
--- b/src/backend/commands/analyze.c
*************** compute_distinct_stats(VacAttrStatsP sta
*** 2049,2056 ****

          if (nmultiple == 0)
          {
!             /* If we found no repeated values, assume it's a unique column */
!             stats->stadistinct = -1.0;
          }
          else if (track_cnt < track_max && toowide_cnt == 0 &&
                   nmultiple == track_cnt)
--- 2049,2059 ----

          if (nmultiple == 0)
          {
!             /*
!              * If we found no repeated non-null values, assume it's a unique
!              * column; but be sure to discount for any nulls we found.
!              */
!             stats->stadistinct = -1.0 * (1.0 - stats->stanullfrac);
          }
          else if (track_cnt < track_max && toowide_cnt == 0 &&
                   nmultiple == track_cnt)
*************** compute_scalar_stats(VacAttrStatsP stats
*** 2426,2433 ****

          if (nmultiple == 0)
          {
!             /* If we found no repeated values, assume it's a unique column */
!             stats->stadistinct = -1.0;
          }
          else if (toowide_cnt == 0 && nmultiple == ndistinct)
          {
--- 2429,2439 ----

          if (nmultiple == 0)
          {
!             /*
!              * If we found no repeated non-null values, assume it's a unique
!              * column; but be sure to discount for any nulls we found.
!              */
!             stats->stadistinct = -1.0 * (1.0 - stats->stanullfrac);
          }
          else if (toowide_cnt == 0 && nmultiple == ndistinct)
          {
*************** compute_scalar_stats(VacAttrStatsP stats
*** 2753,2759 ****
          else
              stats->stawidth = stats->attrtype->typlen;
          /* Assume all too-wide values are distinct, so it's a unique column */
!         stats->stadistinct = -1.0;
      }
      else if (null_cnt > 0)
      {
--- 2759,2765 ----
          else
              stats->stawidth = stats->attrtype->typlen;
          /* Assume all too-wide values are distinct, so it's a unique column */
!         stats->stadistinct = -1.0 * (1.0 - stats->stanullfrac);
      }
      else if (null_cnt > 0)
      {
diff --git a/src/backend/tsearch/ts_typanalyze.c b/src/backend/tsearch/ts_typanalyze.c
index 0f851ea..817453c 100644
*** a/src/backend/tsearch/ts_typanalyze.c
--- b/src/backend/tsearch/ts_typanalyze.c
*************** compute_tsvector_stats(VacAttrStats *sta
*** 295,301 ****
          stats->stawidth = total_width / (double) nonnull_cnt;

          /* Assume it's a unique column (see notes above) */
!         stats->stadistinct = -1.0;

          /*
           * Construct an array of the interesting hashtable items, that is,
--- 295,301 ----
          stats->stawidth = total_width / (double) nonnull_cnt;

          /* Assume it's a unique column (see notes above) */
!         stats->stadistinct = -1.0 * (1.0 - stats->stanullfrac);

          /*
           * Construct an array of the interesting hashtable items, that is,
diff --git a/src/backend/utils/adt/rangetypes_typanalyze.c b/src/backend/utils/adt/rangetypes_typanalyze.c
index fcb71d3..56504fc 100644
*** a/src/backend/utils/adt/rangetypes_typanalyze.c
--- b/src/backend/utils/adt/rangetypes_typanalyze.c
*************** compute_range_stats(VacAttrStats *stats,
*** 203,209 ****
          /* Do the simple null-frac and width stats */
          stats->stanullfrac = (double) null_cnt / (double) samplerows;
          stats->stawidth = total_width / (double) non_null_cnt;
!         stats->stadistinct = -1.0;

          /* Must copy the target values into anl_context */
          old_cxt = MemoryContextSwitchTo(stats->anl_context);
--- 203,211 ----
          /* Do the simple null-frac and width stats */
          stats->stanullfrac = (double) null_cnt / (double) samplerows;
          stats->stawidth = total_width / (double) non_null_cnt;
!
!         /* Estimate that non-null values are unique */
!         stats->stadistinct = -1.0 * (1.0 - stats->stanullfrac);

          /* Must copy the target values into anl_context */
          old_cxt = MemoryContextSwitchTo(stats->anl_context);
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index cc2a9a1..56943f2 100644
*** a/src/backend/utils/adt/selfuncs.c
--- b/src/backend/utils/adt/selfuncs.c
*************** double
*** 4738,4743 ****
--- 4738,4744 ----
  get_variable_numdistinct(VariableStatData *vardata, bool *isdefault)
  {
      double        stadistinct;
+     double        stanullfrac = 0.0;
      double        ntuples;

      *isdefault = false;
*************** get_variable_numdistinct(VariableStatDat
*** 4745,4751 ****
      /*
       * Determine the stadistinct value to use.  There are cases where we can
       * get an estimate even without a pg_statistic entry, or can get a better
!      * value than is in pg_statistic.
       */
      if (HeapTupleIsValid(vardata->statsTuple))
      {
--- 4746,4753 ----
      /*
       * Determine the stadistinct value to use.  There are cases where we can
       * get an estimate even without a pg_statistic entry, or can get a better
!      * value than is in pg_statistic.  Grab stanullfrac too if we can find it
!      * (otherwise, assume no nulls, for lack of any better idea).
       */
      if (HeapTupleIsValid(vardata->statsTuple))
      {
*************** get_variable_numdistinct(VariableStatDat
*** 4754,4759 ****
--- 4756,4762 ----

          stats = (Form_pg_statistic) GETSTRUCT(vardata->statsTuple);
          stadistinct = stats->stadistinct;
+         stanullfrac = stats->stanullfrac;
      }
      else if (vardata->vartype == BOOLOID)
      {
*************** get_variable_numdistinct(VariableStatDat
*** 4777,4783 ****
              {
                  case ObjectIdAttributeNumber:
                  case SelfItemPointerAttributeNumber:
!                     stadistinct = -1.0; /* unique */
                      break;
                  case TableOidAttributeNumber:
                      stadistinct = 1.0;    /* only 1 value */
--- 4780,4786 ----
              {
                  case ObjectIdAttributeNumber:
                  case SelfItemPointerAttributeNumber:
!                     stadistinct = -1.0; /* unique (and all non null) */
                      break;
                  case TableOidAttributeNumber:
                      stadistinct = 1.0;    /* only 1 value */
*************** get_variable_numdistinct(VariableStatDat
*** 4799,4808 ****
       * If there is a unique index or DISTINCT clause for the variable, assume
       * it is unique no matter what pg_statistic says; the statistics could be
       * out of date, or we might have found a partial unique index that proves
!      * the var is unique for this query.
       */
      if (vardata->isunique)
!         stadistinct = -1.0;

      /*
       * If we had an absolute estimate, use that.
--- 4802,4812 ----
       * If there is a unique index or DISTINCT clause for the variable, assume
       * it is unique no matter what pg_statistic says; the statistics could be
       * out of date, or we might have found a partial unique index that proves
!      * the var is unique for this query.  However, we'd better still believe
!      * the null-fraction statistic.
       */
      if (vardata->isunique)
!         stadistinct = -1.0 * (1.0 - stanullfrac);

      /*
       * If we had an absolute estimate, use that.

Re: Bogus ANALYZE results for an otherwise-unique column with many nulls

From
Andrew Gierth
Date:
>>>>> "Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes:
Tom> What I did in the patch is to scale the formerly fixed "-1.0"Tom> stadistinct estimate to discount the fraction of
nullswe found.
 

This seems quite dubious to me. stadistinct representing only the
non-null values seems to me to be substantially more useful and less
confusing; it should be up to consumers to take stanullfrac into account
(in general they already do) since in many cases we explicitly do _not_
want to count nulls.

-- 
Andrew (irc:RhodiumToad)



Re: Bogus ANALYZE results for an otherwise-unique column with many nulls

From
Andreas Joseph Krogh
Date:
På fredag 05. august 2016 kl. 01:01:06, skrev Tom Lane <tgl@sss.pgh.pa.us>:
I wrote:
> Looking around, there are a couple of places outside commands/analyze.c
> that are making the same mistake, so this patch isn't complete, but it
> illustrates what needs to be done.

Here's a more complete patch.

regards, tom lane
 
(Sorry for HTML-email, but we live in 2016, and I think it improves readability)
 
Great!
 
I have tested it (with PG-9.6 from HEAD, e7caacf733f3ee77a555aa715ab6fbf4434e6b52) and it sure looks like it fixes the problem for my query.
 
The query:
 
explain analyze
SELECT
      log.relation_id as company_id    , sum(log.duration) AS duration
FROM onp_crm_activity_log log
    JOIN onp_crm_person logfor ON logfor.onp_user_id = log.logged_for AND logfor.is_resource = FALSE
WHERE 1 = 1
      -- Filter out already invoiced
      AND NOT EXISTS(    SELECT * FROM onp_crm_calendarentry_invoice_membership cem        JOIN onp_crm_invoice_line il ON cem.invoice_line_id = il.id        JOIN onp_crm_invoice inv ON il.invoice_id = inv.entity_id    WHERE cem.calendar_entry_id = log.id          AND NOT EXISTS(        SELECT * FROM onp_crm_invoice creditnote        WHERE il.invoice_id = creditnote.credit_against    )
)
GROUP BY
    log.relation_id
;
 
Gave the following explain-plan (before the patch), with enable_nestloop=off (needed to prevent nest-loop-anti-join which caused this query to execute in ~26 min.):
 
                                                                                                          QUERY PLAN                                                                                                           
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=25201.62..25220.06 rows=1475 width=36) (actual time=381.191..381.281 rows=404 loops=1)   Group Key: log.relation_id   ->  Hash Anti Join  (cost=4782.34..25148.28 rows=10667 width=12) (actual time=103.683..361.409 rows=148330 loops=1)         Hash Cond: (log.id = cem.calendar_entry_id)         ->  Hash Join  (cost=81.46..20312.73 rows=10667 width=20) (actual time=0.100..156.432 rows=380617 loops=1)               Hash Cond: (log.logged_for = logfor.onp_user_id)               ->  Seq Scan on onp_crm_activity_log log  (cost=0.00..18696.71 rows=380771 width=24) (actual time=0.005..49.195 rows=380774 loops=1)               ->  Hash  (cost=25.02..25.02 rows=4515 width=8) (actual time=0.078..0.078 rows=119 loops=1)                     Buckets: 8192  Batches: 1  Memory Usage: 69kB                     ->  Index Scan using onp_crm_person_onp_id_idx on onp_crm_person logfor  (cost=0.14..25.02 rows=4515 width=8) (actual time=0.005..0.063 rows=119 loops=1)                           Filter: (NOT is_resource)                           Rows Removed by Filter: 8
         ->  Hash  (cost=4700.87..4700.87 rows=1 width=4) (actual time=103.575..103.575 rows=232412 loops=1)               Buckets: 262144 (originally 1024)  Batches: 1 (originally 1)  Memory Usage: 10219kB               ->  Hash Join  (cost=474.41..4700.87 rows=1 width=4) (actual time=9.724..76.015 rows=232412 loops=1)                     Hash Cond: (cem.invoice_line_id = il.id)                     ->  Seq Scan on onp_crm_calendarentry_invoice_membership cem  (cost=0.00..3354.28 rows=232528 width=8) (actual time=0.004..17.626 rows=232528 loops=1)                     ->  Hash  (cost=474.40..474.40 rows=1 width=4) (actual time=9.710..9.710 rows=11535 loops=1)                           Buckets: 16384 (originally 1024)  Batches: 1 (originally 1)  Memory Usage: 534kB                           ->  Merge Join  (cost=415.33..474.40 rows=1 width=4) (actual time=4.149..8.467 rows=11535 loops=1)                                 Merge Cond: (il.invoice_id = inv.entity_id)                                 ->  Sort  (cost=415.05..415.06 rows=1 width=8) (actual time=4.138..4.701 rows=11535 loops=1)                                       Sort Key: il.invoice_id                                       Sort Method: quicksort  Memory: 925kB                                       ->  Hash Anti Join  (cost=77.13..415.04 rows=1 width=8) (actual time=0.257..2.716 rows=11535 loops=1)                                             Hash Cond: (il.invoice_id = creditnote.credit_against)                                             ->  Seq Scan on onp_crm_invoice_line il  (cost=0.00..294.30 rows=11630 width=8) (actual time=0.003..0.995 rows=11630 loops=1)                                             ->  Hash  (cost=50.38..50.38 rows=2140 width=4) (actual time=0.247..0.247 rows=37 loops=1)                                                   Buckets: 4096  Batches: 1  Memory Usage: 34kB                                                   ->  Index Only Scan using origo_invoice_credit_against_idx on onp_crm_invoice creditnote  (cost=0.28..50.38 rows=2140 width=4) (actual time=0.013..0.182 rows=2140 loops=1)                                                         Heap Fetches: 0
                                 ->  Index Only Scan using origo_invoice_id_status_sent_idx on onp_crm_invoice inv  (cost=0.28..53.98 rows=2140 width=8) (actual time=0.008..1.274 rows=11576 loops=1)                                       Heap Fetches: 0
 Planning time: 0.955 ms Execution time: 381.884 ms
(35 rows)
 
AFAIU, this is the problematic estimate: -> Hash Anti Join (cost=77.13..415.04 rows=1 width=8) (actual time=0.257..2.716 rows=11535 loops=1)
Right?
 
now (after the patch, and without needing to disable nest_loop) gives the following explain-plan:
                                                                                                       QUERY PLAN                                                                                                       
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=44409.89..44428.47 rows=1486 width=36) (actual time=366.502..366.594 rows=404 loops=1)   Group Key: log.relation_id   ->  Hash Anti Join  (cost=10157.30..43650.11 rows=151956 width=12) (actual time=99.420..347.309 rows=148327 loops=1)         Hash Cond: (log.id = cem.calendar_entry_id)         ->  Merge Join  (cost=0.56..26729.34 rows=379984 width=20) (actual time=0.011..150.415 rows=380614 loops=1)               Merge Cond: (logfor.onp_user_id = log.logged_for)               ->  Index Scan using onp_crm_person_onp_id_idx on onp_crm_person logfor  (cost=0.14..25.02 rows=4515 width=8) (actual time=0.006..0.030 rows=35 loops=1)                     Filter: (NOT is_resource)                     Rows Removed by Filter: 2
               ->  Index Scan using origo_activity_log_logged_for_idx on onp_crm_activity_log log  (cost=0.42..21977.16 rows=380771 width=24) (actual time=0.003..69.406 rows=380623 loops=1)         ->  Hash  (cost=7300.48..7300.48 rows=228500 width=4) (actual time=99.226..99.226 rows=232412 loops=1)               Buckets: 262144  Batches: 1  Memory Usage: 10219kB               ->  Hash Join  (cost=789.13..7300.48 rows=228500 width=4) (actual time=8.043..67.079 rows=232412 loops=1)                     Hash Cond: (cem.invoice_line_id = il.id)                     ->  Seq Scan on onp_crm_calendarentry_invoice_membership cem  (cost=0.00..3354.28 rows=232528 width=8) (actual time=0.007..16.230 rows=232528 loops=1)                     ->  Hash  (cost=646.27..646.27 rows=11429 width=4) (actual time=8.003..8.003 rows=11535 loops=1)                           Buckets: 16384  Batches: 1  Memory Usage: 534kB                           ->  Merge Anti Join  (cost=0.85..646.27 rows=11429 width=4) (actual time=0.021..6.393 rows=11535 loops=1)                                 Merge Cond: (il.invoice_id = creditnote.credit_against)                                 ->  Merge Join  (cost=0.57..613.99 rows=11630 width=8) (actual time=0.010..4.771 rows=11630 loops=1)                                       Merge Cond: (inv.entity_id = il.invoice_id)                                       ->  Index Only Scan using origo_invoice_id_status_sent_idx on onp_crm_invoice inv  (cost=0.28..53.98 rows=2140 width=8) (actual time=0.006..0.318 rows=2140 loops=1)                                             Heap Fetches: 0
                                       ->  Index Scan using onp_crm_invoice_line_invoice_id_lineno_key on onp_crm_invoice_line il  (cost=0.29..409.28 rows=11630 width=8) (actual time=0.003..2.323 rows=11630 loops=1)                                 ->  Index Only Scan using onp_crm_invoice_credit_against_key on onp_crm_invoice creditnote  (cost=0.28..47.98 rows=2140 width=4) (actual time=0.011..0.016 rows=38 loops=1)                                       Heap Fetches: 0
 Planning time: 0.931 ms Execution time: 366.710 ms
(28 rows)
 
And, for fun with max_parallel_workers_per_gather = 6
                                                                                                                   QUERY PLAN                                                                                                                   
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Finalize GroupAggregate  (cost=39788.73..39851.89 rows=1486 width=36) (actual time=234.289..234.747 rows=404 loops=1)   Group Key: log.relation_id   ->  Sort  (cost=39788.73..39799.88 rows=4458 width=36) (actual time=234.276..234.307 rows=1000 loops=1)         Sort Key: log.relation_id         Sort Method: quicksort  Memory: 103kB         ->  Gather  (cost=39054.15..39518.53 rows=4458 width=36) (actual time=233.012..234.110 rows=1000 loops=1)               Workers Planned: 3
               Workers Launched: 3
               ->  Partial HashAggregate  (cost=38054.15..38072.73 rows=1486 width=36) (actual time=225.607..225.701 rows=250 loops=4)                     Group Key: log.relation_id                     ->  Hash Join  (cost=10238.19..37294.37 rows=151956 width=12) (actual time=163.060..219.652 rows=37082 loops=4)                           Hash Cond: (log.logged_for = logfor.onp_user_id)                           ->  Hash Anti Join  (cost=10156.73..28460.48 rows=152271 width=16) (actual time=162.618..212.351 rows=37121 loops=4)                                 Hash Cond: (log.id = cem.calendar_entry_id)                                 ->  Parallel Seq Scan on onp_crm_activity_log log  (cost=0.00..16117.29 rows=122829 width=24) (actual time=0.010..15.532 rows=95193 loops=4)                                 ->  Hash  (cost=7300.48..7300.48 rows=228500 width=4) (actual time=161.270..161.270 rows=232412 loops=4)                                       Buckets: 262144  Batches: 1  Memory Usage: 10219kB                                       ->  Hash Join  (cost=789.13..7300.48 rows=228500 width=4) (actual time=18.276..112.034 rows=232412 loops=4)                                             Hash Cond: (cem.invoice_line_id = il.id)                                             ->  Seq Scan on onp_crm_calendarentry_invoice_membership cem  (cost=0.00..3354.28 rows=232528 width=8) (actual time=0.045..26.061 rows=232528 loops=4)                                             ->  Hash  (cost=646.27..646.27 rows=11429 width=4) (actual time=18.116..18.116 rows=11535 loops=4)                                                   Buckets: 16384  Batches: 1  Memory Usage: 534kB                                                   ->  Merge Anti Join  (cost=0.85..646.27 rows=11429 width=4) (actual time=0.179..14.505 rows=11535 loops=4)                                                         Merge Cond: (il.invoice_id = creditnote.credit_against)                                                         ->  Merge Join  (cost=0.57..613.99 rows=11630 width=8) (actual time=0.113..10.871 rows=11630 loops=4)                                                               Merge Cond: (inv.entity_id = il.invoice_id)                                                               ->  Index Only Scan using origo_invoice_id_status_sent_idx on onp_crm_invoice inv  (cost=0.28..53.98 rows=2140 width=8) (actual time=0.063..0.811 rows=2140 loops=4)                                                                     Heap Fetches: 0
                                                               ->  Index Scan using onp_crm_invoice_line_invoice_id_lineno_key on onp_crm_invoice_line il  (cost=0.29..409.28 rows=11630 width=8) (actual time=0.046..5.213 rows=11630 loops=4)                                                         ->  Index Only Scan using onp_crm_invoice_credit_against_key on onp_crm_invoice creditnote  (cost=0.28..47.98 rows=2140 width=4) (actual time=0.063..0.071 rows=38 loops=4)                                                               Heap Fetches: 0
                           ->  Hash  (cost=25.02..25.02 rows=4515 width=8) (actual time=0.325..0.325 rows=119 loops=4)                                 Buckets: 8192  Batches: 1  Memory Usage: 69kB                                 ->  Index Scan using onp_crm_person_onp_id_idx on onp_crm_person logfor  (cost=0.14..25.02 rows=4515 width=8) (actual time=0.032..0.270 rows=119 loops=4)                                       Filter: (NOT is_resource)                                       Rows Removed by Filter: 8
 Planning time: 2.443 ms Execution time: 239.979 ms
(38 rows)
 
All in all, thanks for looking into this, and +1 for backpatching to 9.5.
 
--
Andreas Joseph Krogh
CTO / Partner - Visena AS
Mobile: +47 909 56 963
 
Attachment

Re: Bogus ANALYZE results for an otherwise-unique column with many nulls

From
Andrew Gierth
Date:
>>>>> "Andrew" == Andrew Gierth <andrew@tao11.riddles.org.uk> writes:
>>>>> "Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes:
Tom> What I did in the patch is to scale the formerly fixed "-1.0"Tom> stadistinct estimate to discount the fraction of
nullswe found.
 
Andrew> This seems quite dubious to me. stadistinct representing onlyAndrew> the non-null values seems to me to be
substantiallymore usefulAndrew> and less confusing; it should be up to consumers to takeAndrew> stanullfrac into
account(in general they already do) since inAndrew> many cases we explicitly do _not_ want to count nulls.
 

Hm. I am wrong about this, since it's the fact that consumers are taking
stanullfrac into account that makes the value wrong in the first place.
For example, if a million-row table has stanullfrac=0.9 and
stadistinct=-1, then get_variable_numdistinct is returning 1 million,
and (for example) var_eq_non_const divides 0.1 by that to give a
selectivity of 1 in 10 million, which is obviously wrong.

But I think the fix is still wrong, because it changes the meaning of
ALTER TABLE ... ALTER col SET (n_distinct=...)  in a non-useful way; it
is no longer possible to nail down a useful negative n_distinct value if
the null fraction of the column is variable. Would it not make more
sense to do the adjustment in get_variable_numdistinct, instead?

-- 
Andrew (irc:RhodiumToad)



Re: Bogus ANALYZE results for an otherwise-unique column with many nulls

From
Tom Lane
Date:
Andrew Gierth <andrew@tao11.riddles.org.uk> writes:
> Hm. I am wrong about this, since it's the fact that consumers are taking
> stanullfrac into account that makes the value wrong in the first place.

Also, the way that the value is calculated in the samples-not-all-distinct
case corresponds to the way I have it in the patch.  What you want to do
would correspond to leaving these edge cases alone and changing all the
other ANALYZE cases instead (*plus* changing the consumers).  I find that
a tad scary.

> But I think the fix is still wrong, because it changes the meaning of
> ALTER TABLE ... ALTER col SET (n_distinct=...)  in a non-useful way; it
> is no longer possible to nail down a useful negative n_distinct value if
> the null fraction of the column is variable.

I think that argument is bogus.  If we change the way that
get_variable_numdistinct (and other consumers) use the value, that will
break all existing custom settings of n_distinct, because they will no
longer mean what they did before.  There have been exactly zero field
complaints that people could not get the results they wanted, so I do
not think that's justified.

In short, what you want to do constitutes a redefinition of stadistinct,
while my patch doesn't.  That is far more invasive and I fear it will
break things that are not broken today.
        regards, tom lane



Re: Bogus ANALYZE results for an otherwise-unique column with many nulls

From
Andrew Gierth
Date:
>>>>> "Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes:
Tom> Also, the way that the value is calculated in theTom> samples-not-all-distinct case corresponds to the way I have
itinTom> the patch.
 

Ahh, gotcha.  You're referring to this:
       /*        * If we estimated the number of distinct values at more than 10% of        * the total row count (a
veryarbitrary limit), then assume that        * stadistinct should scale with the row count rather than be a fixed
 * value.        */       if (stats->stadistinct > 0.1 * totalrows)           stats->stadistinct = -(stats->stadistinct
/totalrows);
 

where "totalrows" includes nulls obviously. So this expects negative
stadistinct to be scaled by the total table size, and the all-distinct
case should do the same.

Objection withdrawn.

-- 
Andrew (irc:RhodiumToad)



Re: Bogus ANALYZE results for an otherwise-unique column with many nulls

From
Tom Lane
Date:
Andrew Gierth <andrew@tao11.riddles.org.uk> writes:
> Objection withdrawn.

OK, thanks.  What shall we do about Andreas' request to back-patch this?
I'm personally willing to do it, but there is the old bugaboo of "maybe
it will destabilize a plan that someone is happy with".
        regards, tom lane



Re: Bogus ANALYZE results for an otherwise-unique column with many nulls

From
Dean Rasheed
Date:
On 5 August 2016 at 21:48, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> OK, thanks.  What shall we do about Andreas' request to back-patch this?
> I'm personally willing to do it, but there is the old bugaboo of "maybe
> it will destabilize a plan that someone is happy with".
>

My inclination would be to back-patch it because arguably it's a
bug-fix -- at the very least the old behaviour didn't match the docs
for stadistinct:
     The number of distinct nonnull data values in the column.     A value greater than zero is the actual number of
distinctvalues.     A value less than zero is the negative of a multiplier for the number     of rows in the table; for
example,a column in which values appear about     twice on the average could be represented by
<structfield>stadistinct</>= -0.5.
 

Additionally, I think that example is misleading because it's only
really true if there are no null values in the column. Perhaps it
would help to have a more explicit example to illustrate how nulls
affect stadistinct, for example:
     ... for example, a column in which about 80% of the values are nonnull     and each nonnull value appears about
twiceon average could be     represented by <structfield>stadistinct</> = -0.4.
 

Regards,
Dean



Re: Bogus ANALYZE results for an otherwise-unique column with many nulls

From
Andreas Joseph Krogh
Date:
På søndag 07. august 2016 kl. 09:01:40, skrev Dean Rasheed <dean.a.rasheed@gmail.com>:
On 5 August 2016 at 21:48, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> OK, thanks.  What shall we do about Andreas' request to back-patch this?
> I'm personally willing to do it, but there is the old bugaboo of "maybe
> it will destabilize a plan that someone is happy with".
>

My inclination would be to back-patch it because arguably it's a
bug-fix -- at the very least the old behaviour didn't match the docs
for stadistinct:
[snip]
 
 
Will this then make it into the soon-to-be-released 9.5.4?
 
--
Andreas Joseph Krogh
CTO / Partner - Visena AS
Mobile: +47 909 56 963
 
Attachment

Re: Bogus ANALYZE results for an otherwise-unique column with many nulls

From
Tom Lane
Date:
Dean Rasheed <dean.a.rasheed@gmail.com> writes:
> On 5 August 2016 at 21:48, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> OK, thanks.  What shall we do about Andreas' request to back-patch this?
>> I'm personally willing to do it, but there is the old bugaboo of "maybe
>> it will destabilize a plan that someone is happy with".

> My inclination would be to back-patch it because arguably it's a
> bug-fix -- at the very least the old behaviour didn't match the docs
> for stadistinct:

Yeah.  I suspect that situations like this are pretty rare, or we'd have
recognized the problem sooner.  And at least for Andreas, it'd be fixing
a real problem.  I'll apply the back-patch unless I hear objections in
the next couple of hours.

> Additionally, I think that example is misleading because it's only
> really true if there are no null values in the column. Perhaps it
> would help to have a more explicit example to illustrate how nulls
> affect stadistinct, for example:

Good idea, will do.
        regards, tom lane



Re: Bogus ANALYZE results for an otherwise-unique column with many nulls

From
Andreas Joseph Krogh
Date:
På søndag 07. august 2016 kl. 18:35:56, skrev Tom Lane <tgl@sss.pgh.pa.us>:
Dean Rasheed <dean.a.rasheed@gmail.com> writes:
> On 5 August 2016 at 21:48, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> OK, thanks.  What shall we do about Andreas' request to back-patch this?
>> I'm personally willing to do it, but there is the old bugaboo of "maybe
>> it will destabilize a plan that someone is happy with".

> My inclination would be to back-patch it because arguably it's a
> bug-fix -- at the very least the old behaviour didn't match the docs
> for stadistinct:

Yeah.  I suspect that situations like this are pretty rare, or we'd have
recognized the problem sooner.  And at least for Andreas, it'd be fixing
a real problem.  I'll apply the back-patch unless I hear objections in
the next couple of hours.

> Additionally, I think that example is misleading because it's only
> really true if there are no null values in the column. Perhaps it
> would help to have a more explicit example to illustrate how nulls
> affect stadistinct, for example:

Good idea, will do.

regards, tom lane
 
 
(Sorry for the HTML-formatting, I'm using a web-based client which isn't very good at converting the HTML to plain-text)
 
Just wanted to thank you for fixing this.
 
Take this query (pg-9.5):
 
explain analyze SELECT
      com.entity_id                AS company_id    , sum((coalesce(log.adjusted_time, log.duration) / (3600 * 1000)::numeric) * coalesce(log.adjusted_hourly_rate, log.hourly_rate)) as not_invoiced_hours_amount_adjusted    , sum((log.duration / (3600 * 1000)::numeric) * log.hourly_rate) as not_invoiced_hours_amount_original    , sum(coalesce(log.adjusted_time, log.duration)) AS not_invoiced_hours_duration_adjusted    , sum(log.duration) AS not_invoiced_hours_duration_original
FROM onp_crm_relation com    JOIN onp_crm_activity_log log ON com.entity_id = log.relation_id    JOIN onp_crm_person logfor ON logfor.onp_user_id = log.logged_for AND logfor.is_resource = FALSE
    LEFT OUTER JOIN (            onp_crm_activity act INNER JOIN onp_crm_project proj ON act.project_id = proj.id        ) ON log.activity_id = act.id
WHERE 1 = 1
      AND log.entry_type = 'TIMESHEET_ENTRY'
      AND log.start_date IS NOT NULL
      AND log.is_chargable = TRUE
      AND log.status IN (1,2,3,5)      AND log.start_date <= '2016-06-27'
      --           AND com.entity_id = 2246068
      AND NOT EXISTS(    SELECT * FROM onp_crm_calendarentry_invoice_membership cem        JOIN onp_crm_invoice_line il ON cem.invoice_line_id = il.id        JOIN onp_crm_invoice inv ON il.invoice_id = inv.entity_id    WHERE cem.calendar_entry_id = log.id          AND inv.status_key = 'INVOICE_STATUS_INVOICED'
          AND inv.sent_date <= '2016-06-27'
          AND NOT EXISTS(        SELECT * FROM onp_crm_invoice creditnote        WHERE il.invoice_id = creditnote.credit_against              AND creditnote.status_key = 'INVOICE_STATUS_INVOICED'
              AND creditnote.sent_date <= '2016-06-27'
    )
)

GROUP BY
    com.entity_id;
 
Before the fix, the plan was like this:
 
                                                                                                                        QUERY PLAN                                                                                                                         
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=13874.94..13894.38 rows=972 width=32) (actual time=420188.465..420189.232 rows=462 loops=1)   Group Key: com.entity_id   ->  Nested Loop Left Join  (cost=1710.21..13848.21 rows=972 width=32) (actual time=1004.858..420166.736 rows=3882 loops=1)         ->  Nested Loop Anti Join  (cost=1709.51..12990.02 rows=972 width=36) (actual time=1003.001..419356.749 rows=3882 loops=1)               Join Filter: (cem.calendar_entry_id = log.id)               Rows Removed by Join Filter: 3996874113
               ->  Hash Join  (cost=1172.69..10612.53 rows=972 width=44) (actual time=29.085..333.323 rows=89431 loops=1)                     Hash Cond: (log.relation_id = com.entity_id)                     ->  Hash Join  (cost=1063.39..10489.85 rows=976 width=40) (actual time=27.598..261.822 rows=89431 loops=1)                           Hash Cond: (log.logged_for = logfor.onp_user_id)                           ->  Bitmap Heap Scan on onp_crm_activity_log log  (cost=976.83..10055.99 rows=90010 width=44) (actual time=27.402..223.407 rows=89486 loops=1)                                 Recheck Cond: ((status = ANY ('{1,2,3,5}'::integer[])) AND is_chargable AND (start_date IS NOT NULL) AND ((entry_type)::text = 'TIMESHEET_ENTRY'::text))                                 Filter: (start_date <= '2016-06-27'::date)                                 Rows Removed by Filter: 991
                                 Heap Blocks: exact=6095
                                 ->  Bitmap Index Scan on origo_calendar_entry_status_1_idx  (cost=0.00..954.33 rows=93008 width=0) (actual time=26.339..26.339 rows=101274 loops=1)                                       Index Cond: (status = ANY ('{1,2,3,5}'::integer[]))                           ->  Hash  (cost=39.46..39.46 rows=3768 width=8) (actual time=0.148..0.148 rows=36 loops=1)                                 Buckets: 4096  Batches: 1  Memory Usage: 34kB                                 ->  Bitmap Heap Scan on onp_crm_person logfor  (cost=3.69..39.46 rows=3768 width=8) (actual time=0.049..0.137 rows=36 loops=1)                                       Recheck Cond: (onp_user_id IS NOT NULL)                                       Filter: (NOT is_resource)                                       Rows Removed by Filter: 5
                                       Heap Blocks: exact=11
                                       ->  Bitmap Index Scan on onp_crm_person_onp_id_idx  (cost=0.00..2.75 rows=41 width=0) (actual time=0.021..0.021 rows=41 loops=1)                     ->  Hash  (cost=80.66..80.66 rows=2291 width=8) (actual time=1.449..1.449 rows=2292 loops=1)                           Buckets: 4096  Batches: 1  Memory Usage: 122kB                           ->  Index Only Scan using onp_crm_relation_pkey on onp_crm_relation com  (cost=0.28..80.66 rows=2291 width=8) (actual time=0.221..1.036 rows=2292 loops=1)                                 Heap Fetches: 1127
               ->  Materialize  (cost=536.82..2362.91 rows=1 width=4) (actual time=0.000..1.564 rows=44693 loops=89431)                     ->  Nested Loop  (cost=536.82..2362.91 rows=1 width=4) (actual time=10.480..382.816 rows=85668 loops=1)                           ->  Nested Loop  (cost=536.53..2362.59 rows=1 width=8) (actual time=10.437..176.495 rows=88731 loops=1)                                 ->  Hash Anti Join  (cost=536.24..2362.05 rows=1 width=8) (actual time=9.914..44.329 rows=76347 loops=1)                                       Hash Cond: (il.invoice_id = creditnote.credit_against)                                       ->  Index Only Scan using origo_invoice_line_id_invoice_idx on onp_crm_invoice_line il  (cost=0.29..1530.95 rows=78707 width=8) (actual time=0.213..20.450 rows=78787 loops=1)                                             Heap Fetches: 3131
                                       ->  Hash  (cost=390.36..390.36 rows=11647 width=8) (actual time=9.589..9.589 rows=372 loops=1)                                             Buckets: 16384  Batches: 1  Memory Usage: 143kB                                             ->  Index Only Scan using origo_invoice_credit_against_status_sent_idx on onp_crm_invoice creditnote  (cost=0.29..390.36 rows=11647 width=8) (actual time=0.205..8.953 rows=11507 loops=1)                                                   Index Cond: ((status_key = 'INVOICE_STATUS_INVOICED'::text) AND (sent_date <= '2016-06-27'::date))                                                   Heap Fetches: 999
                                 ->  Index Only Scan using onp_crm_calendarentry_invoice_invoice_line_id_calendar_entr_key on onp_crm_calendarentry_invoice_membership cem  (cost=0.29..0.45 rows=9 width=8) (actual time=0.001..0.001 rows=1 loops=76347)                                       Index Cond: (invoice_line_id = il.id)                                       Heap Fetches: 4951
                           ->  Index Only Scan using origo_invoice_id_status_sent_idx on onp_crm_invoice inv  (cost=0.29..0.31 rows=1 width=8) (actual time=0.002..0.002 rows=1 loops=88731)                                 Index Cond: ((entity_id = il.invoice_id) AND (status_key = 'INVOICE_STATUS_INVOICED'::text) AND (sent_date <= '2016-06-27'::date))                                 Heap Fetches: 9293
         ->  Nested Loop  (cost=0.71..0.87 rows=1 width=4) (actual time=0.207..0.207 rows=1 loops=3882)               ->  Index Scan using onp_crm_activity_pkey on onp_crm_activity act  (cost=0.42..0.56 rows=1 width=8) (actual time=0.199..0.199 rows=1 loops=3882)                     Index Cond: (log.activity_id = id)               ->  Index Only Scan using onp_crm_project_pkey on onp_crm_project proj  (cost=0.29..0.30 rows=1 width=4) (actual time=0.006..0.006 rows=1 loops=3882)                     Index Cond: (id = act.project_id)                     Heap Fetches: 2782
 Planning time: 33.132 ms Execution time: 420195.041 ms
(55 rows)
 
After the fix (using pg-9.5.4), the plan is this:
 
                                                                                                                                        QUERY PLAN                                                                                                                                         
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=44223.20..44269.02 rows=2291 width=32) (actual time=604.480..604.729 rows=462 loops=1)   Group Key: com.entity_id   ->  Hash Left Join  (cost=40805.80..42869.54 rows=49224 width=32) (actual time=591.912..597.284 rows=3882 loops=1)         Hash Cond: (log.activity_id = act.id)         ->  Hash Join  (cost=24077.93..25498.04 rows=49224 width=36) (actual time=309.816..313.542 rows=3882 loops=1)               Hash Cond: (log.relation_id = com.entity_id)               ->  Merge Join  (cost=23968.63..24711.11 rows=49435 width=32) (actual time=304.949..307.474 rows=3882 loops=1)                     Merge Cond: (logfor.onp_user_id = log.logged_for)                     ->  Index Scan using onp_crm_person_onp_id_idx on onp_crm_person logfor  (cost=0.14..41.05 rows=3768 width=8) (actual time=0.194..1.321 rows=36 loops=1)                           Filter: (NOT is_resource)                           Rows Removed by Filter: 5
                     ->  Sort  (cost=23968.49..24092.29 rows=49520 width=36) (actual time=304.579..305.137 rows=3893 loops=1)                           Sort Key: log.logged_for                           Sort Method: quicksort  Memory: 401kB                           ->  Hash Anti Join  (cost=9997.74..20106.99 rows=49520 width=36) (actual time=147.213..302.609 rows=3893 loops=1)                                 Hash Cond: (log.id = cem.calendar_entry_id)                                 ->  Bitmap Heap Scan on onp_crm_activity_log log  (cost=976.83..10055.99 rows=90010 width=44) (actual time=28.286..158.599 rows=89486 loops=1)                                       Recheck Cond: ((status = ANY ('{1,2,3,5}'::integer[])) AND is_chargable AND (start_date IS NOT NULL) AND ((entry_type)::text = 'TIMESHEET_ENTRY'::text))                                       Filter: (start_date <= '2016-06-27'::date)                                       Rows Removed by Filter: 991
                                       Heap Blocks: exact=6095
                                       ->  Bitmap Index Scan on origo_calendar_entry_status_1_idx  (cost=0.00..954.33 rows=93008 width=0) (actual time=27.251..27.251 rows=101274 loops=1)                                             Index Cond: (status = ANY ('{1,2,3,5}'::integer[]))                                 ->  Hash  (cost=7964.73..7964.73 rows=84494 width=4) (actual time=118.106..118.106 rows=85668 loops=1)                                       Buckets: 131072  Batches: 1  Memory Usage: 4036kB                                       ->  Hash Join  (cost=1069.29..7964.73 rows=84494 width=4) (actual time=15.605..102.721 rows=85668 loops=1)                                             Hash Cond: (il.invoice_id = inv.entity_id)                                             ->  Hash Anti Join  (cost=536.55..6265.54 rows=85741 width=8) (actual time=5.467..71.124 rows=88731 loops=1)                                                   Hash Cond: (il.invoice_id = creditnote.credit_against)                                                   ->  Merge Join  (cost=0.61..4501.71 rows=89266 width=8) (actual time=0.441..53.014 rows=89301 loops=1)                                                         Merge Cond: (il.id = cem.invoice_line_id)                                                         ->  Index Only Scan using origo_invoice_line_id_invoice_idx on onp_crm_invoice_line il  (cost=0.29..1530.95 rows=78707 width=8) (actual time=0.226..15.901 rows=78787 loops=1)                                                               Heap Fetches: 3131
                                                         ->  Index Only Scan using onp_crm_calendarentry_invoice_invoice_line_id_calendar_entr_key on onp_crm_calendarentry_invoice_membership cem  (cost=0.29..1662.98 rows=89266 width=8) (actual time=0.194..16.489 rows=89301 loops=1)                                                               Heap Fetches: 5027
                                                   ->  Hash  (cost=390.36..390.36 rows=11647 width=8) (actual time=4.785..4.785 rows=372 loops=1)                                                         Buckets: 16384  Batches: 1  Memory Usage: 143kB                                                         ->  Index Only Scan using origo_invoice_credit_against_status_sent_idx on onp_crm_invoice creditnote  (cost=0.29..390.36 rows=11647 width=8) (actual time=0.020..4.213 rows=11507 loops=1)                                                               Index Cond: ((status_key = 'INVOICE_STATUS_INVOICED'::text) AND (sent_date <= '2016-06-27'::date))                                                               Heap Fetches: 999
                                             ->  Hash  (cost=387.15..387.15 rows=11647 width=8) (actual time=10.063..10.063 rows=11507 loops=1)                                                   Buckets: 16384  Batches: 1  Memory Usage: 578kB                                                   ->  Index Only Scan using origo_invoice_id_status_sent_idx on onp_crm_invoice inv  (cost=0.29..387.15 rows=11647 width=8) (actual time=0.233..8.133 rows=11507 loops=1)                                                         Index Cond: ((status_key = 'INVOICE_STATUS_INVOICED'::text) AND (sent_date <= '2016-06-27'::date))                                                         Heap Fetches: 999
               ->  Hash  (cost=80.66..80.66 rows=2291 width=8) (actual time=4.831..4.831 rows=2292 loops=1)                     Buckets: 4096  Batches: 1  Memory Usage: 122kB                     ->  Index Only Scan using onp_crm_relation_pkey on onp_crm_relation com  (cost=0.28..80.66 rows=2291 width=8) (actual time=0.235..4.370 rows=2292 loops=1)                           Heap Fetches: 1127
         ->  Hash  (cost=14012.31..14012.31 rows=217245 width=4) (actual time=281.015..281.015 rows=220147 loops=1)               Buckets: 262144  Batches: 1  Memory Usage: 9788kB               ->  Hash Join  (cost=433.74..14012.31 rows=217245 width=4) (actual time=29.243..235.289 rows=220147 loops=1)                     Hash Cond: (act.project_id = proj.id)                     ->  Seq Scan on onp_crm_activity act  (cost=0.00..10591.45 rows=217245 width=8) (actual time=0.186..139.016 rows=220147 loops=1)                     ->  Hash  (cost=262.35..262.35 rows=13711 width=4) (actual time=28.980..28.980 rows=13760 loops=1)                           Buckets: 16384  Batches: 1  Memory Usage: 612kB                           ->  Index Only Scan using onp_crm_project_pkey on onp_crm_project proj  (cost=0.29..262.35 rows=13711 width=4) (actual time=0.067..26.575 rows=13760 loops=1)                                 Heap Fetches: 6146
 Planning time: 41.021 ms Execution time: 605.530 ms
(60 rows)
 
That's 420195 / 605 = 695x speedup!
Thanks!
 
--
Andreas Joseph Krogh
CTO / Partner - Visena AS
Mobile: +47 909 56 963
 
Attachment