Re: Bogus ANALYZE results for an otherwise-unique column with many nulls - Mailing list pgsql-hackers

From Andreas Joseph Krogh
Subject Re: Bogus ANALYZE results for an otherwise-unique column with many nulls
Date
Msg-id VisenaEmail.0.4fb0246758650501.1568ac7ca0e@tc7-visena
Whole thread Raw
In response to Re: Bogus ANALYZE results for an otherwise-unique column with many nulls  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
På søndag 07. august 2016 kl. 18:35:56, skrev Tom Lane <tgl@sss.pgh.pa.us>:
Dean Rasheed <dean.a.rasheed@gmail.com> writes:
> On 5 August 2016 at 21:48, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> OK, thanks.  What shall we do about Andreas' request to back-patch this?
>> I'm personally willing to do it, but there is the old bugaboo of "maybe
>> it will destabilize a plan that someone is happy with".

> My inclination would be to back-patch it because arguably it's a
> bug-fix -- at the very least the old behaviour didn't match the docs
> for stadistinct:

Yeah.  I suspect that situations like this are pretty rare, or we'd have
recognized the problem sooner.  And at least for Andreas, it'd be fixing
a real problem.  I'll apply the back-patch unless I hear objections in
the next couple of hours.

> Additionally, I think that example is misleading because it's only
> really true if there are no null values in the column. Perhaps it
> would help to have a more explicit example to illustrate how nulls
> affect stadistinct, for example:

Good idea, will do.

regards, tom lane
 
 
(Sorry for the HTML-formatting, I'm using a web-based client which isn't very good at converting the HTML to plain-text)
 
Just wanted to thank you for fixing this.
 
Take this query (pg-9.5):
 
explain analyze SELECT
      com.entity_id                AS company_id    , sum((coalesce(log.adjusted_time, log.duration) / (3600 * 1000)::numeric) * coalesce(log.adjusted_hourly_rate, log.hourly_rate)) as not_invoiced_hours_amount_adjusted    , sum((log.duration / (3600 * 1000)::numeric) * log.hourly_rate) as not_invoiced_hours_amount_original    , sum(coalesce(log.adjusted_time, log.duration)) AS not_invoiced_hours_duration_adjusted    , sum(log.duration) AS not_invoiced_hours_duration_original
FROM onp_crm_relation com    JOIN onp_crm_activity_log log ON com.entity_id = log.relation_id    JOIN onp_crm_person logfor ON logfor.onp_user_id = log.logged_for AND logfor.is_resource = FALSE
    LEFT OUTER JOIN (            onp_crm_activity act INNER JOIN onp_crm_project proj ON act.project_id = proj.id        ) ON log.activity_id = act.id
WHERE 1 = 1
      AND log.entry_type = 'TIMESHEET_ENTRY'
      AND log.start_date IS NOT NULL
      AND log.is_chargable = TRUE
      AND log.status IN (1,2,3,5)      AND log.start_date <= '2016-06-27'
      --           AND com.entity_id = 2246068
      AND NOT EXISTS(    SELECT * FROM onp_crm_calendarentry_invoice_membership cem        JOIN onp_crm_invoice_line il ON cem.invoice_line_id = il.id        JOIN onp_crm_invoice inv ON il.invoice_id = inv.entity_id    WHERE cem.calendar_entry_id = log.id          AND inv.status_key = 'INVOICE_STATUS_INVOICED'
          AND inv.sent_date <= '2016-06-27'
          AND NOT EXISTS(        SELECT * FROM onp_crm_invoice creditnote        WHERE il.invoice_id = creditnote.credit_against              AND creditnote.status_key = 'INVOICE_STATUS_INVOICED'
              AND creditnote.sent_date <= '2016-06-27'
    )
)

GROUP BY
    com.entity_id;
 
Before the fix, the plan was like this:
 
                                                                                                                        QUERY PLAN                                                                                                                         
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=13874.94..13894.38 rows=972 width=32) (actual time=420188.465..420189.232 rows=462 loops=1)   Group Key: com.entity_id   ->  Nested Loop Left Join  (cost=1710.21..13848.21 rows=972 width=32) (actual time=1004.858..420166.736 rows=3882 loops=1)         ->  Nested Loop Anti Join  (cost=1709.51..12990.02 rows=972 width=36) (actual time=1003.001..419356.749 rows=3882 loops=1)               Join Filter: (cem.calendar_entry_id = log.id)               Rows Removed by Join Filter: 3996874113
               ->  Hash Join  (cost=1172.69..10612.53 rows=972 width=44) (actual time=29.085..333.323 rows=89431 loops=1)                     Hash Cond: (log.relation_id = com.entity_id)                     ->  Hash Join  (cost=1063.39..10489.85 rows=976 width=40) (actual time=27.598..261.822 rows=89431 loops=1)                           Hash Cond: (log.logged_for = logfor.onp_user_id)                           ->  Bitmap Heap Scan on onp_crm_activity_log log  (cost=976.83..10055.99 rows=90010 width=44) (actual time=27.402..223.407 rows=89486 loops=1)                                 Recheck Cond: ((status = ANY ('{1,2,3,5}'::integer[])) AND is_chargable AND (start_date IS NOT NULL) AND ((entry_type)::text = 'TIMESHEET_ENTRY'::text))                                 Filter: (start_date <= '2016-06-27'::date)                                 Rows Removed by Filter: 991
                                 Heap Blocks: exact=6095
                                 ->  Bitmap Index Scan on origo_calendar_entry_status_1_idx  (cost=0.00..954.33 rows=93008 width=0) (actual time=26.339..26.339 rows=101274 loops=1)                                       Index Cond: (status = ANY ('{1,2,3,5}'::integer[]))                           ->  Hash  (cost=39.46..39.46 rows=3768 width=8) (actual time=0.148..0.148 rows=36 loops=1)                                 Buckets: 4096  Batches: 1  Memory Usage: 34kB                                 ->  Bitmap Heap Scan on onp_crm_person logfor  (cost=3.69..39.46 rows=3768 width=8) (actual time=0.049..0.137 rows=36 loops=1)                                       Recheck Cond: (onp_user_id IS NOT NULL)                                       Filter: (NOT is_resource)                                       Rows Removed by Filter: 5
                                       Heap Blocks: exact=11
                                       ->  Bitmap Index Scan on onp_crm_person_onp_id_idx  (cost=0.00..2.75 rows=41 width=0) (actual time=0.021..0.021 rows=41 loops=1)                     ->  Hash  (cost=80.66..80.66 rows=2291 width=8) (actual time=1.449..1.449 rows=2292 loops=1)                           Buckets: 4096  Batches: 1  Memory Usage: 122kB                           ->  Index Only Scan using onp_crm_relation_pkey on onp_crm_relation com  (cost=0.28..80.66 rows=2291 width=8) (actual time=0.221..1.036 rows=2292 loops=1)                                 Heap Fetches: 1127
               ->  Materialize  (cost=536.82..2362.91 rows=1 width=4) (actual time=0.000..1.564 rows=44693 loops=89431)                     ->  Nested Loop  (cost=536.82..2362.91 rows=1 width=4) (actual time=10.480..382.816 rows=85668 loops=1)                           ->  Nested Loop  (cost=536.53..2362.59 rows=1 width=8) (actual time=10.437..176.495 rows=88731 loops=1)                                 ->  Hash Anti Join  (cost=536.24..2362.05 rows=1 width=8) (actual time=9.914..44.329 rows=76347 loops=1)                                       Hash Cond: (il.invoice_id = creditnote.credit_against)                                       ->  Index Only Scan using origo_invoice_line_id_invoice_idx on onp_crm_invoice_line il  (cost=0.29..1530.95 rows=78707 width=8) (actual time=0.213..20.450 rows=78787 loops=1)                                             Heap Fetches: 3131
                                       ->  Hash  (cost=390.36..390.36 rows=11647 width=8) (actual time=9.589..9.589 rows=372 loops=1)                                             Buckets: 16384  Batches: 1  Memory Usage: 143kB                                             ->  Index Only Scan using origo_invoice_credit_against_status_sent_idx on onp_crm_invoice creditnote  (cost=0.29..390.36 rows=11647 width=8) (actual time=0.205..8.953 rows=11507 loops=1)                                                   Index Cond: ((status_key = 'INVOICE_STATUS_INVOICED'::text) AND (sent_date <= '2016-06-27'::date))                                                   Heap Fetches: 999
                                 ->  Index Only Scan using onp_crm_calendarentry_invoice_invoice_line_id_calendar_entr_key on onp_crm_calendarentry_invoice_membership cem  (cost=0.29..0.45 rows=9 width=8) (actual time=0.001..0.001 rows=1 loops=76347)                                       Index Cond: (invoice_line_id = il.id)                                       Heap Fetches: 4951
                           ->  Index Only Scan using origo_invoice_id_status_sent_idx on onp_crm_invoice inv  (cost=0.29..0.31 rows=1 width=8) (actual time=0.002..0.002 rows=1 loops=88731)                                 Index Cond: ((entity_id = il.invoice_id) AND (status_key = 'INVOICE_STATUS_INVOICED'::text) AND (sent_date <= '2016-06-27'::date))                                 Heap Fetches: 9293
         ->  Nested Loop  (cost=0.71..0.87 rows=1 width=4) (actual time=0.207..0.207 rows=1 loops=3882)               ->  Index Scan using onp_crm_activity_pkey on onp_crm_activity act  (cost=0.42..0.56 rows=1 width=8) (actual time=0.199..0.199 rows=1 loops=3882)                     Index Cond: (log.activity_id = id)               ->  Index Only Scan using onp_crm_project_pkey on onp_crm_project proj  (cost=0.29..0.30 rows=1 width=4) (actual time=0.006..0.006 rows=1 loops=3882)                     Index Cond: (id = act.project_id)                     Heap Fetches: 2782
 Planning time: 33.132 ms Execution time: 420195.041 ms
(55 rows)
 
After the fix (using pg-9.5.4), the plan is this:
 
                                                                                                                                        QUERY PLAN                                                                                                                                         
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=44223.20..44269.02 rows=2291 width=32) (actual time=604.480..604.729 rows=462 loops=1)   Group Key: com.entity_id   ->  Hash Left Join  (cost=40805.80..42869.54 rows=49224 width=32) (actual time=591.912..597.284 rows=3882 loops=1)         Hash Cond: (log.activity_id = act.id)         ->  Hash Join  (cost=24077.93..25498.04 rows=49224 width=36) (actual time=309.816..313.542 rows=3882 loops=1)               Hash Cond: (log.relation_id = com.entity_id)               ->  Merge Join  (cost=23968.63..24711.11 rows=49435 width=32) (actual time=304.949..307.474 rows=3882 loops=1)                     Merge Cond: (logfor.onp_user_id = log.logged_for)                     ->  Index Scan using onp_crm_person_onp_id_idx on onp_crm_person logfor  (cost=0.14..41.05 rows=3768 width=8) (actual time=0.194..1.321 rows=36 loops=1)                           Filter: (NOT is_resource)                           Rows Removed by Filter: 5
                     ->  Sort  (cost=23968.49..24092.29 rows=49520 width=36) (actual time=304.579..305.137 rows=3893 loops=1)                           Sort Key: log.logged_for                           Sort Method: quicksort  Memory: 401kB                           ->  Hash Anti Join  (cost=9997.74..20106.99 rows=49520 width=36) (actual time=147.213..302.609 rows=3893 loops=1)                                 Hash Cond: (log.id = cem.calendar_entry_id)                                 ->  Bitmap Heap Scan on onp_crm_activity_log log  (cost=976.83..10055.99 rows=90010 width=44) (actual time=28.286..158.599 rows=89486 loops=1)                                       Recheck Cond: ((status = ANY ('{1,2,3,5}'::integer[])) AND is_chargable AND (start_date IS NOT NULL) AND ((entry_type)::text = 'TIMESHEET_ENTRY'::text))                                       Filter: (start_date <= '2016-06-27'::date)                                       Rows Removed by Filter: 991
                                       Heap Blocks: exact=6095
                                       ->  Bitmap Index Scan on origo_calendar_entry_status_1_idx  (cost=0.00..954.33 rows=93008 width=0) (actual time=27.251..27.251 rows=101274 loops=1)                                             Index Cond: (status = ANY ('{1,2,3,5}'::integer[]))                                 ->  Hash  (cost=7964.73..7964.73 rows=84494 width=4) (actual time=118.106..118.106 rows=85668 loops=1)                                       Buckets: 131072  Batches: 1  Memory Usage: 4036kB                                       ->  Hash Join  (cost=1069.29..7964.73 rows=84494 width=4) (actual time=15.605..102.721 rows=85668 loops=1)                                             Hash Cond: (il.invoice_id = inv.entity_id)                                             ->  Hash Anti Join  (cost=536.55..6265.54 rows=85741 width=8) (actual time=5.467..71.124 rows=88731 loops=1)                                                   Hash Cond: (il.invoice_id = creditnote.credit_against)                                                   ->  Merge Join  (cost=0.61..4501.71 rows=89266 width=8) (actual time=0.441..53.014 rows=89301 loops=1)                                                         Merge Cond: (il.id = cem.invoice_line_id)                                                         ->  Index Only Scan using origo_invoice_line_id_invoice_idx on onp_crm_invoice_line il  (cost=0.29..1530.95 rows=78707 width=8) (actual time=0.226..15.901 rows=78787 loops=1)                                                               Heap Fetches: 3131
                                                         ->  Index Only Scan using onp_crm_calendarentry_invoice_invoice_line_id_calendar_entr_key on onp_crm_calendarentry_invoice_membership cem  (cost=0.29..1662.98 rows=89266 width=8) (actual time=0.194..16.489 rows=89301 loops=1)                                                               Heap Fetches: 5027
                                                   ->  Hash  (cost=390.36..390.36 rows=11647 width=8) (actual time=4.785..4.785 rows=372 loops=1)                                                         Buckets: 16384  Batches: 1  Memory Usage: 143kB                                                         ->  Index Only Scan using origo_invoice_credit_against_status_sent_idx on onp_crm_invoice creditnote  (cost=0.29..390.36 rows=11647 width=8) (actual time=0.020..4.213 rows=11507 loops=1)                                                               Index Cond: ((status_key = 'INVOICE_STATUS_INVOICED'::text) AND (sent_date <= '2016-06-27'::date))                                                               Heap Fetches: 999
                                             ->  Hash  (cost=387.15..387.15 rows=11647 width=8) (actual time=10.063..10.063 rows=11507 loops=1)                                                   Buckets: 16384  Batches: 1  Memory Usage: 578kB                                                   ->  Index Only Scan using origo_invoice_id_status_sent_idx on onp_crm_invoice inv  (cost=0.29..387.15 rows=11647 width=8) (actual time=0.233..8.133 rows=11507 loops=1)                                                         Index Cond: ((status_key = 'INVOICE_STATUS_INVOICED'::text) AND (sent_date <= '2016-06-27'::date))                                                         Heap Fetches: 999
               ->  Hash  (cost=80.66..80.66 rows=2291 width=8) (actual time=4.831..4.831 rows=2292 loops=1)                     Buckets: 4096  Batches: 1  Memory Usage: 122kB                     ->  Index Only Scan using onp_crm_relation_pkey on onp_crm_relation com  (cost=0.28..80.66 rows=2291 width=8) (actual time=0.235..4.370 rows=2292 loops=1)                           Heap Fetches: 1127
         ->  Hash  (cost=14012.31..14012.31 rows=217245 width=4) (actual time=281.015..281.015 rows=220147 loops=1)               Buckets: 262144  Batches: 1  Memory Usage: 9788kB               ->  Hash Join  (cost=433.74..14012.31 rows=217245 width=4) (actual time=29.243..235.289 rows=220147 loops=1)                     Hash Cond: (act.project_id = proj.id)                     ->  Seq Scan on onp_crm_activity act  (cost=0.00..10591.45 rows=217245 width=8) (actual time=0.186..139.016 rows=220147 loops=1)                     ->  Hash  (cost=262.35..262.35 rows=13711 width=4) (actual time=28.980..28.980 rows=13760 loops=1)                           Buckets: 16384  Batches: 1  Memory Usage: 612kB                           ->  Index Only Scan using onp_crm_project_pkey on onp_crm_project proj  (cost=0.29..262.35 rows=13711 width=4) (actual time=0.067..26.575 rows=13760 loops=1)                                 Heap Fetches: 6146
 Planning time: 41.021 ms Execution time: 605.530 ms
(60 rows)
 
That's 420195 / 605 = 695x speedup!
Thanks!
 
--
Andreas Joseph Krogh
CTO / Partner - Visena AS
Mobile: +47 909 56 963
 
Attachment

pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Why --backup-and-modify-in-place in perltidy config?
Next
From: Jim Nasby
Date:
Subject: Improve handling of ^C in psql / top-level transaction abort