Thread: Row estimates off by two orders of magnitude with hstore

Row estimates off by two orders of magnitude with hstore

From
Patrick Krecker
Date:
Hi everyone --

I had an issue the other day where a relatively simple query went from
taking about 1 minute to execute to taking 19 hours. It seems that the
planner chooses to use a materialize sometimes [1] and not other times
[2]. I think the issue is that the row count estimate for the result
of the condition "type_id = 23 and ref.attributes ? 'reference'" is
about 10k rows, but the actual result is 4624280. It seems the
estimate varies slightly over time, and if it drops low enough then
the planner decides to materialize the result of the bitmap heap scan
and the query takes forever.

As an exercise, I tried removing the clause "ref.attributes ?
'reference'" and the estimates are very accurate [3].

It seems to me that improving the row estimates would be prudent, but
I can't figure out how postgres figures out the estimate for the
hstore ? operator. I suspect it is making a wild guess, since it has
no data on the contents of the hstore in its estimates.

[1] Query plan with materialize:
# explain (analyze, buffers) declare "foo_cursor" cursor for  SELECT
  ref.case_id,     array_agg(ref.attributes -> 'reference') FROM
component ref JOIN document c ON c.id = ref.case_id WHERE ref.type_id
= 23 AND ref.attributes ? 'reference' AND NOT 0 = ANY(c.types) GROUP
BY ref.case_id;

         QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=390241.39..16973874.43 rows=10 width=35)
(actual time=81253.653..69928443.721 rows=90969 loops=1)
   Buffers: shared hit=54539 read=1551654, temp read=11959529274 written=126377
   ->  Nested Loop  (cost=390241.39..16973835.54 rows=5169 width=35)
(actual time=16157.261..69925063.268 rows=2488142 loops=1)
         Join Filter: (ref.case_id = c.id)
         Rows Removed by Join Filter: 437611625378
         Buffers: shared hit=54539 read=1551654, temp read=11959529274
written=126377
         ->  Index Scan using document_pkey on document c
(cost=0.42..314999.89 rows=99659 width=4) (actual
time=0.016..59255.527 rows=94634 loops=1)
               Filter: (0 <> ALL (types))
               Rows Removed by Filter: 70829
               Buffers: shared hit=54539 read=113549
         ->  Materialize  (cost=390240.97..3189944.33 rows=9010
width=35) (actual time=0.088..450.898 rows=4624280 loops=94634)
               Buffers: shared read=1438105, temp read=11959529274
written=126377
               ->  Bitmap Heap Scan on component ref
(cost=390240.97..3189899.28 rows=9010 width=35) (actual
time=8107.625..79508.136 rows=4624280 loops=1)
                     Recheck Cond: (type_id = 23)
                     Rows Removed by Index Recheck: 49237707
                     Filter: (attributes ? 'reference'::text)
                     Rows Removed by Filter: 4496624
                     Buffers: shared read=1438105
                     ->  Bitmap Index Scan on component_type_id
(cost=0.00..390238.72 rows=9009887 width=0) (actual
time=8105.963..8105.963 rows=9666659 loops=1)
                           Index Cond: (type_id = 23)
                           Buffers: shared read=156948
 Total runtime: 69928552.870 ms

[2] Query plan with simple bitmap heap scan:
# explain (analyze, buffers) declare "foo_cursor" cursor for  SELECT
  ref.case_id,     array_agg(ref.attributes -> 'reference') FROM
component ref JOIN document c ON c.id = ref.case_id WHERE ref.type_id
= 23 AND ref.attributes ? 'reference' AND NOT 0 = ANY(c.types) GROUP
BY ref.case_id;

        QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=3724614.33..3724614.46 rows=10 width=34) (actual
time=248900.627..249030.530 rows=90969 loops=1)
   Buffers: shared hit=16962106 read=5551197 written=5806
   ->  Nested Loop  (cost=488092.46..3724570.36 rows=5863 width=34)
(actual time=30839.635..246416.327 rows=2488142 loops=1)
         Buffers: shared hit=16962106 read=5551197 written=5806
         ->  Bitmap Heap Scan on component ref
(cost=488088.03..3638070.85 rows=10220 width=34) (actual
time=30833.215..196239.109 rows=4624280 loops=1)
               Recheck Cond: (type_id = 23)
               Rows Removed by Index Recheck: 57730489
               Filter: (attributes ? 'reference'::text)
               Rows Removed by Filter: 4496624
               Buffers: shared hit=6922 read=1901840 written=2252
               ->  Bitmap Index Scan on component_type_id
(cost=0.00..488085.48 rows=10220388 width=0) (actual
time=30811.185..30811.185 rows=13292968 loops=1)
                     Index Cond: (type_id = 23)
                     Buffers: shared hit=6922 read=162906 written=1529
         ->  Bitmap Heap Scan on document c  (cost=4.43..8.45 rows=1
width=4) (actual time=0.010..0.010 rows=1 loops=4624280)
               Recheck Cond: (id = ref.case_id)
               Filter: (0 <> ALL (types))
               Rows Removed by Filter: 0
               Buffers: shared hit=16955184 read=3649357 written=3554
               ->  Bitmap Index Scan on document_pkey
(cost=0.00..4.43 rows=1 width=0) (actual time=0.004..0.004 rows=1
loops=4624280)
                     Index Cond: (id = ref.case_id)
                     Buffers: shared hit=14090230 read=1890031 written=1819
 Total runtime: 249051.265 ms

[3] Query plan with hstore clause removed:
# explain (analyze, buffers) declare "foo_cursor" cursor for  SELECT
  ref.case_id,     array_agg(ref.attributes -> 'reference') FROM
component ref JOIN document c ON c.id = ref.case_id WHERE ref.type_id
= 23 AND  NOT 0 = ANY(c.types) GROUP BY ref.case_id;

              QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=5922838.29..12804427.63 rows=9817 width=34)
(actual time=168896.503..181202.278 rows=93580 loops=1)
   Buffers: shared hit=13847 read=2104804 written=26, temp read=902336
written=902336
   ->  Merge Join  (cost=5922838.29..12760329.44 rows=5863397
width=34) (actual time=168896.459..180103.335 rows=5115136 loops=1)
         Merge Cond: (c.id = ref.case_id)
         Buffers: shared hit=13847 read=2104804 written=26, temp
read=902336 written=902336
         ->  Index Scan using document_pkey on document c
(cost=0.43..6696889.20 rows=2128590 width=4) (actual
time=0.006..7684.681 rows=94634 loops=1)
               Filter: (0 <> ALL (types))
               Rows Removed by Filter: 70829
               Buffers: shared hit=13847 read=196042
         ->  Materialize  (cost=5922836.37..5973938.31 rows=10220388
width=34) (actual time=168896.449..171403.773 rows=9120904 loops=1)
               Buffers: shared read=1908762 written=26, temp
read=902336 written=902336
               ->  Sort  (cost=5922836.37..5948387.34 rows=10220388
width=34) (actual time=168896.447..170586.341 rows=9120904 loops=1)
                     Sort Key: ref.case_id
                     Sort Method: external merge  Disk: 1392648kB
                     Buffers: shared read=1908762 written=26, temp
read=902336 written=902336
                     ->  Bitmap Heap Scan on component ref
(cost=490640.57..3615072.42 rows=10220388 width=34) (actual
time=21652.779..148333.012 rows=9120904 loops=1)
                           Recheck Cond: (type_id = 23)
                           Rows Removed by Index Recheck: 57730489
                           Buffers: shared read=1908762 written=26
                           ->  Bitmap Index Scan on component_type_id
(cost=0.00..488085.48 rows=10220388 width=0) (actual
time=21649.716..21649.716 rows=13292968 loops=1)
                                 Index Cond: (type_id = 23)
                                 Buffers: shared read=169828
 Total runtime: 181378.101 ms


Re: Row estimates off by two orders of magnitude with hstore

From
Merlin Moncure
Date:
On Wed, Jun 10, 2015 at 12:32 PM, Patrick Krecker <patrick@judicata.com> wrote:
> Hi everyone --
>
> I had an issue the other day where a relatively simple query went from
> taking about 1 minute to execute to taking 19 hours. It seems that the
> planner chooses to use a materialize sometimes [1] and not other times
> [2]. I think the issue is that the row count estimate for the result
> of the condition "type_id = 23 and ref.attributes ? 'reference'" is
> about 10k rows, but the actual result is 4624280. It seems the
> estimate varies slightly over time, and if it drops low enough then
> the planner decides to materialize the result of the bitmap heap scan
> and the query takes forever.
>
> As an exercise, I tried removing the clause "ref.attributes ?
> 'reference'" and the estimates are very accurate [3].

This is a fundamental issue with using 'database in a box' datatypes
like hstore and jsonb.  They are opaque to the statistics gathering
system and so are unable to give reasonable estimates beyond broad
assumptions.  Speaking generally, the workarounds are too:

*) disable particular plan choices for this query
(materialize/nestloop are common culprits)

*) create btree indexes around specific extraction clauses

*) refactor some of the query into set returning function with a
custom ROWS clause

*) try alternate indexing strategy such as jsonb/jsquery

*) move out of hstore and into more standard relational strucure

none of the above may be ideal in your particular case.

merlin


Re: Row estimates off by two orders of magnitude with hstore

From
Josh Berkus
Date:
On 06/10/2015 11:32 AM, Merlin Moncure wrote:
> This is a fundamental issue with using 'database in a box' datatypes
> like hstore and jsonb.  They are opaque to the statistics gathering
> system and so are unable to give reasonable estimates beyond broad
> assumptions.  Speaking generally, the workarounds are too:
>
> *) disable particular plan choices for this query
> (materialize/nestloop are common culprits)
>
> *) create btree indexes around specific extraction clauses
>
> *) refactor some of the query into set returning function with a
> custom ROWS clause
>
> *) try alternate indexing strategy such as jsonb/jsquery
>
> *) move out of hstore and into more standard relational strucure

You forgot:

*) Fund a PostgreSQL developer to add selectivity estimation and stats
to hstore.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


Re: Row estimates off by two orders of magnitude with hstore

From
Merlin Moncure
Date:
On Wed, Jun 10, 2015 at 2:40 PM, Josh Berkus <josh@agliodbs.com> wrote:
> On 06/10/2015 11:32 AM, Merlin Moncure wrote:
>> This is a fundamental issue with using 'database in a box' datatypes
>> like hstore and jsonb.  They are opaque to the statistics gathering
>> system and so are unable to give reasonable estimates beyond broad
>> assumptions.  Speaking generally, the workarounds are too:
>>
>> *) disable particular plan choices for this query
>> (materialize/nestloop are common culprits)
>>
>> *) create btree indexes around specific extraction clauses
>>
>> *) refactor some of the query into set returning function with a
>> custom ROWS clause
>>
>> *) try alternate indexing strategy such as jsonb/jsquery
>>
>> *) move out of hstore and into more standard relational strucure
>
> You forgot:
>
> *) Fund a PostgreSQL developer to add selectivity estimation and stats
> to hstore.

Well, I don't know.  That's really complex to the point of making me
wonder if it's worth doing even given infinite time and resources.  If
it was my money, I'd be researching a clean way to inject estimate
returning expressions into the query that the planner could utilize.
Not 'hints' which are really about managing the output of the planner,
just what feeds in.  Also lots of various solutions of alcohol to
lubricate the attendant -hackers discussions.

merlin


Re: Row estimates off by two orders of magnitude with hstore

From
Patrick Krecker
Date:
OK. Well, fortunately for us, we have a lot of possible solutions this
problem, and it sounds like actually getting statistics for attributes
? 'reference' is not realistic. I just wanted to make sure it wasn't
some configuration error on our part.

Can anyone explain where exactly the estimate for that clause comes from?

I tried adding an index and I don't think it improved the estimation,
the planner still thinks there will be 9k rows as a result of type_id
= 23 and attributes ? 'reference'. [1]. It might make the pathological
plan less likely though. It's not clear to me that it reduces the risk
of a pathological plan to zero.

I also tried wrapping it in a subquery [2]. The estimate is, of
course, still awful, but it doesn't matter anymore because it can't
pick a plan that leverages its low estimate. Its only choice is a
simple filter on the results.

[1]
# CREATE INDEX foobarbaz ON component((attributes -> 'reference'))
WHERE ( attributes ? 'reference' );

CREATE INDEX

judicata=# explain (analyze, buffers) declare "foo_cursor" cursor for
SELECT     ref.case_id,     array_agg(ref.attributes -> 'reference')
FROM component ref JOIN document c ON c.id = ref.case_id WHERE
ref.type_id = 23 AND ref.attributes ? 'reference' AND NOT 0 =
ANY(c.types) GROUP BY ref.case_id;

 QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=419667.86..419667.99 rows=10 width=34) (actual
time=97074.773..97197.487 rows=90969 loops=1)
   Buffers: shared hit=16954389 read=4533956 dirtied=2963 written=4759
   ->  Nested Loop  (cost=5472.44..419628.76 rows=5213 width=34)
(actual time=537.202..94710.844 rows=2488142 loops=1)
         Buffers: shared hit=16954389 read=4533956 dirtied=2963 written=4759
         ->  Bitmap Heap Scan on component ref
(cost=5468.01..342716.88 rows=9087 width=34) (actual
time=534.862..49617.945 rows=4624280 loops=1)
               Recheck Cond: (attributes ? 'reference'::text)
               Rows Removed by Index Recheck: 28739170
               Filter: (type_id = 23)
               Rows Removed by Filter: 165268
               Buffers: shared hit=25 read=921758 dirtied=2963 written=906
               ->  Bitmap Index Scan on foobarbaz  (cost=0.00..5465.74
rows=98636 width=0) (actual time=532.215..532.215 rows=4789548
loops=1)
                     Buffers: shared read=59300 written=57
         ->  Bitmap Heap Scan on document c  (cost=4.43..8.45 rows=1
width=4) (actual time=0.009..0.009 rows=1 loops=4624280)
               Recheck Cond: (id = ref.case_id)
               Filter: (0 <> ALL (types))
               Rows Removed by Filter: 0
               Buffers: shared hit=16954364 read=3612198 written=3853
               ->  Bitmap Index Scan on document_pkey
(cost=0.00..4.43 rows=1 width=0) (actual time=0.003..0.003 rows=1
loops=4624280)
                     Index Cond: (id = ref.case_id)
                     Buffers: shared hit=14082540 read=1859742 written=1974
 Total runtime: 97217.718 ms

[2]
# explain (analyze, buffers) declare "foo_cursor" cursor for  SELECT *
FROM (SELECT     ref.case_id as case_id,     array_agg(ref.attributes
-> 'reference') as reference FROM component ref JOIN document c ON
c.id = ref.case_id WHERE ref.type_id = 23 AND NOT 0 = ANY(c.types)
GROUP BY ref.case_id) as t WHERE reference IS NOT NULL;

              QUERY PLAN

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=5636347.52..12524155.45 rows=9817 width=34)
(actual time=165466.502..195035.433 rows=93580 loops=1)
   Filter: (array_agg((ref.attributes -> 'reference'::text)) IS NOT NULL)
   Buffers: shared hit=13884 read=2085572 written=2952, temp
read=902337 written=902337
   ->  Merge Join  (cost=5636347.52..12458841.11 rows=5213367
width=34) (actual time=165383.814..193813.490 rows=5115136 loops=1)
         Merge Cond: (c.id = ref.case_id)
         Buffers: shared hit=13884 read=2085572 written=2952, temp
read=902337 written=902337
         ->  Index Scan using document_pkey on document c
(cost=0.43..6696889.20 rows=2128590 width=4) (actual
time=0.009..24720.726 rows=94634 loops=1)
               Filter: (0 <> ALL (types))
               Rows Removed by Filter: 70829
               Buffers: shared hit=13852 read=195821
         ->  Materialize  (cost=5636345.76..5681782.42 rows=9087332
width=34) (actual time=165383.798..168027.149 rows=9120904 loops=1)
               Buffers: shared hit=32 read=1889751 written=2952, temp
read=902337 written=902337
               ->  Sort  (cost=5636345.76..5659064.09 rows=9087332
width=34) (actual time=165383.793..167173.325 rows=9120904 loops=1)
                     Sort Key: ref.case_id
                     Sort Method: external merge  Disk: 1392648kB
                     Buffers: shared hit=32 read=1889751 written=2952,
temp read=902337 written=902337
                     ->  Bitmap Heap Scan on component ref
(cost=481859.39..3592128.04 rows=9087332 width=34) (actual
time=20950.899..145515.599 rows=9120904 loops=1)
                           Recheck Cond: (type_id = 23)
                           Rows Removed by Index Recheck: 57286889
                           Buffers: shared hit=32 read=1889751 written=2952
                           ->  Bitmap Index Scan on component_type_id
(cost=0.00..479587.56 rows=9087332 width=0) (actual
time=20947.739..20947.739 rows=12143019 loops=1)
                                 Index Cond: (type_id = 23)
                                 Buffers: shared read=164918 written=2816
 Total runtime: 195213.232 ms

On Wed, Jun 10, 2015 at 1:01 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
> On Wed, Jun 10, 2015 at 2:40 PM, Josh Berkus <josh@agliodbs.com> wrote:
>> On 06/10/2015 11:32 AM, Merlin Moncure wrote:
>>> This is a fundamental issue with using 'database in a box' datatypes
>>> like hstore and jsonb.  They are opaque to the statistics gathering
>>> system and so are unable to give reasonable estimates beyond broad
>>> assumptions.  Speaking generally, the workarounds are too:
>>>
>>> *) disable particular plan choices for this query
>>> (materialize/nestloop are common culprits)
>>>
>>> *) create btree indexes around specific extraction clauses
>>>
>>> *) refactor some of the query into set returning function with a
>>> custom ROWS clause
>>>
>>> *) try alternate indexing strategy such as jsonb/jsquery
>>>
>>> *) move out of hstore and into more standard relational strucure
>>
>> You forgot:
>>
>> *) Fund a PostgreSQL developer to add selectivity estimation and stats
>> to hstore.
>
> Well, I don't know.  That's really complex to the point of making me
> wonder if it's worth doing even given infinite time and resources.  If
> it was my money, I'd be researching a clean way to inject estimate
> returning expressions into the query that the planner could utilize.
> Not 'hints' which are really about managing the output of the planner,
> just what feeds in.  Also lots of various solutions of alcohol to
> lubricate the attendant -hackers discussions.
>
> merlin


Re: Row estimates off by two orders of magnitude with hstore

From
Merlin Moncure
Date:
On Wed, Jun 10, 2015 at 3:55 PM, Patrick Krecker <patrick@judicata.com> wrote:
> OK. Well, fortunately for us, we have a lot of possible solutions this
> problem, and it sounds like actually getting statistics for attributes
> ? 'reference' is not realistic. I just wanted to make sure it wasn't
> some configuration error on our part.
>
> Can anyone explain where exactly the estimate for that clause comes from?
>
> I tried adding an index and I don't think it improved the estimation,
> the planner still thinks there will be 9k rows as a result of type_id
> = 23 and attributes ? 'reference'. [1]. It might make the pathological
> plan less likely though. It's not clear to me that it reduces the risk
> of a pathological plan to zero.

no, but done in conjunction with disabling managing out nestloops and
materliaze query plans, nestloops (say, via SET LOCAL) it will
probably be fast and future proof..

merlin


Re: Row estimates off by two orders of magnitude with hstore

From
Patrick Krecker
Date:
On Wed, Jun 10, 2015 at 2:08 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
> On Wed, Jun 10, 2015 at 3:55 PM, Patrick Krecker <patrick@judicata.com> wrote:
>> OK. Well, fortunately for us, we have a lot of possible solutions this
>> problem, and it sounds like actually getting statistics for attributes
>> ? 'reference' is not realistic. I just wanted to make sure it wasn't
>> some configuration error on our part.
>>
>> Can anyone explain where exactly the estimate for that clause comes from?
>>
>> I tried adding an index and I don't think it improved the estimation,
>> the planner still thinks there will be 9k rows as a result of type_id
>> = 23 and attributes ? 'reference'. [1]. It might make the pathological
>> plan less likely though. It's not clear to me that it reduces the risk
>> of a pathological plan to zero.
>
> no, but done in conjunction with disabling managing out nestloops and
> materliaze query plans, nestloops (say, via SET LOCAL) it will
> probably be fast and future proof..
>
> merlin

Wouldn't wrapping it in an optimization fence (e.g. SELECT * FROM
(...) AS t WHERE t.attributes ? 'reference') have the same effect as
disabling materialize, but allow the planner to optimize the inner
query however it wants?


Re: Row estimates off by two orders of magnitude with hstore

From
Merlin Moncure
Date:
On Wed, Jun 10, 2015 at 4:37 PM, Patrick Krecker <patrick@judicata.com> wrote:
> On Wed, Jun 10, 2015 at 2:08 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
>> On Wed, Jun 10, 2015 at 3:55 PM, Patrick Krecker <patrick@judicata.com> wrote:
>>> OK. Well, fortunately for us, we have a lot of possible solutions this
>>> problem, and it sounds like actually getting statistics for attributes
>>> ? 'reference' is not realistic. I just wanted to make sure it wasn't
>>> some configuration error on our part.
>>>
>>> Can anyone explain where exactly the estimate for that clause comes from?
>>>
>>> I tried adding an index and I don't think it improved the estimation,
>>> the planner still thinks there will be 9k rows as a result of type_id
>>> = 23 and attributes ? 'reference'. [1]. It might make the pathological
>>> plan less likely though. It's not clear to me that it reduces the risk
>>> of a pathological plan to zero.
>>
>> no, but done in conjunction with disabling managing out nestloops and
>> materliaze query plans, nestloops (say, via SET LOCAL) it will
>> probably be fast and future proof..
>>
>> merlin
>
> Wouldn't wrapping it in an optimization fence (e.g. SELECT * FROM
> (...) AS t WHERE t.attributes ? 'reference') have the same effect as
> disabling materialize, but allow the planner to optimize the inner
> query however it wants?

yes, but
select * from (query) q;

is not an optimization fence. the server is smarter than you and I and
will immediately flatten that back out :-).  however,

select * from (query ... OFFSET 0) q;
and the more portable
with data as (query) select ... from query;

can fix up the estimates.  they are both materialization fences
essentially though.

merlin