row estimate very wrong for array type - Mailing list pgsql-performance

From Denis de Bernardy
Subject row estimate very wrong for array type
Date
Msg-id 625605.72736.qm@web112406.mail.gq1.yahoo.com
Whole thread Raw
Responses Re: row estimate very wrong for array type
List pgsql-performance
Hi,

I'm running into erroneous row estimates when using an array type, and I'm running out of ideas on how to steer
postgresinto the right direction... I've tried setting statistics to 1000, vacuuming and analyzing over and over,
rewritingthe query differently... to no avail. 

The table looks like this:

create table test (
id serial primary key,
sortcol float unique,
intarr1 int[] not null default '{}',
intarr2 int[] not null default '{}'

);

It contains 40k rows of random data which, for the sake of the queries that follow, aren't too far from what they'd
containin production. 


# select intarr1, intarr2, count(*) from test group by intarr1, intarr2;


 intarr1 | intarr2 | count 
--------------+-------------+-------
 {0}          | {2,3}       | 40018


The stats seem right:

# select * from pg_stats where tablename = 'test' and attname = 'intarr1';

 schemaname | tablename |   attname    | inherited | null_frac | avg_width | n_distinct | most_common_vals |
most_common_freqs| histogram_bounds | correlation  

------------+-----------+--------------+-----------+-----------+-----------+------------+------------------+-------------------+------------------+-------------
 test       | test      | intarr1 | f         |         0 |        25 |          1 | {"{0}"}          | {1}            
 |                  |           1 


A query without any condition on the array results in reasonable estimates and the proper use of the index on the sort
column:

# explain analyze select * from test order by sortcol limit 10;

                                                             QUERY PLAN                                                
            

------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.00..3.00 rows=10 width=217) (actual time=0.098..0.109 rows=10 loops=1)
   ->  Index Scan using test_sortcol_key on test  (cost=0.00..12019.08 rows=40018 width=217) (actual time=0.096..0.105
rows=10loops=1) 
 Total runtime: 0.200 ms


After adding a condition on the array, however, the row estimate is completely off:


# explain analyze select * from test where intarr1 && '{0,1}' order by sortcol limit 10;

                                                            QUERY PLAN                                                
           

----------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.00..605.96 rows=10 width=217) (actual time=0.131..0.175 rows=10 loops=1)
   ->  Index Scan using test_sortcol_key on test  (cost=0.00..12119.13 rows=200 width=217) (actual time=0.129..0.169
rows=10loops=1) 
         Filter: (intarr1 && '{0,1}'::integer[])


When there's a condition on both arrays, this then leads to a seq scan:

# explain analyze select * from test where intarr1 && '{0,1}' and intarr2 && '{2,4}' order by sortcol limit 10;
                                                     QUERY PLAN                                                     
--------------------------------------------------------------------------------------------------------------------
 Limit  (cost=3241.28..3241.29 rows=1 width=217) (actual time=88.260..88.265 rows=10 loops=1)
   ->  Sort  (cost=3241.28..3241.29 rows=1 width=217) (actual time=88.258..88.260 rows=10 loops=1)
         Sort Key: sortcol
         Sort Method:  top-N heapsort  Memory: 27kB
         ->  Seq Scan on test  (cost=0.00..3241.27 rows=1 width=217) (actual time=0.169..68.785 rows=40018 loops=1)
               Filter: ((intarr1 && '{0,1}'::integer[]) AND (intarr2 && '{2,4}'::integer[]))
 Total runtime: 88.339 ms


Adding a GIN index on the two arrays results in similar ugliness:

# explain analyze select * from test where intarr1 && '{0,1}' and intarr2 && '{2,4}' order by lft limit 10;
                                                                         QUERY PLAN                                    
                                    

------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=8.29..8.29 rows=1 width=217) (actual time=56.122..56.127 rows=10 loops=1)
   ->  Sort  (cost=8.29..8.29 rows=1 width=217) (actual time=56.120..56.122 rows=10 loops=1)
         Sort Key: sortcol
         Sort Method:  top-N heapsort  Memory: 27kB
         ->  Bitmap Heap Scan on test  (cost=4.26..8.28 rows=1 width=217) (actual time=19.635..39.824 rows=40018
loops=1)
               Recheck Cond: ((intarr1 && '{0,1}'::integer[]) AND (intarr2 && '{2,4}'::integer[]))
               ->  Bitmap Index Scan on test_intarr1_intarr2_idx  (cost=0.00..4.26 rows=1 width=0) (actual
time=19.387..19.387rows=40018 loops=1) 
                     Index Cond: ((intarr1 && '{0,1}'::integer[]) AND (intarr2 && '{2,4}'::integer[]))
 Total runtime: 56.210 ms


Might this be a bug in the operator's selectivity, or am I doing something wrong?

Thanks in advance,
Denis


pgsql-performance by date:

Previous
From: Jim Nasby
Date:
Subject: Re: REINDEX takes half a day (and still not complete!)
Next
From: Shaun Thomas
Date:
Subject: Re: amazon ec2