Re: count(*), EXISTS, indexes - Mailing list pgsql-sql

From Itai Zukerman
Subject Re: count(*), EXISTS, indexes
Date
Msg-id 873ckoqs7m.fsf@matt.w80.math-hat.com
Whole thread Raw
In response to Re: count(*), EXISTS, indexes  (Josh Berkus <josh@agliodbs.com>)
List pgsql-sql
>> They are my own operators and functions.  profile is a integer array
>> and the ~'s are subset operators.  psig is a bit signature, "~" is
>> complement, and the ~ operators again are subset operators.
>
> You're going to have to work on your question-posting skills.  

*blush*, sorry.

> This explains why the planner thinks that the exists clause will return 
> 255,000 rows instead of the handful it actually does return.

Actually:
 # explain select sum(weight) from rprofile rp where exists (select 1 from rcount_prof rcp where rcp.profile ~<=
rp.profileand ~rcp.psig ~<= rp.psig and rcp.filter='{734,1944}');                                              QUERY
PLAN
---------------------------------------------------------------------------------------------------- Aggregate
(cost=1544943.75..1544943.75rows=1 width=4)    ->  Seq Scan on rprofile rp  (cost=0.00..1544255.00 rows=275500 width=4)
        Filter: (subplan)          SubPlan            ->  Seq Scan on rcount_prof rcp  (cost=0.00..2.70 rows=1 width=0)
                Filter: ((profile ~<= $0) AND ((~ psig) ~<= $1) AND (filter = '{734,1944}'::text))
 

I sort-of don't know what I'm doing reading these query plans, but I
think the estimates are more-or-less right: few rows from rcount_prof,
many rows from rprofile.  275,500 may be a bit high (it's actually
around 24,000 in this case), but def. within an order of magnitude.

> Or build your own custom index types to take advantage of your custom 
> operators.    B-tree indexes are optimized for =, LIKE, <, and > queries; I 
> don't think they know what to do with "~<="

I created GiST indexes for the ~ operators (but, no, I haven't tuned
the selectivity estimation functions).  The problem was, the query
wasn't using the index on rprofile.  Stephan Szabo's suggestion worked
beautifully, though:
 dl=# EXPLAIN ANALYZE dl-# SELECT sum(weight) dl-# FROM (SELECT DISTINCT ON(rp.rid) rp.rid, rp.weight dl(#       FROM
rprofilerp, rcount_prof rcp dl(#       WHERE rcp.profile ~<= rp.profile dl(#       AND ~rcp.psig ~<= rp.psig dl(#
ANDrcp.filter='{734,1944}') AS FOO;
QUERYPLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate  (cost=2245.08..2245.08 rows=1 width=1026) (actual time=954.45..954.45 rows=1 loops=1)    ->  Subquery Scan
foo (cost=2245.07..2245.08 rows=1 width=1026) (actual time=824.85..935.71 rows=23619 loops=1)          ->  Unique
(cost=2245.07..2245.08rows=1 width=1026) (actual time=824.85..898.24 rows=23619 loops=1)                ->  Sort
(cost=2245.07..2245.08rows=1 width=1026) (actual time=824.84..840.52 rows=23619 loops=1)                      Sort Key:
rp.rid                     ->  Nested Loop  (cost=0.00..2245.06 rows=1 width=1026) (actual time=0.28..520.41 rows=23619
loops=1)                           Join Filter: ("outer".profile ~<= "inner".profile)                            ->
SeqScan on rcount_prof rcp  (cost=0.00..2.44 rows=1 width=287) (actual time=0.06..0.33 rows=1 loops=1)
               Filter: (filter = '{734,1944}'::text)                            ->  Index Scan using
rprofile_profile_idxon rprofile rp  (cost=0.00..2232.98 rows=551 width=739) (actual time=0.14..453.11 rows=23666
loops=1)                                 Index Cond: ((~ "outer".psig) ~<= rp.psig)  Total runtime: 958.62 msec
 

The row estimates are way off, though.  My fault...

-- 
Itai Zukerman  <http://www.math-hat.com/~zukerman/>



pgsql-sql by date:

Previous
From: Tom Lane
Date:
Subject: Re: count(*), EXISTS, indexes
Next
From: "Randolph Jones"
Date:
Subject: sort up then down