Thread: count(*), EXISTS, indexes

count(*), EXISTS, indexes

From
Itai Zukerman
Date:
Define:
 CREATE TABLE A (x int PRIMARY KEY, real v); CREATE TABLE B (x int);

I'd like to calculate:
 SELECT sum(v) FROM A WHERE EXISTS (SELECT 1 FROM B WHERE A.x=B.x);

...but then it won't use the primary key index on A.x.

It *will* use the index with:
 SELECT sum(A.v) FROM A,B WHERE A.x=B.x;

...but that's not the same thing.  This is soooo close:
 SELECT count(DISTINCT A.x) FROM A,B WHERE A.x=B.x;

Am I going to have to write a plpgsql function for this?

PS.  B is relatively small, a few thousand rows, while A has well over
500,000 rows.  The DISTINCT A.x should be about 10,000-50,000.

-- 
Itai Zukerman  <http://www.math-hat.com/~zukerman/>



Re: count(*), EXISTS, indexes

From
Josh Berkus
Date:
Itai,

Interesting.  Can you post your Postges version, and EXPLAIN ANALYZE for each
of those queries?

--
-Josh BerkusAglio Database SolutionsSan Francisco



Re: count(*), EXISTS, indexes

From
Itai Zukerman
Date:
> Interesting.  Can you post your Postges version, and EXPLAIN ANALYZE for each 
> of those queries?

Sure.

Here's what I want:
 # explain select sum(weight) from rprofile rp where exists (select 1 from rcount_prof rcp where rcp.profile ~<=
rp.profileand ~rcp.psig ~<= rp.psig and rcp.filter='{734,1944}');                                              QUERY
PLAN
---------------------------------------------------------------------------------------------------- Aggregate
(cost=1544943.75..1544943.75rows=1 width=4)    ->  Seq Scan on rprofile rp  (cost=0.00..1544255.00 rows=275500 width=4)
        Filter: (subplan)          SubPlan            ->  Seq Scan on rcount_prof rcp  (cost=0.00..2.70 rows=1 width=0)
                Filter: ((profile ~<= $0) AND ((~ psig) ~<= $1) AND (filter = '{734,1944}'::text))
 

Here's a version that uses the index, but over-counts:
 # explain analyze select sum(weight) from rprofile rp, rcount_prof rcp where rcp.profile ~<= rp.profile and ~rcp.psig
~<=rp.psig and rcp.filter='{734,1944}';                                                                         QUERY
PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate  (cost=2245.07..2245.07 rows=1 width=1001) (actual time=1183.53..1183.53 rows=1 loops=1)    ->  Nested Loop
(cost=0.00..2245.06rows=1 width=1001) (actual time=0.44..1156.98 rows=23338 loops=1)          Join Filter:
("outer".profile~<= "inner".profile)          ->  Seq Scan on rcount_prof rcp  (cost=0.00..2.44 rows=1 width=287)
(actualtime=0.08..0.17 rows=1 loops=1)                Filter: (filter = '{734,1944}'::text)          ->  Index Scan
usingrprofile_profile_idx on rprofile rp  (cost=0.00..2232.98 rows=551 width=714) (actual time=0.25..1083.15 rows=23385
loops=1)               Index Cond: ((~ "outer".psig) ~<= rp.psig)  Total runtime: 1183.67 msec
 

$ psql --version
psql (PostgreSQL) 7.3.2

Running on RedHat.

(It takes a long time to run the first select, so I left off the
analyze.)

-- 
Itai Zukerman  <http://www.math-hat.com/~zukerman/>



Re: count(*), EXISTS, indexes

From
Tom Lane
Date:
Itai Zukerman <zukerman@math-hat.com> writes:
> Define:
>   CREATE TABLE A (x int PRIMARY KEY, real v);
>   CREATE TABLE B (x int);

> I'd like to calculate:
>   SELECT sum(v) FROM A WHERE EXISTS (SELECT 1 FROM B WHERE A.x=B.x);
> ...but then it won't use the primary key index on A.x.

In CVS tip (7.4-to-be) I think

SELECT sum(v) FROM A WHERE A.x IN (SELECT B.x FROM B);

would probably work well.  In current releases I think all you can do is
add an index to B.x and settle for the EXISTS() approach.

> PS.  B is relatively small, a few thousand rows, while A has well over
> 500,000 rows.  The DISTINCT A.x should be about 10,000-50,000.

BTW, how can a PRIMARY KEY column have fewer DISTINCT values than there
are rows?
        regards, tom lane



Re: count(*), EXISTS, indexes

From
Itai Zukerman
Date:
>> Define:
>>   CREATE TABLE A (x int PRIMARY KEY, real v);
>>   CREATE TABLE B (x int);
>
>> I'd like to calculate:
>>   SELECT sum(v) FROM A WHERE EXISTS (SELECT 1 FROM B WHERE A.x=B.x);
>> ...but then it won't use the primary key index on A.x.
>
> In CVS tip (7.4-to-be) I think
>
> SELECT sum(v) FROM A WHERE A.x IN (SELECT B.x FROM B);

Oops.  I guess I should've changed the example to "WHERE A.x>=B.x"
since that's what I'm really doing.  Oh, well, my goof.

> would probably work well.  In current releases I think all you can do is
> add an index to B.x and settle for the EXISTS() approach.

I don't think indexing B will help speed-wise if it still does the seq
scan through A.

>> PS.  B is relatively small, a few thousand rows, while A has well over
>> 500,000 rows.  The DISTINCT A.x should be about 10,000-50,000.
>
> BTW, how can a PRIMARY KEY column have fewer DISTINCT values than there
> are rows?

Oh, I just meant, "the DISTINCT A.x that are returned by:
 SELECT DISTINCT A.x FROM A,B WHERE A.x=B.x

".  Sorry for my lack of clarity...

-- 
Itai Zukerman  <http://www.math-hat.com/~zukerman/>



Re: count(*), EXISTS, indexes

From
Itai Zukerman
Date:
>   CREATE TABLE A (x int PRIMARY KEY, real v);
>   CREATE TABLE B (x int);
>
> I'd like to calculate:
>
>   SELECT sum(v) FROM A WHERE EXISTS (SELECT 1 FROM B WHERE A.x=B.x);

This seems to be a reasonably-performing workaround:
 SELECT DISTINCT x INTO TEMP C FROM A,B WHERE A.x=B.x; SELECT sum(v) FROM A,C WHERE A.x=C.x;

-- 
Itai Zukerman  <http://www.math-hat.com/~zukerman/>



Re: count(*), EXISTS, indexes

From
Josh Berkus
Date:
Itai,

>   # explain select sum(weight) from rprofile rp where exists (select 1 from
rcount_prof rcp where rcp.profile ~<= rp.profile and ~rcp.psig ~<= rp.psig
and rcp.filter='{734,1944}');

I'm not familiar with the "~" that you have in the query.  What are those for?

Do you have an index on rcp.profile, rcp.psig, rcp.filter?

--
-Josh BerkusAglio Database SolutionsSan Francisco



Re: count(*), EXISTS, indexes

From
Itai Zukerman
Date:
>>   # explain select sum(weight) from rprofile rp where exists (select 1 from 
> rcount_prof rcp where rcp.profile ~<= rp.profile and ~rcp.psig ~<= rp.psig 
> and rcp.filter='{734,1944}');
>
> I'm not familiar with the "~" that you have in the query.  What are those for?

They are my own operators and functions.  profile is a integer array
and the ~'s are subset operators.  psig is a bit signature, "~" is
complement, and the ~ operators again are subset operators.

> Do you have an index on rcp.profile, rcp.psig, rcp.filter?

Yes, yes, and yes.  ATM, though, there are only about 50 rows in
rcount_prof.  The vast majority of time is spent scanning the
600,000-row rprofile table.

-- 
Itai Zukerman  <http://www.math-hat.com/~zukerman/>



SELECT INTO TEMP in Trigger?

From
Itai Zukerman
Date:
This seems to not be possible is plpgsql:
 SELECT DISTINCT A.x INTO TEMP X FROM A,B WHERE A.x=B.x;

Is there some way to create a temporary table in plpgsql?

-- 
Itai Zukerman  <http://www.math-hat.com/~zukerman/>



Re: count(*), EXISTS, indexes

From
Stephan Szabo
Date:
On Fri, 11 Apr 2003, Itai Zukerman wrote:

> >   CREATE TABLE A (x int PRIMARY KEY, real v);
> >   CREATE TABLE B (x int);
> >
> > I'd like to calculate:
> >
> >   SELECT sum(v) FROM A WHERE EXISTS (SELECT 1 FROM B WHERE A.x=B.x);
>
> This seems to be a reasonably-performing workaround:
>
>   SELECT DISTINCT x INTO TEMP C FROM A,B WHERE A.x=B.x;
>   SELECT sum(v) FROM A,C WHERE A.x=C.x;

Hmm, given that, would something like:

select sum(v) from(select distinct on(x) x,v from a,b where a.x=b.x) as foo;

work?



Re: count(*), EXISTS, indexes

From
Tom Lane
Date:
Itai Zukerman <zukerman@math-hat.com> writes:
>>> I'd like to calculate:
>>> SELECT sum(v) FROM A WHERE EXISTS (SELECT 1 FROM B WHERE A.x=B.x);
>>> ...but then it won't use the primary key index on A.x.
>> 
>> In CVS tip (7.4-to-be) I think
>> SELECT sum(v) FROM A WHERE A.x IN (SELECT B.x FROM B);

> Oops.  I guess I should've changed the example to "WHERE A.x>=B.x"
> since that's what I'm really doing.  Oh, well, my goof.

Anything else you didn't bother to mention?  Because in that case I
think your query reduces to

SELECT sum(v) FROM A WHERE A.x >= (SELECT min(x) FROM B);

which ought to run reasonably well (I'm sure it'll still want to
do a seqscan --- but not a nested loop).
        regards, tom lane



Re: SELECT INTO TEMP in Trigger?

From
"Dan Langille"
Date:
On 11 Apr 2003 at 18:05, Itai Zukerman wrote:

> This seems to not be possible is plpgsql:
> 
>   SELECT DISTINCT A.x INTO TEMP X FROM A,B WHERE A.x=B.x;
> 
> Is there some way to create a temporary table in plpgsql?

Yes.

http://www.postgresql.org/docs/view.php?version=7.3&idoc=0&file=sql-
selectinto.html

-- 
Dan Langille : http://www.langille.org/



Re: SELECT INTO TEMP in Trigger?

From
Itai Zukerman
Date:
> This seems to not be possible is plpgsql:
>
>   SELECT DISTINCT A.x INTO TEMP X FROM A,B WHERE A.x=B.x;
>
> Is there some way to create a temporary table in plpgsql?

Just to clarify, this sort-of works HOWEVER:
 SELECT count(*) FROM X; DROP X;

The first time through the plpgsql the SELECT is fine.  Further times
it complains:
 ERROR:  pg_class_aclcheck: relation 8845807 not found

I guess it doesn't like dealing with a different X.  Should I use
FOR-IN-EXECUTE?  That seems ugly...

-- 
Itai Zukerman  <http://www.math-hat.com/~zukerman/>



Re: count(*), EXISTS, indexes

From
Itai Zukerman
Date:
>> >   CREATE TABLE A (x int PRIMARY KEY, real v);
>> >   CREATE TABLE B (x int);
>> >
>> >   SELECT sum(v) FROM A WHERE EXISTS (SELECT 1 FROM B WHERE A.x=B.x);
>>
>> This seems to be a reasonably-performing workaround:
>>
>>   SELECT DISTINCT x INTO TEMP C FROM A,B WHERE A.x=B.x;
>>   SELECT sum(v) FROM A,C WHERE A.x=C.x;
>
> Hmm, given that, would something like:
>
> select sum(v) from
>  (select distinct on(x) x,v from a,b where a.x=b.x) as foo;

Excellent.  Thanks!

-- 
Itai Zukerman  <http://www.math-hat.com/~zukerman/>



Re: count(*), EXISTS, indexes

From
Josh Berkus
Date:
Itai,

> They are my own operators and functions.  profile is a integer array
> and the ~'s are subset operators.  psig is a bit signature, "~" is
> complement, and the ~ operators again are subset operators.

You're going to have to work on your question-posting skills.

Your query problem is that basically you have custom operators which the
planner doesn't know how to evaluate the return results on correctly.    This
is a radically different situation from how you presented it in your first
posting.

This explains why the planner thinks that the exists clause will return
255,000 rows instead of the handful it actually does return.   I'd suggest
re-building the query in several different syntaxes, until you find the one
the planner gets right.

Or build your own custom index types to take advantage of your custom
operators.    B-tree indexes are optimized for =, LIKE, <, and > queries; I
don't think they know what to do with "~<="

At least, I think so.  I'm not much of an expert on custom operators.

>  > Do you have an index on rcp.profile, rcp.psig, rcp.filter?
>
> Yes, yes, and yes.  ATM, though, there are only about 50 rows in
> rcount_prof.  The vast majority of time is spent scanning the
> 600,000-row rprofile table.

Um, three seperate indexes on those three columns is not the same as a single
index on all three columns.

I was basically fishing for the reason why the planner got the row count so
radically wrong; now I think I know the reason ....

--
-Josh BerkusAglio Database SolutionsSan Francisco



Re: count(*), EXISTS, indexes

From
Tom Lane
Date:
Josh Berkus <josh@agliodbs.com> writes:
> Your query problem is that basically you have custom operators which the 
> planner doesn't know how to evaluate the return results on correctly.  This 
> is a radically different situation from how you presented it in your first 
> posting.

> This explains why the planner thinks that the exists clause will return 
> 255,000 rows instead of the handful it actually does return.   I'd suggest 
> re-building the query in several different syntaxes, until you find the one 
> the planner gets right.

Or more likely to work: build some custom selectivity estimation
functions to attach to the custom operators.

> Or build your own custom index types to take advantage of your custom 
> operators.    B-tree indexes are optimized for =, LIKE, <, and > queries; I 
> don't think they know what to do with "~<="

They certainly don't.  Possibly GIST could be taught what to do with
such things, but it won't happen by magic.
        regards, tom lane



Re: count(*), EXISTS, indexes

From
Itai Zukerman
Date:
>> They are my own operators and functions.  profile is a integer array
>> and the ~'s are subset operators.  psig is a bit signature, "~" is
>> complement, and the ~ operators again are subset operators.
>
> You're going to have to work on your question-posting skills.  

*blush*, sorry.

> This explains why the planner thinks that the exists clause will return 
> 255,000 rows instead of the handful it actually does return.

Actually:
 # explain select sum(weight) from rprofile rp where exists (select 1 from rcount_prof rcp where rcp.profile ~<=
rp.profileand ~rcp.psig ~<= rp.psig and rcp.filter='{734,1944}');                                              QUERY
PLAN
---------------------------------------------------------------------------------------------------- Aggregate
(cost=1544943.75..1544943.75rows=1 width=4)    ->  Seq Scan on rprofile rp  (cost=0.00..1544255.00 rows=275500 width=4)
        Filter: (subplan)          SubPlan            ->  Seq Scan on rcount_prof rcp  (cost=0.00..2.70 rows=1 width=0)
                Filter: ((profile ~<= $0) AND ((~ psig) ~<= $1) AND (filter = '{734,1944}'::text))
 

I sort-of don't know what I'm doing reading these query plans, but I
think the estimates are more-or-less right: few rows from rcount_prof,
many rows from rprofile.  275,500 may be a bit high (it's actually
around 24,000 in this case), but def. within an order of magnitude.

> Or build your own custom index types to take advantage of your custom 
> operators.    B-tree indexes are optimized for =, LIKE, <, and > queries; I 
> don't think they know what to do with "~<="

I created GiST indexes for the ~ operators (but, no, I haven't tuned
the selectivity estimation functions).  The problem was, the query
wasn't using the index on rprofile.  Stephan Szabo's suggestion worked
beautifully, though:
 dl=# EXPLAIN ANALYZE dl-# SELECT sum(weight) dl-# FROM (SELECT DISTINCT ON(rp.rid) rp.rid, rp.weight dl(#       FROM
rprofilerp, rcount_prof rcp dl(#       WHERE rcp.profile ~<= rp.profile dl(#       AND ~rcp.psig ~<= rp.psig dl(#
ANDrcp.filter='{734,1944}') AS FOO;
QUERYPLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate  (cost=2245.08..2245.08 rows=1 width=1026) (actual time=954.45..954.45 rows=1 loops=1)    ->  Subquery Scan
foo (cost=2245.07..2245.08 rows=1 width=1026) (actual time=824.85..935.71 rows=23619 loops=1)          ->  Unique
(cost=2245.07..2245.08rows=1 width=1026) (actual time=824.85..898.24 rows=23619 loops=1)                ->  Sort
(cost=2245.07..2245.08rows=1 width=1026) (actual time=824.84..840.52 rows=23619 loops=1)                      Sort Key:
rp.rid                     ->  Nested Loop  (cost=0.00..2245.06 rows=1 width=1026) (actual time=0.28..520.41 rows=23619
loops=1)                           Join Filter: ("outer".profile ~<= "inner".profile)                            ->
SeqScan on rcount_prof rcp  (cost=0.00..2.44 rows=1 width=287) (actual time=0.06..0.33 rows=1 loops=1)
               Filter: (filter = '{734,1944}'::text)                            ->  Index Scan using
rprofile_profile_idxon rprofile rp  (cost=0.00..2232.98 rows=551 width=739) (actual time=0.14..453.11 rows=23666
loops=1)                                 Index Cond: ((~ "outer".psig) ~<= rp.psig)  Total runtime: 958.62 msec
 

The row estimates are way off, though.  My fault...

-- 
Itai Zukerman  <http://www.math-hat.com/~zukerman/>



Re: SELECT INTO TEMP in Trigger?

From
Bruce Momjian
Date:
See the FAQ on creating temp objects in functions ... use EXECUTE.

---------------------------------------------------------------------------

Itai Zukerman wrote:
> > This seems to not be possible is plpgsql:
> >
> >   SELECT DISTINCT A.x INTO TEMP X FROM A,B WHERE A.x=B.x;
> >
> > Is there some way to create a temporary table in plpgsql?
> 
> Just to clarify, this sort-of works HOWEVER:
> 
>   SELECT count(*) FROM X;
>   DROP X;
> 
> The first time through the plpgsql the SELECT is fine.  Further times
> it complains:
> 
>   ERROR:  pg_class_aclcheck: relation 8845807 not found
> 
> I guess it doesn't like dealing with a different X.  Should I use
> FOR-IN-EXECUTE?  That seems ugly...
> 
> -- 
> Itai Zukerman  <http://www.math-hat.com/~zukerman/>
> 
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 6: Have you searched our list archives?
> 
> http://archives.postgresql.org
> 

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073