Thread: Query for filtering records

Query for filtering records

From
eric soroos
Date:
I'm having trouble subtracting groups from other groups. 


I've got a data model that has the following essential features:

create table contacts (num int, properties....);
create table groups (groupNum int, contactNum int);

Where not all contacts will be in a group, some groups will contain most contacts, and there will be something like
hundredsof groups and tens of thousands of contacts.  I allow people to build groups using criteria, which I need to
programaticallytranslate to sql.  
 

One somewhat common pattern is:

Select all contacts in group a, who have property b, and who aren't in groups c,d,e,f...

My first shot was subqueries:

select num, p1,p2 ... from contacts    inner join groups using (contacts.num=groups.contactNum)   where
groups.groupNum=a  and contact.p3=b   and not num in (select contactNum from groups where groupNum=c)   and not num in
(selectcontactNum from groups where groupNum=d)   and not num in (select contactNum from groups where groupNum=e)   and
notnum in (select contactNum from groups where groupNum=f)
 

This is .... slow.  agonizingly so. 

With an inner join, I'm not convinced that the subtraction is actually correct., but it is much faster. Unfortunatley,
fasterincorrect answers are rarely helpful. 
 

Outer joins seem even worse than subselects for speed, but it does appear to give the correct answer. (example with a
singlejoin.)
 

select num from contacts   left outer join groups        on (contacts.num=groups.contactNum    and  groups.groupNum=b)
where dl_groupDonor._groupNum is null  and p3=c
 

I've got to be missing something here, because this is much slower from the (slow) procedural system that I'm porting
from.
 

I've been avoiding using union / intersect since I don't really ever know what columns are going to be in the query.
perhapsI should revisit that decision and try to work around it.
 

eric





Re: Query for filtering records

From
Joel Burton
Date:
On Tue, Dec 03, 2002 at 11:01:33AM -0800, eric soroos wrote:
> I'm having trouble subtracting groups from other groups. 
> 
> 
> I've got a data model that has the following essential features:
> 
> create table contacts (num int, properties....);
> create table groups (groupNum int, contactNum int);
> 
> Where not all contacts will be in a group, some groups will contain most contacts, and there will be something like
hundredsof groups and tens of thousands of contacts.  I allow people to build groups using criteria, which I need to
programaticallytranslate to sql.  
 
> 
> One somewhat common pattern is:
> 
> Select all contacts in group a, who have property b, and who aren't in groups c,d,e,f...
> 
> My first shot was subqueries:
> 
> select num, p1,p2 ... from contacts 
>     inner join groups using (contacts.num=groups.contactNum)
>     where groups.groupNum=a
>     and contact.p3=b
>     and not num in (select contactNum from groups where groupNum=c)
>     and not num in (select contactNum from groups where groupNum=d)
>     and not num in (select contactNum from groups where groupNum=e)
>     and not num in (select contactNum from groups where groupNum=f)
> 
> This is .... slow.  agonizingly so. 

I'd say so!

Something like:

SELECT * ...FROM ...WHERE NOT IN (SELECT contactnum FROM groups WHERE groupnum='c' orgroupnum='d' OR ... )

is bound to be _much_ faster!

And even better is

SELECT *FROM ... contacts c1WHERE NOT EXISTS (SELECT * FROM groups WHERE groupname='c' orgroupnum='d' or groupnume='e'
...AND groups.contactnum=c1.contactnum)
 


EXISTS is almost always faster in PG.

-- 

Joel BURTON  |  joel@joelburton.com  |  joelburton.com  |  aim: wjoelburton
Independent Knowledge Management Consultant


Re: Query for filtering records

From
Jean-Luc Lachance
Date:
Eric try:


select num, p1,p2 ... from contacts inner join groups using (contacts.num=groups.contactNum) where groups.groupNum=a
andcontact.p3=b and not exists (    select 1 from groups g2    where g2.contactNum = groups.contactNum and
g2.groupNum!= a);
 

or 

select num, p1,p2 ... from contacts inner join groups using (contacts.num=groups.contactNum) where groups.groupNum=a
andcontact.p3=b and groups.groupNum in (    select contactNum from groups group by contactNum having count(*) =
 
1);

The IN version may still be faster as the sub-select should be evaluated
only once;


JLL



eric soroos wrote:
> 
> I'm having trouble subtracting groups from other groups.
> 
> I've got a data model that has the following essential features:
> 
> create table contacts (num int, properties....);
> create table groups (groupNum int, contactNum int);
> 
> Where not all contacts will be in a group, some groups will contain most contacts, and there will be something like
hundredsof groups and tens of thousands of contacts.  I allow people to build groups using criteria, which I need to
programaticallytranslate to sql.
 
> 
> One somewhat common pattern is:
> 
> Select all contacts in group a, who have property b, and who aren't in groups c,d,e,f...
> 
> My first shot was subqueries:
> 
> select num, p1,p2 ... from contacts
>     inner join groups using (contacts.num=groups.contactNum)
>     where groups.groupNum=a
>     and contact.p3=b
>     and not num in (select contactNum from groups where groupNum=c)
>     and not num in (select contactNum from groups where groupNum=d)
>     and not num in (select contactNum from groups where groupNum=e)
>     and not num in (select contactNum from groups where groupNum=f)
> 
> This is .... slow.  agonizingly so.
> 
> With an inner join, I'm not convinced that the subtraction is actually correct., but it is much faster.
Unfortunatley,faster incorrect answers are rarely helpful.
 
> 
> Outer joins seem even worse than subselects for speed, but it does appear to give the correct answer. (example with a
singlejoin.)
 
> 
> select num from contacts
>    left outer join groups
>         on (contacts.num=groups.contactNum
>             and  groups.groupNum=b)
>    where
>    dl_groupDonor._groupNum is null
>    and p3=c
> 
> I've got to be missing something here, because this is much slower from the (slow) procedural system that I'm porting
from.
> 
> I've been avoiding using union / intersect since I don't really ever know what columns are going to be in the query.
perhapsI should revisit that decision and try to work around it.
 
> 
> eric
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 2: you can get off all lists at once with the unregister command
>     (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)


Re: Query for filtering records

From
eric soroos
Date:
> SELECT * ...
>  FROM ...
>  WHERE NOT IN (SELECT contactnum FROM groups WHERE groupnum='c' or
>  groupnum='d' OR ... )
> 
> is bound to be _much_ faster!

Yeah, that's an obvious optimization. Unfortunately, due to needing to match semantics of a previous non-sql version
andsome pathological group specifications from clients, it can't be applied as often as I'd like. Yes, I call client
specspathological sometimes. No, they don't know that. 
 
> And even better is
> 
> SELECT *
>  FROM ... contacts c1
>  WHERE NOT EXISTS (SELECT * FROM groups WHERE groupname='c' or
>  groupnum='d' or groupnume='e' ... AND groups.contactnum=c1.contactnum)
> 
> 
> EXISTS is almost always faster in PG.

Exists seems to be the answer. It even gives the right answer, which has been a problem for queries of this sort. 

Rewriting the queries so that the subtraction clauses use exists are giving me reasonable runtimes (~5 sec) and
believableanswers. (instead of the other two extremes of 5 minutes and either no answers or everyone in the database)
What'suseful here is that I'm getting multiple exists index scans instead of nested loops or table scans. What's more,
existsclauses are really easy to integrate into my query generation routine. 
 

thanks

eric