Thread: Expensive query

Expensive query

From
Christophe Pettus
Date:
I have two tables, structured as:

users:

    id        varchar(70) not null unique

events:

    userid        varchar(70) not null,
    code        char(10) not null,
    when        datetime not null

The query I need to perform answers the question, "Which users do NOT
have a particular event (selected by code), and which do not have ANY
event for the last day?"  The query I use is:

    select id from users
       where
          id not in (
             select unique id from events
             where code = 'some code'
          )
          and
          id not in (
             select unique id from events
             where age('now',when) <= '1 day'
          );

This query is *very* expensive.  With 10,000 users and 40,000 events,
it can take up to 20 minutes (!) to execute, and the postgres process
grows to 40 megabytes (!!) of memory (on a Pentium II system running
BSDI).  This seems surprising, since each of the subqueries only needs
to be evaluated once, rather than once per row of users.  Is there a
way to reformulate this query to make it less expensive?

Re: [GENERAL] Expensive query

From
Dustin Sallings
Date:
On Fri, 30 Oct 1998, Christophe Pettus wrote:

    That users table looks kinda useless.  I'd recommend changing id
to name and making id an int, then putting your events users in as ints.
Currently, it doesn't do anything at all, and you might as well be doing
this query:

    select distinct id from events where code = 'Whatever'
        and age('now',when) <= ' 1 day';

    I would also recommend not making code a char(10).  If you put an
index on code and an index on date, you should be able to get your results
pretty quick.  If you change the userid and the code to integers and do a
three table join, it might be faster, but it would use *much* less disk
space.

# I have two tables, structured as:
#
# users:
#
#     id        varchar(70) not null unique
#
# events:
#
#     userid        varchar(70) not null,
#     code        char(10) not null,
#     when        datetime not null
#
# The query I need to perform answers the question, "Which users do NOT
# have a particular event (selected by code), and which do not have ANY
# event for the last day?"  The query I use is:
#
#     select id from users
#        where
#           id not in (
#              select unique id from events
#              where code = 'some code'
#           )
#           and
#           id not in (
#              select unique id from events
#              where age('now',when) <= '1 day'
#           );
#
# This query is *very* expensive.  With 10,000 users and 40,000 events,
# it can take up to 20 minutes (!) to execute, and the postgres process
# grows to 40 megabytes (!!) of memory (on a Pentium II system running
# BSDI).  This seems surprising, since each of the subqueries only needs
# to be evaluated once, rather than once per row of users.  Is there a
# way to reformulate this query to make it less expensive?
#
#

--
SA, beyond.com           My girlfriend asked me which one I like better.
pub  1024/3CAE01D5 1994/11/03 Dustin Sallings <dustin@spy.net>
|    Key fingerprint =  87 02 57 08 02 D0 DA D6  C8 0F 3E 65 51 98 D8 BE
L_______________________ I hope the answer won't upset her. ____________


Re: [GENERAL] Expensive query

From
Christophe Pettus
Date:
At 02:13 PM 10/30/98 , you wrote:
>    That users table looks kinda useless.  I'd recommend changing id
>to name and making id an int, then putting your events users in as ints.
>Currently, it doesn't do anything at all, and you might as well be doing
>this query [...]

Actually, the real-life 'users' table has lots of fields, none of them
relevant to the question at hand.

>    I would also recommend not making code a char(10).  If you put an
>index on code and an index on date, you should be able to get your results
>pretty quick.

They are indexed.

Using ints is not unreasonable, but does not explain the pathological behavior.

Re: [GENERAL] Expensive query

From
dustin sallings
Date:
On Fri, 30 Oct 1998, Christophe Pettus wrote:

// At 02:13 PM 10/30/98 , you wrote:
// >    That users table looks kinda useless.  I'd recommend changing id
// >to name and making id an int, then putting your events users in as ints.
// >Currently, it doesn't do anything at all, and you might as well be doing
// >this query [...]
//
// Actually, the real-life 'users' table has lots of fields, none of
// them relevant to the question at hand.

    Right, the point was the table you were using in the query the way
you were using it in the query wouldn't do anything but slow down the
query at best.  You were doing three queries with no chance of the
optimizer doing its job.  Also using ``is not in'' on those search results
just sounds like it'd be horribly slow when you could just ask the whole
thing in one query.

--
SA, beyond.com                            The world is watching America,
pub  1024/3CAE01D5 1994/11/03 Dustin Sallings <dustin@spy.net>
|    Key fingerprint =  87 02 57 08 02 D0 DA D6  C8 0F 3E 65 51 98 D8 BE
L________________________________________ and America is watching TV. __


Re: [GENERAL] Expensive query

From
Herouth Maoz
Date:
At 0:13 +0200 on 31/10/98, Dustin Sallings wrote:


>
>     That users table looks kinda useless.  I'd recommend changing id
> to name and making id an int, then putting your events users in as ints.
> Currently, it doesn't do anything at all, and you might as well be doing
> this query:
>
>     select distinct id from events where code = 'Whatever'
>         and age('now',when) <= ' 1 day';
>
>     I would also recommend not making code a char(10).  If you put an
> index on code and an index on date, you should be able to get your results
> pretty quick.  If you change the userid and the code to integers and do a
> three table join, it might be faster, but it would use *much* less disk
> space.

I don't get your solution. Doesn't it return the exact opposite of what is
needed? He needed the users which did NOT have code 'Whatever' and also
didn't have an event the last day - and your query asks which users have a
'Whatever' event less than a day ago...

If what you intended is to put this query into a NOT IN subquery, then it's
still not what was required, because we didn't want to exclude all
'Whatever' events of the last day. We wanted to exclude both 'Whatever'
events and events of the last day.

No, it's either two subqueries or a subquery with an OR, which is not
optimized. Perhaps changing to a NOT EXISTS rather than a NOT IN query
would help:

    select id from users
       where
          NOT EXISTS (
             select * from events
             where userid = id
               and code = 'some code'
          )
          and
          NOT EXISTS (
             select * from events
             where userid = id
               and age('now',when) <= '1 day'
          );

Herouth

--
Herouth Maoz, Internet developer.
Open University of Israel - Telem project
http://telem.openu.ac.il/~herutma