Thread: Query plan for NOT IN

Query plan for NOT IN

From
Matthew Wakeling
Date:
mnw21-modmine-r13features-copy=# select count(*) from project;
  count
-------
     10
(1 row)

mnw21-modmine-r13features-copy=# select count(*) from intermineobject;
   count
----------
  26344616
(1 row)

mnw21-modmine-r13features-copy=# \d intermineobject;
Table "public.intermineobject"
  Column |  Type   | Modifiers
--------+---------+-----------
  object | text    |
  id     | integer | not null
  class  | text    |
Indexes:
     "intermineobject_pkey" UNIQUE, btree (id)

mnw21-modmine-r13features-copy=# explain select * from project where id
NOT IN (SELECT id FROM intermineobject);
                                      QUERY PLAN
------------------------------------------------------------------------------------
  Seq Scan on project  (cost=1476573.93..1476575.05 rows=5 width=183)
    Filter: (NOT (hashed SubPlan 1))
    SubPlan 1
      ->  Seq Scan on intermineobject  (cost=0.00..1410720.74 rows=26341274 width=4)
(4 rows)

This query plan seems to me to be a little slow. Surely it could iterate
through the ten project rows and perform ten index lookups in the big
table?

Matthew

--
 Riker: Our memory pathways have become accustomed to your sensory input.
 Data:  I understand - I'm fond of you too, Commander. And you too Counsellor

Re: Query plan for NOT IN

From
Grzegorz Jaśkiewicz
Date:


On Mon, Oct 5, 2009 at 2:52 PM, Matthew Wakeling <matthew@flymine.org> wrote:

mnw21-modmine-r13features-copy=# select count(*) from project;
 count
-------
   10
(1 row)

mnw21-modmine-r13features-copy=# select count(*) from intermineobject;
 count
----------
 26344616
(1 row)

mnw21-modmine-r13features-copy=# \d intermineobject;
Table "public.intermineobject"
 Column |  Type   | Modifiers
--------+---------+-----------
 object | text    |
 id     | integer | not null
 class  | text    |
Indexes:
   "intermineobject_pkey" UNIQUE, btree (id)

mnw21-modmine-r13features-copy=# explain select * from project where id NOT IN (SELECT id FROM intermineobject);
                                    QUERY PLAN
------------------------------------------------------------------------------------
 Seq Scan on project  (cost=1476573.93..1476575.05 rows=5 width=183)
  Filter: (NOT (hashed SubPlan 1))
  SubPlan 1
    ->  Seq Scan on intermineobject  (cost=0.00..1410720.74 rows=26341274 width=4)
(4 rows)

This query plan seems to me to be a little slow. Surely it could iterate through the ten project rows and perform ten index lookups in the big table?

 
try using join instead of 'not in'..


select p.* from project p left join intermineobject i on i.id=p.id where i.id is null;


--
GJ

Re: Query plan for NOT IN

From
Matthew Wakeling
Date:
On Mon, 5 Oct 2009, Grzegorz Jaśkiewicz wrote:
> On Mon, Oct 5, 2009 at 2:52 PM, Matthew Wakeling <matthew@flymine.org> wrote:
>       Table "public.intermineobject"
>        Column |  Type   | Modifiers
>       --------+---------+-----------
>        object | text    |
>        id     | integer | not null
>        class  | text    |
>       Indexes:
>          "intermineobject_pkey" UNIQUE, btree (id)
>
>       mnw21-modmine-r13features-copy=# explain select * from project where id NOT
>       IN (SELECT id FROM intermineobject);
>  
> try using join instead of 'not in'..
>
> select p.* from project p left join intermineobject i on i.id=p.id where i.id is null;

Yes, that does work, but only because id is NOT NULL. I thought Postgres
8.4 had had a load of these join types unified to make it less important
how the query is written?

Matthew

--
 I'm always interested when [cold callers] try to flog conservatories.
 Anyone who can actually attach a conservatory to a fourth floor flat
 stands a marginally better than average chance of winning my custom.
 (Seen on Usenet)

Re: Query plan for NOT IN

From
Grzegorz Jaśkiewicz
Date:
2009/10/5 Matthew Wakeling <matthew@flymine.org>

Yes, that does work, but only because id is NOT NULL. I thought Postgres 8.4 had had a load of these join types unified to make it less important how the query is written?

well, as a rule of thumb - unless you can't think of a default value of column - don't use nulls. So using nulls as default 'idunno' - is a bad practice, but everybody's opinion on that differ.

But back on a subject, postgresql is very very poor performance wise with [NOT] IN () type of constructs. So if you can, avoid them, and learn to use joins.



--
GJ

Re: Query plan for NOT IN

From
Tom Lane
Date:
Matthew Wakeling <matthew@flymine.org> writes:
> Yes, that does work, but only because id is NOT NULL. I thought Postgres
> 8.4 had had a load of these join types unified to make it less important
> how the query is written?

NOT IN is not easily optimizable because of its odd behavior in the
presence of nulls.  Use NOT EXISTS instead, or that left join hack.

            regards, tom lane

Re: Query plan for NOT IN

From
Guy Rouillier
Date:
Grzegorz Jaśkiewicz wrote:
>
> well, as a rule of thumb - unless you can't think of a default value of
> column - don't use nulls. So using nulls as default 'idunno' - is a bad
> practice, but everybody's opinion on that differ.

I don't understand this point of view.  The concept of null was
introduced into the SQL vernacular by Codd and Date expressly to
represent unknown values.

--
Guy Rouillier

Re: Query plan for NOT IN

From
Grzegorz Jaśkiewicz
Date:


On Mon, Oct 5, 2009 at 8:35 PM, Guy Rouillier <guyr-ml1@burntmail.com> wrote:
Grzegorz Jaśkiewicz wrote:

well, as a rule of thumb - unless you can't think of a default value of column - don't use nulls. So using nulls as default 'idunno' - is a bad practice, but everybody's opinion on that differ.

I don't understand this point of view.  The concept of null was introduced into the SQL vernacular by Codd and Date expressly to represent unknown values.

Yes, unknown. So as long as you know the default value of field, you should set it to such. 

For instance, if by default your account balance is 0, you should set it to 0, not leave it as null, etc. Other example, if client doesn't have description - leave it as blank '' string, instead of null. 

On the other hand, if you want to denote that the value wasn't set - use null, but use it wisely. Hence, I personally think that DEFAULT value (in create table) should be compulsory, and 'DEFAULT NULL' an option, that you would have to choose. 

Not to mention other (valid in this case) argument, that you would mostly use IN/EXISTS, and/or join keys on fields that are either PK, or at least NOT NULL. Hence, using JOIN instead of IN/EXISTS most of the times. 
One of My few personal wishes, ever since I started to use postgresql - is that it could rewrite IN/EXISTS into JOIN - when possible (that is, when columns are NOT NULL).

  
--
GJ

Re: Query plan for NOT IN

From
"Kevin Grittner"
Date:
Grzegorz Jaœkiewicz<gryzman@gmail.com> wrote:
> Guy Rouillier <guyr-ml1@burntmail.com>wrote:
>> Grzegorz Jaœkiewicz wrote:

>>> using nulls as default 'idunno' - is a bad practice

>> I don't understand this point of view.  The concept of null was
>> introduced into the SQL vernacular by Codd and Date expressly to
>> represent unknown values.

> if by default your account balance is 0, you should set it to 0, not
> leave it as null

If your business rules are that a new account is created with a zero
balance and then deposits are made, sure -- insert the account row
with a zero balance, *because you know it to be zero*.  It's been rare
that I've seen anyone err on the side of using NULL in place of a
default for such cases.  Much more common is using, for example, 'NMI'
in the middle name column to denote "No Middle Initial".  Such "magic
values" can cause no end of trouble.

A failing of the SQL standard is that it uses the same mark (NULL) to
show the absence of a value because it is unknown as for the case
where it is known that no value exists (not applicable).  Codd argued
for a distinction there, but it hasn't come to pass, at least in the
standard.  If anyone could suggest a way to support standard syntax
and semantics and add extensions to support this distinction, it might
be another advance that would distinguish PostgreSQL from "less
evolved" products.   :-)

None of that changes the requirement that NOT IN must result in
UNKNOWN if any of the values involved are NULL.  You can't say that my
birthday is not in the set of birthdays for other subscribers to this
list without knowing the birthdays of all subscribers.  This
definition of the operator makes it hard to optimize, but setting
unknown birthdays to some date far in the past or future, to avoid
using NULL, would just result in bogus results for this query as well
as, for example, queries attempting to calculate aggregates on age.

-Kevin

Re: Query plan for NOT IN

From
Guy Rouillier
Date:
Kevin Grittner wrote:
> Grzegorz Jaœkiewicz<gryzman@gmail.com> wrote:

> A failing of the SQL standard is that it uses the same mark (NULL) to
> show the absence of a value because it is unknown as for the case
> where it is known that no value exists (not applicable).  Codd argued
> for a distinction there, but it hasn't come to pass, at least in the
> standard.  If anyone could suggest a way to support standard syntax
> and semantics and add extensions to support this distinction, it might
> be another advance that would distinguish PostgreSQL from "less
> evolved" products.   :-)

Theoretically, the distinction already exists.  If you don't know a
person's middle initial, then set it to null; if you know the person
doesn't have one, set it to the empty string.

But from a practical point of view, that wouldn't go very far.  Most
*people* equate an empty string to mean the same as null.  When I wrote
my own data access layer years ago, I expressly checked for empty
strings on input and changed them to null.  I did this because empty
strings had a nasty way of creeping into our databases; writing queries
to produce predictable results got to be very messy.

--
Guy Rouillier

Re: Query plan for NOT IN

From
"Kevin Grittner"
Date:
Guy Rouillier <guyr-ml1@burntmail.com> wrote:
> Kevin Grittner wrote:

>> A failing of the SQL standard is that it uses the same mark (NULL)
>> to show the absence of a value because it is unknown as for the
>> case where it is known that no value exists (not applicable).  Codd
>> argued for a distinction there, but it hasn't come to pass, at
>> least in the standard.  If anyone could suggest a way to support
>> standard syntax and semantics and add extensions to support this
>> distinction, it might be another advance that would distinguish
>> PostgreSQL from "less evolved" products.   :-)
>
> Theoretically, the distinction already exists.  If you don't know a
> person's middle initial, then set it to null; if you know the
> person doesn't have one, set it to the empty string.

Well, it is arguable whether an empty string is the proper way to
indicate that a character string based column is not applicable to a
given row, but it certainly falls flat for any other types, such as
dates or numbers; and I think there's value in having a consistent way
to handle this.

> But from a practical point of view, that wouldn't go very far.
> Most *people* equate an empty string to mean the same as null.  When
> I wrote my own data access layer years ago, I expressly checked for
> empty strings on input and changed them to null.  I did this because
> empty strings had a nasty way of creeping into our databases;
> writing queries to produce predictable results got to be very messy.

Yeah, there's that, too.

Which leaves the issue open -- a flexible way to flag the *reason* (or
*reasons*) for the absence of a value could be a nice enhancement, if
someone could invent a good implementation.  Of course, one could
always add a column to indicate the reason for a NULL; and perhaps
that would be as good as any scheme to attach reason flags to NULL.
You'd just have to make sure the reason column was null capable for
those rows where there *was* a value, which would make the reason "not
applicable"....

-Kevin

Re: Query plan for NOT IN

From
Craig James
Date:
Kevin Grittner wrote:
> Which leaves the issue open -- a flexible way to flag the *reason* (or
> *reasons*) for the absence of a value could be a nice enhancement, if
> someone could invent a good implementation.  Of course, one could
> always add a column to indicate the reason for a NULL; and perhaps
> that would be as good as any scheme to attach reason flags to NULL.
> You'd just have to make sure the reason column was null capable for
> those rows where there *was* a value, which would make the reason "not
> applicable"....

I'd argue that this is just a special case of a broader problem of metadata: Data about the data.  For example, I could
havea temperature, 40 degrees, and an error bounds, +/- 0.25 degrees.  Nobody would think twice about making these
separatecolumns.  I don't see how this is any different from a person's middle initial of NULL, plus a separate column
indicating"not known" versus "doesn't have one" if that distinction is important.  There are many examples like this,
wherea simple value in one column isn't sufficient, so another column contains metadata that qualifies or clarifies the
information. NULL is just one such case. 

But, this should probably be on an SQL discussion board, not PG performance...

Craig

Re: Query plan for NOT IN

From
Guy Rouillier
Date:
Craig James wrote:
> Kevin Grittner wrote:
>> Which leaves the issue open -- a flexible way to flag the *reason* (or
>> *reasons*) for the absence of a value could be a nice enhancement, if
>> someone could invent a good implementation.  Of course, one could
>> always add a column to indicate the reason for a NULL; and perhaps
>> that would be as good as any scheme to attach reason flags to NULL.
>> You'd just have to make sure the reason column was null capable for
>> those rows where there *was* a value, which would make the reason "not
>> applicable"....
>
> I'd argue that this is just a special case of a broader problem of
> metadata: Data about the data.  For example, I could have a temperature,
> 40 degrees, and an error bounds, +/- 0.25 degrees.  Nobody would think
> twice about making these separate columns.  I don't see how this is any
> different from a person's middle initial of NULL, plus a separate column
> indicating "not known" versus "doesn't have one" if that distinction is
> important.  There are many examples like this, where a simple value in
> one column isn't sufficient, so another column contains metadata that
> qualifies or clarifies the information.  NULL is just one such case.
>
> But, this should probably be on an SQL discussion board, not PG
> performance...

Most DBMSs I'm aware of use a null *byte* attached to a nullable column
to indicate whether the column is null or not.  yes/no takes one *bit*.
  That leaves 255 other possible values to describe the state of the
column.  That seems preferable to adding an additional column to every
nullable column.

But as you say, that would have to be taken up with the SQL
standardization bodies, and not PostgreSQL.

--
Guy Rouillier