Thread: Re: NOT IN doesn't use index? (fwd)

Re: NOT IN doesn't use index? (fwd)

From

Becky Neville

Date:

03 May 2003, 13:58:01

Here is the EXPLAIN output from the two queries.  The first is the one
that uses WHERE field NOT IN ( 'a','b' etc ).  The second is the (much
faster) one
that uses WHERE NOT (field = 'a' and field = 'b' etc).

I don't understand why the query planner thinks there are only 38055 rows
in the table on the slow one.  I didn't run analyze in between them and the
second try seems to know (correctly) that there are 1799976 rows.

Also, why does the first (slow) one think there are 38055 rows and only
evaluate 48 rows - and yet it still takes longer. ? I assume it's due to
the lack of a sort, but I don't understand why using NOT IN should
prohibit a sort.

-------------slow one - ~9 minutes-----------------------
/home/accts/ran26/cs437/Proj/code/scripts/sql
test=# \i query3.sql
psql:query3.sql:76: NOTICE:  QUERY PLAN:

Seq Scan on uabopen  (cost=0.00..3305914.86 rows=38055 width=7) (actual
time=36577.26..494243.37 rows=48 loops=1)
Total runtime: 494243.67 msec

--------------faster one - 2 minutes-----------------
psql:query3Mod2.sql:77: NOTICE:  QUERY PLAN:

Unique  (cost=3592408.28..3596908.22 rows=179998 width=7) (actual
time=104959.31..114131.22 rows=101 loops=1)
  ->  Sort  (cost=3592408.28..3592408.28 rows=1799976 width=7) (actual
time=104959.30..108425.61 rows=1799976 loops=1)
        ->  Seq Scan on uabopen  (cost=0.00..3305914.86 rows=1799976
width=7) (actual time=30.13..14430.99 rows=1799976 loops=1)
Total runtime: 114220.66 msec

---------- Forwarded message ----------
Date: Sat, 3 May 2003 13:09:22 -0400 (EDT)
From: Becky Neville <ran26@pantheon.yale.edu>
To: Andrew Sullivan <andrew@libertyrms.info>
Subject: Re: [PERFORM] NOT IN doesn't use index?

I didn't post it because the rest of the query is exactly the same (and
the NOT IN list is about a page long - although it's
apparently still shorter than the IN list.)

I need to verify something and then can send the EXPLAIN output.

I am running my own server and have no idea what parameters I should use
to speed things up.  Everything is dog slow.

On Sat, 3 May 2003, Andrew Sullivan wrote:

> On Sat, May 03, 2003 at 01:56:02AM -0400, Becky Neville wrote:
> > Does the use of WHERE field  NOT IN ('A','B' etc) prevent the use of an
> > index?
>
> That '&c.' is hiding a lot.  Why not post your query and the explain
> analyse output?
>
> A
>

Re: NOT IN doesn't use index? (fwd)

From

Josh Berkus

Date:

03 May 2003, 14:36:00

Becky,

> Here is the EXPLAIN output from the two queries.  The first is the one
> that uses WHERE field NOT IN ( 'a','b' etc ).  The second is the (much
> faster) one
> that uses WHERE NOT (field = 'a' and field = 'b' etc).

We still can't help you if you're not posting your actual queries.  We have no
idea what's in query3.sql.    We're not clairvoyant, y'know.

--
Josh Berkus
Aglio Database Solutions
San Francisco

Re: NOT IN doesn't use index? (fwd)

From

Joe Conway

Date:

03 May 2003, 14:36:05

Becky Neville wrote:
> Here is the EXPLAIN output from the two queries.  The first is the one
> that uses WHERE field NOT IN ( 'a','b' etc ).  The second is the (much

Unless you are working with Postgres 7.4devel (i.e. cvs HEAD), the IN
construct is notoriously slow in Postgres. In cvs it is vastly improved.

Also, as I mentioned in the other reply, send in "EXPLAIN ANALYZE"
results instead of "EXPLAIN" (and make sure you run "VACUUM ANALYZE" first).

Joe

Re: NOT IN doesn't use index? (fwd)

From

Becky Neville

Date:

03 May 2003, 15:08:40

Well I think you answered my question already, but just in case
here are the explain results again and the query follows (I warned, it is
long.)  And I did run VACUUM ANALYZE beforehand.

psql:sql/query3.sql:76: NOTICE:  QUERY PLAN:

Seq Scan on uabopen  (cost=0.00..3305914.86 rows=56580 width=7) (actual
time=36077.26..491592.22 rows=48 loops=1)
Total runtime: 491592.52 msec
-------------------------------------------

explain analyze
select uabopen_srat_code
  FROM UABOPEN
 where uabopen_srat_code not in
('1A','1B','1C','1E','1AC','1BC','1CC','1EC','PG1A',

'PG1B','PG1C','PG1E','R1A','R1B',

'R1C','R1E','RD1A','RD1B','RD1C','RD1E','TRF','WN1A',
                                            'WN1B','WN1C','WN1E', 'APS')
   AND uabopen_srat_code not in
('1F','1FD','3A','3AD','3B','3B1','3BD','3C','3CD','3F',

'3FD','3G','3GD','3H','3HD','4A','4AD','5A','5AD','5B','5BD','5C','5CD',

'5D','5DD','5E','5ED','5F','5FD','5G','5GD','6A','6AD','6B','6BD','6C',

'6CD','6D','6DD','8A','8B','8AD','9A','9TA','9AD','9B','9BD','9C','9CD','9D','9D\
D',

'9E','9ED','9F','9FD','9G','9GD','9H','9I','9T','ACC','CM3A','CM3B','CM3C','CM3F\
',

'CM3G','CM3H','DEM','GR3A','GR3B','GR3C','GR3H','GR4A','GR5A','GR5B','GR5C',

'GR5D','GR5E','GR5F','GR6A','GR6B','GR6C','GR6D','GR9A','GR9B','GR9C','GR9D','GR\
9E',

'GR9F','GR9G','GR9H','GR9T','MT3B','MT3C','MT3G','MT3H','MT4A','MT9A','MT9B','MT\
9C',

'MT9D','MT9E','MT9F','MT9G','N1','N10','N100','N101','N102','N103','N104','N105'\
,

'N106','N107','N108','N109','ITCP','1FC','3AP','3CP','5AC',

'5AP','5BC','5BP','5CC','5CP','5DC','5DP','5GC','6AC','6AP','6BC','6BP','6CC','6\
CP',

'6DC','6DP','MT5A','MT5B','MT6A','MT6B','MT5H','MT6I','MT6H',

'5HP','6H','6HC','6HP','6I','6IC','6IP','3BP','5H','5HC',

'5I','5IC','5IP','GR5H','GR5I','GR6H','GR6I',

'MT5I','PG5H','PG5I','PG6H','PG6I','WN5H','WN5I','WN6H','WN6I',

'5CT','6CT','6DT','MT6C','MT6D','MT5C','MT5D','5DT','5HD')
    AND UABOPEN_SRAT_CODE NOT IN
('N11','N110','N111','N112','N113','N114','N115','N116','N117','N118','N119','N12',

'N120','N121','N122','N123','N124','N125','N126','N127','N128','N129','N13','N13\
0',

'N131','N132','N133','N134','N135','N136','N137','N138','N139','N14','N140',

'N141','N142','N143','N144','N145','N146','N147','N148','N149','N15','N150',

'N151','N152','N153','N154','N155','N156','N157','N158'


On Sat, 3 May 2003, Joe Conway wrote:

> Becky Neville wrote:
> > Here is the EXPLAIN output from the two queries.  The first is the one
> > that uses WHERE field NOT IN ( 'a','b' etc ).  The second is the (much
>
> Unless you are working with Postgres 7.4devel (i.e. cvs HEAD), the IN
> construct is notoriously slow in Postgres. In cvs it is vastly improved.
>
> Also, as I mentioned in the other reply, send in "EXPLAIN ANALYZE"
> results instead of "EXPLAIN" (and make sure you run "VACUUM ANALYZE" first).
>
> Joe
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 2: you can get off all lists at once with the unregister command
>     (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
>

Re: NOT IN doesn't use index? (fwd)

From

Joe Conway

Date:

03 May 2003, 15:31:56

Becky Neville wrote:
> Well I think you answered my question already, but just in case
> here are the explain results again and the query follows (I warned, it is
> long.)  And I did run VACUUM ANALYZE beforehand.

[snipped ugly query with three NOT IN clauses]

Hmmm, no surprise that's slow. How are those three lists of constants
generated? One idea is to recast this as a left join with a FROM clause
subselect, e.g.

select
  uabopen_srat_code
from
  uabopen u left join
  (select '1F' as uabopen_srat_code union all
          '1FD' union all
          '3A' ...) as ss
  on u.uabopen_srat_code = ss.uabopen_srat_code
where ss.uabopen_srat_code is null;

But I'm not sure that will be much quicker. If the list of
uabopen_srat_code you're filtering on comes from one of the other
tables, you might be able to do better -- back to the question above,
how is that list generated? What do the other table look like?

Joe

Re: NOT IN doesn't use index? (fwd)

From

Becky Neville

Date:

03 May 2003, 15:57:05

I think that list is actually (gulp) hard coded.  It's not my query.  I am
trying to speed it up for someone else - to hopefully learn something in
the process that isn't dependent on what version of postgres i'm
running :)

I assume it's from another table but can't find it on their data model at
the moment.  Those are all valid billing codes.  The query is checking to see if
anyone was billed under an invalid code.  So if everything is ok, the query
returns nothing.

But there must be more to it than that...otherwise, they could just add a
Valid flag to the lookup table.

If you have any ideas for speeding it up other than using another table
please let me know.  It only takes me 9 min to run with 2 mil rows but it
takes them 7 hours (51 mil rows in Oracle with many other jobs running and
poor system maintenance.)

 On Sat, 3 May 2003, Joe Conway
wrote:

> Becky Neville wrote:
> > Well I think you answered my question already, but just in case
> > here are the explain results again and the query follows (I warned, it is
> > long.)  And I did run VACUUM ANALYZE beforehand.
>
> [snipped ugly query with three NOT IN clauses]
>
> Hmmm, no surprise that's slow. How are those three lists of constants
> generated? One idea is to recast this as a left join with a FROM clause
> subselect, e.g.
>
> select
>   uabopen_srat_code
> from
>   uabopen u left join
>   (select '1F' as uabopen_srat_code union all
>           '1FD' union all
>           '3A' ...) as ss
>   on u.uabopen_srat_code = ss.uabopen_srat_code
> where ss.uabopen_srat_code is null;
>
> But I'm not sure that will be much quicker. If the list of
> uabopen_srat_code you're filtering on comes from one of the other
> tables, you might be able to do better -- back to the question above,
> how is that list generated? What do the other table look like?
>
> Joe
>

Re: NOT IN doesn't use index? (fwd)

From

Joe Conway

Date:

03 May 2003, 16:25:47

Becky Neville wrote:
> I think that list is actually (gulp) hard coded.  It's not my query.
> I am trying to speed it up for someone else - to hopefully learn something
> in the process that isn't dependent on what version of postgres i'm
> running :)
>
> I assume it's from another table but can't find it on their data
> model at the moment.  Those are all valid billing codes.  The query
> is checking to see if anyone was billed under an invalid code.  So if
> everything is ok, the query returns nothing.

Yeah -- that sounds like there has to be a table of valid codes
somewhere. In that case you can substitute the "valid_codes" table in
the left join where I had the subselect with all the UNIONs.
Alternatively you might find a NOT EXISTS method would work best. If
there isn't a "valid_codes" table, but that hard coded list is static,
perhaps you could build one and use that.

> But there must be more to it than that...otherwise, they could just
> add a Valid flag to the lookup table.

Well I certainly wouldn't query a whole table of historical information
over and over. Can you use and date column (suitably indexed) to just
check recent transactions (like since the last time you checked)?

> If you have any ideas for speeding it up other than using another
> table please let me know.  It only takes me 9 min to run with 2 mil
> rows but it takes them 7 hours (51 mil rows in Oracle with many other
> jobs running and poor system maintenance.)

As above, are all 51 million rows recent transactions, or is that all of
eternity? If its the latter, I'd scan the whole thing once and produce a
report, or maybe a "transactions_with_invalid_codes" table.

 From that point on, I'd only check the transactions since the last time
I'd checked, either based on a timestamp or even a sequence generated id
field. All you need to do is save off the max value each time you run,
and then use that as the starting point next time.

HTH,

Joe

Re: NOT IN doesn't use index? (fwd)

From

Rod Taylor

Date:

03 May 2003, 17:17:22

On Sat, 2003-05-03 at 15:56, Becky Neville wrote:
> I think that list is actually (gulp) hard coded.  It's not my query.  I am
> trying to speed it up for someone else - to hopefully learn something in
> the process that isn't dependent on what version of postgres i'm
> running :)

An interesting test might be to see if the overhead of doing a character
based comparison (as opposed to integer based) is significant.  If it
is, previous tests show it can be significant for CPU bound queries,
convert all of those codes into integers and use a lookup table table to
do the conversion.

Another interesting thought, since you have a long running query would
be to attempt an inversion.  Create a temporary table with the *valid*
codes if count(valid codes) < 2 * count(invalid codes).  Run the query
replacing NOT IN with a join to the temporary table.  This will reduce
the number of comparisons required, as a match can move onto the next
datum, but a NOT IN must check all values. If this helps, try indexing
(and analyzing) the temporary table.

By far the fastest results can be achieved by not allowing invalid
billing codes to be inserted into the table via a constraint of somekind
(check or fkey to summary table).

--
Rod Taylor <rbt@rbt.ca>

PGP Key: http://www.rbt.ca/rbtpub.asc

Attachment

signature.asc

Re: NOT IN doesn't use index? (fwd)

From

Christopher Kings-Lynne

Date:

04 May 2003, 00:06:13

AFAIK, it's only the IN (large subquery) form that is slow...

Chris

On Sat, 3 May 2003, Joe Conway wrote:

> Becky Neville wrote:
> > Here is the EXPLAIN output from the two queries.  The first is the one
> > that uses WHERE field NOT IN ( 'a','b' etc ).  The second is the (much
>
> Unless you are working with Postgres 7.4devel (i.e. cvs HEAD), the IN
> construct is notoriously slow in Postgres. In cvs it is vastly improved.
>
> Also, as I mentioned in the other reply, send in "EXPLAIN ANALYZE"
> results instead of "EXPLAIN" (and make sure you run "VACUUM ANALYZE" first).
>
> Joe
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 2: you can get off all lists at once with the unregister command
>     (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
>