Thread: Re: NOT IN doesn't use index? (fwd)
Here is the EXPLAIN output from the two queries. The first is the one that uses WHERE field NOT IN ( 'a','b' etc ). The second is the (much faster) one that uses WHERE NOT (field = 'a' and field = 'b' etc). I don't understand why the query planner thinks there are only 38055 rows in the table on the slow one. I didn't run analyze in between them and the second try seems to know (correctly) that there are 1799976 rows. Also, why does the first (slow) one think there are 38055 rows and only evaluate 48 rows - and yet it still takes longer. ? I assume it's due to the lack of a sort, but I don't understand why using NOT IN should prohibit a sort. -------------slow one - ~9 minutes----------------------- /home/accts/ran26/cs437/Proj/code/scripts/sql test=# \i query3.sql psql:query3.sql:76: NOTICE: QUERY PLAN: Seq Scan on uabopen (cost=0.00..3305914.86 rows=38055 width=7) (actual time=36577.26..494243.37 rows=48 loops=1) Total runtime: 494243.67 msec --------------faster one - 2 minutes----------------- psql:query3Mod2.sql:77: NOTICE: QUERY PLAN: Unique (cost=3592408.28..3596908.22 rows=179998 width=7) (actual time=104959.31..114131.22 rows=101 loops=1) -> Sort (cost=3592408.28..3592408.28 rows=1799976 width=7) (actual time=104959.30..108425.61 rows=1799976 loops=1) -> Seq Scan on uabopen (cost=0.00..3305914.86 rows=1799976 width=7) (actual time=30.13..14430.99 rows=1799976 loops=1) Total runtime: 114220.66 msec ---------- Forwarded message ---------- Date: Sat, 3 May 2003 13:09:22 -0400 (EDT) From: Becky Neville <ran26@pantheon.yale.edu> To: Andrew Sullivan <andrew@libertyrms.info> Subject: Re: [PERFORM] NOT IN doesn't use index? I didn't post it because the rest of the query is exactly the same (and the NOT IN list is about a page long - although it's apparently still shorter than the IN list.) I need to verify something and then can send the EXPLAIN output. I am running my own server and have no idea what parameters I should use to speed things up. Everything is dog slow. On Sat, 3 May 2003, Andrew Sullivan wrote: > On Sat, May 03, 2003 at 01:56:02AM -0400, Becky Neville wrote: > > Does the use of WHERE field NOT IN ('A','B' etc) prevent the use of an > > index? > > That '&c.' is hiding a lot. Why not post your query and the explain > analyse output? > > A >
Becky, > Here is the EXPLAIN output from the two queries. The first is the one > that uses WHERE field NOT IN ( 'a','b' etc ). The second is the (much > faster) one > that uses WHERE NOT (field = 'a' and field = 'b' etc). We still can't help you if you're not posting your actual queries. We have no idea what's in query3.sql. We're not clairvoyant, y'know. -- Josh Berkus Aglio Database Solutions San Francisco
Becky Neville wrote: > Here is the EXPLAIN output from the two queries. The first is the one > that uses WHERE field NOT IN ( 'a','b' etc ). The second is the (much Unless you are working with Postgres 7.4devel (i.e. cvs HEAD), the IN construct is notoriously slow in Postgres. In cvs it is vastly improved. Also, as I mentioned in the other reply, send in "EXPLAIN ANALYZE" results instead of "EXPLAIN" (and make sure you run "VACUUM ANALYZE" first). Joe
Well I think you answered my question already, but just in case here are the explain results again and the query follows (I warned, it is long.) And I did run VACUUM ANALYZE beforehand. psql:sql/query3.sql:76: NOTICE: QUERY PLAN: Seq Scan on uabopen (cost=0.00..3305914.86 rows=56580 width=7) (actual time=36077.26..491592.22 rows=48 loops=1) Total runtime: 491592.52 msec ------------------------------------------- explain analyze select uabopen_srat_code FROM UABOPEN where uabopen_srat_code not in ('1A','1B','1C','1E','1AC','1BC','1CC','1EC','PG1A', 'PG1B','PG1C','PG1E','R1A','R1B', 'R1C','R1E','RD1A','RD1B','RD1C','RD1E','TRF','WN1A', 'WN1B','WN1C','WN1E', 'APS') AND uabopen_srat_code not in ('1F','1FD','3A','3AD','3B','3B1','3BD','3C','3CD','3F', '3FD','3G','3GD','3H','3HD','4A','4AD','5A','5AD','5B','5BD','5C','5CD', '5D','5DD','5E','5ED','5F','5FD','5G','5GD','6A','6AD','6B','6BD','6C', '6CD','6D','6DD','8A','8B','8AD','9A','9TA','9AD','9B','9BD','9C','9CD','9D','9D\ D', '9E','9ED','9F','9FD','9G','9GD','9H','9I','9T','ACC','CM3A','CM3B','CM3C','CM3F\ ', 'CM3G','CM3H','DEM','GR3A','GR3B','GR3C','GR3H','GR4A','GR5A','GR5B','GR5C', 'GR5D','GR5E','GR5F','GR6A','GR6B','GR6C','GR6D','GR9A','GR9B','GR9C','GR9D','GR\ 9E', 'GR9F','GR9G','GR9H','GR9T','MT3B','MT3C','MT3G','MT3H','MT4A','MT9A','MT9B','MT\ 9C', 'MT9D','MT9E','MT9F','MT9G','N1','N10','N100','N101','N102','N103','N104','N105'\ , 'N106','N107','N108','N109','ITCP','1FC','3AP','3CP','5AC', '5AP','5BC','5BP','5CC','5CP','5DC','5DP','5GC','6AC','6AP','6BC','6BP','6CC','6\ CP', '6DC','6DP','MT5A','MT5B','MT6A','MT6B','MT5H','MT6I','MT6H', '5HP','6H','6HC','6HP','6I','6IC','6IP','3BP','5H','5HC', '5I','5IC','5IP','GR5H','GR5I','GR6H','GR6I', 'MT5I','PG5H','PG5I','PG6H','PG6I','WN5H','WN5I','WN6H','WN6I', '5CT','6CT','6DT','MT6C','MT6D','MT5C','MT5D','5DT','5HD') AND UABOPEN_SRAT_CODE NOT IN ('N11','N110','N111','N112','N113','N114','N115','N116','N117','N118','N119','N12', 'N120','N121','N122','N123','N124','N125','N126','N127','N128','N129','N13','N13\ 0', 'N131','N132','N133','N134','N135','N136','N137','N138','N139','N14','N140', 'N141','N142','N143','N144','N145','N146','N147','N148','N149','N15','N150', 'N151','N152','N153','N154','N155','N156','N157','N158' On Sat, 3 May 2003, Joe Conway wrote: > Becky Neville wrote: > > Here is the EXPLAIN output from the two queries. The first is the one > > that uses WHERE field NOT IN ( 'a','b' etc ). The second is the (much > > Unless you are working with Postgres 7.4devel (i.e. cvs HEAD), the IN > construct is notoriously slow in Postgres. In cvs it is vastly improved. > > Also, as I mentioned in the other reply, send in "EXPLAIN ANALYZE" > results instead of "EXPLAIN" (and make sure you run "VACUUM ANALYZE" first). > > Joe > > > ---------------------------(end of broadcast)--------------------------- > TIP 2: you can get off all lists at once with the unregister command > (send "unregister YourEmailAddressHere" to majordomo@postgresql.org) >
Becky Neville wrote: > Well I think you answered my question already, but just in case > here are the explain results again and the query follows (I warned, it is > long.) And I did run VACUUM ANALYZE beforehand. [snipped ugly query with three NOT IN clauses] Hmmm, no surprise that's slow. How are those three lists of constants generated? One idea is to recast this as a left join with a FROM clause subselect, e.g. select uabopen_srat_code from uabopen u left join (select '1F' as uabopen_srat_code union all '1FD' union all '3A' ...) as ss on u.uabopen_srat_code = ss.uabopen_srat_code where ss.uabopen_srat_code is null; But I'm not sure that will be much quicker. If the list of uabopen_srat_code you're filtering on comes from one of the other tables, you might be able to do better -- back to the question above, how is that list generated? What do the other table look like? Joe
I think that list is actually (gulp) hard coded. It's not my query. I am trying to speed it up for someone else - to hopefully learn something in the process that isn't dependent on what version of postgres i'm running :) I assume it's from another table but can't find it on their data model at the moment. Those are all valid billing codes. The query is checking to see if anyone was billed under an invalid code. So if everything is ok, the query returns nothing. But there must be more to it than that...otherwise, they could just add a Valid flag to the lookup table. If you have any ideas for speeding it up other than using another table please let me know. It only takes me 9 min to run with 2 mil rows but it takes them 7 hours (51 mil rows in Oracle with many other jobs running and poor system maintenance.) On Sat, 3 May 2003, Joe Conway wrote: > Becky Neville wrote: > > Well I think you answered my question already, but just in case > > here are the explain results again and the query follows (I warned, it is > > long.) And I did run VACUUM ANALYZE beforehand. > > [snipped ugly query with three NOT IN clauses] > > Hmmm, no surprise that's slow. How are those three lists of constants > generated? One idea is to recast this as a left join with a FROM clause > subselect, e.g. > > select > uabopen_srat_code > from > uabopen u left join > (select '1F' as uabopen_srat_code union all > '1FD' union all > '3A' ...) as ss > on u.uabopen_srat_code = ss.uabopen_srat_code > where ss.uabopen_srat_code is null; > > But I'm not sure that will be much quicker. If the list of > uabopen_srat_code you're filtering on comes from one of the other > tables, you might be able to do better -- back to the question above, > how is that list generated? What do the other table look like? > > Joe >
Becky Neville wrote: > I think that list is actually (gulp) hard coded. It's not my query. > I am trying to speed it up for someone else - to hopefully learn something > in the process that isn't dependent on what version of postgres i'm > running :) > > I assume it's from another table but can't find it on their data > model at the moment. Those are all valid billing codes. The query > is checking to see if anyone was billed under an invalid code. So if > everything is ok, the query returns nothing. Yeah -- that sounds like there has to be a table of valid codes somewhere. In that case you can substitute the "valid_codes" table in the left join where I had the subselect with all the UNIONs. Alternatively you might find a NOT EXISTS method would work best. If there isn't a "valid_codes" table, but that hard coded list is static, perhaps you could build one and use that. > But there must be more to it than that...otherwise, they could just > add a Valid flag to the lookup table. Well I certainly wouldn't query a whole table of historical information over and over. Can you use and date column (suitably indexed) to just check recent transactions (like since the last time you checked)? > If you have any ideas for speeding it up other than using another > table please let me know. It only takes me 9 min to run with 2 mil > rows but it takes them 7 hours (51 mil rows in Oracle with many other > jobs running and poor system maintenance.) As above, are all 51 million rows recent transactions, or is that all of eternity? If its the latter, I'd scan the whole thing once and produce a report, or maybe a "transactions_with_invalid_codes" table. From that point on, I'd only check the transactions since the last time I'd checked, either based on a timestamp or even a sequence generated id field. All you need to do is save off the max value each time you run, and then use that as the starting point next time. HTH, Joe
On Sat, 2003-05-03 at 15:56, Becky Neville wrote: > I think that list is actually (gulp) hard coded. It's not my query. I am > trying to speed it up for someone else - to hopefully learn something in > the process that isn't dependent on what version of postgres i'm > running :) An interesting test might be to see if the overhead of doing a character based comparison (as opposed to integer based) is significant. If it is, previous tests show it can be significant for CPU bound queries, convert all of those codes into integers and use a lookup table table to do the conversion. Another interesting thought, since you have a long running query would be to attempt an inversion. Create a temporary table with the *valid* codes if count(valid codes) < 2 * count(invalid codes). Run the query replacing NOT IN with a join to the temporary table. This will reduce the number of comparisons required, as a match can move onto the next datum, but a NOT IN must check all values. If this helps, try indexing (and analyzing) the temporary table. By far the fastest results can be achieved by not allowing invalid billing codes to be inserted into the table via a constraint of somekind (check or fkey to summary table). -- Rod Taylor <rbt@rbt.ca> PGP Key: http://www.rbt.ca/rbtpub.asc
Attachment
AFAIK, it's only the IN (large subquery) form that is slow... Chris On Sat, 3 May 2003, Joe Conway wrote: > Becky Neville wrote: > > Here is the EXPLAIN output from the two queries. The first is the one > > that uses WHERE field NOT IN ( 'a','b' etc ). The second is the (much > > Unless you are working with Postgres 7.4devel (i.e. cvs HEAD), the IN > construct is notoriously slow in Postgres. In cvs it is vastly improved. > > Also, as I mentioned in the other reply, send in "EXPLAIN ANALYZE" > results instead of "EXPLAIN" (and make sure you run "VACUUM ANALYZE" first). > > Joe > > > ---------------------------(end of broadcast)--------------------------- > TIP 2: you can get off all lists at once with the unregister command > (send "unregister YourEmailAddressHere" to majordomo@postgresql.org) >