Re: "SELECT .. WHERE NOT IN" query running for hours - Mailing list pgsql-performance

From Γιωργος Βαλκανας
Subject Re: "SELECT .. WHERE NOT IN" query running for hours
Date
Msg-id AANLkTimCc6U-xHigzNcjF3xABadg-yKV_1+kTqXRXhnL@mail.gmail.com
Whole thread Raw
In response to Re: "SELECT .. WHERE NOT IN" query running for hours  (Mladen Gogala <mladen.gogala@vmsinfo.com>)
Responses Re: "SELECT .. WHERE NOT IN" query running for hours  (Mladen Gogala <mladen.gogala@vmsinfo.com>)
Re: "SELECT .. WHERE NOT IN" query running for hours  (Scott Carey <scott@richrelevance.com>)
List pgsql-performance
Fair enough!

I also turned seqscan off, so the new plan (for the NOT EXISTS) is:

Merge Anti Join (cost=0.00..212686.89 rows=1 width=313) (actual time=0.426..14921.344 rows=63836 loops=1)
 Merge Cond: ((d2.hwdocid)::text = (d.hwdocid)::text)
 -> Index Scan using hwdocid2_uniq on "Doc2" d2 (cost=0.00..19442.87 rows=96454 width=313) (actual time=0.130..1248.783 rows=96454 loops=1)
 -> Index Scan using hwdocid_uniq on "Document" d (cost=0.00..189665.17 rows=949272 width=12) (actual time=0.085..11158.740 rows=948336 loops=1)
Total runtime: 15062.925 ms

Hmm.. doesn't really seem to be such a great boost on performance. But i guess I'll be sticking to this one.

So my follow-up question on the subject is this:

Are there any particular semantics for the "NOT IN" statement that cause the correlated query to execute for every row of the outter query, as opposed to the "NOT EXISTS" ? Or are there any other practical reasons, related to "IN / NOT IN", for this to be happening? Or is it simply due to implementation details of each RDBMS? I guess the former (or the 2nd one), since, as you say, this is common in most databases, but I would most appreciate an answer to clarify this.

Thanks again!

Best regards,
George



2011/1/7 Mladen Gogala <mladen.gogala@vmsinfo.com>
On 1/6/2011 9:36 PM, Γιωργος Βαλκανας wrote:

1) Why is it taking *so* long for the first query (with the "NOT IN" ) to do even the simple select?
Because NOT IN has to execute the correlated subquery for every row and then check whether the requested value is in the result set, usually by doing sequential comparison. The NOT EXIST plan is also bad because there is no index but at least it can use very fast and efficient hash algorithm. Indexing the "hwdocid" column on the "Document" table or, ideally, making it a primary key, should provide an additional boost to your query. If you already do have an index, you may consider using enable_seqscan=false for this session, so that the "hwdocid" index will be used. It's a common wisdom that in the most cases NOT EXISTS will beat NOT IN. That is so all over the database world. I've seen that in Oracle applications, MS SQL applications and, of course MySQL applications. Optimizing queries is far from trivial.

Μλαδεν Γογαλα

--
Mladen Gogala
Sr. Oracle DBA
1500 Broadway
New York, NY 10036
(212) 329-5251
www.vmsinfo.com


pgsql-performance by date:

Previous
From: marc47marc47
Date:
Subject: Re: How to turn autovacuum prevent wrap around run faster?
Next
From: Florian Weimer
Date:
Subject: Wrong docs on checkpoint_segments?