Re: Performance problems testing with Spamassassin 3.1.0 - Mailing list pgsql-performance

From John A Meinel
Subject Re: Performance problems testing with Spamassassin 3.1.0
Date
Msg-id 42F22F59.7090901@arbash-meinel.com
Whole thread Raw
In response to Re: Performance problems testing with Spamassassin 3.1.0  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Performance problems testing with Spamassassin 3.1.0  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-performance
Tom Lane wrote:
> Matthew Schumacher <matt.s@aptalaska.net> writes:
>
>>  for i in array_lower(intokenary, 1) .. array_upper(intokenary, 1)
>>  LOOP
>>    _token := intokenary[i];
>>    INSERT INTO bayes_token_tmp VALUES (_token);
>>  END LOOP;
>
>
>>  UPDATE
>>    bayes_token
>>  SET
>>    spam_count = greatest_int(spam_count + inspam_count, 0),
>>    ham_count = greatest_int(ham_count + inham_count , 0),
>>    atime = greatest_int(atime, 1000)
>>  WHERE
>>    id = inuserid
>>  AND
>>    (token) IN (SELECT intoken FROM bayes_token_tmp);
>
>
> I don't really see why you think that this path is going to lead to
> better performance than where you were before.  Manipulation of the
> temp table is never going to be free, and IN (sub-select) is always
> inherently not fast, and NOT IN (sub-select) is always inherently
> awful.  Throwing a pile of simple queries at the problem is not
> necessarily the wrong way ... especially when you are doing it in
> plpgsql, because you've already eliminated the overhead of network
> round trips and repeated planning of the queries.

So for an IN (sub-select), does it actually pull all of the rows from
the other table, or is the planner smart enough to stop once it finds
something?

Is IN (sub-select) about the same as EXISTS (sub-select WHERE x=y)?

What about NOT IN (sub-select) versus NOT EXISTS (sub-select WHERE x=y)

I would guess that the EXISTS/NOT EXISTS would be faster, though it
probably would necessitate using a nested loop (at least that seems to
be the way the query is written).

I did some tests on a database with 800k rows, versus a temp table with
2k rows. I did one sequential test (1-2000, with 66 rows missing), and
one sparse test (1-200, 100000-100200, 200000-200200, ... with 658 rows
missing).

If found that NOT IN did indeed have to load the whole table. IN was
smart enough to do a nested loop.
EXISTS and NOT EXISTS did a sequential scan on my temp table, with a
SubPlan filter (which looks a whole lot like a Nested Loop).

What I found was that IN performed about the same as EXISTS (since they
are both effectively doing a nested loop), but that NOT IN took 4,000ms
while NOT EXISTS was the same speed as EXISTS at around 166ms.

Anyway, so it does seem like NOT IN is not a good choice, but IN seems
to be equivalent to EXISTS, and NOT EXISTS is also very fast.

Is this generally true, or did I just get lucky on my data?

John
=:->



>
>             regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: Have you checked our extensive FAQ?
>
>                http://www.postgresql.org/docs/faq
>

mifar07=# EXPLAIN ANALYZE SELECT id FROM ids WHERE id IN (SELECT id FROM object_t);
                                                           QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------------
 Nested Loop IN Join  (cost=0.00..9851.68 rows=2140 width=4) (actual time=0.085..183.889 rows=1351 loops=1)
   ->  Seq Scan on ids  (cost=0.00..31.40 rows=2140 width=4) (actual time=0.014..24.032 rows=2009 loops=1)
   ->  Index Scan using object_t_pkey on object_t  (cost=0.00..4.58 rows=1 width=4) (actual time=0.071..0.071 rows=1
loops=2009)
         Index Cond: ("outer".id = object_t.id)
 Total runtime: 184.823 ms
(5 rows)

Time: 186.931 ms
mifar07=# EXPLAIN ANALYZE SELECT id FROM ids WHERE EXISTS (SELECT id FROM object_t o WHERE o.id = ids.id);
                                                             QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------------
 Seq Scan on ids  (cost=0.00..9824.93 rows=1070 width=4) (actual time=0.086..165.053 rows=1351 loops=1)
   Filter: (subplan)
   SubPlan
     ->  Index Scan using object_t_pkey on object_t o  (cost=0.00..4.58 rows=1 width=4) (actual time=0.025..0.025
rows=1loops=2009) 
           Index Cond: (id = $0)
 Total runtime: 165.995 ms
(6 rows)

Time: 167.795 ms
mifar07=# EXPLAIN ANALYZE SELECT id FROM ids WHERE id NOT IN (SELECT id FROM object_t);
                                                        QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------
 Seq Scan on ids  (cost=36410.51..36447.26 rows=1070 width=4) (actual time=4168.247..4172.080 rows=658 loops=1)
   Filter: (NOT (hashed subplan))
   SubPlan
     ->  Seq Scan on object_t  (cost=0.00..34381.81 rows=811481 width=4) (actual time=0.044..2464.296 rows=811481
loops=1)
 Total runtime: 4210.784 ms
(5 rows)

Time: 4212.276 ms
mifar07=# EXPLAIN ANALYZE SELECT id FROM ids WHERE NOT EXISTS (SELECT id FROM object_t o WHERE o.id = ids.id);
                                                             QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------------
 Seq Scan on ids  (cost=0.00..9824.93 rows=1070 width=4) (actual time=0.372..164.510 rows=658 loops=1)
   Filter: (NOT (subplan))
   SubPlan
     ->  Index Scan using object_t_pkey on object_t o  (cost=0.00..4.58 rows=1 width=4) (actual time=0.064..0.064
rows=1loops=2009) 
           Index Cond: (id = $0)
 Total runtime: 165.016 ms
(6 rows)

Time: 166.786 ms

mifar07=# EXPLAIN ANALYZE SELECT id FROM ids WHERE id IN (SELECT id FROM object_t);
                                                           QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------------
 Nested Loop IN Join  (cost=0.00..9851.68 rows=2140 width=4) (actual time=0.069..97.567 rows=1934 loops=1)
   ->  Seq Scan on ids  (cost=0.00..31.40 rows=2140 width=4) (actual time=0.012..1.868 rows=2000 loops=1)
   ->  Index Scan using object_t_pkey on object_t  (cost=0.00..4.58 rows=1 width=4) (actual time=0.045..0.045 rows=1
loops=2000)
         Index Cond: ("outer".id = object_t.id)
 Total runtime: 98.236 ms
(5 rows)

Time: 99.921 ms
mifar07=# EXPLAIN ANALYZE SELECT id FROM ids WHERE EXISTS (SELECT id FROM object_t o WHERE o.id = ids.id);
                                                             QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------------
 Seq Scan on ids  (cost=0.00..9824.93 rows=1070 width=4) (actual time=0.071..74.158 rows=1934 loops=1)
   Filter: (subplan)
   SubPlan
     ->  Index Scan using object_t_pkey on object_t o  (cost=0.00..4.58 rows=1 width=4) (actual time=0.013..0.013
rows=1loops=2000) 
           Index Cond: (id = $0)
 Total runtime: 74.798 ms
(6 rows)

Time: 86.287 ms
mifar07=# EXPLAIN ANALYZE SELECT id FROM ids WHERE id NOT IN (SELECT id FROM object_t);
                                                        QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------
 Seq Scan on ids  (cost=36410.51..36447.26 rows=1070 width=4) (actual time=4024.613..4028.774 rows=66 loops=1)
   Filter: (NOT (hashed subplan))
   SubPlan
     ->  Seq Scan on object_t  (cost=0.00..34381.81 rows=811481 width=4) (actual time=0.040..2503.374 rows=811481
loops=1)
 Total runtime: 4122.327 ms
(5 rows)

Time: 4134.659 ms
mifar07=# EXPLAIN ANALYZE SELECT id FROM ids WHERE NOT EXISTS (SELECT id FROM object_t o WHERE o.id = ids.id);
                                                             QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------------
 Seq Scan on ids  (cost=0.00..9824.93 rows=1070 width=4) (actual time=0.220..92.611 rows=66 loops=1)
   Filter: (NOT (subplan))
   SubPlan
     ->  Index Scan using object_t_pkey on object_t o  (cost=0.00..4.58 rows=1 width=4) (actual time=0.043..0.043
rows=1loops=2000) 
           Index Cond: (id = $0)
 Total runtime: 92.743 ms
(6 rows)

Time: 94.190 ms
mifar07=#


Attachment

pgsql-performance by date:

Previous
From: Tom Lane
Date:
Subject: Re: Performance problems testing with Spamassassin 3.1.0
Next
From: Matthew Schumacher
Date:
Subject: Re: Performance problems testing with Spamassassin 3.1.0