Re: Performance problems testing with Spamassassin 3.1.0 - Mailing list pgsql-performance
From | John A Meinel |
---|---|
Subject | Re: Performance problems testing with Spamassassin 3.1.0 |
Date | |
Msg-id | 42F22F59.7090901@arbash-meinel.com Whole thread Raw |
In response to | Re: Performance problems testing with Spamassassin 3.1.0 (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: Performance problems testing with Spamassassin 3.1.0
|
List | pgsql-performance |
Tom Lane wrote: > Matthew Schumacher <matt.s@aptalaska.net> writes: > >> for i in array_lower(intokenary, 1) .. array_upper(intokenary, 1) >> LOOP >> _token := intokenary[i]; >> INSERT INTO bayes_token_tmp VALUES (_token); >> END LOOP; > > >> UPDATE >> bayes_token >> SET >> spam_count = greatest_int(spam_count + inspam_count, 0), >> ham_count = greatest_int(ham_count + inham_count , 0), >> atime = greatest_int(atime, 1000) >> WHERE >> id = inuserid >> AND >> (token) IN (SELECT intoken FROM bayes_token_tmp); > > > I don't really see why you think that this path is going to lead to > better performance than where you were before. Manipulation of the > temp table is never going to be free, and IN (sub-select) is always > inherently not fast, and NOT IN (sub-select) is always inherently > awful. Throwing a pile of simple queries at the problem is not > necessarily the wrong way ... especially when you are doing it in > plpgsql, because you've already eliminated the overhead of network > round trips and repeated planning of the queries. So for an IN (sub-select), does it actually pull all of the rows from the other table, or is the planner smart enough to stop once it finds something? Is IN (sub-select) about the same as EXISTS (sub-select WHERE x=y)? What about NOT IN (sub-select) versus NOT EXISTS (sub-select WHERE x=y) I would guess that the EXISTS/NOT EXISTS would be faster, though it probably would necessitate using a nested loop (at least that seems to be the way the query is written). I did some tests on a database with 800k rows, versus a temp table with 2k rows. I did one sequential test (1-2000, with 66 rows missing), and one sparse test (1-200, 100000-100200, 200000-200200, ... with 658 rows missing). If found that NOT IN did indeed have to load the whole table. IN was smart enough to do a nested loop. EXISTS and NOT EXISTS did a sequential scan on my temp table, with a SubPlan filter (which looks a whole lot like a Nested Loop). What I found was that IN performed about the same as EXISTS (since they are both effectively doing a nested loop), but that NOT IN took 4,000ms while NOT EXISTS was the same speed as EXISTS at around 166ms. Anyway, so it does seem like NOT IN is not a good choice, but IN seems to be equivalent to EXISTS, and NOT EXISTS is also very fast. Is this generally true, or did I just get lucky on my data? John =:-> > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 3: Have you checked our extensive FAQ? > > http://www.postgresql.org/docs/faq > mifar07=# EXPLAIN ANALYZE SELECT id FROM ids WHERE id IN (SELECT id FROM object_t); QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------- Nested Loop IN Join (cost=0.00..9851.68 rows=2140 width=4) (actual time=0.085..183.889 rows=1351 loops=1) -> Seq Scan on ids (cost=0.00..31.40 rows=2140 width=4) (actual time=0.014..24.032 rows=2009 loops=1) -> Index Scan using object_t_pkey on object_t (cost=0.00..4.58 rows=1 width=4) (actual time=0.071..0.071 rows=1 loops=2009) Index Cond: ("outer".id = object_t.id) Total runtime: 184.823 ms (5 rows) Time: 186.931 ms mifar07=# EXPLAIN ANALYZE SELECT id FROM ids WHERE EXISTS (SELECT id FROM object_t o WHERE o.id = ids.id); QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------- Seq Scan on ids (cost=0.00..9824.93 rows=1070 width=4) (actual time=0.086..165.053 rows=1351 loops=1) Filter: (subplan) SubPlan -> Index Scan using object_t_pkey on object_t o (cost=0.00..4.58 rows=1 width=4) (actual time=0.025..0.025 rows=1loops=2009) Index Cond: (id = $0) Total runtime: 165.995 ms (6 rows) Time: 167.795 ms mifar07=# EXPLAIN ANALYZE SELECT id FROM ids WHERE id NOT IN (SELECT id FROM object_t); QUERY PLAN --------------------------------------------------------------------------------------------------------------------------- Seq Scan on ids (cost=36410.51..36447.26 rows=1070 width=4) (actual time=4168.247..4172.080 rows=658 loops=1) Filter: (NOT (hashed subplan)) SubPlan -> Seq Scan on object_t (cost=0.00..34381.81 rows=811481 width=4) (actual time=0.044..2464.296 rows=811481 loops=1) Total runtime: 4210.784 ms (5 rows) Time: 4212.276 ms mifar07=# EXPLAIN ANALYZE SELECT id FROM ids WHERE NOT EXISTS (SELECT id FROM object_t o WHERE o.id = ids.id); QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------- Seq Scan on ids (cost=0.00..9824.93 rows=1070 width=4) (actual time=0.372..164.510 rows=658 loops=1) Filter: (NOT (subplan)) SubPlan -> Index Scan using object_t_pkey on object_t o (cost=0.00..4.58 rows=1 width=4) (actual time=0.064..0.064 rows=1loops=2009) Index Cond: (id = $0) Total runtime: 165.016 ms (6 rows) Time: 166.786 ms mifar07=# EXPLAIN ANALYZE SELECT id FROM ids WHERE id IN (SELECT id FROM object_t); QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------- Nested Loop IN Join (cost=0.00..9851.68 rows=2140 width=4) (actual time=0.069..97.567 rows=1934 loops=1) -> Seq Scan on ids (cost=0.00..31.40 rows=2140 width=4) (actual time=0.012..1.868 rows=2000 loops=1) -> Index Scan using object_t_pkey on object_t (cost=0.00..4.58 rows=1 width=4) (actual time=0.045..0.045 rows=1 loops=2000) Index Cond: ("outer".id = object_t.id) Total runtime: 98.236 ms (5 rows) Time: 99.921 ms mifar07=# EXPLAIN ANALYZE SELECT id FROM ids WHERE EXISTS (SELECT id FROM object_t o WHERE o.id = ids.id); QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------- Seq Scan on ids (cost=0.00..9824.93 rows=1070 width=4) (actual time=0.071..74.158 rows=1934 loops=1) Filter: (subplan) SubPlan -> Index Scan using object_t_pkey on object_t o (cost=0.00..4.58 rows=1 width=4) (actual time=0.013..0.013 rows=1loops=2000) Index Cond: (id = $0) Total runtime: 74.798 ms (6 rows) Time: 86.287 ms mifar07=# EXPLAIN ANALYZE SELECT id FROM ids WHERE id NOT IN (SELECT id FROM object_t); QUERY PLAN --------------------------------------------------------------------------------------------------------------------------- Seq Scan on ids (cost=36410.51..36447.26 rows=1070 width=4) (actual time=4024.613..4028.774 rows=66 loops=1) Filter: (NOT (hashed subplan)) SubPlan -> Seq Scan on object_t (cost=0.00..34381.81 rows=811481 width=4) (actual time=0.040..2503.374 rows=811481 loops=1) Total runtime: 4122.327 ms (5 rows) Time: 4134.659 ms mifar07=# EXPLAIN ANALYZE SELECT id FROM ids WHERE NOT EXISTS (SELECT id FROM object_t o WHERE o.id = ids.id); QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------- Seq Scan on ids (cost=0.00..9824.93 rows=1070 width=4) (actual time=0.220..92.611 rows=66 loops=1) Filter: (NOT (subplan)) SubPlan -> Index Scan using object_t_pkey on object_t o (cost=0.00..4.58 rows=1 width=4) (actual time=0.043..0.043 rows=1loops=2000) Index Cond: (id = $0) Total runtime: 92.743 ms (6 rows) Time: 94.190 ms mifar07=#
Attachment
pgsql-performance by date: