Re: Inefficient query plan - Mailing list pgsql-performance
From | Jann Röder |
---|---|
Subject | Re: Inefficient query plan |
Date | |
Msg-id | i50fuk$vg4$1@dough.gmane.org Whole thread Raw |
In response to | Re: Inefficient query plan ("Kevin Grittner" <Kevin.Grittner@wicourts.gov>) |
Responses |
Re: Inefficient query plan
|
List | pgsql-performance |
So that took a while... I'm currently running ANALYZE on the PaperReferences table again (the one where I changed the data type). The plan however is still the same: "Hash Join (cost=280.88..24330800.08 rows=670602240 width=16)" " Hash Cond: (r.itemid = p.itemid)" " -> Seq Scan on paperreferences r (cost=0.00..15109738.40 rows=670602240 width=64)" " -> Hash (cost=274.53..274.53 rows=508 width=16)" " -> Index Scan using idx_papers_issueid on papers p (cost=0.00..274.53 rows=508 width=16)" " Index Cond: (issueid = 'A1983PW823'::bpchar)" But I can now force it to use an index scan instead of a seqScan: "Merge Join (cost=0.00..2716711476.57 rows=670602240 width=16)" " Merge Cond: (p.itemid = r.itemid)" " -> Index Scan using papers_pkey on papers p (cost=0.00..21335008.47 rows=508 width=16)" " Filter: (issueid = 'A1983PW823'::bpchar)" " -> Index Scan using paperreferences_pkey on paperreferences r (cost=0.00..2686993938.83 rows=670602240 width=64)" Unfortunately this is not faster than the other one. I did not wait until it returned because I want this query to take less than 5 seconds or so. Here is my query again: SELECT p.ItemID FROM Papers AS p, PaperReferences AS r WHERE p.itemID = r.ItemID AND p.issueID = 'A1983PW823'; I can also write it as: SELECT ItemID FROM PaperReferences WHERE ItemID IN (SELECT ItemID FROM Papers WHERE IssueID = 'A1983PW823') Which is more what I would do if I was the database. Unfortunately this is not fast either: "Hash Semi Join (cost=280.88..24330800.08 rows=670602240 width=64)" " Hash Cond: (paperreferences.itemid = papers.itemid)" " -> Seq Scan on paperreferences (cost=0.00..15109738.40 rows=670602240 width=64)" " -> Hash (cost=274.53..274.53 rows=508 width=16)" " -> Index Scan using idx_papers_issueid on papers (cost=0.00..274.53 rows=508 width=16)" " Index Cond: (issueid = 'A1983PW823'::bpchar)" The sub-query SELECT ItemID FROM Papers WHERE IssueID = 'A1983PW823' is really fast, though and returns 16 rows. If I unroll the query by hand like this: SELECT ItemID FROM PaperReferences WHERE (ItemID = 'A1983PW82300001' OR ItemID = 'A1983PW82300002' OR ItemID = 'A1983PW82300003' OR ItemID = 'A1983PW82300004' OR ItemID = 'A1983PW82300005' OR ItemID = 'A1983PW82300006' OR ...) (All the ORed stuff is the result of the sub-query) I get my result really fast. So what I need now is a way to tell postgres to do it that way automatically. If everything else fails I will have to put that logic into my application in java code, which I don't want to do because then I will also have to remove my constraints so I can delete stuff at a reasonable speed. Thanks, Jann Am 23.08.10 15:33, schrieb Kevin Grittner: > Jann Röder<roederja@ethz.ch> wrote: > >> Table "public.papers" >> Column | Type | Modifiers >> ------------------+-------------------------+----------- >> itemid | character(15) | not null > >> wos-db=> \d PaperReferences >> Table "public.paperreferences" >> Column | Type | Modifiers >> --------------------+-----------------------+----------- >> itemid | character varying(15) | not null > >> I just noticed that PaperReferences uses character varying (15) >> and Papers uses character(15). Stupid mistake of mine. Do you >> think this might cause the bad query planning? > > Absolutely. These are *not* the same type and don't compare all > that well. > >> I will alter the table to use character(15) in both cases and see >> if that helps. > > I suspect that making them the same will cure the problem, but I > would recommend you make any character(n) columns character > varying(n) instead of the other way around. The the character(n) > data type has many surprising behaviors and tends to perform worse. > Avoid using it if possible. > >> postgresql.conf: >> max_connections = 20 >> shared_buffers = 256MB >> work_mem = 10MB >> maintenance_work_mem = 128MB >> max_stack_depth = 4MB >> synchronous_commit = off >> wal_buffers = 1MB >> checkpoint_segments = 10 >> effective_cache_size = 768MB >> default_statistics_target = 200 >> datestyle = 'iso, mdy' >> lc_messages = 'C' >> lc_monetary = 'C' >> lc_numeric = 'C' >> lc_time = 'C' >> default_text_search_config = 'pg_catalog.simple' > >> Do you need an EXPLAIN ANALYZE output? Since it takes so long I >> can't easily post one right now. But maybe I can get one over >> night. > > Not necessary; you've already identified the cause and the fix. > >> My Hardware is an iMac running OS X 10.6.4 with 1.5 GB RAM and a >> 2.1 GHz (or so) core 2 Duo processor. > > OK. If you still don't get a good plan, you might want to try > edging up effective_cache_size, if the sum of your shared_buffers > and OS cache is larger than 768MB (which I would expect it might > be). If the active part of your database (the part which is > frequently referenced) fits within cache space, or even a > significant portion of it fits, you might need to adjust > random_page_cost and perhaps seq_page_cost to reflect the lower > average cost of fetching from cache rather than disk -- but you want > to fix your big problem (the type mismatch) first, and then see if > you need further adjustments. > > -Kevin >
pgsql-performance by date: