Thread: Inefficient query plan
I have two tables: A: ItemID (PK), IsssueID (Indexed) B: ItemID (FK), IndexNumber : PK(ItemID, IndexNumber) Both tables have several million columns, but B has much more than A. Now if I run SELECT A.ItemID FROM A, B WHERE A.ItemID = B.itemID AND A.issueID = <some id> The query takes extremely long (several hours). I ran EXPLAIN and got: "Hash Join (cost=516.66..17710110.47 rows=8358225 width=16)" " Hash Cond: ((b.itemid)::bpchar = a.itemid)" " -> Seq Scan on b (cost=0.00..15110856.68 rows=670707968 width=16)" " -> Hash (cost=504.12..504.12 rows=1003 width=16)" " -> Index Scan using idx_issueid on a (cost=0.00..504.12 rows=1003 width=16)" " Index Cond: (issueid = 'A1983PW823'::bpchar)" Now we see the problem is the seq scan on B. However there are only a handful of rows in A that match a specific issueID. So my question is why doesn't it just check for each of the ItemIDs that have the correct IssueID in A if there is a matching itemID in B. This should be really fast because ItemID in B is indexed since it is part of the primary key. What is the reason for postgres not doing this, is there a way I can make it do that? I'm using postgresql 8.4.4 and yes, I did run ANALYZE on the entire DB. I have work_mem = 10MB shared_buffer = 256MB effective_cache_size = 768MB The database is basically for a single user. Thanks a lot, Jann
On Sun, Aug 22, 2010 at 10:23 PM, Jann Röder <roederja@ethz.ch> wrote: > I have two tables: > A: ItemID (PK), IsssueID (Indexed) > B: ItemID (FK), IndexNumber : PK(ItemID, IndexNumber) > > Both tables have several million columns, but B has much more than A. > > Now if I run > > SELECT A.ItemID FROM A, B WHERE A.ItemID = B.itemID AND A.issueID = > <some id> > > The query takes extremely long (several hours). I ran EXPLAIN and got: > > "Hash Join (cost=516.66..17710110.47 rows=8358225 width=16)" > " Hash Cond: ((b.itemid)::bpchar = a.itemid)" > " -> Seq Scan on b (cost=0.00..15110856.68 rows=670707968 width=16)" > " -> Hash (cost=504.12..504.12 rows=1003 width=16)" > " -> Index Scan using idx_issueid on a (cost=0.00..504.12 > rows=1003 width=16)" > " Index Cond: (issueid = 'A1983PW823'::bpchar)" Have you tried adding an index on b.indexid?
Also are a.indexid and b.indexid the same type?
Am 23.08.10 07:52, schrieb Scott Marlowe: > Also are a.indexid and b.indexid the same type? > You mean ItemID? Fields of the same name are of the same type - so yes. According to the documentation pgsql adds indexes for primary keys automatically so (b.itemID, b.indexNumber) is indexed. Or do you think adding an extra idnex for b.itemID alone will help? If I understand the documentation correctly, pqSQL can use the first column of a multi-column index as if it was indexed individually... but maybe I'm wrong here. >> I have two tables: >> A: ItemID (PK), IsssueID (Indexed) >> B: ItemID (FK), IndexNumber : PK(ItemID, IndexNumber)
On Mon, Aug 23, 2010 at 4:15 AM, Jann Röder <roederja@ethz.ch> wrote: > Am 23.08.10 07:52, schrieb Scott Marlowe: >> Also are a.indexid and b.indexid the same type? >> > > You mean ItemID? Fields of the same name are of the same type - so yes. > According to the documentation pgsql adds indexes for primary keys > automatically so (b.itemID, b.indexNumber) is indexed. Or do you think > adding an extra idnex for b.itemID alone will help? If I understand the > documentation correctly, pqSQL can use the first column of a > multi-column index as if it was indexed individually... but maybe I'm > wrong here. It can but that doesn't mean it will. A multi-column index is often quite a bit bigger than a single column one. What happens if you try set enable_seqscan=off; (your query here)
Am 23.08.10 12:18, schrieb Scott Marlowe: > On Mon, Aug 23, 2010 at 4:15 AM, Jann Röder <roederja@ethz.ch> wrote: >> Am 23.08.10 07:52, schrieb Scott Marlowe: >>> Also are a.indexid and b.indexid the same type? >>> >> >> You mean ItemID? Fields of the same name are of the same type - so yes. >> According to the documentation pgsql adds indexes for primary keys >> automatically so (b.itemID, b.indexNumber) is indexed. Or do you think >> adding an extra idnex for b.itemID alone will help? If I understand the >> documentation correctly, pqSQL can use the first column of a >> multi-column index as if it was indexed individually... but maybe I'm >> wrong here. > > It can but that doesn't mean it will. A multi-column index is often > quite a bit bigger than a single column one. > > What happens if you try > > set enable_seqscan=off; > (your query here) > Tried that already. The query plan is exactly the same.
Jann Röder wrote: Am 23.08.10 12:18, schrieb Scott Marlowe: >> What happens if you try >> >> set enable_seqscan=off; >> (your query here) >> > Tried that already. The query plan is exactly the same. Exactly? Not even the cost shown for the seq scan changed? You are almost certainly omitting some crucial piece of information in your report. Please look over this page and post a more complete report. In particular, please show the results of \d for both tables (or of pg_dump -s -t 'tablename'), your complete postgresql.conf file stripped of comments, and a description of your hardware and OS. -Kevin
I forgot to paste link: http://wiki.postgresql.org/wiki/SlowQueryQuestions -Kevin
Thanks for your help, here is the information you requested: Table information: A = Papers, B = PaperReferences wos-db=> \d Papers Table "public.papers" Column | Type | Modifiers ------------------+-------------------------+----------- itemid | character(15) | not null t9id | integer | artn | character varying | doi | character varying | pii | character varying | unsp | character varying | issueid | character(10) | not null title | character varying(1500) | not null titleenhancement | character varying(500) | beginningpage | character varying(19) | pagecount | integer | not null documenttype | character(1) | not null abstract | text | Indexes: "papers_pkey" PRIMARY KEY, btree (itemid) "idx_papers_issueid" btree (issueid) Foreign-key constraints: "papers_issueid_fkey" FOREIGN KEY (issueid) REFERENCES journals(issueid) ON DELETE CASCADE Referenced by: TABLE "authorkeywords" CONSTRAINT "authorkeywords_itemid_fkey" FOREIGN KEY (itemid) REFERENCES papers(itemid) ON DELETE CASCADE TABLE "authors" CONSTRAINT "authors_itemid_fkey" FOREIGN KEY (itemid) REFERENCES papers(itemid) ON DELETE CASCADE TABLE "grantnumbers" CONSTRAINT "grantnumbers_itemid_fkey" FOREIGN KEY (itemid) REFERENCES papers(itemid) ON DELETE CASCADE TABLE "keywordsplus" CONSTRAINT "keywordsplus_itemid_fkey" FOREIGN KEY (itemid) REFERENCES papers(itemid) ON DELETE CASCADE TABLE "languages" CONSTRAINT "languages_itemid_fkey" FOREIGN KEY (itemid) REFERENCES papers(itemid) ON DELETE CASCADE TABLE "paperreferences" CONSTRAINT "paperreferences_fromitemid_fkey" FOREIGN KEY (itemid) REFERENCES papers(itemid) ON DELETE CASCADE wos-db=> \d PaperReferences Table "public.paperreferences" Column | Type | Modifiers --------------------+-----------------------+----------- itemid | character varying(15) | not null t9id | integer | citedauthor | character varying(75) | citedartn | character varying | citeddoi | character varying | citedpii | character varying | citedunsp | character varying | citedreferenceyear | integer | citedtitle | character varying(20) | not null citedvolume | character varying(4) | citedpage | character varying(5) | referenceindex | integer | not null Indexes: "paperreferences_pkey" PRIMARY KEY, btree (itemid, referenceindex) Foreign-key constraints: "paperreferences_fromitemid_fkey" FOREIGN KEY (itemid) REFERENCES papers(itemid) ON DELETE CASCADE I just noticed that PaperReferences uses character varying (15) and Papers uses character(15). Stupid mistake of mine. Do you think this might cause the bad query planning? I will alter the table to use character(15) in both cases and see if that helps. postgresql.conf: max_connections = 20 shared_buffers = 256MB work_mem = 10MB maintenance_work_mem = 128MB max_stack_depth = 4MB synchronous_commit = off wal_buffers = 1MB checkpoint_segments = 10 effective_cache_size = 768MB default_statistics_target = 200 datestyle = 'iso, mdy' lc_messages = 'C' lc_monetary = 'C' lc_numeric = 'C' lc_time = 'C' default_text_search_config = 'pg_catalog.simple' The query I run: SELECT p.ItemID FROM Papers AS p, PaperReferences AS r WHERE p.itemID = r.ItemID AND p.issueID = 'A1983PW823' Query plan with seqscan enabled: "Hash Join (cost=512.71..17709356.53 rows=8283226 width=16)" " Hash Cond: ((r.itemid)::bpchar = p.itemid)" " -> Seq Scan on paperreferences r (cost=0.00..15110856.68 rows=670707968 width=16)" " -> Hash (cost=500.29..500.29 rows=994 width=16)" " -> Index Scan using idx_papers_issueid on papers p (cost=0.00..500.29 rows=994 width=16)" " Index Cond: (issueid = 'A1983PW823'::bpchar)" Query plan with seqscan disbaled "Hash Join (cost=10000000280.88..10017668625.22 rows=4233278 width=16)" " Hash Cond: ((r.itemid)::bpchar = p.itemid)" " -> Seq Scan on paperreferences r (cost=10000000000.00..10015110856.68 rows=670707968 width=16)" " -> Hash (cost=274.53..274.53 rows=508 width=16)" " -> Index Scan using idx_papers_issueid on papers p (cost=0.00..274.53 rows=508 width=16)" " Index Cond: (issueid = 'A1983PW823'::bpchar)" Do you need an EXPLAIN ANALYZE output? Since it takes so long I can't easily post one right now. But maybe I can get one over night. My Hardware is an iMac running OS X 10.6.4 with 1.5 GB RAM and a 2.1 GHz (or so) core 2 Duo processor. Jann Am 23.08.10 14:08, schrieb Kevin Grittner: > Jann Röder wrote: > Am 23.08.10 12:18, schrieb Scott Marlowe: > >>> What happens if you try >>> >>> set enable_seqscan=off; >>> (your query here) >>> >> Tried that already. The query plan is exactly the same. > > Exactly? Not even the cost shown for the seq scan changed? > > You are almost certainly omitting some crucial piece of information > in your report. Please look over this page and post a more complete > report. In particular, please show the results of \d for both tables > (or of pg_dump -s -t 'tablename'), your complete postgresql.conf file > stripped of comments, and a description of your hardware and OS. > > -Kevin >
joining on varchars is always going to be very expensive. Longer the value is, more expensive it will be. Consider going for surrogate keys.
Jann Röder<roederja@ethz.ch> wrote: > Table "public.papers" > Column | Type | Modifiers > ------------------+-------------------------+----------- > itemid | character(15) | not null > wos-db=> \d PaperReferences > Table "public.paperreferences" > Column | Type | Modifiers > --------------------+-----------------------+----------- > itemid | character varying(15) | not null > I just noticed that PaperReferences uses character varying (15) > and Papers uses character(15). Stupid mistake of mine. Do you > think this might cause the bad query planning? Absolutely. These are *not* the same type and don't compare all that well. > I will alter the table to use character(15) in both cases and see > if that helps. I suspect that making them the same will cure the problem, but I would recommend you make any character(n) columns character varying(n) instead of the other way around. The the character(n) data type has many surprising behaviors and tends to perform worse. Avoid using it if possible. > postgresql.conf: > max_connections = 20 > shared_buffers = 256MB > work_mem = 10MB > maintenance_work_mem = 128MB > max_stack_depth = 4MB > synchronous_commit = off > wal_buffers = 1MB > checkpoint_segments = 10 > effective_cache_size = 768MB > default_statistics_target = 200 > datestyle = 'iso, mdy' > lc_messages = 'C' > lc_monetary = 'C' > lc_numeric = 'C' > lc_time = 'C' > default_text_search_config = 'pg_catalog.simple' > Do you need an EXPLAIN ANALYZE output? Since it takes so long I > can't easily post one right now. But maybe I can get one over > night. Not necessary; you've already identified the cause and the fix. > My Hardware is an iMac running OS X 10.6.4 with 1.5 GB RAM and a > 2.1 GHz (or so) core 2 Duo processor. OK. If you still don't get a good plan, you might want to try edging up effective_cache_size, if the sum of your shared_buffers and OS cache is larger than 768MB (which I would expect it might be). If the active part of your database (the part which is frequently referenced) fits within cache space, or even a significant portion of it fits, you might need to adjust random_page_cost and perhaps seq_page_cost to reflect the lower average cost of fetching from cache rather than disk -- but you want to fix your big problem (the type mismatch) first, and then see if you need further adjustments. -Kevin
Grzegorz Jaœkiewicz<gryzman@gmail.com> wrote: > joining on varchars is always going to be very expensive. Longer > the value is, more expensive it will be. Consider going for > surrogate keys. Surrogate keys come with their own set of costs and introduce quite a few problems of their own. I don't want to start a flame war or go into an overly long diatribe on the evils of surrogate keys on this thread; suffice it to say that it's not the first thing to try here. As an example of the performance we get using natural keys, with compound keys on almost every table, check out this 1.3TB database, being updated live by 3000 users as you view it: http://wcca.wicourts.gov/ Some tables have hundreds of millions of rows. No partitioning. -Kevin
On Mon, Aug 23, 2010 at 2:47 PM, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote: > Grzegorz Jaœkiewicz<gryzman@gmail.com> wrote: > >> joining on varchars is always going to be very expensive. Longer >> the value is, more expensive it will be. Consider going for >> surrogate keys. > > Surrogate keys come with their own set of costs and introduce quite > a few problems of their own. I don't want to start a flame war or > go into an overly long diatribe on the evils of surrogate keys on > this thread; suffice it to say that it's not the first thing to try > here. > > As an example of the performance we get using natural keys, with > compound keys on almost every table, check out this 1.3TB database, > being updated live by 3000 users as you view it: > > http://wcca.wicourts.gov/ > > Some tables have hundreds of millions of rows. No partitioning. > True, but as far as joining is concerned, joining on single column fixed length fields is always going to be a win. Hence why surrogate keys make sens in this particular example, or the guy here should at least test it to see, rather than believe in one or the other. -- GJ
Grzegorz Jaœkiewicz<gryzman@gmail.com> wrote: > True, but as far as joining is concerned, joining on single column > fixed length fields is always going to be a win. Hence why > surrogate keys make sens in this particular example, or the guy > here should at least test it to see, rather than believe in one or > the other. How about we start by just having him use the same data type in both tables? If you insist on getting into a discussion of the merits of surrogate keys, you need to look at not just this one query and its response time, where surrogate keys might give a percentage point or two increase in performance, but at the integrity challenges they introduce, and at what happens when you've got dozens of other tables which would be containing the natural data, but which now need to navigate through particular linkage paths to get to it to generate summary reports and such. It's easy to construct a narrow case where a surrogate key is a short-term marginal win; it's just about as easy to show data corruption vulnerabilities and huge performance hits on complex queries when surrogate keys are used. They have a place, but it's a pretty narrow set of use-cases in my book. For every place they're not used where they should be, there are at least 100 places they are used where they shouldn't be. -Kevin
Excerpts from Jann Röder's message of lun ago 23 00:23:38 -0400 2010: > "Hash Join (cost=516.66..17710110.47 rows=8358225 width=16)" > " Hash Cond: ((b.itemid)::bpchar = a.itemid)" > " -> Seq Scan on b (cost=0.00..15110856.68 rows=670707968 width=16)" > " -> Hash (cost=504.12..504.12 rows=1003 width=16)" > " -> Index Scan using idx_issueid on a (cost=0.00..504.12 > rows=1003 width=16)" > " Index Cond: (issueid = 'A1983PW823'::bpchar)" Hmm, I'm placing bets on the bpchar weirdness. I'd try getting rid of that and using plain varchar for all the columns. -- Álvaro Herrera <alvherre@commandprompt.com> The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support
I am not a fan of 'do this - this is best' response to queries like that. Rather: this is what you should try, and choose whichever one suits you better. So, rather than 'natural keys ftw', I am giving him another option to choose from. You see, in my world, I was able to improve some large dbs performance 10+ times fold, by going for surrogate keys. But in them cases, joins were performed on 2+ varchar PK fields, and the whole thing was crawwwling. So, don't narrow down to one solution because it worked for you. Keep an open book.
Oh, and I second using same types in joins especially, very much so :)
Grzegorz Jaœkiewicz<gryzman@gmail.com> wrote: > So, don't narrow down to one solution because it worked for you. > Keep an open book. What I was trying to do was advise on what would most directly fix the problem. Adding surrogate keys goes way beyond adding the columns and using them as keys, as I'm sure you're aware if you've done this on a large scale. I wouldn't tell someone not to ever use them; I would advise not to try them when there is a natural key unless there are problems which are not solved without them, as appears to have been the case with your database. I may be a little bit over-sensitive on the topic, because I've seen so many people who consider it "wrong" to use natural keys on any table *ever*. About one out of every four or five programmers who gets hired here feels compelled to argue that we should add surrogate keys to all our tables for no reason beyond "it's the thing to do". I've been at this for 38 years, most of that as a consultant to a wide variety of businesses, government agencies, and NPOs; and in my experience it usually is *not* the right thing to do. Don't worry -- when I see evidence that surrogate keys will solve a problem which has not yielded to more conservative solutions, I'll suggest using them. -Kevin
Alvaro Herrera <alvherre@commandprompt.com> writes: > Hmm, I'm placing bets on the bpchar weirdness. I'd try getting rid of > that and using plain varchar for all the columns. That's certainly what's inhibiting it from considering an indexscan on the larger table. I'm not as convinced as the OP that a nestloop indexscan is really going to win compared to the hash plan, but if the comparison value is varchar then an index on a bpchar column is simply not useful --- at least not unless you stick an explicit cast into the query, so that the comparison will have bpchar rather than varchar semantics. regards, tom lane
>I may be a little bit over-sensitive on the topic, because I've seen >so many people who consider it "wrong" to use natural keys on any >table *ever*. About one out of every four or five programmers who >gets hired here feels compelled to argue that we should add >surrogate keys to all our tables for no reason beyond "it's the >thing to do". I've been at this for 38 years, most of that as a >consultant to a wide variety of businesses, government agencies, and >NPOs; and in my experience it usually is *not* the right thing to >do. > >Don't worry -- when I see evidence that surrogate keys will solve a >problem which has not yielded to more conservative solutions, I'll >suggest using them. > >-Kevin > Ah feeel your pain, brother. Been there, done that. In almost every case, those who make the "thou shalt always, and only,use a surrogate key" cite folks like Ambler as authoritative rather than folks like Date. The Kiddie Koder Krew arewoefully uninformed about the history and development of RDBMS, and why some approaches are more intelligent than others. Ambler, et al coders all, have poisoned more minds than can be counted. Dijkstra called out BASIC and COBOL for polluting the minds of young coders. Allowing Ambler and the like to be "thoughtleaders" in RDBMS is just as polluting. There, I said. -- Robert
On Mon, Aug 23, 2010 at 7:19 AM, Jann Röder <roederja@ethz.ch> wrote: > Thanks for your help, > here is the information you requested: > > Table information: A = Papers, B = PaperReferences > > wos-db=> \d Papers > Table "public.papers" > Column | Type | Modifiers > ------------------+-------------------------+----------- > itemid | character(15) | not null > wos-db=> \d PaperReferences > Table "public.paperreferences" > Column | Type | Modifiers > --------------------+-----------------------+----------- > itemid | character varying(15) | not null > > I just noticed that PaperReferences uses character varying (15) and > Papers uses character(15). Stupid mistake of mine. Do you think this > might cause the bad query planning? I will alter the table to use > character(15) in both cases and see if that helps. Almost certainly it's not helping. If the planner doesn't choose an indexed lookup when you turn seq scans off, then either an index plan is WAY expensive (the planner is tricked to turning off seq scan by setting the value of them to something very high) or you don't have a useful index. When I asked if they were the same and if you'd tried with seqscan off that's what I was getting at, that the types might not match. Now, it may or may not be much faster with an index scan, depending on your data distribution and the number of rows to be returned, but at least if they're the same type the planner has a choice. If they're not, it has no choice, it has to go with the seq scan. Let us know how it runs when you've got the types matched up. BTW, I'd generally go with text over char or varchar, but that's just me.
So that took a while... I'm currently running ANALYZE on the PaperReferences table again (the one where I changed the data type). The plan however is still the same: "Hash Join (cost=280.88..24330800.08 rows=670602240 width=16)" " Hash Cond: (r.itemid = p.itemid)" " -> Seq Scan on paperreferences r (cost=0.00..15109738.40 rows=670602240 width=64)" " -> Hash (cost=274.53..274.53 rows=508 width=16)" " -> Index Scan using idx_papers_issueid on papers p (cost=0.00..274.53 rows=508 width=16)" " Index Cond: (issueid = 'A1983PW823'::bpchar)" But I can now force it to use an index scan instead of a seqScan: "Merge Join (cost=0.00..2716711476.57 rows=670602240 width=16)" " Merge Cond: (p.itemid = r.itemid)" " -> Index Scan using papers_pkey on papers p (cost=0.00..21335008.47 rows=508 width=16)" " Filter: (issueid = 'A1983PW823'::bpchar)" " -> Index Scan using paperreferences_pkey on paperreferences r (cost=0.00..2686993938.83 rows=670602240 width=64)" Unfortunately this is not faster than the other one. I did not wait until it returned because I want this query to take less than 5 seconds or so. Here is my query again: SELECT p.ItemID FROM Papers AS p, PaperReferences AS r WHERE p.itemID = r.ItemID AND p.issueID = 'A1983PW823'; I can also write it as: SELECT ItemID FROM PaperReferences WHERE ItemID IN (SELECT ItemID FROM Papers WHERE IssueID = 'A1983PW823') Which is more what I would do if I was the database. Unfortunately this is not fast either: "Hash Semi Join (cost=280.88..24330800.08 rows=670602240 width=64)" " Hash Cond: (paperreferences.itemid = papers.itemid)" " -> Seq Scan on paperreferences (cost=0.00..15109738.40 rows=670602240 width=64)" " -> Hash (cost=274.53..274.53 rows=508 width=16)" " -> Index Scan using idx_papers_issueid on papers (cost=0.00..274.53 rows=508 width=16)" " Index Cond: (issueid = 'A1983PW823'::bpchar)" The sub-query SELECT ItemID FROM Papers WHERE IssueID = 'A1983PW823' is really fast, though and returns 16 rows. If I unroll the query by hand like this: SELECT ItemID FROM PaperReferences WHERE (ItemID = 'A1983PW82300001' OR ItemID = 'A1983PW82300002' OR ItemID = 'A1983PW82300003' OR ItemID = 'A1983PW82300004' OR ItemID = 'A1983PW82300005' OR ItemID = 'A1983PW82300006' OR ...) (All the ORed stuff is the result of the sub-query) I get my result really fast. So what I need now is a way to tell postgres to do it that way automatically. If everything else fails I will have to put that logic into my application in java code, which I don't want to do because then I will also have to remove my constraints so I can delete stuff at a reasonable speed. Thanks, Jann Am 23.08.10 15:33, schrieb Kevin Grittner: > Jann Röder<roederja@ethz.ch> wrote: > >> Table "public.papers" >> Column | Type | Modifiers >> ------------------+-------------------------+----------- >> itemid | character(15) | not null > >> wos-db=> \d PaperReferences >> Table "public.paperreferences" >> Column | Type | Modifiers >> --------------------+-----------------------+----------- >> itemid | character varying(15) | not null > >> I just noticed that PaperReferences uses character varying (15) >> and Papers uses character(15). Stupid mistake of mine. Do you >> think this might cause the bad query planning? > > Absolutely. These are *not* the same type and don't compare all > that well. > >> I will alter the table to use character(15) in both cases and see >> if that helps. > > I suspect that making them the same will cure the problem, but I > would recommend you make any character(n) columns character > varying(n) instead of the other way around. The the character(n) > data type has many surprising behaviors and tends to perform worse. > Avoid using it if possible. > >> postgresql.conf: >> max_connections = 20 >> shared_buffers = 256MB >> work_mem = 10MB >> maintenance_work_mem = 128MB >> max_stack_depth = 4MB >> synchronous_commit = off >> wal_buffers = 1MB >> checkpoint_segments = 10 >> effective_cache_size = 768MB >> default_statistics_target = 200 >> datestyle = 'iso, mdy' >> lc_messages = 'C' >> lc_monetary = 'C' >> lc_numeric = 'C' >> lc_time = 'C' >> default_text_search_config = 'pg_catalog.simple' > >> Do you need an EXPLAIN ANALYZE output? Since it takes so long I >> can't easily post one right now. But maybe I can get one over >> night. > > Not necessary; you've already identified the cause and the fix. > >> My Hardware is an iMac running OS X 10.6.4 with 1.5 GB RAM and a >> 2.1 GHz (or so) core 2 Duo processor. > > OK. If you still don't get a good plan, you might want to try > edging up effective_cache_size, if the sum of your shared_buffers > and OS cache is larger than 768MB (which I would expect it might > be). If the active part of your database (the part which is > frequently referenced) fits within cache space, or even a > significant portion of it fits, you might need to adjust > random_page_cost and perhaps seq_page_cost to reflect the lower > average cost of fetching from cache rather than disk -- but you want > to fix your big problem (the type mismatch) first, and then see if > you need further adjustments. > > -Kevin >
Thanks everyone, the problem just solved itself. After the ANALYZE had finished, postgres started doing what I wanted it to do all along: EXPLAIN SELECT p.ItemID FROM Papers AS p, PaperReferences AS r WHERE p.itemID = r.ItemID AND p.issueID = 'A1983PW823'; "Nested Loop (cost=0.00..4515980.97 rows=2071811 width=16)" " -> Index Scan using idx_papers_issueid on papers p (cost=0.00..274.53 rows=508 width=16)" " Index Cond: (issueid = 'A1983PW823'::bpchar)" " -> Index Scan using paperreferences_pkey on paperreferences r (cost=0.00..8838.21 rows=4078 width=16)" " Index Cond: (r.itemid = p.itemid)" So thanks again. I'm starting to grasp the postgres quirks :) Jann Am 24.08.10 15:03, schrieb Jann Röder: > So that took a while... I'm currently running ANALYZE on the > PaperReferences table again (the one where I changed the data type). > > The plan however is still the same: > "Hash Join (cost=280.88..24330800.08 rows=670602240 width=16)" > " Hash Cond: (r.itemid = p.itemid)" > " -> Seq Scan on paperreferences r (cost=0.00..15109738.40 > rows=670602240 width=64)" > " -> Hash (cost=274.53..274.53 rows=508 width=16)" > " -> Index Scan using idx_papers_issueid on papers p > (cost=0.00..274.53 rows=508 width=16)" > " Index Cond: (issueid = 'A1983PW823'::bpchar)" > > But I can now force it to use an index scan instead of a seqScan: > "Merge Join (cost=0.00..2716711476.57 rows=670602240 width=16)" > " Merge Cond: (p.itemid = r.itemid)" > " -> Index Scan using papers_pkey on papers p (cost=0.00..21335008.47 > rows=508 width=16)" > " Filter: (issueid = 'A1983PW823'::bpchar)" > " -> Index Scan using paperreferences_pkey on paperreferences r > (cost=0.00..2686993938.83 rows=670602240 width=64)" > > Unfortunately this is not faster than the other one. I did not wait > until it returned because I want this query to take less than 5 seconds > or so. > > Here is my query again: > SELECT p.ItemID FROM Papers AS p, PaperReferences AS r WHERE p.itemID = > r.ItemID AND p.issueID = 'A1983PW823'; > > I can also write it as: > SELECT ItemID FROM PaperReferences WHERE ItemID IN (SELECT ItemID FROM > Papers WHERE IssueID = 'A1983PW823') > > Which is more what I would do if I was the database. Unfortunately this > is not fast either: > > "Hash Semi Join (cost=280.88..24330800.08 rows=670602240 width=64)" > " Hash Cond: (paperreferences.itemid = papers.itemid)" > " -> Seq Scan on paperreferences (cost=0.00..15109738.40 > rows=670602240 width=64)" > " -> Hash (cost=274.53..274.53 rows=508 width=16)" > " -> Index Scan using idx_papers_issueid on papers > (cost=0.00..274.53 rows=508 width=16)" > " Index Cond: (issueid = 'A1983PW823'::bpchar)" > > The sub-query SELECT ItemID FROM Papers WHERE IssueID = 'A1983PW823' is > really fast, though and returns 16 rows. If I unroll the query by hand > like this: > SELECT ItemID FROM PaperReferences WHERE > (ItemID = 'A1983PW82300001' OR > ItemID = 'A1983PW82300002' OR > ItemID = 'A1983PW82300003' OR > ItemID = 'A1983PW82300004' OR > ItemID = 'A1983PW82300005' OR > ItemID = 'A1983PW82300006' OR > ...) > > (All the ORed stuff is the result of the sub-query) I get my result > really fast. So what I need now is a way to tell postgres to do it that > way automatically. If everything else fails I will have to put that > logic into my application in java code, which I don't want to do because > then I will also have to remove my constraints so I can delete stuff at > a reasonable speed. > > Thanks, > Jann > > > Am 23.08.10 15:33, schrieb Kevin Grittner: >> Jann Röder<roederja@ethz.ch> wrote: >> >>> Table "public.papers" >>> Column | Type | Modifiers >>> ------------------+-------------------------+----------- >>> itemid | character(15) | not null >> >>> wos-db=> \d PaperReferences >>> Table "public.paperreferences" >>> Column | Type | Modifiers >>> --------------------+-----------------------+----------- >>> itemid | character varying(15) | not null >> >>> I just noticed that PaperReferences uses character varying (15) >>> and Papers uses character(15). Stupid mistake of mine. Do you >>> think this might cause the bad query planning? >> >> Absolutely. These are *not* the same type and don't compare all >> that well. >> >>> I will alter the table to use character(15) in both cases and see >>> if that helps. >> >> I suspect that making them the same will cure the problem, but I >> would recommend you make any character(n) columns character >> varying(n) instead of the other way around. The the character(n) >> data type has many surprising behaviors and tends to perform worse. >> Avoid using it if possible. >> >>> postgresql.conf: >>> max_connections = 20 >>> shared_buffers = 256MB >>> work_mem = 10MB >>> maintenance_work_mem = 128MB >>> max_stack_depth = 4MB >>> synchronous_commit = off >>> wal_buffers = 1MB >>> checkpoint_segments = 10 >>> effective_cache_size = 768MB >>> default_statistics_target = 200 >>> datestyle = 'iso, mdy' >>> lc_messages = 'C' >>> lc_monetary = 'C' >>> lc_numeric = 'C' >>> lc_time = 'C' >>> default_text_search_config = 'pg_catalog.simple' >> >>> Do you need an EXPLAIN ANALYZE output? Since it takes so long I >>> can't easily post one right now. But maybe I can get one over >>> night. >> >> Not necessary; you've already identified the cause and the fix. >> >>> My Hardware is an iMac running OS X 10.6.4 with 1.5 GB RAM and a >>> 2.1 GHz (or so) core 2 Duo processor. >> >> OK. If you still don't get a good plan, you might want to try >> edging up effective_cache_size, if the sum of your shared_buffers >> and OS cache is larger than 768MB (which I would expect it might >> be). If the active part of your database (the part which is >> frequently referenced) fits within cache space, or even a >> significant portion of it fits, you might need to adjust >> random_page_cost and perhaps seq_page_cost to reflect the lower >> average cost of fetching from cache rather than disk -- but you want >> to fix your big problem (the type mismatch) first, and then see if >> you need further adjustments. >> >> -Kevin