Thread: Inefficient query plan

Inefficient query plan

From
Jann Röder
Date:
I have two tables:
A: ItemID (PK), IsssueID (Indexed)
B: ItemID (FK), IndexNumber : PK(ItemID, IndexNumber)

Both tables have several million columns, but B has much more than A.

Now if I run

SELECT A.ItemID FROM A, B WHERE A.ItemID = B.itemID AND A.issueID =
<some id>

The query takes extremely long (several hours). I ran EXPLAIN and got:

"Hash Join  (cost=516.66..17710110.47 rows=8358225 width=16)"
"  Hash Cond: ((b.itemid)::bpchar = a.itemid)"
"  ->  Seq Scan on b  (cost=0.00..15110856.68 rows=670707968 width=16)"
"  ->  Hash  (cost=504.12..504.12 rows=1003 width=16)"
"        ->  Index Scan using idx_issueid on a  (cost=0.00..504.12
rows=1003 width=16)"
"              Index Cond: (issueid = 'A1983PW823'::bpchar)"

Now we see the problem is the seq scan on B. However there are only a
handful of rows in A that match a specific issueID. So my question is
why doesn't it just check for each of the ItemIDs that have the correct
IssueID in A if there is a matching itemID in B. This should be really
fast because ItemID in B is indexed since it is part of the primary key.

What is the reason for postgres not doing this, is there a way I can
make it do that? I'm using postgresql 8.4.4 and yes, I did run ANALYZE
on the entire DB.

I have
work_mem = 10MB
shared_buffer = 256MB
effective_cache_size = 768MB

The database is basically for a single user.

Thanks a lot,
Jann

Re: Inefficient query plan

From
Scott Marlowe
Date:
On Sun, Aug 22, 2010 at 10:23 PM, Jann Röder <roederja@ethz.ch> wrote:
> I have two tables:
> A: ItemID (PK), IsssueID (Indexed)
> B: ItemID (FK), IndexNumber : PK(ItemID, IndexNumber)
>
> Both tables have several million columns, but B has much more than A.
>
> Now if I run
>
> SELECT A.ItemID FROM A, B WHERE A.ItemID = B.itemID AND A.issueID =
> <some id>
>
> The query takes extremely long (several hours). I ran EXPLAIN and got:
>
> "Hash Join  (cost=516.66..17710110.47 rows=8358225 width=16)"
> "  Hash Cond: ((b.itemid)::bpchar = a.itemid)"
> "  ->  Seq Scan on b  (cost=0.00..15110856.68 rows=670707968 width=16)"
> "  ->  Hash  (cost=504.12..504.12 rows=1003 width=16)"
> "        ->  Index Scan using idx_issueid on a  (cost=0.00..504.12
> rows=1003 width=16)"
> "              Index Cond: (issueid = 'A1983PW823'::bpchar)"

Have you tried adding an index on b.indexid?

Re: Inefficient query plan

From
Scott Marlowe
Date:
Also are a.indexid and b.indexid the same type?

Re: Inefficient query plan

From
Jann Röder
Date:
Am 23.08.10 07:52, schrieb Scott Marlowe:
> Also are a.indexid and b.indexid the same type?
>

You mean ItemID? Fields of the same name are of the same type - so yes.
According to the documentation pgsql adds indexes for primary keys
automatically so (b.itemID, b.indexNumber) is indexed. Or do you think
adding an extra idnex for b.itemID alone will help? If I understand the
documentation correctly, pqSQL can use the first column of a
multi-column index as if it was indexed individually... but maybe I'm
wrong here.

>> I have two tables:
>> A: ItemID (PK), IsssueID (Indexed)
>> B: ItemID (FK), IndexNumber : PK(ItemID, IndexNumber)


Re: Inefficient query plan

From
Scott Marlowe
Date:
On Mon, Aug 23, 2010 at 4:15 AM, Jann Röder <roederja@ethz.ch> wrote:
> Am 23.08.10 07:52, schrieb Scott Marlowe:
>> Also are a.indexid and b.indexid the same type?
>>
>
> You mean ItemID? Fields of the same name are of the same type - so yes.
> According to the documentation pgsql adds indexes for primary keys
> automatically so (b.itemID, b.indexNumber) is indexed. Or do you think
> adding an extra idnex for b.itemID alone will help? If I understand the
> documentation correctly, pqSQL can use the first column of a
> multi-column index as if it was indexed individually... but maybe I'm
> wrong here.

It can but that doesn't mean it will.  A multi-column index is often
quite a bit bigger than a single column one.

What happens if you try

set enable_seqscan=off;
(your query here)

Re: Inefficient query plan

From
Jann Röder
Date:
Am 23.08.10 12:18, schrieb Scott Marlowe:
> On Mon, Aug 23, 2010 at 4:15 AM, Jann Röder <roederja@ethz.ch> wrote:
>> Am 23.08.10 07:52, schrieb Scott Marlowe:
>>> Also are a.indexid and b.indexid the same type?
>>>
>>
>> You mean ItemID? Fields of the same name are of the same type - so yes.
>> According to the documentation pgsql adds indexes for primary keys
>> automatically so (b.itemID, b.indexNumber) is indexed. Or do you think
>> adding an extra idnex for b.itemID alone will help? If I understand the
>> documentation correctly, pqSQL can use the first column of a
>> multi-column index as if it was indexed individually... but maybe I'm
>> wrong here.
>
> It can but that doesn't mean it will.  A multi-column index is often
> quite a bit bigger than a single column one.
>
> What happens if you try
>
> set enable_seqscan=off;
> (your query here)
>
Tried that already. The query plan is exactly the same.

Re: Inefficient query plan

From
"Kevin Grittner"
Date:
Jann Röder wrote:
Am 23.08.10 12:18, schrieb Scott Marlowe:

>> What happens if you try
>>
>> set enable_seqscan=off;
>> (your query here)
>>
> Tried that already. The query plan is exactly the same.

Exactly?  Not even the cost shown for the seq scan changed?

You are almost certainly omitting some crucial piece of information
in your report.  Please look over this page and post a more complete
report.  In particular, please show the results of \d for both tables
(or of pg_dump -s -t 'tablename'), your complete postgresql.conf file
stripped of comments, and a description of your hardware and OS.

-Kevin

Re: Inefficient query plan

From
"Kevin Grittner"
Date:

Re: Inefficient query plan

From
Jann Röder
Date:
Thanks for your help,
here is the information you requested:

Table information: A = Papers, B = PaperReferences

wos-db=> \d Papers
                 Table "public.papers"
      Column      |          Type           | Modifiers
------------------+-------------------------+-----------
 itemid           | character(15)           | not null
 t9id             | integer                 |
 artn             | character varying       |
 doi              | character varying       |
 pii              | character varying       |
 unsp             | character varying       |
 issueid          | character(10)           | not null
 title            | character varying(1500) | not null
 titleenhancement | character varying(500)  |
 beginningpage    | character varying(19)   |
 pagecount        | integer                 | not null
 documenttype     | character(1)            | not null
 abstract         | text                    |
Indexes:
    "papers_pkey" PRIMARY KEY, btree (itemid)
    "idx_papers_issueid" btree (issueid)
Foreign-key constraints:
    "papers_issueid_fkey" FOREIGN KEY (issueid) REFERENCES
journals(issueid) ON DELETE CASCADE
Referenced by:
    TABLE "authorkeywords" CONSTRAINT "authorkeywords_itemid_fkey"
FOREIGN KEY (itemid) REFERENCES papers(itemid) ON DELETE CASCADE
    TABLE "authors" CONSTRAINT "authors_itemid_fkey" FOREIGN KEY
(itemid) REFERENCES papers(itemid) ON DELETE CASCADE
    TABLE "grantnumbers" CONSTRAINT "grantnumbers_itemid_fkey" FOREIGN
KEY (itemid) REFERENCES papers(itemid) ON DELETE CASCADE
    TABLE "keywordsplus" CONSTRAINT "keywordsplus_itemid_fkey" FOREIGN
KEY (itemid) REFERENCES papers(itemid) ON DELETE CASCADE
    TABLE "languages" CONSTRAINT "languages_itemid_fkey" FOREIGN KEY
(itemid) REFERENCES papers(itemid) ON DELETE CASCADE
    TABLE "paperreferences" CONSTRAINT "paperreferences_fromitemid_fkey"
FOREIGN KEY (itemid) REFERENCES papers(itemid) ON DELETE CASCADE

wos-db=> \d PaperReferences
             Table "public.paperreferences"
       Column       |         Type          | Modifiers
--------------------+-----------------------+-----------
 itemid             | character varying(15) | not null
 t9id               | integer               |
 citedauthor        | character varying(75) |
 citedartn          | character varying     |
 citeddoi           | character varying     |
 citedpii           | character varying     |
 citedunsp          | character varying     |
 citedreferenceyear | integer               |
 citedtitle         | character varying(20) | not null
 citedvolume        | character varying(4)  |
 citedpage          | character varying(5)  |
 referenceindex     | integer               | not null
Indexes:
    "paperreferences_pkey" PRIMARY KEY, btree (itemid, referenceindex)
Foreign-key constraints:
    "paperreferences_fromitemid_fkey" FOREIGN KEY (itemid) REFERENCES
papers(itemid) ON DELETE CASCADE

I just noticed that PaperReferences uses character varying (15) and
Papers uses character(15). Stupid mistake of mine. Do you think this
might cause the bad query planning? I will alter the table to use
character(15) in both cases and see if that helps.

postgresql.conf:
max_connections = 20
shared_buffers = 256MB
work_mem = 10MB
maintenance_work_mem = 128MB
max_stack_depth = 4MB
synchronous_commit = off
wal_buffers = 1MB
checkpoint_segments = 10
effective_cache_size = 768MB
default_statistics_target = 200
datestyle = 'iso, mdy'
lc_messages = 'C'
lc_monetary = 'C'
lc_numeric = 'C'
lc_time = 'C'
default_text_search_config = 'pg_catalog.simple'

The query I run:
SELECT p.ItemID FROM Papers AS p, PaperReferences AS r WHERE p.itemID =
r.ItemID AND p.issueID = 'A1983PW823'

Query plan with seqscan enabled:

"Hash Join  (cost=512.71..17709356.53 rows=8283226 width=16)"
"  Hash Cond: ((r.itemid)::bpchar = p.itemid)"
"  ->  Seq Scan on paperreferences r  (cost=0.00..15110856.68
rows=670707968 width=16)"
"  ->  Hash  (cost=500.29..500.29 rows=994 width=16)"
"        ->  Index Scan using idx_papers_issueid on papers p
(cost=0.00..500.29 rows=994 width=16)"
"              Index Cond: (issueid = 'A1983PW823'::bpchar)"

Query plan with seqscan disbaled

"Hash Join  (cost=10000000280.88..10017668625.22 rows=4233278 width=16)"
"  Hash Cond: ((r.itemid)::bpchar = p.itemid)"
"  ->  Seq Scan on paperreferences r
(cost=10000000000.00..10015110856.68 rows=670707968 width=16)"
"  ->  Hash  (cost=274.53..274.53 rows=508 width=16)"
"        ->  Index Scan using idx_papers_issueid on papers p
(cost=0.00..274.53 rows=508 width=16)"
"              Index Cond: (issueid = 'A1983PW823'::bpchar)"

Do you need an EXPLAIN ANALYZE output? Since it takes so long I can't
easily post one right now. But maybe I can get one over night.

My Hardware is an iMac running OS X 10.6.4 with 1.5 GB RAM and a 2.1 GHz
(or so) core 2 Duo processor.

Jann

Am 23.08.10 14:08, schrieb Kevin Grittner:
> Jann Röder wrote:
> Am 23.08.10 12:18, schrieb Scott Marlowe:
>
>>> What happens if you try
>>>
>>> set enable_seqscan=off;
>>> (your query here)
>>>
>> Tried that already. The query plan is exactly the same.
>
> Exactly?  Not even the cost shown for the seq scan changed?
>
> You are almost certainly omitting some crucial piece of information
> in your report.  Please look over this page and post a more complete
> report.  In particular, please show the results of \d for both tables
> (or of pg_dump -s -t 'tablename'), your complete postgresql.conf file
> stripped of comments, and a description of your hardware and OS.
>
> -Kevin
>


Re: Inefficient query plan

From
Grzegorz Jaśkiewicz
Date:
joining on varchars is always going to be very expensive. Longer the
value is, more expensive it will be. Consider going for surrogate
keys.

Re: Inefficient query plan

From
"Kevin Grittner"
Date:
Jann Röder<roederja@ethz.ch> wrote:

>                  Table "public.papers"
>       Column      |          Type           | Modifiers
> ------------------+-------------------------+-----------
>  itemid           | character(15)           | not null

> wos-db=> \d PaperReferences
>              Table "public.paperreferences"
>        Column       |         Type          | Modifiers
> --------------------+-----------------------+-----------
>  itemid             | character varying(15) | not null

> I just noticed that PaperReferences uses character varying (15)
> and Papers uses character(15). Stupid mistake of mine. Do you
> think this might cause the bad query planning?

Absolutely.  These are *not* the same type and don't compare all
that well.

> I will alter the table to use character(15) in both cases and see
> if that helps.

I suspect that making them the same will cure the problem, but I
would recommend you make any character(n) columns character
varying(n) instead of the other way around.  The the character(n)
data type has many surprising behaviors and tends to perform worse.
Avoid using it if possible.

> postgresql.conf:
> max_connections = 20
> shared_buffers = 256MB
> work_mem = 10MB
> maintenance_work_mem = 128MB
> max_stack_depth = 4MB
> synchronous_commit = off
> wal_buffers = 1MB
> checkpoint_segments = 10
> effective_cache_size = 768MB
> default_statistics_target = 200
> datestyle = 'iso, mdy'
> lc_messages = 'C'
> lc_monetary = 'C'
> lc_numeric = 'C'
> lc_time = 'C'
> default_text_search_config = 'pg_catalog.simple'

> Do you need an EXPLAIN ANALYZE output? Since it takes so long I
> can't easily post one right now. But maybe I can get one over
> night.

Not necessary; you've already identified the cause and the fix.

> My Hardware is an iMac running OS X 10.6.4 with 1.5 GB RAM and a
> 2.1 GHz (or so) core 2 Duo processor.

OK.  If you still don't get a good plan, you might want to try
edging up effective_cache_size, if the sum of your shared_buffers
and OS cache is larger than 768MB (which I would expect it might
be).  If the active part of your database (the part which is
frequently referenced) fits within cache space, or even a
significant portion of it fits, you might need to adjust
random_page_cost and perhaps seq_page_cost to reflect the lower
average cost of fetching from cache rather than disk -- but you want
to fix your big problem (the type mismatch) first, and then see if
you need further adjustments.

-Kevin

Re: Inefficient query plan

From
"Kevin Grittner"
Date:
Grzegorz Jaœkiewicz<gryzman@gmail.com> wrote:

> joining on varchars is always going to be very expensive. Longer
> the value is, more expensive it will be. Consider going for
> surrogate keys.

Surrogate keys come with their own set of costs and introduce quite
a few problems of their own.  I don't want to start a flame war or
go into an overly long diatribe on the evils of surrogate keys on
this thread; suffice it to say that it's not the first thing to try
here.

As an example of the performance we get using natural keys, with
compound keys on almost every table, check out this 1.3TB database,
being updated live by 3000 users as you view it:

http://wcca.wicourts.gov/

Some tables have hundreds of millions of rows.  No partitioning.

-Kevin

Re: Inefficient query plan

From
Grzegorz Jaśkiewicz
Date:
On Mon, Aug 23, 2010 at 2:47 PM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:
> Grzegorz Jaœkiewicz<gryzman@gmail.com> wrote:
>
>> joining on varchars is always going to be very expensive. Longer
>> the value is, more expensive it will be. Consider going for
>> surrogate keys.
>
> Surrogate keys come with their own set of costs and introduce quite
> a few problems of their own.  I don't want to start a flame war or
> go into an overly long diatribe on the evils of surrogate keys on
> this thread; suffice it to say that it's not the first thing to try
> here.
>
> As an example of the performance we get using natural keys, with
> compound keys on almost every table, check out this 1.3TB database,
> being updated live by 3000 users as you view it:
>
> http://wcca.wicourts.gov/
>
> Some tables have hundreds of millions of rows.  No partitioning.
>

True, but as far as joining is concerned, joining on single column
fixed length fields is always going to be a win. Hence why surrogate
keys make sens in this particular example, or the guy here should at
least test it to see, rather than believe in one or the other.


--
GJ

Re: Inefficient query plan

From
"Kevin Grittner"
Date:
Grzegorz Jaœkiewicz<gryzman@gmail.com> wrote:

> True, but as far as joining is concerned, joining on single column
> fixed length fields is always going to be a win. Hence why
> surrogate keys make sens in this particular example, or the guy
> here should at least test it to see, rather than believe in one or
> the other.

How about we start by just having him use the same data type in both
tables?

If you insist on getting into a discussion of the merits of
surrogate keys, you need to look at not just this one query and its
response time, where surrogate keys might give a percentage point or
two increase in performance, but at the integrity challenges they
introduce, and at what happens when you've got dozens of other
tables which would be containing the natural data, but which now
need to navigate through particular linkage paths to get to it to
generate summary reports and such.  It's easy to construct a narrow
case where a surrogate key is a short-term marginal win; it's just
about as easy to show data corruption vulnerabilities and huge
performance hits on complex queries when surrogate keys are used.
They have a place, but it's a pretty narrow set of use-cases in my
book.  For every place they're not used where they should be, there
are at least 100 places they are used where they shouldn't be.

-Kevin

Re: Inefficient query plan

From
Alvaro Herrera
Date:
Excerpts from Jann Röder's message of lun ago 23 00:23:38 -0400 2010:

> "Hash Join  (cost=516.66..17710110.47 rows=8358225 width=16)"
> "  Hash Cond: ((b.itemid)::bpchar = a.itemid)"
> "  ->  Seq Scan on b  (cost=0.00..15110856.68 rows=670707968 width=16)"
> "  ->  Hash  (cost=504.12..504.12 rows=1003 width=16)"
> "        ->  Index Scan using idx_issueid on a  (cost=0.00..504.12
> rows=1003 width=16)"
> "              Index Cond: (issueid = 'A1983PW823'::bpchar)"

Hmm, I'm placing bets on the bpchar weirdness.  I'd try getting rid of
that and using plain varchar for all the columns.

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: Inefficient query plan

From
Grzegorz Jaśkiewicz
Date:
I am not a fan of 'do this - this is best' response to queries like that.
Rather: this is what you should try, and choose whichever one suits you better.
So, rather than 'natural keys ftw', I am giving him another option to
choose from.

You see, in my world, I was able to improve some large dbs performance
10+ times fold, by going for surrogate keys. But in them cases, joins
were performed on 2+ varchar PK fields, and the whole thing was
crawwwling.

So, don't narrow down to one solution because it worked for you. Keep
an open book.

Re: Inefficient query plan

From
Grzegorz Jaśkiewicz
Date:
Oh, and I second using same types in joins especially, very much so :)

Re: Inefficient query plan

From
"Kevin Grittner"
Date:
Grzegorz Jaœkiewicz<gryzman@gmail.com> wrote:

> So, don't narrow down to one solution because it worked for you.
> Keep an open book.

What I was trying to do was advise on what would most directly fix
the problem.  Adding surrogate keys goes way beyond adding the
columns and using them as keys, as I'm sure you're aware if you've
done this on a large scale.  I wouldn't tell someone not to ever use
them; I would advise not to try them when there is a natural key
unless there are problems which are not solved without them, as
appears to have been the case with your database.

I may be a little bit over-sensitive on the topic, because I've seen
so many people who consider it "wrong" to use natural keys on any
table *ever*.  About one out of every four or five programmers who
gets hired here feels compelled to argue that we should add
surrogate keys to all our tables for no reason beyond "it's the
thing to do".  I've been at this for 38 years, most of that as a
consultant to a wide variety of businesses, government agencies, and
NPOs; and in my experience it usually is *not* the right thing to
do.

Don't worry -- when I see evidence that surrogate keys will solve a
problem which has not yielded to more conservative solutions, I'll
suggest using them.

-Kevin

Re: Inefficient query plan

From
Tom Lane
Date:
Alvaro Herrera <alvherre@commandprompt.com> writes:
> Hmm, I'm placing bets on the bpchar weirdness.  I'd try getting rid of
> that and using plain varchar for all the columns.

That's certainly what's inhibiting it from considering an indexscan
on the larger table.  I'm not as convinced as the OP that a nestloop
indexscan is really going to win compared to the hash plan, but if
the comparison value is varchar then an index on a bpchar column
is simply not useful --- at least not unless you stick an explicit
cast into the query, so that the comparison will have bpchar rather
than varchar semantics.

            regards, tom lane

Re: Inefficient query plan

From
Date:
>I may be a little bit over-sensitive on the topic, because I've seen
>so many people who consider it "wrong" to use natural keys on any
>table *ever*.  About one out of every four or five programmers who
>gets hired here feels compelled to argue that we should add
>surrogate keys to all our tables for no reason beyond "it's the
>thing to do".  I've been at this for 38 years, most of that as a
>consultant to a wide variety of businesses, government agencies, and
>NPOs; and in my experience it usually is *not* the right thing to
>do.
>
>Don't worry -- when I see evidence that surrogate keys will solve a
>problem which has not yielded to more conservative solutions, I'll
>suggest using them.
>
>-Kevin
>

Ah feeel your pain, brother.  Been there, done that.  In almost every case, those who make the "thou shalt always, and
only,use a surrogate key" cite folks like Ambler as authoritative rather than folks like Date.  The Kiddie Koder Krew
arewoefully uninformed about the history and development of RDBMS, and why some approaches are more intelligent than
others. Ambler, et al coders all, have poisoned more minds than can be counted. 

Dijkstra called out BASIC and COBOL for polluting the minds of young coders.  Allowing Ambler and the like to be
"thoughtleaders" in RDBMS is just as polluting. 

There, I said.

-- Robert

Re: Inefficient query plan

From
Scott Marlowe
Date:
On Mon, Aug 23, 2010 at 7:19 AM, Jann Röder <roederja@ethz.ch> wrote:
> Thanks for your help,
> here is the information you requested:
>
> Table information: A = Papers, B = PaperReferences
>
> wos-db=> \d Papers
>                 Table "public.papers"
>      Column      |          Type           | Modifiers
> ------------------+-------------------------+-----------
>  itemid           | character(15)           | not null

> wos-db=> \d PaperReferences
>             Table "public.paperreferences"
>       Column       |         Type          | Modifiers
> --------------------+-----------------------+-----------
>  itemid             | character varying(15) | not null
>
> I just noticed that PaperReferences uses character varying (15) and
> Papers uses character(15). Stupid mistake of mine. Do you think this
> might cause the bad query planning? I will alter the table to use
> character(15) in both cases and see if that helps.

Almost certainly it's not helping.  If the planner doesn't choose an
indexed lookup when you turn seq scans off, then either an index plan
is WAY expensive (the planner is tricked to turning off seq scan by
setting the value of them to something very high) or you don't have a
useful index.

When I asked if they were the same and if you'd tried with seqscan off
that's what I was getting at, that the types might not match.

Now, it may or may not be much faster with an index scan, depending on
your data distribution and the number of rows to be returned, but at
least if they're the same type the planner has a choice.  If they're
not, it has no choice, it has to go with the seq scan.

Let us know how it runs when you've got the types matched up.  BTW,
I'd generally go with text over char or varchar, but that's just me.

Re: Inefficient query plan

From
Jann Röder
Date:
So that took a while... I'm currently running ANALYZE on the
PaperReferences table again (the one where I changed the data type).

The plan however is still the same:
"Hash Join  (cost=280.88..24330800.08 rows=670602240 width=16)"
"  Hash Cond: (r.itemid = p.itemid)"
"  ->  Seq Scan on paperreferences r  (cost=0.00..15109738.40
rows=670602240 width=64)"
"  ->  Hash  (cost=274.53..274.53 rows=508 width=16)"
"        ->  Index Scan using idx_papers_issueid on papers p
(cost=0.00..274.53 rows=508 width=16)"
"              Index Cond: (issueid = 'A1983PW823'::bpchar)"

But I can now force it to use an index scan instead of a seqScan:
"Merge Join  (cost=0.00..2716711476.57 rows=670602240 width=16)"
"  Merge Cond: (p.itemid = r.itemid)"
"  ->  Index Scan using papers_pkey on papers p  (cost=0.00..21335008.47
rows=508 width=16)"
"        Filter: (issueid = 'A1983PW823'::bpchar)"
"  ->  Index Scan using paperreferences_pkey on paperreferences r
(cost=0.00..2686993938.83 rows=670602240 width=64)"

Unfortunately this is not faster than the other one. I did not wait
until it returned because I want this query to take less than 5 seconds
or so.

Here is my query again:
SELECT p.ItemID FROM Papers AS p, PaperReferences AS r WHERE p.itemID =
r.ItemID AND p.issueID = 'A1983PW823';

I can also write it as:
SELECT ItemID FROM PaperReferences WHERE ItemID IN (SELECT ItemID FROM
Papers WHERE IssueID = 'A1983PW823')

Which is more what I would do if I was the database. Unfortunately this
is not fast either:

"Hash Semi Join  (cost=280.88..24330800.08 rows=670602240 width=64)"
"  Hash Cond: (paperreferences.itemid = papers.itemid)"
"  ->  Seq Scan on paperreferences  (cost=0.00..15109738.40
rows=670602240 width=64)"
"  ->  Hash  (cost=274.53..274.53 rows=508 width=16)"
"        ->  Index Scan using idx_papers_issueid on papers
(cost=0.00..274.53 rows=508 width=16)"
"              Index Cond: (issueid = 'A1983PW823'::bpchar)"

The sub-query SELECT ItemID FROM Papers WHERE IssueID = 'A1983PW823' is
really fast, though and returns 16 rows. If I unroll the query by hand
like this:
SELECT ItemID FROM PaperReferences WHERE
(ItemID = 'A1983PW82300001' OR
ItemID = 'A1983PW82300002' OR
ItemID = 'A1983PW82300003' OR
ItemID = 'A1983PW82300004' OR
ItemID = 'A1983PW82300005' OR
ItemID = 'A1983PW82300006' OR
...)

(All the ORed stuff is the result of the sub-query) I get my result
really fast. So what I need now is a way to tell postgres to do it that
way automatically. If everything else fails I will have to put that
logic into my application in java code, which I don't want to do because
then I will also have to remove my constraints so I can delete stuff at
a reasonable speed.

Thanks,
Jann


Am 23.08.10 15:33, schrieb Kevin Grittner:
> Jann Röder<roederja@ethz.ch> wrote:
>
>>                  Table "public.papers"
>>       Column      |          Type           | Modifiers
>> ------------------+-------------------------+-----------
>>  itemid           | character(15)           | not null
>
>> wos-db=> \d PaperReferences
>>              Table "public.paperreferences"
>>        Column       |         Type          | Modifiers
>> --------------------+-----------------------+-----------
>>  itemid             | character varying(15) | not null
>
>> I just noticed that PaperReferences uses character varying (15)
>> and Papers uses character(15). Stupid mistake of mine. Do you
>> think this might cause the bad query planning?
>
> Absolutely.  These are *not* the same type and don't compare all
> that well.
>
>> I will alter the table to use character(15) in both cases and see
>> if that helps.
>
> I suspect that making them the same will cure the problem, but I
> would recommend you make any character(n) columns character
> varying(n) instead of the other way around.  The the character(n)
> data type has many surprising behaviors and tends to perform worse.
> Avoid using it if possible.
>
>> postgresql.conf:
>> max_connections = 20
>> shared_buffers = 256MB
>> work_mem = 10MB
>> maintenance_work_mem = 128MB
>> max_stack_depth = 4MB
>> synchronous_commit = off
>> wal_buffers = 1MB
>> checkpoint_segments = 10
>> effective_cache_size = 768MB
>> default_statistics_target = 200
>> datestyle = 'iso, mdy'
>> lc_messages = 'C'
>> lc_monetary = 'C'
>> lc_numeric = 'C'
>> lc_time = 'C'
>> default_text_search_config = 'pg_catalog.simple'
>
>> Do you need an EXPLAIN ANALYZE output? Since it takes so long I
>> can't easily post one right now. But maybe I can get one over
>> night.
>
> Not necessary; you've already identified the cause and the fix.
>
>> My Hardware is an iMac running OS X 10.6.4 with 1.5 GB RAM and a
>> 2.1 GHz (or so) core 2 Duo processor.
>
> OK.  If you still don't get a good plan, you might want to try
> edging up effective_cache_size, if the sum of your shared_buffers
> and OS cache is larger than 768MB (which I would expect it might
> be).  If the active part of your database (the part which is
> frequently referenced) fits within cache space, or even a
> significant portion of it fits, you might need to adjust
> random_page_cost and perhaps seq_page_cost to reflect the lower
> average cost of fetching from cache rather than disk -- but you want
> to fix your big problem (the type mismatch) first, and then see if
> you need further adjustments.
>
> -Kevin
>


Re: Inefficient query plan

From
Jann Röder
Date:
Thanks everyone,
the problem just solved itself. After the ANALYZE had finished, postgres
started doing what I wanted it to do all along:
EXPLAIN SELECT p.ItemID FROM Papers AS p, PaperReferences AS r WHERE
p.itemID = r.ItemID AND p.issueID = 'A1983PW823';

"Nested Loop  (cost=0.00..4515980.97 rows=2071811 width=16)"
"  ->  Index Scan using idx_papers_issueid on papers p
(cost=0.00..274.53 rows=508 width=16)"
"        Index Cond: (issueid = 'A1983PW823'::bpchar)"
"  ->  Index Scan using paperreferences_pkey on paperreferences r
(cost=0.00..8838.21 rows=4078 width=16)"
"        Index Cond: (r.itemid = p.itemid)"

So thanks again. I'm starting to grasp the postgres quirks :)

Jann

Am 24.08.10 15:03, schrieb Jann Röder:
> So that took a while... I'm currently running ANALYZE on the
> PaperReferences table again (the one where I changed the data type).
>
> The plan however is still the same:
> "Hash Join  (cost=280.88..24330800.08 rows=670602240 width=16)"
> "  Hash Cond: (r.itemid = p.itemid)"
> "  ->  Seq Scan on paperreferences r  (cost=0.00..15109738.40
> rows=670602240 width=64)"
> "  ->  Hash  (cost=274.53..274.53 rows=508 width=16)"
> "        ->  Index Scan using idx_papers_issueid on papers p
> (cost=0.00..274.53 rows=508 width=16)"
> "              Index Cond: (issueid = 'A1983PW823'::bpchar)"
>
> But I can now force it to use an index scan instead of a seqScan:
> "Merge Join  (cost=0.00..2716711476.57 rows=670602240 width=16)"
> "  Merge Cond: (p.itemid = r.itemid)"
> "  ->  Index Scan using papers_pkey on papers p  (cost=0.00..21335008.47
> rows=508 width=16)"
> "        Filter: (issueid = 'A1983PW823'::bpchar)"
> "  ->  Index Scan using paperreferences_pkey on paperreferences r
> (cost=0.00..2686993938.83 rows=670602240 width=64)"
>
> Unfortunately this is not faster than the other one. I did not wait
> until it returned because I want this query to take less than 5 seconds
> or so.
>
> Here is my query again:
> SELECT p.ItemID FROM Papers AS p, PaperReferences AS r WHERE p.itemID =
> r.ItemID AND p.issueID = 'A1983PW823';
>
> I can also write it as:
> SELECT ItemID FROM PaperReferences WHERE ItemID IN (SELECT ItemID FROM
> Papers WHERE IssueID = 'A1983PW823')
>
> Which is more what I would do if I was the database. Unfortunately this
> is not fast either:
>
> "Hash Semi Join  (cost=280.88..24330800.08 rows=670602240 width=64)"
> "  Hash Cond: (paperreferences.itemid = papers.itemid)"
> "  ->  Seq Scan on paperreferences  (cost=0.00..15109738.40
> rows=670602240 width=64)"
> "  ->  Hash  (cost=274.53..274.53 rows=508 width=16)"
> "        ->  Index Scan using idx_papers_issueid on papers
> (cost=0.00..274.53 rows=508 width=16)"
> "              Index Cond: (issueid = 'A1983PW823'::bpchar)"
>
> The sub-query SELECT ItemID FROM Papers WHERE IssueID = 'A1983PW823' is
> really fast, though and returns 16 rows. If I unroll the query by hand
> like this:
> SELECT ItemID FROM PaperReferences WHERE
> (ItemID = 'A1983PW82300001' OR
> ItemID = 'A1983PW82300002' OR
> ItemID = 'A1983PW82300003' OR
> ItemID = 'A1983PW82300004' OR
> ItemID = 'A1983PW82300005' OR
> ItemID = 'A1983PW82300006' OR
> ...)
>
> (All the ORed stuff is the result of the sub-query) I get my result
> really fast. So what I need now is a way to tell postgres to do it that
> way automatically. If everything else fails I will have to put that
> logic into my application in java code, which I don't want to do because
> then I will also have to remove my constraints so I can delete stuff at
> a reasonable speed.
>
> Thanks,
> Jann
>
>
> Am 23.08.10 15:33, schrieb Kevin Grittner:
>> Jann Röder<roederja@ethz.ch> wrote:
>>
>>>                  Table "public.papers"
>>>       Column      |          Type           | Modifiers
>>> ------------------+-------------------------+-----------
>>>  itemid           | character(15)           | not null
>>
>>> wos-db=> \d PaperReferences
>>>              Table "public.paperreferences"
>>>        Column       |         Type          | Modifiers
>>> --------------------+-----------------------+-----------
>>>  itemid             | character varying(15) | not null
>>
>>> I just noticed that PaperReferences uses character varying (15)
>>> and Papers uses character(15). Stupid mistake of mine. Do you
>>> think this might cause the bad query planning?
>>
>> Absolutely.  These are *not* the same type and don't compare all
>> that well.
>>
>>> I will alter the table to use character(15) in both cases and see
>>> if that helps.
>>
>> I suspect that making them the same will cure the problem, but I
>> would recommend you make any character(n) columns character
>> varying(n) instead of the other way around.  The the character(n)
>> data type has many surprising behaviors and tends to perform worse.
>> Avoid using it if possible.
>>
>>> postgresql.conf:
>>> max_connections = 20
>>> shared_buffers = 256MB
>>> work_mem = 10MB
>>> maintenance_work_mem = 128MB
>>> max_stack_depth = 4MB
>>> synchronous_commit = off
>>> wal_buffers = 1MB
>>> checkpoint_segments = 10
>>> effective_cache_size = 768MB
>>> default_statistics_target = 200
>>> datestyle = 'iso, mdy'
>>> lc_messages = 'C'
>>> lc_monetary = 'C'
>>> lc_numeric = 'C'
>>> lc_time = 'C'
>>> default_text_search_config = 'pg_catalog.simple'
>>
>>> Do you need an EXPLAIN ANALYZE output? Since it takes so long I
>>> can't easily post one right now. But maybe I can get one over
>>> night.
>>
>> Not necessary; you've already identified the cause and the fix.
>>
>>> My Hardware is an iMac running OS X 10.6.4 with 1.5 GB RAM and a
>>> 2.1 GHz (or so) core 2 Duo processor.
>>
>> OK.  If you still don't get a good plan, you might want to try
>> edging up effective_cache_size, if the sum of your shared_buffers
>> and OS cache is larger than 768MB (which I would expect it might
>> be).  If the active part of your database (the part which is
>> frequently referenced) fits within cache space, or even a
>> significant portion of it fits, you might need to adjust
>> random_page_cost and perhaps seq_page_cost to reflect the lower
>> average cost of fetching from cache rather than disk -- but you want
>> to fix your big problem (the type mismatch) first, and then see if
>> you need further adjustments.
>>
>> -Kevin