Re: Inefficient query plan - Mailing list pgsql-performance

From Jann Röder
Subject Re: Inefficient query plan
Date
Msg-id i50fuk$vg4$1@dough.gmane.org
Whole thread Raw
In response to Re: Inefficient query plan  ("Kevin Grittner" <Kevin.Grittner@wicourts.gov>)
Responses Re: Inefficient query plan
List pgsql-performance
So that took a while... I'm currently running ANALYZE on the
PaperReferences table again (the one where I changed the data type).

The plan however is still the same:
"Hash Join  (cost=280.88..24330800.08 rows=670602240 width=16)"
"  Hash Cond: (r.itemid = p.itemid)"
"  ->  Seq Scan on paperreferences r  (cost=0.00..15109738.40
rows=670602240 width=64)"
"  ->  Hash  (cost=274.53..274.53 rows=508 width=16)"
"        ->  Index Scan using idx_papers_issueid on papers p
(cost=0.00..274.53 rows=508 width=16)"
"              Index Cond: (issueid = 'A1983PW823'::bpchar)"

But I can now force it to use an index scan instead of a seqScan:
"Merge Join  (cost=0.00..2716711476.57 rows=670602240 width=16)"
"  Merge Cond: (p.itemid = r.itemid)"
"  ->  Index Scan using papers_pkey on papers p  (cost=0.00..21335008.47
rows=508 width=16)"
"        Filter: (issueid = 'A1983PW823'::bpchar)"
"  ->  Index Scan using paperreferences_pkey on paperreferences r
(cost=0.00..2686993938.83 rows=670602240 width=64)"

Unfortunately this is not faster than the other one. I did not wait
until it returned because I want this query to take less than 5 seconds
or so.

Here is my query again:
SELECT p.ItemID FROM Papers AS p, PaperReferences AS r WHERE p.itemID =
r.ItemID AND p.issueID = 'A1983PW823';

I can also write it as:
SELECT ItemID FROM PaperReferences WHERE ItemID IN (SELECT ItemID FROM
Papers WHERE IssueID = 'A1983PW823')

Which is more what I would do if I was the database. Unfortunately this
is not fast either:

"Hash Semi Join  (cost=280.88..24330800.08 rows=670602240 width=64)"
"  Hash Cond: (paperreferences.itemid = papers.itemid)"
"  ->  Seq Scan on paperreferences  (cost=0.00..15109738.40
rows=670602240 width=64)"
"  ->  Hash  (cost=274.53..274.53 rows=508 width=16)"
"        ->  Index Scan using idx_papers_issueid on papers
(cost=0.00..274.53 rows=508 width=16)"
"              Index Cond: (issueid = 'A1983PW823'::bpchar)"

The sub-query SELECT ItemID FROM Papers WHERE IssueID = 'A1983PW823' is
really fast, though and returns 16 rows. If I unroll the query by hand
like this:
SELECT ItemID FROM PaperReferences WHERE
(ItemID = 'A1983PW82300001' OR
ItemID = 'A1983PW82300002' OR
ItemID = 'A1983PW82300003' OR
ItemID = 'A1983PW82300004' OR
ItemID = 'A1983PW82300005' OR
ItemID = 'A1983PW82300006' OR
...)

(All the ORed stuff is the result of the sub-query) I get my result
really fast. So what I need now is a way to tell postgres to do it that
way automatically. If everything else fails I will have to put that
logic into my application in java code, which I don't want to do because
then I will also have to remove my constraints so I can delete stuff at
a reasonable speed.

Thanks,
Jann


Am 23.08.10 15:33, schrieb Kevin Grittner:
> Jann Röder<roederja@ethz.ch> wrote:
>
>>                  Table "public.papers"
>>       Column      |          Type           | Modifiers
>> ------------------+-------------------------+-----------
>>  itemid           | character(15)           | not null
>
>> wos-db=> \d PaperReferences
>>              Table "public.paperreferences"
>>        Column       |         Type          | Modifiers
>> --------------------+-----------------------+-----------
>>  itemid             | character varying(15) | not null
>
>> I just noticed that PaperReferences uses character varying (15)
>> and Papers uses character(15). Stupid mistake of mine. Do you
>> think this might cause the bad query planning?
>
> Absolutely.  These are *not* the same type and don't compare all
> that well.
>
>> I will alter the table to use character(15) in both cases and see
>> if that helps.
>
> I suspect that making them the same will cure the problem, but I
> would recommend you make any character(n) columns character
> varying(n) instead of the other way around.  The the character(n)
> data type has many surprising behaviors and tends to perform worse.
> Avoid using it if possible.
>
>> postgresql.conf:
>> max_connections = 20
>> shared_buffers = 256MB
>> work_mem = 10MB
>> maintenance_work_mem = 128MB
>> max_stack_depth = 4MB
>> synchronous_commit = off
>> wal_buffers = 1MB
>> checkpoint_segments = 10
>> effective_cache_size = 768MB
>> default_statistics_target = 200
>> datestyle = 'iso, mdy'
>> lc_messages = 'C'
>> lc_monetary = 'C'
>> lc_numeric = 'C'
>> lc_time = 'C'
>> default_text_search_config = 'pg_catalog.simple'
>
>> Do you need an EXPLAIN ANALYZE output? Since it takes so long I
>> can't easily post one right now. But maybe I can get one over
>> night.
>
> Not necessary; you've already identified the cause and the fix.
>
>> My Hardware is an iMac running OS X 10.6.4 with 1.5 GB RAM and a
>> 2.1 GHz (or so) core 2 Duo processor.
>
> OK.  If you still don't get a good plan, you might want to try
> edging up effective_cache_size, if the sum of your shared_buffers
> and OS cache is larger than 768MB (which I would expect it might
> be).  If the active part of your database (the part which is
> frequently referenced) fits within cache space, or even a
> significant portion of it fits, you might need to adjust
> random_page_cost and perhaps seq_page_cost to reflect the lower
> average cost of fetching from cache rather than disk -- but you want
> to fix your big problem (the type mismatch) first, and then see if
> you need further adjustments.
>
> -Kevin
>


pgsql-performance by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: PARSE WAITING
Next
From: Jann Röder
Date:
Subject: Re: Inefficient query plan