Re: Query performance - Mailing list pgsql-performance

From S Arvind
Subject Re: Query performance
Date
Msg-id abf9211d0910120623i49d3789ah38d2ac2ff9fe4e2d@mail.gmail.com
Whole thread Raw
In response to Re: Query performance  (Matthew Wakeling <matthew@flymine.org>)
Responses Re: Query performance
List pgsql-performance
Thanks very much Matthew , its more then my expectation...

Without changing the query is there any way to optimize it, like by changing the pg configuration for handling these kind queries?

-Arvind S


On Mon, Oct 12, 2009 at 6:31 PM, Matthew Wakeling <matthew@flymine.org> wrote:
On Mon, 12 Oct 2009, S Arvind wrote:
I can understand left join, actually can any one tell me why sort operation is carried
out and wat Materialize means...
Can anyone explain me the mentioned plan with reason(s)?

 Merge Left Join  (cost=62451.86..67379.08 rows=286789 width=0)
    Merge Cond: (a.id = b.id)

    ->  Sort  (cost=18610.57..18923.27 rows=125077 width=8)
         Sort Key: a.id
         ->  Seq Scan on a  (cost=0.00..6309.77 rows=125077 width=8)

    ->  Materialize  (cost=43841.28..47426.15 rows=286789 width=8)
         ->  Sort  (cost=43841.28..44558.26 rows=286789 width=8)
             Sort Key: b.id
            ->  Seq Scan on b (cost=0.00..13920.89 rows=286789 width=8)

This is a merge join. A merge join joins together two streams of data, where both streams are sorted, by placing the two streams side by side and advancing through both streams finding matching rows. The algorithm can use a pointer to a position in both of the streams, and advance the pointer of the stream that has the earlier value according to the sort order, and therefore get all the matches.

You are performing a query over the whole of both of the tables, so the cheapest way to obtain a sorted stream of data is to do a full sequential scan of the whole table, bring it into memory, and sort it. An alternative would be to follow a B-tree index if one was available on the correct column, but that is usually more expensive unless the table is clustered on the index or only a small portion of the table is to be read. If you had put a "LIMIT 10" clause on the end of the query and had such an index, it would probably switch to that strategy instead.

The materialise step is effectively a buffer that allows one of the streams to be rewound cheaply, which will be necessary if there are multiple rows with the same value.

Does that answer your question?

Matthew

--
The only secure computer is one that's unplugged, locked in a safe,
and buried 20 feet under the ground in a secret location...and i'm not
even too sure about that one.                         --Dennis Huges, FBI

pgsql-performance by date:

Previous
From: Matthew Wakeling
Date:
Subject: Re: Query performance
Next
From: Grzegorz Jaśkiewicz
Date:
Subject: Re: Query performance