Thread: optimising UNION performance

optimising UNION performance

From
Rafal Pietrak
Date:
Hi all,

Is there a way to speed up the query to my 'grand total' logfile,
constructed as a UNION of smaller (specialised) logfiles?

Access to log1/log2 is quick (If I'm reading ANALYSE log correctly, it's
c.a. 100ms each - and it feels like that, so presumebly I'm reading
ANALYSE just OK), but the UNION is quite slow (3.5 sec, which I can
confirm - it's the real response time of the server in that case)

Again, if I'm reading the ANALYSE correctly (and I'm not really sure of
that), it looks like the performence is hit by the sort performed by the
UNION. The sort is not really needed in this case and don't really know
why it's there (my comlog VIEW definition does not request one - or does
it implicitly?).

One point not shown in the attachment, is that 'brands' and 'clty' are
foreing references. Should foreign reference matter here?

Can someone give me a hint on how to write a VIEW, that returns
concatenation of both log-tables within a sum of their respective access
times (as of current, that would be c.a. 200ms)

The ANALYSE is taken from postgres v8.1.4 installed from *.deb on
Debian-sid, hosted by 800MHz/512MB Duron machine. It really strickes me,
that 20k+30k rows is returned in 3.5sec!

--
-R


Attachment

Re: optimising UNION performance

From
Peter Eisentraut
Date:
Am Montag, 28. August 2006 10:23 schrieb Rafal Pietrak:
> Is there a way to speed up the query to my 'grand total' logfile,
> constructed as a UNION of smaller (specialised) logfiles?

If it is sufficient for your purpose, you will find UNION ALL to be
significantly faster.

--
Peter Eisentraut
http://developer.postgresql.org/~petere/

Re: optimising UNION performance

From
Ragnar
Date:
On mán, 2006-08-28 at 10:23 +0200, Rafal Pietrak wrote:
> Hi all,
>
> Is there a way to speed up the query to my 'grand total' logfile,
> constructed as a UNION of smaller (specialised) logfiles?
>

I do not know if this is relevant to your case, but
possibly you can use a UNION ALL instead of a UNION.

In many cases the UNION ALL gives petter performance,
as the unique step can be skipped, as well as a sort
needed by the unique.

gnari



Re: optimising UNION performance

From
Rafal Pietrak
Date:
Thank you All for explanations. Looks loke that's what I was looking
for.

UNION ALL is quite satisfactory (830ms).

And yet, somwhere I loose c.a. 600ms (as compared to 120ms+80ms of each
respective 'raw' subquery).... which as percentage seem signifficant.
Does anybody know where the processing goes now?

Currently, the ANALYSE looks like this:
                                                            QUERY
PLAN

----------------------------------------------------------------------------------------------------------------------------------
 Subquery Scan comlog  (cost=0.00..2269.71 rows=51400 width=0) (actual
time=0.053..755.649 rows=51400 loops=1)
   ->  Append  (cost=0.00..1755.71 rows=51400 width=59) (actual
time=0.048..607.437 rows=51400 loops=1)
->  Subquery Scan "*SELECT* 1"  (cost=0.00..1084.61 rows=30916 width=59)
(actual time=0.046..278.802 rows=30916 loops=1)
->  Seq Scan on log1 c  (cost=0.00..775.45 rows=30916 width=59) (actual
time=0.042..170.193 rows=30916 loops=1)
->  Subquery Scan "*SELECT* 2"  (cost=0.00..671.10 rows=20484 width=26)
(actual time=0.055..200.223 rows=20484 loops=1)
->  Seq Scan on log2 s  (cost=0.00..466.26 rows=20484 width=26) (actual
time=0.044..127.301 rows=20484 loops=1)
Total runtime: 822.901 ms
(7 rows)
-----------------------------------------

On Mon, 2006-08-28 at 09:11 +0000, Ragnar wrote:
> On mán, 2006-08-28 at 10:23 +0200, Rafal Pietrak wrote:
> > Hi all,
> >
> > Is there a way to speed up the query to my 'grand total' logfile,
> > constructed as a UNION of smaller (specialised) logfiles?
> >
>
> I do not know if this is relevant to your case, but
> possibly you can use a UNION ALL instead of a UNION.
>
> In many cases the UNION ALL gives petter performance,
> as the unique step can be skipped, as well as a sort
> needed by the unique.
>
> gnari
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: Have you checked our extensive FAQ?
>
>                http://www.postgresql.org/docs/faq
--
-R

Re: optimising UNION performance

From
Alban Hertroys
Date:
Rafal Pietrak wrote:
> Thank you All for explanations. Looks loke that's what I was looking
> for.
>
> UNION ALL is quite satisfactory (830ms).
>
> And yet, somwhere I loose c.a. 600ms (as compared to 120ms+80ms of each
> respective 'raw' subquery).... which as percentage seem signifficant.
> Does anybody know where the processing goes now?
>
> Currently, the ANALYSE looks like this:
>                                                             QUERY
> PLAN
>
----------------------------------------------------------------------------------------------------------------------------------
>  Subquery Scan comlog  (cost=0.00..2269.71 rows=51400 width=0) (actual
> time=0.053..755.649 rows=51400 loops=1)
>    ->  Append  (cost=0.00..1755.71 rows=51400 width=59) (actual
> time=0.048..607.437 rows=51400 loops=1)
> ->  Subquery Scan "*SELECT* 1"  (cost=0.00..1084.61 rows=30916 width=59)
> (actual time=0.046..278.802 rows=30916 loops=1)
> ->  Seq Scan on log1 c  (cost=0.00..775.45 rows=30916 width=59) (actual
> time=0.042..170.193 rows=30916 loops=1)
> ->  Subquery Scan "*SELECT* 2"  (cost=0.00..671.10 rows=20484 width=26)
> (actual time=0.055..200.223 rows=20484 loops=1)
> ->  Seq Scan on log2 s  (cost=0.00..466.26 rows=20484 width=26) (actual
> time=0.044..127.301 rows=20484 loops=1)
> Total runtime: 822.901 ms
> (7 rows)
> -----------------------------------------

Just to make sure: You do have an appropriate index over the tables in
that UNION?

 From experience, it seems that PostgreSQL chooses a sequential scan
over unioned sets instead of an index scan - the details escape me, but
there is a good reason for that. I'm sure it's not for performance
reasons, though.

There have been some discussions about inheritance performance, which
boils down to exactly this problem (inheritance basically is a UNION
over all the tables involved). You may want to check the archives.

Regards,
--
Alban Hertroys
alban@magproductions.nl

magproductions b.v.

T: ++31(0)534346874
F: ++31(0)534346876
M:
I: www.magproductions.nl
A: Postbus 416
    7500 AK Enschede

// Integrate Your World //

Re: optimising UNION performance

From
Rafal Pietrak
Date:
On Mon, 2006-08-28 at 13:04 +0200, Alban Hertroys wrote:
> Rafal Pietrak wrote:
> > Total runtime: 822.901 ms
> > (7 rows)
> > -----------------------------------------
>
> Just to make sure: You do have an appropriate index over the tables in
> that UNION?

Well. The logfiles don't have their own indexes but make foreign key
references over brand1/brand2/clty columns. Unique constreins are on the
target tables.

But that particual ANALYSED query was: "SELECT 1 FROM comlog".

So, the use of seq-scan looks quite adequate to me (I haven't posted
results to avoid cuttering of my original query with too many details,
but the ANALYSE of "SELECT * FROM comlog" gives almost exactly the same
cost and time, and *that* is what I will actually be doing in the
application).

So: "SELECT 1" on comlog costs 830ms and is done by two saq-scans.

but: "SELECT 1" on log1 gives 120ms, and "SELECT 1" on log2 gives 80ms.

All three queries are executed as seq-scans.... which look OK, as I
intend to fetch *all* the rows.

And yet, there is this 600ms 'leak'.

Now, as I try to  read the ANALYSE output (which I'm not very proficient
at - just blindly comparing them): for queries of raw logs (log1/log2),
I can see just one server task: "seq-scan" for each respective query.
But when I look at ANALYSE output of comlog SELECT, I can see, that:
1. the seq-scans is more expensive here: 170ms and 120ms respectively.
Any reasons for that?
2. each scan has an additional job of: Subquery Scan "*SELECT* 1" ...
which costs even more (280ms and 230ms respectively), although it's
purpose it not very clear to me.
3. only on top of that, there is an Append process, which looks cheap as
expected.

So I can risk an opinion, that I can seek the missing 600ms, I only
don't understand why it's there.

Or to put it the other way around: is there a way to write a UNION where
the 200ms to 800ms cost increase does not occure.

-R

>  From experience, it seems that PostgreSQL chooses a sequential scan
> over unioned sets instead of an index scan - the details escape me, but
> there is a good reason for that. I'm sure it's not for performance
> reasons, though.
>
> There have been some discussions about inheritance performance, which
> boils down to exactly this problem (inheritance basically is a UNION
> over all the tables involved). You may want to check the archives.
>
> Regards,
--
-R

Re: optimising UNION performance

From
Alban Hertroys
Date:
Rafal Pietrak wrote:
> Well. The logfiles don't have their own indexes but make foreign key
> references over brand1/brand2/clty columns. Unique constreins are on the
> target tables.

So there's no index on the logfiles then? (A foreign key constraint
doesn't create an index). It doesn't seem like in your case an index
would cause any benefit, just so you know.

It also means that your problem is different from mine; PostgreSQL using
a seq-scan over a UNION (ALL) where indices were available over the
parts (mind you, it would use the appropriate index on the seperate
union parts).

> But that particual ANALYSED query was: "SELECT 1 FROM comlog".
>
> So, the use of seq-scan looks quite adequate to me (I haven't posted
> results to avoid cuttering of my original query with too many details,
> but the ANALYSE of "SELECT * FROM comlog" gives almost exactly the same
> cost and time, and *that* is what I will actually be doing in the
> application).

There's practically no difference between SELECT 1 FROM ... and SELECT *
FROM ...; the only added costs (AFAIK) are for actually fetching the
column values and such. Pretty cheap operations.

> So: "SELECT 1" on comlog costs 830ms and is done by two saq-scans.
>
> but: "SELECT 1" on log1 gives 120ms, and "SELECT 1" on log2 gives 80ms.
>
> All three queries are executed as seq-scans.... which look OK, as I
> intend to fetch *all* the rows.
>
> And yet, there is this 600ms 'leak'.

...

> But when I look at ANALYSE output of comlog SELECT, I can see, that:
> 1. the seq-scans is more expensive here: 170ms and 120ms respectively.
> Any reasons for that?
> 2. each scan has an additional job of: Subquery Scan "*SELECT* 1" ...
> which costs even more (280ms and 230ms respectively), although it's
> purpose it not very clear to me.

This is probably caused by using UNION as opposed to UNION ALL (as other
people already mentioned).

To merge duplicate results (one from either subquery) the database
sorts[1] the results. To do that, it needs to compare with other records
- hence the extra subquery, and probably the added 50ms as well.

Regards,
--
Alban Hertroys
alban@magproductions.nl

magproductions b.v.

T: ++31(0)534346874
F: ++31(0)534346876
M:
I: www.magproductions.nl
A: Postbus 416
    7500 AK Enschede

// Integrate Your World //

Re: optimising UNION performance

From
Rafal Pietrak
Date:
On Mon, 2006-08-28 at 14:50 +0200, Alban Hertroys wrote:
> Rafal Pietrak wrote:
>
> > But when I look at ANALYSE output of comlog SELECT, I can see, that:
> > 1. the seq-scans is more expensive here: 170ms and 120ms respectively.
> > Any reasons for that?
> > 2. each scan has an additional job of: Subquery Scan "*SELECT* 1" ...
> > which costs even more (280ms and 230ms respectively), although it's
> > purpose it not very clear to me.
>
> This is probably caused by using UNION as opposed to UNION ALL (as other
> people already mentioned).
>
> To merge duplicate results (one from either subquery) the database
> sorts[1] the results. To do that, it needs to compare with other records
> - hence the extra subquery, and probably the added 50ms as well.

No no no.

The above 1. 2. 3. is read from UNION ALL analysis - the results of
UNION per se are only in my initial post, and after I've read of the
'ALL' option I make no further reference to the original construct
(where the cost of "SELECT 1" was 3600ms as oposed to 830ms for current
"UNION ALL").

Currently I'm digging why the SELECT on UNION takes 830ms, while SELECT
on respective raw log-tables take just 120ms and 80ms respectively -
where does the remaining 600ms go.

I have notices the spurious "Subquery Scan "*SELECT* 1" ..." server
task, which takes more then the indispensable "seq-scan" on respective
table while does not serve any purpose .... to my unexperienced eye at
least.

And why the same seq-scan taken by select on my log-table *within* a
UNION is more expensive, than when it's taken on that table by itself:
120ms rises to 170ms, and 80ms rises to 120ms for log1/log2 tables
respectively.
--
-R

Re: optimising UNION performance

From
Tom Lane
Date:
Alban Hertroys <alban@magproductions.nl> writes:
> There's practically no difference between SELECT 1 FROM ... and SELECT *
> FROM ...; the only added costs (AFAIK) are for actually fetching the
> column values and such. Pretty cheap operations.

You're both glossing over exactly the wrong thing, particularly seeing
that Rafal appears to be using 8.0 or older which hasn't got 8.1's
significant reductions in targetlist evaluation costs.

            regards, tom lane

Re: optimising UNION performance

From
Rafal Pietrak
Date:
This is a little strange - my response to this post apparently got lost
in the net?? I haven't received it back through the list nor it's
visible in the archieve. Yet, my exim logfile contains entry indicating
'delivery complited'???

But to the point.

All the EXPLAIN ANALISE I did on posggres v8.1.4 - plain *.deb binary
install from debian network repository.

So if version v8.1 was expected to behave any differently, it doesn't.
But may be the reduction improvement are in some pre-8.2 versions?

-R

On Mon, 2006-08-28 at 10:46 -0400, Tom Lane wrote:
> Alban Hertroys <alban@magproductions.nl> writes:
> > There's practically no difference between SELECT 1 FROM ... and SELECT *
> > FROM ...; the only added costs (AFAIK) are for actually fetching the
> > column values and such. Pretty cheap operations.
>
> You're both glossing over exactly the wrong thing, particularly seeing
> that Rafal appears to be using 8.0 or older which hasn't got 8.1's
> significant reductions in targetlist evaluation costs.
>
>             regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: Have you checked our extensive FAQ?
>
>                http://www.postgresql.org/docs/faq
--
-R