Thread: unnesesary sorting after Merge Full Join

unnesesary sorting after Merge Full Join

From
Alexey Nalbat
Date:
Hello.

I'd like to use ORDER BY in any specified order and LIMIT, OFFSET for paging query results.
The query is FULL OUTER JOIN of two tables by field id. I think the results of Merge Full Join
to be ordered by some "combined id". And there is no need in extra Sort if I specify ORDER BY
that "combined id". But unfortunately it is not so.

Here is a simple example:
-- BEGIN
create table t1 as select generate_series(1,1000000,2) as id;
create table t2 as select generate_series(1,1000000,3) as id;

create index i1 on t1 ( id );
create index i2 on t2 ( id );

analyze t1;
analyze t2;

explain analyze
select id, t1.*, t2.*
from t1 natural full join t2
order by 1 limit 10 offset 10;

drop table t1;
drop table t2;
-- END

Postgresql chooses such plan:
 Limit  (cost=44080.12..44080.15 rows=10 width=8) (actual time=6724.850..6724.906 rows=10 loops=1)
   ->  Sort  (cost=44080.10..45330.10 rows=500000 width=8) (actual time=6724.806..6724.845 rows=20 loops=1)
         Sort Key: (COALESCE(t1.id, t2.id))
         Sort Method:  top-N heapsort  Memory: 25kB
         ->  Merge Full Join  (cost=0.00..30775.28 rows=500000 width=8) (actual time=0.142..5237.289 rows=666667
loops=1)
               Merge Cond: (t1.id = t2.id)
               ->  Index Scan using i1 on t1  (cost=0.00..15212.30 rows=500000 width=4) (actual time=0.079..1188.601
rows=500000loops=1) 
               ->  Index Scan using i2 on t2  (cost=0.00..10146.30 rows=333334 width=4) (actual time=0.051..793.635
rows=333334loops=1) 

The desired plan is much faster:
 Limit  (cost=0.62..1.23 rows=10 width=8) (actual time=0.262..0.366 rows=10 loops=1)
   ->  Merge Full Join  (cost=0.00..30775.28 rows=500000 width=8) (actual time=0.156..0.303 rows=20 loops=1)
         Merge Cond: (t1.id = t2.id)
         ->  Index Scan using i1 on t1  (cost=0.00..15212.30 rows=500000 width=4) (actual time=0.088..0.120 rows=15
loops=1)
         ->  Index Scan using i2 on t2  (cost=0.00..10146.30 rows=333334 width=4) (actual time=0.056..0.078 rows=11
loops=1)

I found comment in src/backend/optimizer/path/pathkeys.c:
* EXCEPTION: in a FULL or RIGHT join, we cannot treat the result as
* having the outer path's path keys, because null lefthand rows may be
* inserted at random points. It must be treated as unsorted.

How can I get rid of this sorting? Or could this behavior of Merge Full Join be improved?

--
Alexey A. Nalbat

Price Express
http://www.price.ru/
http://www.tyndex.ru/

Re: unnesesary sorting after Merge Full Join

From
Decibel!
Date:
On Feb 21, 2008, at 4:08 AM, Alexey Nalbat wrote:
> I'd like to use ORDER BY in any specified order and LIMIT, OFFSET
> for paging query results.
> The query is FULL OUTER JOIN of two tables by field id. I think the
> results of Merge Full Join
> to be ordered by some "combined id". And there is no need in extra
> Sort if I specify ORDER BY
> that "combined id". But unfortunately it is not so.
>
> Here is a simple example:
> -- BEGIN
> create table t1 as select generate_series(1,1000000,2) as id;
> create table t2 as select generate_series(1,1000000,3) as id;
>
> create index i1 on t1 ( id );
> create index i2 on t2 ( id );
>
> analyze t1;
> analyze t2;
>
> explain analyze
> select id, t1.*, t2.*
> from t1 natural full join t2
> order by 1 limit 10 offset 10;
>
> drop table t1;
> drop table t2;
> -- END
>
> Postgresql chooses such plan:
>  Limit  (cost=44080.12..44080.15 rows=10 width=8) (actual
> time=6724.850..6724.906 rows=10 loops=1)
>    ->  Sort  (cost=44080.10..45330.10 rows=500000 width=8) (actual
> time=6724.806..6724.845 rows=20 loops=1)
>          Sort Key: (COALESCE(t1.id, t2.id))
>          Sort Method:  top-N heapsort  Memory: 25kB
>          ->  Merge Full Join  (cost=0.00..30775.28 rows=500000
> width=8) (actual time=0.142..5237.289 rows=666667 loops=1)
>                Merge Cond: (t1.id = t2.id)
>                ->  Index Scan using i1 on t1  (cost=0.00..15212.30
> rows=500000 width=4) (actual time=0.079..1188.601 rows=500000 loops=1)
>                ->  Index Scan using i2 on t2  (cost=0.00..10146.30
> rows=333334 width=4) (actual time=0.051..793.635 rows=333334 loops=1)
>
> The desired plan is much faster:
>  Limit  (cost=0.62..1.23 rows=10 width=8) (actual time=0.262..0.366
> rows=10 loops=1)
>    ->  Merge Full Join  (cost=0.00..30775.28 rows=500000 width=8)
> (actual time=0.156..0.303 rows=20 loops=1)
>          Merge Cond: (t1.id = t2.id)
>          ->  Index Scan using i1 on t1  (cost=0.00..15212.30
> rows=500000 width=4) (actual time=0.088..0.120 rows=15 loops=1)
>          ->  Index Scan using i2 on t2  (cost=0.00..10146.30
> rows=333334 width=4) (actual time=0.056..0.078 rows=11 loops=1)
>
> I found comment in src/backend/optimizer/path/pathkeys.c:
> * EXCEPTION: in a FULL or RIGHT join, we cannot treat the result as
> * having the outer path's path keys, because null lefthand rows may be
> * inserted at random points. It must be treated as unsorted.
>
> How can I get rid of this sorting? Or could this behavior of Merge
> Full Join be improved?

Theoretically, this can be improved, but I suspect it would be non-
trivial. I suspect that the problem is the planner doesn't realize
that the join key could never be null, which is often not the case.
Consider this example:

decibel=# create table t1 as select generate_series(1,1000000,2) as id1;
SELECT
decibel=# create table t2 as select generate_series(1,1000000,3) as id2;
SELECT

Create index, etc.

explain analyze
select id1, id2
from t1 full join t2 on (t1.id1=t2.id2)
order by 1 limit 10 offset 10;

Note that in this case you have to re-sort, because NULLs sort
differently.

As a workaround, I suggest creating a table that contains all the IDs
from t1 and t2. You could maintain this table via a trigger if you
wanted. You could then quickly determine the exact IDs you wanted,
and then join against the two real tables.
--
Decibel!, aka Jim C. Nasby, Database Architect  decibel@decibel.org
Give your computer some brain candy! www.distributed.net Team #1828



Attachment

Re: unnesesary sorting after Merge Full Join

From
Simon Riggs
Date:
On Sat, 2008-02-23 at 14:49 -0600, Decibel! wrote:
> On Feb 21, 2008, at 4:08 AM, Alexey Nalbat wrote:

> > I found comment in src/backend/optimizer/path/pathkeys.c:
> > * EXCEPTION: in a FULL or RIGHT join, we cannot treat the result as
> > * having the outer path's path keys, because null lefthand rows may be
> > * inserted at random points. It must be treated as unsorted.
> >
> > How can I get rid of this sorting? Or could this behavior of Merge
> > Full Join be improved?
>
> Theoretically, this can be improved

I don't see how. The ORDER BY ... LIMIT ... code is already optimised.

If there are NULLs in the left hand side then it needs to be treated as
unsorted, which forces a sort.

If you know there are no NULLs then don't do a FULL join.

--
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com


Re: unnesesary sorting after Merge Full Join

From
"Alexey A. Nalbat"
Date:
>> > I found comment in src/backend/optimizer/path/pathkeys.c:
>> > * EXCEPTION: in a FULL or RIGHT join, we cannot treat the result as
>> > * having the outer path's path keys, because null lefthand rows may be
>> > * inserted at random points. It must be treated as unsorted.
>> >
>> > How can I get rid of this sorting? Or could this behavior of Merge
>> > Full Join be improved?
>>
>> Theoretically, this can be improved
>
> I don't see how. The ORDER BY ... LIMIT ... code is already optimised.

Yes. But may be the FULL MERGE JOIN could be improved, because it
is ordered, it actually has "outer path's path key": "coalesce(id1,id2)".

> If there are NULLs in the left hand side then it needs to be treated as
> unsorted, which forces a sort.

Yes, it is not ordered by the id1 from the left table because of NULLs.
And it is also not ordered by the id2 from the right table because of NULLs.
But it is ordered by coalesce(id1,id2). Could postgresql have sense about
this fact?

> If you know there are no NULLs then don't do a FULL join.

Full join is right choice for my task. There are images of products stored
on HDD, their IDs are in table pics_arch. And there are image IDs mentioned
in the pricelist, they are in table pr_img. Thus some images could be both
on HDD and pricelist, some only on HDD, other only in the pricelist. I use
full join of these two tables to show HTML-table consists of all images
with remark "both on HDD and in pricelist", "only on HDD" or "only in
pricelist".


Re: unnesesary sorting after Merge Full Join

From
Gregory Stark
Date:
"Alexey A. Nalbat" <alexey_nalbat@hotbox.ru> writes:

> Yes. But may be the FULL MERGE JOIN could be improved, because it
> is ordered, it actually has "outer path's path key": "coalesce(id1,id2)".

Hm, there's more going on here than meets the eye. If you changed the query
slightly Postgres would actually _almost_ do the right thing here.

The immediate blocker is that currently in build_join_pathkeys() for FULL
OUTER JOIN we don't note any path keys at all. We could note COALESCE(id1,id2)
as a path key, though we would have to create an equivalence class and add
COALESCE(id2,id1) to it as well I think.

But if you changed the query to use USING you would get something like this:

postgres=# explain select * from a1 full outer join a2 as a2(i) using (i) order by i;
                               QUERY PLAN
------------------------------------------------------------------------
 Sort  (cost=2914.68..2986.68 rows=28800 width=8)
   Sort Key: (COALESCE(a1.i, a2.i))
   ->  Merge Full Join  (cost=337.49..781.49 rows=28800 width=8)
         Merge Cond: (a1.i = a2.i)
         ->  Sort  (cost=168.75..174.75 rows=2400 width=4)
               Sort Key: a1.i
               ->  Seq Scan on a1  (cost=0.00..34.00 rows=2400 width=4)
         ->  Sort  (cost=168.75..174.75 rows=2400 width=4)
               Sort Key: a2.i
               ->  Seq Scan on a2  (cost=0.00..34.00 rows=2400 width=4)
(10 rows)


Note that it does recognize that "i" is in fact equivalent to COALESCE(). It
just doesn't recognize that the merge join is producing such an ordering.

Even if it wasn't hard to add that at least for this case in
reconsider_outer_join_clauses() we explicitly don't consider join clauses
unless they're against a constant. I haven't quite absorbed the logic here but
I think it only applies to path keys useful for joining, not for ordering:

 * For full-join cases, we can only do something useful if it's a FULL JOIN
 * USING and a merged column has an equivalence MERGEDVAR = CONSTANT.
 * By the time it gets here, the merged column will look like
 *        COALESCE(LEFTVAR, RIGHTVAR)
 * and we will have a full-join clause LEFTVAR = RIGHTVAR that we can match
 * the COALESCE expression to. In this situation we can push LEFTVAR = CONSTANT
 * and RIGHTVAR = CONSTANT into the input relations, since any rows not
 * meeting these conditions cannot contribute to the join result.
 *
 * Again, there isn't any traction to be gained by trying to deal with
 * clauses comparing a mergedvar to a non-pseudoconstant.  So we can make
 * use of the EquivalenceClasses to search for matching variables that were
 * equivalenced to constants.  The interesting outer-join clauses were
 * accumulated for us by distribute_qual_to_rels.

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com
  Ask me about EnterpriseDB's RemoteDBA services!

Re: unnesesary sorting after Merge Full Join

From
Tom Lane
Date:
Gregory Stark <stark@enterprisedb.com> writes:
> "Alexey A. Nalbat" <alexey_nalbat@hotbox.ru> writes:
>> Yes. But may be the FULL MERGE JOIN could be improved, because it
>> is ordered, it actually has "outer path's path key": "coalesce(id1,id2)".

No, it does not have the outer path's path key.  The outer path's key
is just id1.

> The immediate blocker is that currently in build_join_pathkeys() for FULL
> OUTER JOIN we don't note any path keys at all. We could note COALESCE(id1,id2)
> as a path key, though we would have to create an equivalence class

Right ...

> and add COALESCE(id2,id1) to it as well I think.

No, because those two expressions are not equivalent.  (Hmm ... squint
... but full merge join is pretty much symmetric, so it's not clear
why it should matter which side is left or right.  Maybe COALESCE isn't
exactly the right concept with which to describe the merged variable?)

> Even if it wasn't hard to add that at least for this case in
> reconsider_outer_join_clauses() we explicitly don't consider join clauses
> unless they're against a constant. I haven't quite absorbed the logic here but

That stuff is irrelevant for sort-order considerations.

It strikes me that there's another bit of smarts that could be added
here: in a Merge Right Join the correct output pathkey is the righthand
input's path key, rather than nil.  Again this is because mergejoin is
symmetric in the two inputs.

            regards, tom lane

Re: unnesesary sorting after Merge Full Join

From
Simon Riggs
Date:
On Tue, 2008-02-26 at 10:34 -0500, Tom Lane wrote:
> > and add COALESCE(id2,id1) to it as well I think.
>
> No, because those two expressions are not equivalent.  (Hmm ... squint
> ... but full merge join is pretty much symmetric, so it's not clear
> why it should matter which side is left or right.  Maybe COALESCE
> isn't
> exactly the right concept with which to describe the merged variable?)

It is, in this case only, since when id2 is not null id2 == id1.

So in this case its OK to express a symmetric relationship as a left
handed function.

--
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com


Re: unnesesary sorting after Merge Full Join

From
Tom Lane
Date:
Simon Riggs <simon@2ndquadrant.com> writes:
> On Tue, 2008-02-26 at 10:34 -0500, Tom Lane wrote:
> and add COALESCE(id2,id1) to it as well I think.
>> No, because those two expressions are not equivalent.  (Hmm ... squint
>> ... but full merge join is pretty much symmetric, so it's not clear
>> why it should matter which side is left or right.  Maybe COALESCE
>> isn't
>> exactly the right concept with which to describe the merged variable?)

> It is, in this case only, since when id2 is not null id2 == id1.

> So in this case its OK to express a symmetric relationship as a left
> handed function.

Well, it gives the right answer, but it fails to capture the property
that the expression is really symmetric.  Which is something we need
to capture here so that the planner doesn't think that x mergejoin y
and y mergejoin x produce different output orderings.  I think I had
done the COALESCE hack to avoid putting very much effort into FULL
JOIN, but it might be time to put in some more work, if we really
care about improving this.

            regards, tom lane

Re: unnesesary sorting after Merge Full Join

From
Tom Lane
Date:
I wrote:
> Gregory Stark <stark@enterprisedb.com> writes:
>> as a path key, though we would have to create an equivalence class
>> and add COALESCE(id2,id1) to it as well I think.

> No, because those two expressions are not equivalent.  (Hmm ... squint
> ... but full merge join is pretty much symmetric, so it's not clear
> why it should matter which side is left or right.  Maybe COALESCE isn't
> exactly the right concept with which to describe the merged variable?)

Wait ... after consuming more caffeine, I think you were right.  The
point of an EquivalenceClass is that it asserts the contained
expressions must have the same values, and in the case of a full join
it actually is the case that COALESCE(id1,id2) = COALESCE(id2,id1)
at every output row.  So it's legitimate to put both expressions in
the same eclass, even though their values might be different in other
circumstances.  And that solves our symmetry problem because the eclass
is the same whichever way you build it.

It might still be interesting sometime to have a more bespoke
representation for a merged variable, but I guess we don't need
it just for this.

            regards, tom lane

Re: unnesesary sorting after Merge Full Join

From
Gregory Stark
Date:
"Tom Lane" <tgl@sss.pgh.pa.us> writes:

> I wrote:
>> Gregory Stark <stark@enterprisedb.com> writes:
>>> as a path key, though we would have to create an equivalence class
>>> and add COALESCE(id2,id1) to it as well I think.
>
>> No, because those two expressions are not equivalent.  (Hmm ... squint
>> ... but full merge join is pretty much symmetric, so it's not clear
>> why it should matter which side is left or right.  Maybe COALESCE isn't
>> exactly the right concept with which to describe the merged variable?)
>
> Wait ... after consuming more caffeine, I think you were right.  The
> point of an EquivalenceClass is that it asserts the contained
> expressions must have the same values, and in the case of a full join
> it actually is the case that COALESCE(id1,id2) = COALESCE(id2,id1)
> at every output row.  So it's legitimate to put both expressions in
> the same eclass, even though their values might be different in other
> circumstances.  And that solves our symmetry problem because the eclass
> is the same whichever way you build it.
>
> It might still be interesting sometime to have a more bespoke
> representation for a merged variable, but I guess we don't need
> it just for this.

It might have an advantage if you're doing a three-way outer join and neither
C(id1, C(id2,id3)) nor C(C(id2,id3), id1) match the requested order of
C(C(id1,id2), id3).

It doesn't see too important to get that right since the outer joins can't be
reordered and the user could reasonably be expected to match the grouping of
his joins with his coalesce expression.

I'm normally a proponent of guaranteeing that equivalent queries produce
identical plans but avoiding factorial-order growth of equivalence classes is
also nice. Even putting both orders in there will have exponential growth
which is a bit frightful.

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com
  Ask me about EnterpriseDB's RemoteDBA services!

Re: unnesesary sorting after Merge Full Join

From
Tom Lane
Date:
Gregory Stark <stark@enterprisedb.com> writes:
> "Tom Lane" <tgl@sss.pgh.pa.us> writes:
>> It might still be interesting sometime to have a more bespoke
>> representation for a merged variable, but I guess we don't need
>> it just for this.

> It might have an advantage if you're doing a three-way outer join and neither
> C(id1, C(id2,id3)) nor C(C(id2,id3), id1) match the requested order of
> C(C(id1,id2), id3).

Hmm ... couldn't we fix that if the COALESCE-builder were smart enough
to flatten nested COALESCEs?  AFAICS, C(C(x,y),z) == C(x,y,z)

> It doesn't see too important to get that right since the outer joins can't be
> reordered and the user could reasonably be expected to match the grouping of
> his joins with his coalesce expression.

Agreed, it's not real clear that nested full joins are something we need
to be overly tense about right now.

            regards, tom lane