Thread: Wrong width of UNION statement

Wrong width of UNION statement

From
Kenichiro Tanaka
Date:
Hello hackers

I think I found a bug about estimating width of table column when I
perform SQL with UNION statement.

I think table column width of  UNION statement should be equal one of UNION ALL.
But they don't match.This can be reproduce it on HEAD.

See following example.

--CREATE TEST TABLE
DROP TABLE union_test;DROP TABLE union_test2;
CREATE TABLE union_test  AS SELECT md5(g::text)::char(84) as data FROM
generate_series(1,1000) as g;
CREATE TABLE union_test2 AS SELECT md5(g::text)::char(84) as data FROM
generate_series(1,1000) as g;
ANALYZE union_test;
ANALYZE union_test2;

--width of union_test is 85.
SELECT * FROM union_test;
                                                 QUERY PLAN
------------------------------------------------------------------------------------------------------------
 Seq Scan on union_test  (cost=0.00..25.00 rows=1000 width=85) (actual
time=0.591..1.166 rows=1000 loops=1)
 Planning Time: 10.559 ms
 Execution Time: 2.974 ms
(3 rows)

--width of UNION is 340(wrong)
EXPLAIN ANALYZE
SELECT * FROM union_test
UNION
SELECT * FROM union_test2;

QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------
HashAggregate  (cost=85.00..105.00 rows=2000 width=*340*) (actual
time=3.323..3.672 rows=1000 loops=1)
Group Key: union_test.data
Peak Memory Usage: 369 kB
->  Append  (cost=0.00..80.00 rows=2000 width=340) (actual
time=0.021..1.191 rows=2000 loops=1)
->  Seq Scan on union_test  (cost=0.00..25.00 rows=1000 width=85)
(actual time=0.019..0.393 rows=1000 loops=1)
->  Seq Scan on union_test2  (cost=0.00..25.00 rows=1000 width=85)
(actual time=0.027..0.302 rows=1000 loops=1)
Planning Time: 0.096 ms
Execution Time: 3.908 ms
(8 rows)

--width of UNION ALL is 85
EXPLAIN ANALYZE
SELECT * FROM union_test
UNION ALL
SELECT * FROM union_test2;

QUERY PLAN
-------------------------------------------------------------------------------------------------------------------
Append  (cost=0.00..60.00 rows=2000 width=85) (actual
time=0.017..1.187 rows=2000 loops=1)
->  Seq Scan on union_test  (cost=0.00..25.00 rows=1000 width=85)
(actual time=0.017..0.251 rows=1000 loops=1)
->  Seq Scan on union_test2  (cost=0.00..25.00 rows=1000 width=85)
(actual time=0.018..0.401 rows=1000 loops=1)
Planning Time: 0.213 ms
Execution Time: 1.444 ms
(5 rows)

I think this is bug, is it right?

Regards
Kenichiro Tanaka.



Re: Wrong width of UNION statement

From
Tom Lane
Date:
Kenichiro Tanaka <kenichirotanakapg@gmail.com> writes:
> I think table column width of  UNION statement should be equal one of UNION ALL.

I don't buy that argument, because there could be type coercions involved,
so that the result width isn't necessarily equal to any one of the inputs.

Having said that, the example you give shows that we make use of
pg_statistic.stawidth values when estimating the width of immediate
relation outputs, but that data isn't available by the time we get to
a UNION output.  So we fall back to get_typavgwidth, which in this
case is going to produce something involving the typmod times the
maximum encoding length.  (I suppose you're using UTF8 encoding...)

There's room for improvement there, but this is all bound up in the legacy
mess that we have in prepunion.c.  For example, because we don't have
RelOptInfo nodes associated with individual set-operation outputs, it's
difficult to figure out where we might store data about the widths of such
outputs.  Nor could we easily access the data if we had it, since the
associated Vars don't have valid RTE indexes.  So to my mind, that code
needs to be thrown away and rewritten, using actual relations to represent
the different setop results and Paths to represent possible computations.
In the meantime, it's hard to get excited about layering some additional
hacks on top of what's there now.

            regards, tom lane



Re: Wrong width of UNION statement

From
Kenichiro Tanaka
Date:
Hello,

Thank you for your quick response and sorry for my late reply.

> (I suppose you're using UTF8 encoding...)
It is right.
As you said, my encoding of database is UTF8.

>There's room for improvement there, but this is all bound up in the legacy
>mess that we have in prepunion.c.
At first,I think it is easy to fix it.
Because I think that it is easy to fix by calculating in the same way
as UNION ALL.
But ,now,I understand it is not so easy.

I'll report if I find some strong reason enough to throw away and
rewrite prepunion.c.

Thank you.

Regards
Kenichiro Tanaka.

2020年6月2日(火) 0:04 Tom Lane <tgl@sss.pgh.pa.us>:
>
> Kenichiro Tanaka <kenichirotanakapg@gmail.com> writes:
> > I think table column width of  UNION statement should be equal one of UNION ALL.
>
> I don't buy that argument, because there could be type coercions involved,
> so that the result width isn't necessarily equal to any one of the inputs.
>
> Having said that, the example you give shows that we make use of
> pg_statistic.stawidth values when estimating the width of immediate
> relation outputs, but that data isn't available by the time we get to
> a UNION output.  So we fall back to get_typavgwidth, which in this
> case is going to produce something involving the typmod times the
> maximum encoding length.  (I suppose you're using UTF8 encoding...)
>
> There's room for improvement there, but this is all bound up in the legacy
> mess that we have in prepunion.c.  For example, because we don't have
> RelOptInfo nodes associated with individual set-operation outputs, it's
> difficult to figure out where we might store data about the widths of such
> outputs.  Nor could we easily access the data if we had it, since the
> associated Vars don't have valid RTE indexes.  So to my mind, that code
> needs to be thrown away and rewritten, using actual relations to represent
> the different setop results and Paths to represent possible computations.
> In the meantime, it's hard to get excited about layering some additional
> hacks on top of what's there now.
>
>                         regards, tom lane



Re: Wrong width of UNION statement

From
Ashutosh Bapat
Date:
On Mon, Jun 1, 2020 at 8:35 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Kenichiro Tanaka <kenichirotanakapg@gmail.com> writes:
> > I think table column width of  UNION statement should be equal one of UNION ALL.
>
> I don't buy that argument, because there could be type coercions involved,
> so that the result width isn't necessarily equal to any one of the inputs.
>
> Having said that, the example you give shows that we make use of
> pg_statistic.stawidth values when estimating the width of immediate
> relation outputs, but that data isn't available by the time we get to
> a UNION output.  So we fall back to get_typavgwidth, which in this
> case is going to produce something involving the typmod times the
> maximum encoding length.  (I suppose you're using UTF8 encoding...)
>
> There's room for improvement there, but this is all bound up in the legacy
> mess that we have in prepunion.c.  For example, because we don't have
> RelOptInfo nodes associated with individual set-operation outputs,

We already have that infrastructure, IIUC through commit

commit c596fadbfe20ff50a8e5f4bc4b4ff5b7c302ecc0
Author: Robert Haas <rhaas@postgresql.org>
Date:   Mon Mar 19 11:55:38 2018 -0400

    Generate a separate upper relation for each stage of setop planning.

Can we use that to fix this bug?

-- 
Best Wishes,
Ashutosh Bapat



Re: Wrong width of UNION statement

From
Kenichiro Tanaka
Date:
Hello,Ashutosh
Thank you for your response.

>We already have that infrastructure, IIUC through commit
That's good!

>Can we use that to fix this bug?
I'll try it.
But this is my first hack,I think I need time.

Regards
Kenichiro Tanaka

2020年6月8日(月) 22:11 Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>:
>
> On Mon, Jun 1, 2020 at 8:35 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >
> > Kenichiro Tanaka <kenichirotanakapg@gmail.com> writes:
> > > I think table column width of  UNION statement should be equal one of UNION ALL.
> >
> > I don't buy that argument, because there could be type coercions involved,
> > so that the result width isn't necessarily equal to any one of the inputs.
> >
> > Having said that, the example you give shows that we make use of
> > pg_statistic.stawidth values when estimating the width of immediate
> > relation outputs, but that data isn't available by the time we get to
> > a UNION output.  So we fall back to get_typavgwidth, which in this
> > case is going to produce something involving the typmod times the
> > maximum encoding length.  (I suppose you're using UTF8 encoding...)
> >
> > There's room for improvement there, but this is all bound up in the legacy
> > mess that we have in prepunion.c.  For example, because we don't have
> > RelOptInfo nodes associated with individual set-operation outputs,
>
> We already have that infrastructure, IIUC through commit
>
> commit c596fadbfe20ff50a8e5f4bc4b4ff5b7c302ecc0
> Author: Robert Haas <rhaas@postgresql.org>
> Date:   Mon Mar 19 11:55:38 2018 -0400
>
>     Generate a separate upper relation for each stage of setop planning.
>
> Can we use that to fix this bug?
>
> --
> Best Wishes,
> Ashutosh Bapat