Thread: Adding CORRESPONDING to Set Operations

Adding CORRESPONDING to Set Operations

From
Kerem Kat
Date:
Hello,

I am new to postgresql code, I would like to start implementing easyish TODO items. I have read most of the development guidelines, faqs, articles by Greg Smith (Hacking Postgres with UDFs, Adding WHEN to triggers).

The item I would like to implement is adding CORRESPONDING [BY (col1[,col2,...]])] to INTERSECT and EXCEPT operators.

Can anyone comment on how much effort this item needs?


regards, kerem kat.

Re: Adding CORRESPONDING to Set Operations

From
Kerem Kat
Date:
Is it feasible to implement the CORRESPONDING [BY (expr_list)] statement in set operations by the following changes:

i) In analyze.c:transformSetOperationStmt after parsing left and right queries as subnodes to a set operation tree,
    a) CORRESPONDING: Find matching column targets from both statements, eliminate unmatching targets and proceed.
    b) CORRESPONDING BY (expr_list): Verify expr_list columns exist in both select statements. Eliminate unmatched column names to expr_list and proceed.
ii) Instead of elimination set TargetEntry->resjunk = true for unwanted output columns.


Thank you for your attention,
Any comments are welcome.

Kerem KAT

On Sun, Sep 18, 2011 at 12:39, Kerem Kat <keremkat@gmail.com> wrote:
Hello,

I am new to postgresql code, I would like to start implementing easyish TODO items. I have read most of the development guidelines, faqs, articles by Greg Smith (Hacking Postgres with UDFs, Adding WHEN to triggers).

The item I would like to implement is adding CORRESPONDING [BY (col1[,col2,...]])] to INTERSECT and EXCEPT operators.

Can anyone comment on how much effort this item needs?


regards, kerem kat.

Re: Adding CORRESPONDING to Set Operations

From
Robert Haas
Date:
On Sun, Sep 18, 2011 at 5:39 AM, Kerem Kat <keremkat@gmail.com> wrote:
> I am new to postgresql code, I would like to start implementing easyish TODO
> items. I have read most of the development guidelines, faqs, articles by
> Greg Smith (Hacking Postgres with UDFs, Adding WHEN to triggers).
> The item I would like to implement is adding CORRESPONDING [BY
> (col1[,col2,...]])] to INTERSECT and EXCEPT operators.
> Can anyone comment on how much effort this item needs?

This seems reasonably tricky for a first project, but maybe not out of
reach if you are a skilled C hacker.  It's certainly more complicated
than my first patch:

http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=a0b76dc662efde6e02921c2d16e06418483b7534

I guess the first question that needs to be answered here is ... what
exactly is this syntax supposed to do?  A little looking around
suggests that EXCEPT CORRESPONDING is supposed to make the
correspondence run by column names rather than by column positions,
and if you further add BY col1, ... then it restricts the comparison
to those columns.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Adding CORRESPONDING to Set Operations

From
Kerem Kat
Date:
I delved into the code without waiting for comments from the list just to learn something about postgresql internals. And I have finished the CORRESPONDING, now CORRESPONDING BY is being tested. I will also write documentation and regression tests.


Yes Robert, you are correct. Having used SQL 20nn standard draft as a guide, a brief explanation can be provided as such:

Shorter version: column name lists are intersected.
Short version: In the set operation queries, which are queries containing INTERSECT, EXCEPT or UNION, a CORRESPONDING clause can be used to project the resulting columns to only columns contained in both sides of the query. There is also and addition of BY(col1, col2, ...) to the clause which projects the columns to its own list. An example query would clarifiy.

SELECT 1 a, 2 b UNION CORRESPONDING SELECT 3 a;
a
--
1
3

SELECT 1 a, 2 b, 3 c UNION CORRESPONDING BY(a, c) SELECT 4 a, 5 c
a   c
------
1   3
4   5



On Thu, Sep 22, 2011 at 16:20, Robert Haas <robertmhaas@gmail.com> wrote:
On Sun, Sep 18, 2011 at 5:39 AM, Kerem Kat <keremkat@gmail.com> wrote:
> I am new to postgresql code, I would like to start implementing easyish TODO
> items. I have read most of the development guidelines, faqs, articles by
> Greg Smith (Hacking Postgres with UDFs, Adding WHEN to triggers).
> The item I would like to implement is adding CORRESPONDING [BY
> (col1[,col2,...]])] to INTERSECT and EXCEPT operators.
> Can anyone comment on how much effort this item needs?

This seems reasonably tricky for a first project, but maybe not out of
reach if you are a skilled C hacker.  It's certainly more complicated
than my first patch:

http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=a0b76dc662efde6e02921c2d16e06418483b7534

I guess the first question that needs to be answered here is ... what
exactly is this syntax supposed to do?  A little looking around
suggests that EXCEPT CORRESPONDING is supposed to make the
correspondence run by column names rather than by column positions,
and if you further add BY col1, ... then it restricts the comparison
to those columns.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Adding CORRESPONDING to Set Operations

From
Kerem Kat
Date:
While testing I noticed that ordering is incorrect in my implementation. At first I thought that removing mismatched entries from ltargetlist and rtargetlist would be enough, it didn't seem enough so I added rtargetlist sorting.

SELECT 1 a, 2 b, 3 c UNION CORRESPONDING 4 b, 5 a, 6 c;
returns incorrectly:
a  b  c
1  2  3
4  5  6

Correct:
a  b  c
1  2  3
5  4  6

In the analyze.c:transfromSetOperationStmt, I tried to sort rtargetlist before the forboth(ltl, ltargetlist, rtl,rtargetlist) to no avail.
Sorted column names are in correct order in rtargetlist, but query is executed as if rtargetlist is never sorted.

Where the targetlist gets the column ordering? Apparently not while targetlist is being lappend'ed (?).


regards,

Kerem KAT



On Thu, Sep 22, 2011 at 17:03, Kerem Kat <keremkat@gmail.com> wrote:
I delved into the code without waiting for comments from the list just to learn something about postgresql internals. And I have finished the CORRESPONDING, now CORRESPONDING BY is being tested. I will also write documentation and regression tests.


Yes Robert, you are correct. Having used SQL 20nn standard draft as a guide, a brief explanation can be provided as such:

Shorter version: column name lists are intersected.
Short version: In the set operation queries, which are queries containing INTERSECT, EXCEPT or UNION, a CORRESPONDING clause can be used to project the resulting columns to only columns contained in both sides of the query. There is also and addition of BY(col1, col2, ...) to the clause which projects the columns to its own list. An example query would clarifiy.

SELECT 1 a, 2 b UNION CORRESPONDING SELECT 3 a;
a
--
1
3

SELECT 1 a, 2 b, 3 c UNION CORRESPONDING BY(a, c) SELECT 4 a, 5 c
a   c
------
1   3
4   5



On Thu, Sep 22, 2011 at 16:20, Robert Haas <robertmhaas@gmail.com> wrote:
On Sun, Sep 18, 2011 at 5:39 AM, Kerem Kat <keremkat@gmail.com> wrote:
> I am new to postgresql code, I would like to start implementing easyish TODO
> items. I have read most of the development guidelines, faqs, articles by
> Greg Smith (Hacking Postgres with UDFs, Adding WHEN to triggers).
> The item I would like to implement is adding CORRESPONDING [BY
> (col1[,col2,...]])] to INTERSECT and EXCEPT operators.
> Can anyone comment on how much effort this item needs?

This seems reasonably tricky for a first project, but maybe not out of
reach if you are a skilled C hacker.  It's certainly more complicated
than my first patch:

http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=a0b76dc662efde6e02921c2d16e06418483b7534

I guess the first question that needs to be answered here is ... what
exactly is this syntax supposed to do?  A little looking around
suggests that EXCEPT CORRESPONDING is supposed to make the
correspondence run by column names rather than by column positions,
and if you further add BY col1, ... then it restricts the comparison
to those columns.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Adding CORRESPONDING to Set Operations

From
Tom Lane
Date:
Kerem Kat <keremkat@gmail.com> writes:
> While testing I noticed that ordering is incorrect in my implementation. At
> first I thought that removing mismatched entries from ltargetlist and
> rtargetlist would be enough, it didn't seem enough so I added rtargetlist
> sorting.

I don't think you can get away with changing the targetlists of the
UNION subqueries; you could break their semantics.  Consider for
instance
select distinct a, b, c from t1union correspondingselect b, c from t2;

If you discard the A column from t1's output list then it will deliver a
different set of rows than it should, because the DISTINCT is
considering the wrong set of values.

One possible way to fix that is to introduce a level of sub-select,
as if the query had been written
select b, c from (select distinct a, b, c from t1) ss1unionselect b, c from (select b, c from t2) ss2;

However, the real problem with either type of hackery is that these
machinations will be visible in the parsed query, which means for
example that a view defined as
create view v1 asselect distinct a, b, c from t1union correspondingselect b, c from t2;

would come out looking like the transformed version rather than the
original when it's dumped, or even just examined with tools such as
psql's \d+.  I think this is bad style.  It's certainly ugly to expose
your implementation shortcuts to the user like that, and it also can
cause problems down the road: if in the future we think of some better
way to implement CORRESPONDING, we've lost the chance to do so for any
stored views that got transformed this way.  (There are several places
in Postgres now that take such shortcuts, and all of them were mistakes
that we need to clean up someday, IMO.)

So I think that as far as the parser is concerned, you just want to
store the CORRESPONDING clause more or less as-is, and not do too much
more than verify that it's valid.  The place to actually implement it is
in the planner (see prepunion.c).  Possibly the add-a-level-of-subselect
approach will work, but you want to do that querytree transformation at
plan time not parse time.
        regards, tom lane


Re: Adding CORRESPONDING to Set Operations

From
Kerem Kat
Date:
I am looking into perpunion.c and analyze.c

There is a catch inserting subqueries for corresponding in the planner. 
Parser expects to see equal number of columns in both sides of the 
UNION query. If there is corresponding however we cannot guarantee that.
Target columns, collations and types for the SetOperationStmt are
determined in the parser. If we pass the column number equality checks,
it is not clear that how one would proceed with the targetlist generation loop
which is a forboth for two table's columns.

One way would be filtering the columns in the parser anyway and inserting 
subqueries in the planner but it leads to the previous problem of column
ordering and view definition mess-up, and it would be too much bloat
methinks.

I can guess what needs to be done in prepunion.c, but I need a waypointer 
for the parser.

tom lane: Thanks for your description


regards

Kerem KAT

On Fri, Sep 23, 2011 at 07:40, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Kerem Kat <keremkat@gmail.com> writes:
> While testing I noticed that ordering is incorrect in my implementation. At
> first I thought that removing mismatched entries from ltargetlist and
> rtargetlist would be enough, it didn't seem enough so I added rtargetlist
> sorting.

I don't think you can get away with changing the targetlists of the
UNION subqueries; you could break their semantics.  Consider for
instance

       select distinct a, b, c from t1
       union corresponding
       select b, c from t2;

If you discard the A column from t1's output list then it will deliver a
different set of rows than it should, because the DISTINCT is
considering the wrong set of values.

One possible way to fix that is to introduce a level of sub-select,
as if the query had been written

       select b, c from (select distinct a, b, c from t1) ss1
       union
       select b, c from (select b, c from t2) ss2;

However, the real problem with either type of hackery is that these
machinations will be visible in the parsed query, which means for
example that a view defined as

       create view v1 as
       select distinct a, b, c from t1
       union corresponding
       select b, c from t2;

would come out looking like the transformed version rather than the
original when it's dumped, or even just examined with tools such as
psql's \d+.  I think this is bad style.  It's certainly ugly to expose
your implementation shortcuts to the user like that, and it also can
cause problems down the road: if in the future we think of some better
way to implement CORRESPONDING, we've lost the chance to do so for any
stored views that got transformed this way.  (There are several places
in Postgres now that take such shortcuts, and all of them were mistakes
that we need to clean up someday, IMO.)

So I think that as far as the parser is concerned, you just want to
store the CORRESPONDING clause more or less as-is, and not do too much
more than verify that it's valid.  The place to actually implement it is
in the planner (see prepunion.c).  Possibly the add-a-level-of-subselect
approach will work, but you want to do that querytree transformation at
plan time not parse time.

                       regards, tom lane

Re: Adding CORRESPONDING to Set Operations

From
Tom Lane
Date:
Kerem Kat <keremkat@gmail.com> writes:
> There is a catch inserting subqueries for corresponding in the planner.
> Parser expects to see equal number of columns in both sides of the
> UNION query. If there is corresponding however we cannot guarantee that.

Well, you certainly need the parse analysis code to be aware of
CORRESPONDING's effects.  But I think you can confine the changes to
adjusting the computation of a SetOperationStmt's list of output column
types.  It might be a good idea to also add a list of output column
names to SetOperationStmt, and get rid of the logic that digs down into
the child queries when we need to know the output column names.

> Target columns, collations and types for the SetOperationStmt are
> determined in the parser. If we pass the column number equality checks,
> it is not clear that how one would proceed with the targetlist generation
> loop which is a forboth for two table's columns.

Obviously, that logic doesn't work at all for CORRESPONDING, so you'll
need to have a separate code path to deduce the output column list in
that case.
        regards, tom lane


Re: Adding CORRESPONDING to Set Operations

From
Kerem Kat
Date:
On Sat, Sep 24, 2011 at 18:49, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Kerem Kat <keremkat@gmail.com> writes:
> > There is a catch inserting subqueries for corresponding in the planner.
> > Parser expects to see equal number of columns in both sides of the
> > UNION query. If there is corresponding however we cannot guarantee that.
>
> Well, you certainly need the parse analysis code to be aware of
> CORRESPONDING's effects.  But I think you can confine the changes to
> adjusting the computation of a SetOperationStmt's list of output column
> types.  It might be a good idea to also add a list of output column
> names to SetOperationStmt, and get rid of the logic that digs down into
> the child queries when we need to know the output column names.
>

In the parser while analyzing SetOperationStmt, larg and rarg needs to be
transformed as subqueries. SetOperationStmt can have two fields representing
larg and rarg with projected columns according to corresponding:
larg_corresponding,
rarg_corresponding.

Planner uses _corresponding ones if query is a corresponding query,
view-definition-generator
uses larg and rarg which represent the query user entered.

Comments?


> > Target columns, collations and types for the SetOperationStmt are
> > determined in the parser. If we pass the column number equality checks,
> > it is not clear that how one would proceed with the targetlist generation
> > loop which is a forboth for two table's columns.
>
> Obviously, that logic doesn't work at all for CORRESPONDING, so you'll
> need to have a separate code path to deduce the output column list in
> that case.
>

If the output column list to be determined at that stage it needs to
be filtered and ordered.
In that case aren't we breaking the non-modification of user query argument?

note: I am new to this list, am I asking too much detail?

regards,

Kerem KAT


Re: Adding CORRESPONDING to Set Operations

From
Kerem Kat
Date:
On Sat, Sep 24, 2011 at 19:51, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Kerem Kat <keremkat@gmail.com> writes:
>> In the parser while analyzing SetOperationStmt, larg and rarg needs to be
>> transformed as subqueries. SetOperationStmt can have two fields representing
>> larg and rarg with projected columns according to corresponding:
>> larg_corresponding,
>> rarg_corresponding.
>
> Why?  CORRESPONDING at a given set-operation level doesn't affect either
> sub-query, so I don't see why you'd need a different representation for
> the sub-queries.
>

In the planner to construct a subquery out of SetOperationStmt or
RangeTblRef, a new RangeTblRef is needed.
To create a RangeTableRef, parser state is needed and planner assumes
root->parse->rtable be not modified
after generating simple_rte_array.

SELECT a,b,c FROM t is larg
SELECT a,b FROM (SELECT a,b,c FROM t) is larg_corresponding
SELECT d,a,b FROM t is rarg
SELECT a,b FROM (SELECT d,a,b FROM t); is rarg_corresponding

In the planner choose _corresponding ones if the query has corresponding.

SELECT a,b FROM (SELECT a,b,c FROM t)
UNION
SELECT a,b FROM (SELECT d,a,b FROM t);



>>> Obviously, that logic doesn't work at all for CORRESPONDING, so you'll
>>> need to have a separate code path to deduce the output column list in
>>> that case.
>
>> If the output column list to be determined at that stage it needs to
>> be filtered and ordered.
>> In that case aren't we breaking the non-modification of user query argument?
>
> No.  All that you're doing is correctly computing the lists of the
> set-operation's output column types (and probably names too).  These are
> internal details that needn't be examined when printing the query, so
> they won't affect ruleutils.c.
>
>> note: I am new to this list, am I asking too much detail?
>
> Well, I am beginning to wonder if you should choose a smaller project
> for your first venture into patching Postgres.
>


regards,

Kerem KAT


Re: Adding CORRESPONDING to Set Operations

From
Tom Lane
Date:
Kerem Kat <keremkat@gmail.com> writes:
> On Sat, Sep 24, 2011 at 19:51, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Why?  CORRESPONDING at a given set-operation level doesn't affect either
>> sub-query, so I don't see why you'd need a different representation for
>> the sub-queries.

> In the planner to construct a subquery out of SetOperationStmt or
> RangeTblRef, a new RangeTblRef is needed.
> To create a RangeTableRef, parser state is needed and planner assumes
> root->parse->rtable be not modified
> after generating simple_rte_array.

Actually, after looking at the code again, I don't think you need any of
that, since there's already a SubqueryScan node being inserted into the
plan.  You just need to improve generate_setop_tlist so that it can deal
with cases where the mapping from subplan targetlist to the setop output
columns isn't one-to-one.
        regards, tom lane


Re: Adding CORRESPONDING to Set Operations

From
Tom Lane
Date:
Kerem Kat <keremkat@gmail.com> writes:
> In the parser while analyzing SetOperationStmt, larg and rarg needs to be
> transformed as subqueries. SetOperationStmt can have two fields representing
> larg and rarg with projected columns according to corresponding:
> larg_corresponding,
> rarg_corresponding.

Why?  CORRESPONDING at a given set-operation level doesn't affect either
sub-query, so I don't see why you'd need a different representation for
the sub-queries.

>> Obviously, that logic doesn't work at all for CORRESPONDING, so you'll
>> need to have a separate code path to deduce the output column list in
>> that case.

> If the output column list to be determined at that stage it needs to
> be filtered and ordered.
> In that case aren't we breaking the non-modification of user query argument?

No.  All that you're doing is correctly computing the lists of the
set-operation's output column types (and probably names too).  These are
internal details that needn't be examined when printing the query, so
they won't affect ruleutils.c.

> note: I am new to this list, am I asking too much detail?

Well, I am beginning to wonder if you should choose a smaller project
for your first venture into patching Postgres.
        regards, tom lane


Re: Adding CORRESPONDING to Set Operations

From
Kerem Kat
Date:
CORRESPONDING clause take 2

After realizing that modifying prepunion.c to include a custom subquery
is not easy(incomprehensible to me) as it sounds and turning into a
hassle after making several uninformed changes, I decided to go with
modifying analyze.c.

The incomprehensible part is constructing a custom subquery as a
SubqueryScan.

Anyway I managed to implement the clause as a Subquery in analyze.c.

In the method transformSetOperationTree, if the node is a setoperation and
contains a corresponding clause, i.e. CORRESPONDING, or CORRESPONDING
BY(columns...),
we determine the common column names. Column ordering in select statements
are not important to the CORRESPONDING. With the common column names
in hand, we create a RangeSubselect node accordingly and replace the original
statement op->larg with the new RangeSubselect. RangeSubselect in turn has the
original op->larg as a from clause. We do the same to op->rarg too.

There were no changes done in prepunion.c

There are documentation changes and one regression test in the patch.


Best Regards,

Kerem KAT

Attachment

Re: Adding CORRESPONDING to Set Operations

From
Robert Haas
Date:
On Sun, Oct 16, 2011 at 7:46 PM, Kerem Kat <keremkat@gmail.com> wrote:
> CORRESPONDING clause take 2

You should probably read this:

http://wiki.postgresql.org/wiki/Submitting_a_Patch

And add your patch here:

https://commitfest.postgresql.org/action/commitfest_view/open

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company