Thread: Regression on PostgreSQL 10 ORDER/GROUP BY expression not found in targetlist

On June 26th, our PostGIS debian bot started failing on PostgreSQL 10 with
one of our regression tests for topology.

It started erroring out with 

ORDER/GROUP BY expression not found in targetlist

Related ticket here: https://trac.osgeo.org/postgis/ticket/4111

I think the exact query that is triggering it is this one though I don't
have PostgreSQL 10 head installed locally to confirm.

SELECT 't3280', 'L1b' || topology.TopoGeo_AddLinestring('bug3280', geom)
 FROM bug3280.edge where edge_id = 1
 ORDER BY 1;


This bot follows the dev, and stable git branches of PostgreSQL, and it was
on PostgreSQL 10 that it started failing.
PostgreSQL 11 was okay.

I'm assuming this is an unintended consequence of this commit 

https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=3566873f2
1909397561418fab22c98332fa8b72a

as that's when the issue started happening.
Just want to confirm that and not something in our code we've just been
doing incorrectly.  

If needed I'll try to put together an example that doesn't involve PostGIS
to exercise the problem.

-- more details from the commit and it seemed like lots of files were
changed

    Add PGTYPESchar_free() to avoid cross-module problems on Windows.
(details)

Commit 3566873f21909397561418fab22c98332fa8b72a by tmunro
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=3566873f2
1909397561418fab22c98332fa8b72a

Add PGTYPESchar_free() to avoid cross-module problems on Windows.
On Windows, it is sometimes important for corresponding malloc() and 
free() calls to be made from the same DLL, since some build options can 
result in multiple allocators being active at the same time.  For that 
reason we already provided PQfreemem().  This commit adds a similar 
function for freeing string results allocated by the pgtypes library.
Author: Takayuki Tsunakawa Reviewed-by: Kyotaro Horiguchi Discussion:
https://postgr.es/m/0A3221C70F24FB45833433255569204D1F8AD5D6%40G01JPEXMBYT05

Thanks,
Regina



On 29 June 2018 at 21:06, Regina Obe <lr@pcorp.us> wrote:
> I think the exact query that is triggering it is this one though I don't
> have PostgreSQL 10 head installed locally to confirm.
>
> SELECT 't3280', 'L1b' || topology.TopoGeo_AddLinestring('bug3280', geom)
>  FROM bug3280.edge where edge_id = 1
>  ORDER BY 1;

Can you try in REL_10_STABLE? There was a fix committed [1] for
something very similar just a week ago.

[1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=a4c95b0b80c70677c09c0d5c82a6fba875160288

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


> On 29 June 2018 at 21:06, Regina Obe <lr@pcorp.us> wrote:
> > I think the exact query that is triggering it is this one though I
> > don't have PostgreSQL 10 head installed locally to confirm.
> >
> > SELECT 't3280', 'L1b' || topology.TopoGeo_AddLinestring('bug3280',
> > geom)  FROM bug3280.edge where edge_id = 1  ORDER BY 1;
>
> Can you try in REL_10_STABLE? There was a fix committed [1] for something
> very similar just a week ago.
>
> [1]
> https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=a4c95b0b80
> c70677c09c0d5c82a6fba875160288
>
> --
>  David Rowley                   http://www.2ndQuadrant.com/
>  PostgreSQL Development, 24x7 Support, Training & Services

Thanks for this input.  I guess I was mistaken about the commit that caused this
and it's most likely this one since this one fails as well.

Here is a trivial example to exercise that doesn't involve PostGIS.

CREATE TABLE edge AS SELECT 1 AS edge_id;
SELECT 't3280', 'L1b' || generate_series(1,10)
FROM edge
ORDER BY 1;


Now that said, I think I should change our test since it's using questionable assumptions anyway.
My only concern with this issue is that it's behavior that was allowed before in 10 a stable branch
and it's changing in a micro release.

I doubt few others would be bitten by though, but probably still worthwhile to put a note about it in the upcoming
releasenotes 
if this new behavior is kept.

Thanks,
Regina Obe




"Regina Obe" <lr@pcorp.us> writes:
> Here is a trivial example to exercise that doesn't involve PostGIS.

> CREATE TABLE edge AS SELECT 1 AS edge_id;
> SELECT 't3280', 'L1b' || generate_series(1,10) 
> FROM edge
> ORDER BY 1;

That's odd.  It works fine in HEAD but not the v10 branch.  I suspect
that the fix David mentioned failed to cover some case related to the
issue we saw last week ... but why the cross-version difference?

Don't have time to investigate further right now.

            regards, tom lane


> "Regina Obe" <lr@pcorp.us> writes:
> > Here is a trivial example to exercise that doesn't involve PostGIS.
> 
> > CREATE TABLE edge AS SELECT 1 AS edge_id; SELECT 't3280', 'L1b' ||
> > generate_series(1,10) FROM edge ORDER BY 1;
> 
> That's odd.  It works fine in HEAD but not the v10 branch.  I suspect that
the
> fix David mentioned failed to cover some case related to the issue we saw
> last week ... but why the cross-version difference?
> 
> Don't have time to investigate further right now.
> 
>             regards, tom lane

Yes, that concerned me too because our PostgreSQL 11 head is the first to
run in our battery of tests
So I would have expected any change in 10 would be committed in 11 as well
and 11 would fail too, but it didn't.

Thanks,
Regina



I wrote:
> "Regina Obe" <lr@pcorp.us> writes:
>> Here is a trivial example to exercise that doesn't involve PostGIS.
>> CREATE TABLE edge AS SELECT 1 AS edge_id;
>> SELECT 't3280', 'L1b' || generate_series(1,10)
>> FROM edge
>> ORDER BY 1;

> That's odd.  It works fine in HEAD but not the v10 branch.  I suspect
> that the fix David mentioned failed to cover some case related to the
> issue we saw last week ... but why the cross-version difference?

So the proximate cause of this problem is that we're applying a pathtarget
that contains the Const 't3280', marked with sortgroupref 1, to a seqscan
path.  We later conclude that we don't really care about the Const at the
seqscan level (it'll get generated in ProjectSet instead), so during
createplan.c, use_physical_tlist returns TRUE for the seqscan node.
That's fine ... but then create_scan_plan does this:

            /*
             * Transfer any sortgroupref data to the replacement tlist, unless
             * we don't care because the gating Result will handle it.
             */
            if (!gating_clauses)
                apply_pathtarget_labeling_to_tlist(tlist, best_path->pathtarget);

and apply_pathtarget_labeling_to_tlist chokes because there is noplace
to transfer the Const's sortgroupref label to.  So this is just wrong;
it should read more like the similar code in create_projection_plan:

        if ((flags & CP_LABEL_TLIST) != 0)
            apply_pathtarget_labeling_to_tlist(tlist, best_path->pathtarget);

But that's been wrong since this code was introduced in 9.6.  Why isn't
HEAD failing similarly?  Apparently, the planner.c code restructuring in
11cf92f6e, or possibly its predecessor e2f1eb0ee, changed the path tree
that's produced for this case.  Since the commit log entries for those
patches make no mention of intention to change the behavior for non
parallel and non partitioned cases, I would imagine that that's a bug in
itself.

I also think that there's some horribly unsafe coding in
apply_scanjoin_target_to_paths: it clobbers a RelOptInfo's reltarget,
with no thought for whether that might affect things elsewhere,
and it also clobbers individual paths' PathTargets like so:

       if (tlist_same_exprs)
           subpath->pathtarget->sortgrouprefs =
               scanjoin_target->sortgrouprefs;

with similarly little regard for whether those PathTarget structures are
shared with anything else.

I've run out of time to poke at this for today, though, so I do not have
test cases that prove this code is broken.

            regards, tom lane


On Sat, Jun 30, 2018 at 5:38 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I also think that there's some horribly unsafe coding in
> apply_scanjoin_target_to_paths: it clobbers a RelOptInfo's reltarget,
> with no thought for whether that might affect things elsewhere,
> and it also clobbers individual paths' PathTargets like so:
>
>        if (tlist_same_exprs)
>            subpath->pathtarget->sortgrouprefs =
>                scanjoin_target->sortgrouprefs;
>
> with similarly little regard for whether those PathTarget structures are
> shared with anything else.

Regarding this point specifically, I definitely was not thrilled about
that hack, but it was tricky to avoid having the tlist_same_exprs case
regress in terms of performance, so I didn't feel I could turn down
performance optimizations without some evidence of an actual problem.
For this to be a problem, it would have to be the case that somebody
builds a path, returns it, and then later mutates the sortgrouprefs
array in place.  Does that happen?  Off-hand I would have thought that
you'd have to apply the correct sortgroupref labeling to a path before
using it as a building block in some other path, just like it's not
really OK to change the cost of a path once you've passed it to
add_path() or add_partial_path().  Decisions might have already been
made based on the old information.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company