Re: BUG #17364: Errors caused by views - Mailing list pgsql-bugs

From Tom Lane
Subject Re: BUG #17364: Errors caused by views
Date
Msg-id 3011206.1642010290@sss.pgh.pa.us
Whole thread Raw
In response to BUG #17364: Errors caused by views  (PG Bug reporting form <noreply@postgresql.org>)
List pgsql-bugs
PG Bug reporting form <noreply@postgresql.org> writes:
> I run the following SQL statements on postgres13.2. They are simple, but an
> error occurs.

> create table test1 (id int, c_data varchar);
> insert into test1 values (1, '201201');
> insert into test1 values (2, 'error');

> create table test2(id int, status varchar);
> insert into test2 values (1, 'on');
> insert into test2 values (2, 'off');

> create view test_view as
> SELECT test1.c_data
> from test1 LEFT JOIN test2
> ON (test2.id = test1.id)
> WHERE (test2.status)::text = 'on'::text;

> select * from test_view;
> c_data
> ----------
> 20201101
> (1 row)

> select * from
> (select to_date(c_data, 'YYYYMMDD') as date
> from test_view) t1
> where t1.date >= '2020 - 10 - 01':: TIMESTAMP;
> ERROR: invalid value "erro" for "YYYY"
> DETAIL: Value must be an integer.

The reason this is failing is that the outer WHERE clause is being
pushed down to the scan of test1, so it's applied before the join
occurs, and then to_date spits up on the "error" value.  You can see
that in EXPLAIN:

=# explain select * from
(select to_date(c_data, 'YYYYMMDD') as date
from test_view) t1
where t1.date >= '2020-10-01':: TIMESTAMP;
                                                    QUERY PLAN
-------------------------------------------------------------------------------------------------------------------
 Hash Join  (cost=25.95..56.75 rows=13 width=4)
   Hash Cond: (test1.id = test2.id)
   ->  Seq Scan on test1  (cost=0.00..29.05 rows=423 width=36)
         Filter: (to_date((c_data)::text, 'YYYYMMDD'::text) >= '2020-10-01 00:00:00'::timestamp without time zone)
   ->  Hash  (cost=25.88..25.88 rows=6 width=4)
         ->  Seq Scan on test2  (cost=0.00..25.88 rows=6 width=4)
               Filter: ((status)::text = 'on'::text)
(7 rows)

That's not something we're going to consider a bug, because if we
prevented the push-down from happening it would be quite disastrous
for the performance of many queries.

There are a couple of things you can do to control it yourself,
though:

* Insert "OFFSET 0" as an optimization fence:

=# explain select * from
(select to_date(c_data, 'YYYYMMDD') as date
from test_view offset 0) t1
where t1.date >= '2020-10-01':: TIMESTAMP;
                                QUERY PLAN
---------------------------------------------------------------------------
 Subquery Scan on t1  (cost=25.95..54.36 rows=13 width=4)
   Filter: (t1.date >= '2020-10-01 00:00:00'::timestamp without time zone)
   ->  Hash Join  (cost=25.95..53.89 rows=38 width=4)
         Hash Cond: (test1.id = test2.id)
         ->  Seq Scan on test1  (cost=0.00..22.70 rows=1270 width=36)
         ->  Hash  (cost=25.88..25.88 rows=6 width=4)
               ->  Seq Scan on test2  (cost=0.00..25.88 rows=6 width=4)
                     Filter: ((status)::text = 'on'::text)
(8 rows)

OFFSET 0 in the view definition would serve too, I think,
but I didn't try it.

* Wrap the possibly-error-causing function in a VOLATILE wrapper function:

=# create or replace function mytodate(varchar) returns date as $$
select to_date($1, 'YYYYMMDD')$$ language sql volatile;
CREATE FUNCTION
=# explain verbose select * from
(select mytodate(c_data) as date
from test_view) t1
where t1.date >= '2020-10-01':: TIMESTAMP;
                                  QUERY PLAN
-------------------------------------------------------------------------------
 Subquery Scan on t1  (cost=25.95..54.36 rows=13 width=4)
   Output: t1.date
   Filter: (t1.date >= '2020-10-01 00:00:00'::timestamp without time zone)
   ->  Hash Join  (cost=25.95..53.89 rows=38 width=4)
         Output: to_date((test1.c_data)::text, 'YYYYMMDD'::text)
         Hash Cond: (test1.id = test2.id)
         ->  Seq Scan on public.test1  (cost=0.00..22.70 rows=1270 width=36)
               Output: test1.id, test1.c_data
         ->  Hash  (cost=25.88..25.88 rows=6 width=4)
               Output: test2.id
               ->  Seq Scan on public.test2  (cost=0.00..25.88 rows=6 width=4)
                     Output: test2.id
                     Filter: ((test2.status)::text = 'on'::text)
(13 rows)

The presence of a volatile function in the subquery's output list
has about the same effect as OFFSET 0, preventing the WHERE
condition from being pushed down past it.

In both cases you're significantly de-optimizing your query, so
these answers might not be very satisfactory if you have a large
amount of data to work with.  I'd really suggest reconsidering
your data design, or making the query a bit less fragile, perhaps
with a sub-select expression along the lines of

case when c_data ~ '^\d+$' then to_date(c_data, 'YYYYMMDD') else null end

            regards, tom lane



pgsql-bugs by date:

Previous
From: Timur Khanjanov
Date:
Subject: Re: wrong output in dump of rules with old values of row type columns
Next
From: Andres Freund
Date:
Subject: Re: BUG #17255: Server crashes in index_delete_sort_cmp() due to race condition with vacuum