Thread: BUG #17812: LOCK TABLE IN ACCESS EXCLUSIVE MODE with a view returns an empty tuple set

BUG #17812: LOCK TABLE IN ACCESS EXCLUSIVE MODE with a view returns an empty tuple set

From
PG Bug reporting form
Date:
The following bug has been logged on the website:

Bug reference:      17812
Logged by:          Zhenglong Li
Email address:      smartkeyerror@gmail.com
PostgreSQL version: 15.2
Operating system:   Ubuntu 20.04
Description:

Step to reproduce the behavior

We need 2 sessions to reproduce this bug. 

Firstly, we create a simple view that just fetches all the data from
table1.

And then we start a transaction and lock table1 with AccessExclusive Mode in
Read Committed Transaction Isolation Level.

After that, we try to use CTAS to create a temp table table2 using the data
from table1 in session2, and this will be blocked by AccessExclusive Lock.

Finally, we insert some data into table1 in session1 and commit it, session2
will continue, but there is no data in table2.

```sql
--------[ Sessions 1 ]--------

show transaction_isolation;
 transaction_isolation 
-----------------------
 read committed

DROP TABLE IF EXISTS table1 CASCADE;

CREATE TABLE table1 (
    zahl    integer,
    upd_dat timestamp without time zone
);

CREATE OR REPLACE VIEW view1 as select zahl,upd_dat from table1;

BEGIN;
LOCK TABLE table1 IN ACCESS EXCLUSIVE MODE;
TRUNCATE TABLE table1;

--------[ Sessions 2 ]--------

show transaction_isolation;
 transaction_isolation 
-----------------------
 read committed

DROP TABLE IF EXISTS table2 CASCADE;

CREATE TEMP TABLE table2 AS select zahl,upd_dat from view1;
-- this will hang now waiting for a lock form session1

--------[ Sessions 1 ]--------
SELECT clock_timestamp();
INSERT INTO table1 SELECT i zahl,clock_timestamp() upd_dat FROM
generate_series(1,10) a(i);
COMMIT;

--------[ Sessions 2 ]--------
SELECT * FROM table2 limit 10;
 zahl | upd_dat 
------+---------
(0 rows)
```

 But when we used a table, there will have data:

```sql
--------[ Sessions 1 ]--------

DROP TABLE IF EXISTS table1 CASCADE;

CREATE TABLE table1 (
    zahl    integer,
    upd_dat timestamp without time zone
);

CREATE OR REPLACE VIEW view1 as select zahl,upd_dat from table1;

BEGIN;
LOCK TABLE table1 IN ACCESS EXCLUSIVE MODE;
TRUNCATE TABLE table1;

--------[ Sessions 2 ]--------

DROP TABLE IF EXISTS table2 CASCADE;

CREATE TEMP TABLE table2 AS select zahl,upd_dat from table1;
-- this will hang now waiting for a lock form session1

--------[ Sessions 1 ]--------
SELECT clock_timestamp();
INSERT INTO table1 SELECT i zahl,clock_timestamp() upd_dat FROM
generate_series(1,10) a(i);
COMMIT;

--------[ Sessions 2 ]--------
SELECT * FROM table2 limit 10;
 zahl |          upd_dat           
------+----------------------------
    1 | 2023-02-28 03:10:15.743068
    2 | 2023-02-28 03:10:15.743196
    3 | 2023-02-28 03:10:15.743203
    4 | 2023-02-28 03:10:15.743206
    5 | 2023-02-28 03:10:15.743208
    6 | 2023-02-28 03:10:15.74321
    7 | 2023-02-28 03:10:15.743212
    8 | 2023-02-28 03:10:15.743214
    9 | 2023-02-28 03:10:15.743216
   10 | 2023-02-28 03:10:15.743218
(10 rows)
```


Hi,

On Tue, Feb 28, 2023 at 03:12:38AM +0000, PG Bug reporting form wrote:
> The following bug has been logged on the website:
>
> Bug reference:      17812
> Logged by:          Zhenglong Li
> Email address:      smartkeyerror@gmail.com
> PostgreSQL version: 15.2
> Operating system:   Ubuntu 20.04
> Description:
>
> Step to reproduce the behavior
>
> We need 2 sessions to reproduce this bug.
>
> Firstly, we create a simple view that just fetches all the data from
> table1.
>
> And then we start a transaction and lock table1 with AccessExclusive Mode in
> Read Committed Transaction Isolation Level.
>
> After that, we try to use CTAS to create a temp table table2 using the data
> from table1 in session2, and this will be blocked by AccessExclusive Lock.
>
> Finally, we insert some data into table1 in session1 and commit it, session2
> will continue, but there is no data in table2.
>
> ```sql
> [...]
> TRUNCATE TABLE table1;
> [...]

This is not a bug, this is a documented corner case.  In postgres TRUNCATE is
not fully MVCC, see
https://www.postgresql.org/docs/current/sql-truncate.html and
https://www.postgresql.org/docs/current/mvcc-caveats.html:

"TRUNCATE is not MVCC-safe. After truncation, the table will appear empty to
concurrent transactions, if they are using a snapshot taken before the
truncation occurred".

I guess the difference between referencing the table rather than the view is
that the query get stuck at execution time rather than planning time, meaning
that you do get a snapshot older than the INSERT in the session 1.  Change the
TRUNCATE with a DELETE and you will get the same behavior for both cases.



But when I removed the TRUNCATE STATEMENT in session1, session2 will still not have data:

```sql
--------[ Sessions1 ]--------
DROP TABLE IF EXISTS table1 CASCADE;

CREATE TABLE table1 (
        zahl    integer,
        upd_dat timestamp without time zone
);

CREATE OR REPLACE VIEW view1 as select zahl,upd_dat from table1;

BEGIN;
LOCK TABLE table1 IN ACCESS EXCLUSIVE MODE;

--------[ Sessions 2 ]--------
DROP TABLE IF EXISTS table2 CASCADE;

CREATE TEMP TABLE table2 AS select zahl,upd_dat from view1;
-- this will hang now waiting for a lock form session1

--------[ Sessions 1 ]--------
INSERT INTO table1 SELECT i zahl,clock_timestamp() upd_dat FROM generate_series(1,10) a(i);
COMMIT;

--------[ Sessions 2 ]--------
SELECT * FROM table2 limit 10;
 zahl | upd_dat
------+---------
(0 rows)
```

So I think we can not blame the TRUNCATE is not fully MVCC.

Julien Rouhaud <rjuju123@gmail.com> 于2023年2月28日周二 21:52写道:
Hi,

On Tue, Feb 28, 2023 at 03:12:38AM +0000, PG Bug reporting form wrote:
> The following bug has been logged on the website:
>
> Bug reference:      17812
> Logged by:          Zhenglong Li
> Email address:      smartkeyerror@gmail.com
> PostgreSQL version: 15.2
> Operating system:   Ubuntu 20.04
> Description:
>
> Step to reproduce the behavior
>
> We need 2 sessions to reproduce this bug.
>
> Firstly, we create a simple view that just fetches all the data from
> table1.
>
> And then we start a transaction and lock table1 with AccessExclusive Mode in
> Read Committed Transaction Isolation Level.
>
> After that, we try to use CTAS to create a temp table table2 using the data
> from table1 in session2, and this will be blocked by AccessExclusive Lock.
>
> Finally, we insert some data into table1 in session1 and commit it, session2
> will continue, but there is no data in table2.
>
> ```sql
> [...]
> TRUNCATE TABLE table1;
> [...]

This is not a bug, this is a documented corner case.  In postgres TRUNCATE is
not fully MVCC, see
https://www.postgresql.org/docs/current/sql-truncate.html and
https://www.postgresql.org/docs/current/mvcc-caveats.html:

"TRUNCATE is not MVCC-safe. After truncation, the table will appear empty to
concurrent transactions, if they are using a snapshot taken before the
truncation occurred".

I guess the difference between referencing the table rather than the view is
that the query get stuck at execution time rather than planning time, meaning
that you do get a snapshot older than the INSERT in the session 1.  Change the
TRUNCATE with a DELETE and you will get the same behavior for both cases.
On Wed, Mar 01, 2023 at 10:01:29AM +0800, Keyerror Smart wrote:
> But when I removed the TRUNCATE STATEMENT in session1, session2 will still
> not have data:
>
> ```sql
> --------[ Sessions1 ]--------
> DROP TABLE IF EXISTS table1 CASCADE;
>
> CREATE TABLE table1 (
>         zahl    integer,
>         upd_dat timestamp without time zone
> );
>
> CREATE OR REPLACE VIEW view1 as select zahl,upd_dat from table1;
>
> BEGIN;
> LOCK TABLE table1 IN ACCESS EXCLUSIVE MODE;
>
> --------[ Sessions 2 ]--------
> DROP TABLE IF EXISTS table2 CASCADE;
>
> CREATE TEMP TABLE table2 AS select zahl,upd_dat from view1;
> -- this will hang now waiting for a lock form session1
>
> --------[ Sessions 1 ]--------
> INSERT INTO table1 SELECT i zahl,clock_timestamp() upd_dat FROM
> generate_series(1,10) a(i);
> COMMIT;
>
> --------[ Sessions 2 ]--------
> SELECT * FROM table2 limit 10;
>  zahl | upd_dat
> ------+---------
> (0 rows)
> ```
>
> So I think we can not blame the TRUNCATE is not fully MVCC.

Right, I was testing with a plain SELECT instead of CREATE TABLE AS.

This is however again due to when the lock is actually conflicting, leading to
different snapshot acquisition time.

For CTAS referencing the view, the lock is conflicting during execution, so it
has a snapshot that sees the data as they were before that transaction, as
required by ACID rules.

The CTAS referencing the table conflicts during parse analysis, so when the
lock is finally acquired and the parse analysis is done, it then acquires a new
snapshot for execution, which now sees the transaction as committed and thus
the newly added rows.

I still don't think this is a bug, however I'm not sure if/where those details
are documented.



OK, I get it. Thanks for your explanation.
On Tue, Feb 28, 2023 at 8:02 PM Keyerror Smart <smartkeyerror@gmail.com> wrote:
OK, I get it. Thanks for your explanation.

You may find the following to be informative as well.


In short this seems to fall into "maybe we can do better in the future" - though it seems that future has not yet come to fruition.

To my knowledge this dynamic isn't called out anywhere though it seems almost obvious once pointed out.

There still seems to be something off here:

unfortunately it causes execution to use a snapshot that has been acquired before locking any of the tables mentioned in the query

I went down the path that the planner should be the one realizing that the view-using query has a lock it cannot grab - on the rewritten query's introduction of the new table - and it should have blocked in the planner, resulting in the acquisition of the execution-time snapshot in that case as well.

Apparently the planner gets to bypass a check (that blocks) that parse analysis has to perform.  Thus locking of tables does happen for the ones named in the query text itself?

David J.

"David G. Johnston" <david.g.johnston@gmail.com> writes:
> You may find the following to be informative as well.

A general comment on these examples is that we don't intend MVCC
guarantees to hold with respect to tables that are created/dropped
by other sessions midway through your transaction.  The reason can
be understood from this example:

Session 1:
BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
SELECT * FROM table1;  -- establishes session1's snapshot

Session 2:
DROP TABLE table2;
CREATE TABLE table2 ...

Session 1:
SELECT * FROM table2;

Strict MVCC semantics would require that session 1 now see
the old contents of table2, which implies basically that we can't
ever drop tables --- at the point of the DROP, there was absolutely
nothing indicating that session 1 would take any interest in table2,
but after the fact it wants to know about it.  That's not how it
works; if DROP is able to obtain exclusive lock on table2, that
table is toast, immediately.

(I think that if you use SERIALIZABLE level, you might get a serialization
failure from the "SELECT * FROM table2", or if you don't then perhaps
it'd be reasonable to fix it so you do.  But under no circumstances
are we going to sit on the data in table2 on the off chance that
some transaction might want it later.)

Like Julien, I'm not entirely sure how well this is documented.
There is plenty of detail in [1] though.

            regards, tom lane

[1] https://www.postgresql.org/docs/current/mvcc.html



On Tue, Feb 28, 2023 at 9:13 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
"David G. Johnston" <david.g.johnston@gmail.com> writes:
> You may find the following to be informative as well.

A general comment on these examples is that we don't intend MVCC
guarantees to hold with respect to tables that are created/dropped
by other sessions midway through your transaction.

That isn't what is in question here though...but also my comment regarding planning seems incomplete...

table1 and view1 already exist, table1 is empty
view1 is select * from table1

Given that Session 1 does:
BEGIN;
LOCK TABLE table1 IN ACCESS EXCLUSIVE MODE;
-- sessions 2-5 now issue their commands
INSERT INTO table1 SELECT i zahl,clock_timestamp() upd_dat FROM generate_series(1,10) a(i);
COMMIT;

The command and result from Sessions 2-5 are below:

Session 2:
select * from view1;
10 Rows

Session 3:
select * from table1;
10 Rows

Session 4:
create table tbl2 as select * from view1;
SELECT 0

Session 5:
create table tbl3 as select * from table1;
SELECT 10

Why is it OK for session 4 to be different here?

The argument that it got blocked after getting an execution snapshot that doesn't include the insertion into table1, when all the other cases were apparently blocked somewhere before then, doesn't sit well.  In particular, the difference between it (Session 4) and Session 2.  That one sent the result rows to a newly created temporary table and the other to the client doesn't seem like it should be affecting/affected-by MVCC.

It is unclear to me whether you were instead talking about other sessions dropping tables as another way of saying "ACCESS EXCLUSIVE" in which case at what lock level should this anomaly go away, and does it? (I haven't checked).

David J.

Sessions 2 and 4:

postgres=# select * from view1;
 zahl |          upd_dat
------+----------------------------
    1 | 2023-03-01 04:33:28.628112
    2 | 2023-03-01 04:33:28.628227
    3 | 2023-03-01 04:33:28.628229
    4 | 2023-03-01 04:33:28.62823
    5 | 2023-03-01 04:33:28.62823
    6 | 2023-03-01 04:33:28.62823
    7 | 2023-03-01 04:33:28.628231
    8 | 2023-03-01 04:33:28.628231
    9 | 2023-03-01 04:33:28.628232
postgres=# create table tbl2 as select * from view1;
SELECT 0
postgres=# select * from tbl2;
 zahl | upd_dat
------+---------
(0 rows)

Sessions 3 and 5:

postgres=# select * from table1;
 zahl |          upd_dat
------+----------------------------
    1 | 2023-03-01 04:33:28.628112
    2 | 2023-03-01 04:33:28.628227
    3 | 2023-03-01 04:33:28.628229
    4 | 2023-03-01 04:33:28.62823
    5 | 2023-03-01 04:33:28.62823
    6 | 2023-03-01 04:33:28.62823
    7 | 2023-03-01 04:33:28.628231
    8 | 2023-03-01 04:33:28.628231
postgres=# create table tbl3 as select * from table1;
SELECT 10
postgres=# select * from tbl3;
 zahl |          upd_dat
------+----------------------------
    1 | 2023-03-01 04:36:00.762788
    2 | 2023-03-01 04:36:00.762911
    3 | 2023-03-01 04:36:00.762913
    4 | 2023-03-01 04:36:00.762914
    5 | 2023-03-01 04:36:00.762915
    6 | 2023-03-01 04:36:00.762915
    7 | 2023-03-01 04:36:00.762915
    8 | 2023-03-01 04:36:00.762916

"David G. Johnston" <david.g.johnston@gmail.com> writes:
> Session 4:
> create table tbl2 as select * from view1;
> SELECT 0

> Why is it OK for session 4 to be different here?

Maybe it isn't.  The code flow for CREATE TABLE AS is a bit weird
IIRC, and maybe it's missing a step where we should update the
active snapshot.

> It is unclear to me whether you were instead talking about other sessions
> dropping tables as another way of saying "ACCESS EXCLUSIVE" in which case
> at what lock level should this anomaly go away, and does it?

The originally-proposed tests seemed to all involve either TRUNCATE
or DROP TABLE, which are outside what I consider to be our MVCC
guarantees.  Your example here does seem a bit strange though.
Views generally ought not have different semantics from writing
out the view query in-line.

            regards, tom lane



On Wed, Mar 01, 2023 at 01:53:22AM -0500, Tom Lane wrote:
> "David G. Johnston" <david.g.johnston@gmail.com> writes:
> > Session 4:
> > create table tbl2 as select * from view1;
> > SELECT 0
> 
> > Why is it OK for session 4 to be different here?
> 
> Maybe it isn't.  The code flow for CREATE TABLE AS is a bit weird
> IIRC, and maybe it's missing a step where we should update the
> active snapshot.

I think it comes from this chunk in ExecCreateTableAs():

        rewritten = QueryRewrite(query);

        /* SELECT should never rewrite to more or less than one SELECT query */
        if (list_length(rewritten) != 1)
            elog(ERROR, "unexpected rewrite result for %s",
                 is_matview ? "CREATE MATERIALIZED VIEW" :
                 "CREATE TABLE AS SELECT");
        query = linitial_node(Query, rewritten);
        Assert(query->commandType == CMD_SELECT);

        /* plan the query */
        plan = pg_plan_query(query, pstate->p_sourcetext,
                             CURSOR_OPT_PARALLEL_OK, params);

        /*
         * Use a snapshot with an updated command ID to ensure this query sees
         * results of any previously executed queries.  (This could only
         * matter if the planner executed an allegedly-stable function that
         * changed the database contents, but let's do it anyway to be
         * parallel to the EXPLAIN code path.)
         */
        PushCopiedSnapshot(GetActiveSnapshot());
        UpdateActiveSnapshotCommandId();

        /* Create a QueryDesc, redirecting output to our tuple receiver */
        queryDesc = CreateQueryDesc(plan, pstate->p_sourcetext,
                                    GetActiveSnapshot(), InvalidSnapshot,
                                    dest, params, queryEnv, 0);


It seems to clearly be in contradiction with our usual rule that planning and
execution should have a different snapshot.





On Tuesday, February 28, 2023, Tom Lane <tgl@sss.pgh.pa.us> wrote:
"David G. Johnston" <david.g.johnston@gmail.com> writes:

> It is unclear to me whether you were instead talking about other sessions
> dropping tables as another way of saying "ACCESS EXCLUSIVE" in which case
> at what lock level should this anomaly go away, and does it?

The originally-proposed tests seemed to all involve either TRUNCATE
or DROP TABLE,

The only purpose of those was to produce a self-contained script that could be run repeatedly.  The commands were outside the transaction flow demonstrating the issue.

David J.
 
Julien Rouhaud <rjuju123@gmail.com> writes:
> On Wed, Mar 01, 2023 at 01:53:22AM -0500, Tom Lane wrote:
>> Maybe it isn't.  The code flow for CREATE TABLE AS is a bit weird
>> IIRC, and maybe it's missing a step where we should update the
>> active snapshot.

> I think it comes from this chunk in ExecCreateTableAs():

Hm.  There are a lot of places that do this:

>         PushCopiedSnapshot(GetActiveSnapshot());
>         UpdateActiveSnapshotCommandId();

rather than

        PushActiveSnapshot(GetTransactionSnapshot());

which is what would have the effect of noticing changes from other
sessions.  Digging into the history of that, I found this commit:

Author: Tom Lane <tgl@sss.pgh.pa.us>
Branch: master Release: REL9_1_BR [c0b007603] 2011-02-28 23:28:06 -0500

    Rearrange snapshot handling to make rule expansion more consistent.

    With this patch, portals, SQL functions, and SPI all agree that there
    should be only a CommandCounterIncrement between the queries that are
    generated from a single SQL command by rule expansion.  Fetching a whole
    new snapshot now happens only between original queries.  This is equivalent
    to the existing behavior of EXPLAIN ANALYZE, and it was judged to be the
    best choice since it eliminates one source of concurrency hazards for
    rules.  The patch should also make things marginally faster by reducing the
    number of snapshot push/pop operations.

So I guess it's intended behavior that we don't notice other-session
changes between steps of a single command.  Whether that rule should
apply to CREATE TABLE AS is perhaps debatable --- but I see several
other places that are doing it exactly like this, so it's not like
CREATE TABLE AS is alone in its folly.

I'm pretty hesitant to change this without substantial thought.

            regards, tom lane



On Wed, Mar 1, 2023 at 8:22 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Julien Rouhaud <rjuju123@gmail.com> writes:
> On Wed, Mar 01, 2023 at 01:53:22AM -0500, Tom Lane wrote:
>> Maybe it isn't.  The code flow for CREATE TABLE AS is a bit weird
>> IIRC, and maybe it's missing a step where we should update the
>> active snapshot.

> I think it comes from this chunk in ExecCreateTableAs():

Hm.  There are a lot of places that do this:

>               PushCopiedSnapshot(GetActiveSnapshot());
>               UpdateActiveSnapshotCommandId();

rather than

                PushActiveSnapshot(GetTransactionSnapshot());

which is what would have the effect of noticing changes from other
sessions.  Digging into the history of that, I found this commit:

Author: Tom Lane <tgl@sss.pgh.pa.us>
Branch: master Release: REL9_1_BR [c0b007603] 2011-02-28 23:28:06 -0500

    Rearrange snapshot handling to make rule expansion more consistent.

    With this patch, portals, SQL functions, and SPI all agree that there
    should be only a CommandCounterIncrement between the queries that are
    generated from a single SQL command by rule expansion.
[...] 

So I guess it's intended behavior that we don't notice other-session
changes between steps of a single command.  Whether that rule should
apply to CREATE TABLE AS is perhaps debatable --- but I see several
other places that are doing it exactly like this, so it's not like
CREATE TABLE AS is alone in its folly.

I'm pretty hesitant to change this without substantial thought.


So both Sessions 4 (view) and 5 (table) make it into tcop/postgres.c and into parse analysis.

The view session gets past there because it is a utility command whose body is parsable.  postgres.c tcop invokes the executor (with a new snapshot but still of the empty table) for the utility command CTAS passing the parsable query along.  CTAS performs rewrite and blocks there while holding the empty table snapshot, and when it continues on to execution CTAS doesn't pull forward the snapshot.

The table session fails to get past parse analysis while the lock is held and once it does postgres.c tcop assigns a new snapshot which does contain the data in table1, then invokes the CTAS utility.

To get parity in behavior either CTAS would need to assign a new snapshot OR postgres.c tcop would need to perform rewriting on the query under the CTAS prior to beginning execution of CTAS, since that rewriting would then block the process before the new snapshot is assigned. (That is all in a mostly questioning, not authoritative, tone

The non-CTAS view query gets blocked on rewrite while the non-CTAS table query gets blocked in parse analysis.  When unblocked, postgres.c tcop assigns both a new snapshot just prior to execution.

I experimented with putting the create temp table into a function body then calling the function - same results.

I agree this does seem like a poor risk/reward on the fixing side, especially absent a concrete live use case problem.  I am curious what led to this discovery.

In terms of documentation; "query began" is the terminology we use but as demonstrated here, from an end-user's perspective, and in the presence of locking, the point at which the query began seems undefined.  If it was when I send the statement to the server then none of the four sessions should be producing results, instead I get three that seem to violate the read-committed rule.  Which moves expectations to "begins retrieving rows" in which case the CTAS+View combination, getting stuck in between, when the CTAS+table one doesn't, indeed seems odd and maybe should at least be addressed, even in general terms.

David J.