Re: An issue with missing rows - Mailing list pgsql-bugs

From Viktor Semykin
Subject Re: An issue with missing rows
Date
Msg-id CACaAQPUnj-aqm9aAFc+JXTRjxKw_bmMMbaE-0NnF28sq3iE1sg@mail.gmail.com
Whole thread Raw
In response to An issue with missing rows  (Viktor Semykin <thesame.ml@gmail.com>)
List pgsql-bugs
I'm sorry for the noise, my query just missing ORDER BY clause, my fault.

On Sat, Feb 13, 2021 at 6:01 PM Viktor Semykin <thesame.ml@gmail.com> wrote:
>
> I have faced a weird issue with PostgreSQL database in conjunction with
> nodejs+Sequelize. I saw this issue happening in 13.1 and 10.5. nodejs is 12.16.
> Please let me know if I have to re-format this email somehow for
> comfortable reading.
>
> I'll try to explain step by step what's going on. Sequelize
> library test suite includes a test that have been passing successfully until I
> broke it. After my changes, the following queries fail about once in 20 times.
>
> In a nutshell, my queries look like this (here and below all createdAt and
> updatedAt fields may be inaccurate):
>
> ```sql
>   CREATE SCHEMA IF NOT EXISTS archive;
>   DROP TABLE IF EXISTS "archive"."Tasks" CASCADE;
>   DROP TABLE IF EXISTS "archive"."Users" CASCADE;
>   CREATE TABLE IF NOT EXISTS "archive"."Users" ("id"   SERIAL ,
> "createdAt" TIMESTAMP WITH TIME ZONE NOT NULL, "updatedAt" TIMESTAMP
> WITH TIME ZONE NOT NULL, PRIMARY KEY ("id"));
>   CREATE TABLE IF NOT EXISTS "archive"."Tasks" ("id" INTEGER , "title"
> VARCHAR(255), "createdAt" TIMESTAMP WITH TIME ZONE NOT NULL,
> "updatedAt" TIMESTAMP WITH TIME ZONE NOT NULL, "UserId" INTEGER
> REFERENCES "archive"."Users" ("id") ON DELETE SET NULL ON UPDATE
> CASCADE, PRIMARY KEY ("id"));
>   INSERT INTO "archive"."Users" ("id","createdAt","updatedAt") VALUES
> (1,'...','...') RETURNING "id","createdAt","updatedAt";
>   INSERT INTO "archive"."Users" ("id","createdAt","updatedAt") VALUES
> (2,'...','...') RETURNING "id","createdAt","updatedAt";
>   INSERT INTO "archive"."Tasks"
> ("id","title","createdAt","updatedAt","UserId") VALUES
> (1,'b','...','...',1) RETURNING
> "id","title","createdAt","updatedAt","UserId";
>   INSERT INTO "archive"."Tasks"
> ("id","title","createdAt","updatedAt","UserId") VALUES
> (2,'d','...','...',1) RETURNING
> "id","title","createdAt","updatedAt","UserId";
>   INSERT INTO "archive"."Tasks"
> ("id","title","createdAt","updatedAt","UserId") VALUES
> (3,'c','...','...',1) RETURNING
> "id","title","createdAt","updatedAt","UserId";
>   INSERT INTO "archive"."Tasks"
> ("id","title","createdAt","updatedAt","UserId") VALUES
> (4,'a','...','...',1) RETURNING
> "id","title","createdAt","updatedAt","UserId";
>   INSERT INTO "archive"."Tasks"
> ("id","title","createdAt","updatedAt","UserId") VALUES
> (5,'a','...','...',2) RETURNING
> "id","title","createdAt","updatedAt","UserId";
>   INSERT INTO "archive"."Tasks"
> ("id","title","createdAt","updatedAt","UserId") VALUES
> (6,'c','...','...',2) RETURNING
> "id","title","createdAt","updatedAt","UserId";
>   INSERT INTO "archive"."Tasks"
> ("id","title","createdAt","updatedAt","UserId") VALUES
> (7,'b','...','...',2) RETURNING
> "id","title","createdAt","updatedAt","UserId";
>   SELECT "User"."id", "User"."createdAt", "User"."updatedAt" FROM
> "archive"."Users" AS "User" ORDER BY "User"."id" ASC;
>   SELECT "Task".* FROM (SELECT * FROM (SELECT "id", "title",
> "createdAt", "updatedAt", "UserId" FROM "archive"."Tasks" AS "Task"
> WHERE "Task"."UserId" = 1 ORDER BY "Task"."id" ASC LIMIT 2) AS sub
> UNION ALL SELECT * FROM (SELECT "id", "title", "createdAt",
> "updatedAt", "UserId" FROM "archive"."Tasks" AS "Task" WHERE
> "Task"."UserId" = 2 ORDER BY "Task"."id" ASC LIMIT 2) AS sub) AS
> "Task";
>   DROP SCHEMA IF EXISTS archive CASCADE;
> ```
>
> in other words, in instantiates a test data like this:
>
> ```
> sequelize_test=# select * from "archive"."Users";
>  id |           createdAt           |           updatedAt
> ----+-------------------------------+-------------------------------
>   1 | 2021-02-13 16:59:04.190643-03 | 2021-02-13 16:59:04.190643-03
>   2 | 2021-02-13 16:59:04.217696-03 | 2021-02-13 16:59:04.217696-03
> (2 rows)
>
> sequelize_test=# select * from "archive"."Tasks";
>  id | title |           createdAt           |           updatedAt
>      | UserId
> ----+-------+-------------------------------+-------------------------------+--------
>   1 | b     | 2021-02-13 16:59:04.222593-03 | 2021-02-13
> 16:59:04.222593-03 |      1
>   2 | d     | 2021-02-13 16:59:04.238252-03 | 2021-02-13
> 16:59:04.238252-03 |      1
>   3 | c     | 2021-02-13 16:59:04.24318-03  | 2021-02-13
> 16:59:04.24318-03  |      1
>   4 | a     | 2021-02-13 16:59:04.256812-03 | 2021-02-13
> 16:59:04.256812-03 |      1
>   5 | a     | 2021-02-13 16:59:04.262364-03 | 2021-02-13
> 16:59:04.262364-03 |      2
>   6 | c     | 2021-02-13 16:59:04.277525-03 | 2021-02-13
> 16:59:04.277525-03 |      2
>   7 | b     | 2021-02-13 16:59:04.281861-03 | 2021-02-13
> 16:59:04.281861-03 |      2
> ```
>
> Then I'm trying to fetch all Users, and for each one of them only two tasks
> ordered by "id". This is what my query (the Faulty Query) looks like:
>
> ```sql
> SELECT "User"."id",
>        "User"."createdAt",
>        "User"."updatedAt",
>        "tasks"."id" AS "tasks.id",
>        "tasks"."title" AS "tasks.title",
>        "tasks"."createdAt" AS "tasks.createdAt",
>        "tasks"."updatedAt" AS "tasks.updatedAt",
>        "tasks"."UserId" AS "tasks.UserId"
> FROM "archive"."Users" AS "User"
> LEFT OUTER JOIN LATERAL
>   (SELECT *
>    FROM "archive"."Tasks" AS "tasks"
>    WHERE "User"."id" = "tasks"."UserId"
>    LIMIT 2) AS "tasks" ON TRUE
> ORDER BY "User"."id" ASC;
> ```
>
> and most of the time it returns result as expected:
>
> ```
>  id |           createdAt           |           updatedAt           |
> tasks.id | tasks.title |        tasks.createdAt        |
> tasks.updatedAt        | tasks.UserId
>
----+-------------------------------+-------------------------------+----------+-------------+-------------------------------+-------------------------------+--------------
>   1 | 2021-02-13 16:59:04.190643-03 | 2021-02-13 16:59:04.190643-03 |
>       1 | b           | 2021-02-13 16:59:04.222593-03 | 2021-02-13
> 16:59:04.222593-03 | 1
>   1 | 2021-02-13 16:59:04.190643-03 | 2021-02-13 16:59:04.190643-03 |
>       2 | d           | 2021-02-13 16:59:04.238252-03 | 2021-02-13
> 16:59:04.238252-03 | 1
>   2 | 2021-02-13 16:59:04.217696-03 | 2021-02-13 16:59:04.217696-03 |
>       5 | a           | 2021-02-13 16:59:04.262364-03 | 2021-02-13
> 16:59:04.262364-03 | 2
>   2 | 2021-02-13 16:59:04.217696-03 | 2021-02-13 16:59:04.217696-03 |
>       6 | c           | 2021-02-13 16:59:04.277525-03 | 2021-02-13
> 16:59:04.277525-03 | 2
> ```
>
> note that Tasks ids are 1,2,5,6. But there are chances that it returns
> wrong tasks, for example 1,2,5,7. The problem is that whenever I add logging
> it works correctly 100% of the time.
>
> The good news I was lucky enough to catch the error with Wireshark. Sequelize
> uses connection pooling so the queries are spread across 5 separate
> tcp connections and
> Wireshark output is quite messy. I had to parse the output with an ad-hoc
> script that generates a visual graph of the underlying events. This graph is
> attached as graph.svg. It's handy to open it in the browser and zoom
> out.  Graphviz
> is not my strongest point, so let me briefly explain what this graph means.
> Each node is a pgsql-relevant network frame. Red edges denote TCP connections.
> Black arrows denote sequence order. The red nodes are what matters. Each frame
> begins with its sequence number and a direction (in/out). Body of each node
> shows the list of commands sent/received.
>
> Frames 1062-1084 are boring preludes. Then at 1086 comes the CREATE
> SCHEMA query.
> Then come some Sequelize noise like DROP TABLE IF EXISTS of tables that do not
> exist. Then in 1098 and 1105 it actually queries table creation.
> At 1109 it starts INSERTing data into "Users". Then the pool comes into play.
> At 1122 it spawns one more connection and at 1130 it starts INSERTing the second
> row into "Users". At 1142 and 1146 the Postgres reports both connections are
> "Ready for query". Then at 1148 it starts INSERTing a row into "Tasks" in the
> first connection, and at 1165 a row in the second connection.
> At 1180 the fun begins, it spawns three more connections and keeps INSERTing
> rows into "Tasks" using all the connections.
> Around 1263-1270 all the INSERT queries are executed and the connections
> reported "Ready for query".
>
> At 1273 it starts SELECTing the Faulty Query and 1275 returns the seem-to-be
> incorrect result. The Tasks rows in the response are
> `(1,'b'),(2,'d'),(5,'a'),(7,'b')`. It looks like (6,'c') haven't landed yet.
>
> So I come to the main question: did I find a bug or my assumptions about
> row visibility are just wrong?



pgsql-bugs by date:

Previous
From: Viktor Semykin
Date:
Subject: An issue with missing rows
Next
From: PG Bug reporting form
Date:
Subject: BUG #16865: Regression: GIN Negated prefix search returns results that contain the search term