Hello.
The few things I have got so far:
1) It is not required to order by random() to reproduce the issue - it
could be done using queries like:
BEGIN;
SELECT omg.*
FROM something_is_wrong_here AS omg
ORDER BY value -- change is here
LIMIT 1
FOR UPDATE
\gset
UPDATE something_is_wrong_here SET value = :value + 1 WHERE id = :id;
COMMIT;
But for some reason it is harder to reproduce without random in my
case (typically need to wait for about a minute with 100 connections).
2) It is not an issue at table creation time. Issue is reproducible if
vacuum_defer_cleanup_age set after table preparation.
3) To reproduce the issue, vacuum_defer_cleanup_age should flip xid
over zero (be >= txid_current()).
And it is stable.... So, for example - unable to reproduce with 733
value, but 734 gives error each time.
Just a single additional txid_current() (after data is filled) fixes a
crash... It looks like the first SELECT FOR UPDATE + UPDATE silently
poisons everything somehow.
You could use such PSQL script:
DROP TABLE IF EXISTS something_is_wrong_here;
CREATE TABLE something_is_wrong_here (id bigserial PRIMARY KEY,
value numeric(15,4) DEFAULT 0 NOT NULL);
INSERT INTO something_is_wrong_here (value) (SELECT 10000 from
generate_series(0, 100));
SELECT txid_current() \gset
SELECT :txid_current + 1 as txid \gset
ALTER SYSTEM SET vacuum_defer_cleanup_age to :txid;SELECT
pg_reload_conf();
I have attached some scripts if someone goes to reproduce.
Best regards,
Michail.