Re: AIO v2.5 - Mailing list pgsql-hackers

From Alexander Lakhin
Subject Re: AIO v2.5
Date
Msg-id 6a6756a0-bad5-4ee5-a591-c763af289940@gmail.com
Whole thread Raw
In response to Re: AIO v2.5  (Andres Freund <andres@anarazel.de>)
Responses Re: AIO v2.5
List pgsql-hackers
Hello Andres,

07.04.2025 19:20, Andres Freund wrote:
iteration 8: Sun Apr  6 19:22:39 UTC 2025
installchecks finished: Sun Apr  6 19:23:47 UTC 2025
2025-04-06 19:22:44.216 UTC [349525] LOG:  could not read blocks 0..0 in file "base/6179194/2606": Operation canceled
2025-04-06 19:22:44.216 UTC [349525] ERROR:  could not read blocks 0..0 in file "base/6179194/2606": Operation canceled
Thanks for the report, clearly something isn't right.

Thank you for your attention to it!
It's reproduced better on tmpfs for me; probably you would need to increase
NUM_INSTALLCHECKS/NUM_ITERATIONS for your machine.
I ran it for a while in a VM, it hasn't triggered yet. Neither on xfs nor on
tmpfs.

Before sharing the script I tested it on two my machines, but I had
anticipated that the error can be hard to reproduce. Will try to reduce
the reproducer...

server.log contains:
2025-04-06 19:22:44.215 UTC [38231] LOG:  checkpoint complete: wrote ...
2025-04-06 19:22:44.216 UTC [38231] LOG:  checkpoint starting: immediate force wait flush-all
2025-04-06 19:22:44.216 UTC [349525] LOG:  could not read blocks 0..0 in file "base/6179194/2606": Operation canceled
2025-04-06 19:22:44.216 UTC [349525] STATEMENT:  alter table parted_copytest
attach partition parted_copytest_a1 for values in(1);
2025-04-06 19:22:44.216 UTC [349525] ERROR:  could not read blocks 0..0 in file "base/6179194/2606": Operation canceled
2025-04-06 19:22:44.216 UTC [349525] STATEMENT:  alter table parted_copytest
attach partition parted_copytest_a1 for values in(1);
Hm. Does the failure vary between occurrences?
- is it always the same statement? Probably not?
- is it always 2606 (i.e. pg_constraint)?
- does the failure always happen around a checkpoint? If so, is it always  immediate?
- I do assume it's always ECANCELED?

I've got three such failures with three runs of of the script to
answer your questions:
...
iteration 6: Mon Apr  7 18:08:35 UTC 2025
installchecks finished: Mon Apr  7 18:09:45 UTC 2025
2025-04-07 18:08:45.369 UTC [1941260] LOG:  could not read blocks 7..7 in file "base/4448730/1255": Operation canceled
2025-04-07 18:08:45.369 UTC [1941260] ERROR:  could not read blocks 7..7 in file "base/4448730/1255": Operation canceled
server.log:
...
2025-04-07 18:08:44.120 UTC [1713945] LOG:  checkpoint starting: wal

2025-04-07 18:08:45.369 UTC [1941260] LOG:  could not read blocks 7..7 in file "base/4448730/1255": Operation canceled
2025-04-07 18:08:45.369 UTC [1941260] STATEMENT:  SELECT routine_name, ordinal_position, parameter_name, parameter_default
        FROM information_schema.parameters JOIN information_schema.routines USING (specific_schema, specific_name)
        WHERE routine_schema = 'temp_func_test' AND routine_name ~ '^functest_is_'
        ORDER BY 1, 2;
2025-04-07 18:08:45.369 UTC [1941260] ERROR:  could not read blocks 7..7 in file "base/4448730/1255": Operation canceled
2025-04-07 18:08:45.369 UTC [1941260] STATEMENT:  SELECT routine_name, ordinal_position, parameter_name, parameter_default
        FROM information_schema.parameters JOIN information_schema.routines USING (specific_schema, specific_name)
        WHERE routine_schema = 'temp_func_test' AND routine_name ~ '^functest_is_'
        ORDER BY 1, 2;
...
2025-04-07 18:08:51.836 UTC [1713945] LOG:  checkpoint complete: wrote 1558 buffers (9.5%), wrote 22 SLRU buffers; 0 WAL file(s) added, 0 removed, 33 recycled; write=7.544 s, sync=0.019 s, total=7.720 s; sync files=0, longest=0.000 s, average=0.000 s; distance=533837 kB, estimate=533837 kB; lsn=C/E532D478, redo lsn=C/C7054358
...

---
iteration 8: Mon Apr  7 18:26:47 UTC 2025
installchecks finished: Mon Apr  7 18:27:59 UTC 2025
2025-04-07 18:26:53.252 UTC [2359398] LOG:  could not read blocks 0..12 in file "base/4/1255": Operation canceled
2025-04-07 18:26:53.255 UTC [2359398] ERROR:  could not read blocks 0..12 in file "base/4/1255": Operation canceled
server.log:
...
2025-04-07 18:26:53.249 UTC [2048053] LOG:  checkpoint complete: wrote 3061 buffers (18.7%), wrote 3 SLRU buffers; 0 WAL file(s) added, 2 removed, 0 recycled; write=0.051 s, sync=0.002 s, total=0.068 s; sync files=0, longest=0.000 s, average=0.000 s; distance=18937 kB, estimate=120161 kB; lsn=11/6D42ED30, redo lsn=11/6C17AC30
2025-04-07 18:26:53.251 UTC [2359308] WARNING:  "most_common_elems" must be specified when "most_common_elem_freqs" is specified
2025-04-07 18:26:53.251 UTC [2359351] WARNING:  "elem_count_histogram" array cannot contain NULL values
2025-04-07 18:26:53.252 UTC [2359375] ERROR:  cannot modify statistics for relation "testseq"
2025-04-07 18:26:53.252 UTC [2359375] DETAIL:  This operation is not supported for sequences.
2025-04-07 18:26:53.252 UTC [2359375] STATEMENT:  SELECT pg_catalog.pg_restore_relation_stats(
            'schemaname', 'stats_import',
            'relname', 'testseq');
2025-04-07 18:26:53.252 UTC [2359375] ERROR:  cannot modify statistics for relation "testseq"
2025-04-07 18:26:53.252 UTC [2359375] DETAIL:  This operation is not supported for sequences.
2025-04-07 18:26:53.252 UTC [2359375] STATEMENT:  SELECT pg_catalog.pg_clear_relation_stats(schemaname => 'stats_import', relname => 'testseq');
2025-04-07 18:26:53.252 UTC [2359398] LOG:  could not read blocks 0..12 in file "base/4/1255": Operation canceled
2025-04-07 18:26:53.252 UTC [2359398] STATEMENT:  CREATE DATABASE regress020_tbd
        ENCODING utf8 LC_COLLATE "C" LC_CTYPE "C" TEMPLATE template0;
2025-04-07 18:26:53.255 UTC [2359449] ERROR:  invalid input syntax for type pg_snapshot: "31:12:" at character 8
2025-04-07 18:26:53.255 UTC [2359449] STATEMENT:  select '31:12:'::pg_snapshot;
2025-04-07 18:26:53.255 UTC [2359449] ERROR:  invalid input syntax for type pg_snapshot: "0:1:" at character 8
2025-04-07 18:26:53.255 UTC [2359449] STATEMENT:  select '0:1:'::pg_snapshot;
2025-04-07 18:26:53.255 UTC [2359449] ERROR:  invalid input syntax for type pg_snapshot: "12:13:0" at character 8
2025-04-07 18:26:53.255 UTC [2359449] STATEMENT:  select '12:13:0'::pg_snapshot;
2025-04-07 18:26:53.255 UTC [2359449] ERROR:  invalid input syntax for type pg_snapshot: "12:16:14,13" at character 8
2025-04-07 18:26:53.255 UTC [2359449] STATEMENT:  select '12:16:14,13'::pg_snapshot;
2025-04-07 18:26:53.255 UTC [2359398] ERROR:  could not read blocks 0..12 in file "base/4/1255": Operation canceled
2025-04-07 18:26:53.255 UTC [2359398] STATEMENT:  CREATE DATABASE regress020_tbd
        ENCODING utf8 LC_COLLATE "C" LC_CTYPE "C" TEMPLATE template0;
...
---
iteration 2: Mon Apr  7 19:02:57 UTC 2025
installchecks finished: Mon Apr  7 19:04:07 UTC 2025
2025-04-07 19:03:02.133 UTC [3537488] LOG:  could not read blocks 0..0 in file "base/902183/2607": Operation canceled
2025-04-07 19:03:02.133 UTC [3537488] ERROR:  could not read blocks 0..0 in file "base/902183/2607": Operation canceled
server.log:
2025-04-07 19:03:02.129 UTC [3487651] LOG:  checkpoint complete: wrote 337 buffers (2.1%), wrote 2 SLRU buffers; 0 WAL file(s) added, 0 removed, 0 recycled; write=0.013 s, sync=0.001 s, total=0.024 s; sync files=0, longest=0.000 s, average=0.000 s; distance=2555 kB, estimate=293963 kB; lsn=2/9D3DECE8, redo lsn=2/9D2AAC00
2025-04-07 19:03:02.129 UTC [3537184] ERROR:  invalid regular expression: invalid backreference number
2025-04-07 19:03:02.129 UTC [3537184] STATEMENT:  select 'xyz' ~ 'x(\w)(?=(\1))';
2025-04-07 19:03:02.129 UTC [3537184] ERROR:  invalid regular expression: invalid escape \ sequence
2025-04-07 19:03:02.129 UTC [3537184] STATEMENT:  select 'a' ~ '\x7fffffff';
2025-04-07 19:03:02.130 UTC [3537651] ERROR:  operator does not exist: point = box at character 23
2025-04-07 19:03:02.130 UTC [3537651] HINT:  No operator matches the given name and argument types. You might need to add explicit type casts.
2025-04-07 19:03:02.130 UTC [3537651] STATEMENT:  select '(0,0)'::point in ('(0,0,0,0)'::box, point(0,0));
2025-04-07 19:03:02.133 UTC [3537362] ERROR:  invalid regular expression: invalid backreference number
2025-04-07 19:03:02.133 UTC [3537362] STATEMENT:  select 'xyz' ~ 'x(\w)(?=\1)';
2025-04-07 19:03:02.133 UTC [3537362] ERROR:  invalid regular expression: invalid backreference number
2025-04-07 19:03:02.133 UTC [3537362] STATEMENT:  select 'xyz' ~ 'x(\w)(?=(\1))';
2025-04-07 19:03:02.133 UTC [3537362] ERROR:  invalid regular expression: invalid escape \ sequence
2025-04-07 19:03:02.133 UTC [3537362] STATEMENT:  select 'a' ~ '\x7fffffff';
2025-04-07 19:03:02.133 UTC [3537488] LOG:  could not read blocks 0..0 in file "base/902183/2607": Operation canceled
2025-04-07 19:03:02.133 UTC [3537488] STATEMENT:  SELECT p1.oid, p1.proname
    FROM pg_proc as p1
    WHERE 'cstring'::regtype = ANY (p1.proargtypes)
        AND NOT EXISTS(SELECT 1 FROM pg_type WHERE typinput = p1.oid)
        AND NOT EXISTS(SELECT 1 FROM pg_conversion WHERE conproc = p1.oid)
        AND p1.oid != 'shell_in(cstring)'::regprocedure
    ORDER BY 1;
2025-04-07 19:03:02.133 UTC [3537488] ERROR:  could not read blocks 0..0 in file "base/902183/2607": Operation canceled
2025-04-07 19:03:02.133 UTC [3537488] STATEMENT:  SELECT p1.oid, p1.proname
    FROM pg_proc as p1
    WHERE 'cstring'::regtype = ANY (p1.proargtypes)
        AND NOT EXISTS(SELECT 1 FROM pg_type WHERE typinput = p1.oid)
        AND NOT EXISTS(SELECT 1 FROM pg_conversion WHERE conproc = p1.oid)
        AND p1.oid != 'shell_in(cstring)'::regprocedure
    ORDER BY 1;
2025-04-07 19:03:02.135 UTC [3537671] WARNING:  TIME(7) WITH TIME ZONE precision reduced to maximum allowed, 6
2025-04-07 19:03:02.135 UTC [3537671] WARNING:  TIMESTAMP(7) WITH TIME ZONE precision reduced to maximum allowed, 6
2025-04-07 19:03:02.135 UTC [3537671] WARNING:  TIME(7) precision reduced to maximum allowed, 6
2025-04-07 19:03:02.136 UTC [3487651] LOG:  checkpoint starting: immediate force wait
...
So I suspect checkpointer, but I'm not sure yet.

(Sometimes (2 out of 5 runs) 10 iterations pass without the error.)

That script  with s/grep 'could not read blocks '/grep 'was terminated '/
also discovers another anomaly (right now on the third run, with 10
iterations of 30 installchecks):
Core was generated by `postgres: postgres regress005 [local] SELECT                  '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x0000000000000000 in ?? ()
(gdb) bt
#0  0x0000000000000000 in ?? ()
#1  0x0000591086d5ccfa in pgaio_io_get_target_description (ioh=ioh@entry=0x7ffcf961bc20) at aio_target.c:85
#2  0x0000591086d5bfdd in pg_get_aios (fcinfo=<optimized out>) at aio_funcs.c:218
#3  0x0000591086bb64c1 in ExecMakeTableFunctionResult (setexpr=0x5910bddcbf48, econtext=0x5910bddcbde8, argContext=<optimized out>,
    expectedDesc=0x5910bddff918, randomAccess=false) at execSRF.c:234
#4  0x0000591086bc9673 in FunctionNext (node=node@entry=0x5910bddcbbd8) at nodeFunctionscan.c:94
#5  0x0000591086bb7093 in ExecScanFetch (recheckMtd=0x591086bc9522 <FunctionRecheck>, accessMtd=0x591086bc95a2 <FunctionNext>, epqstate=0x0,
    node=0x5910bddcbbd8) at ../../../src/include/executor/execScan.h:126
#6  ExecScanExtended (projInfo=0x0, qual=0x0, epqstate=0x0, recheckMtd=0x591086bc9522 <FunctionRecheck>, accessMtd=0x591086bc95a2 <FunctionNext>,
    node=0x5910bddcbbd8) at ../../../src/include/executor/execScan.h:170
#7  ExecScan (node=0x5910bddcbbd8, accessMtd=accessMtd@entry=0x591086bc95a2 <FunctionNext>, recheckMtd=recheckMtd@entry=0x591086bc9522 <FunctionRecheck>)
    at execScan.c:59
#8  0x0000591086bc9580 in ExecFunctionScan (pstate=<optimized out>) at nodeFunctionscan.c:269
#9  0x0000591086bb38f3 in ExecProcNodeFirst (node=0x5910bddcbbd8) at execProcnode.c:469
#10 0x0000591086bc0195 in ExecProcNode (node=0x5910bddcbbd8) at ../../../src/include/executor/executor.h:341
#11 0x0000591086bc021b in fetch_input_tuple (aggstate=aggstate@entry=0x5910bddcb480) at nodeAgg.c:563
#12 0x0000591086bc3825 in agg_retrieve_direct (aggstate=aggstate@entry=0x5910bddcb480) at nodeAgg.c:2450
#13 0x0000591086bc3a19 in ExecAgg (pstate=<optimized out>) at nodeAgg.c:2265
#14 0x0000591086bb38f3 in ExecProcNodeFirst (node=0x5910bddcb480) at execProcnode.c:469
#15 0x0000591086baac23 in ExecProcNode (node=node@entry=0x5910bddcb480) at ../../../src/include/executor/executor.h:341
#16 0x0000591086baacd0 in ExecutePlan (queryDesc=queryDesc@entry=0x5910bdde0e00, operation=operation@entry=CMD_SELECT, sendTuples=sendTuples@entry=true,
    numberTuples=numberTuples@entry=0, direction=direction@entry=ForwardScanDirection, dest=dest@entry=0x5910bddf6070) at execMain.c:1783
#17 0x0000591086bab79d in standard_ExecutorRun (queryDesc=0x5910bdde0e00, direction=ForwardScanDirection, count=0) at execMain.c:435
#18 0x0000591086bab7e0 in ExecutorRun (queryDesc=queryDesc@entry=0x5910bdde0e00, direction=direction@entry=ForwardScanDirection, count=count@entry=0)
    at execMain.c:371
#19 0x0000591086da9e9f in PortalRunSelect (portal=portal@entry=0x5910bdd72f10, forward=forward@entry=true, count=0, count@entry=9223372036854775807,
    dest=dest@entry=0x5910bddf6070) at pquery.c:953
#20 0x0000591086dab83b in PortalRun (portal=portal@entry=0x5910bdd72f10, count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=true,
    dest=dest@entry=0x5910bddf6070, altdest=altdest@entry=0x5910bddf6070, qc=qc@entry=0x7ffcf961c9f0) at pquery.c:797
#21 0x0000591086da74a4 in exec_simple_query (query_string=query_string@entry=0x5910bdcebe60 "SELECT COUNT(*) >= 0 AS ok FROM pg_aios;") at postgres.c:1274

But I'm yet to construct a more reliable reproducer for it. Hope I could
do this during the current week.

Best regards,
Alexander Lakhin
Neon (https://neon.tech)

pgsql-hackers by date:

Previous
From: Mahendra Singh Thalor
Date:
Subject: Re: as per commit 643a1a61985bef2590496, move create/open dir code to function using switch case of pg_backup_directory.c file also
Next
From: Masahiko Sawada
Date:
Subject: Re: pg_recvlogical cannot create slots with failover=true