Re: AIO v2.5 - Mailing list pgsql-hackers
From | Alexander Lakhin |
---|---|
Subject | Re: AIO v2.5 |
Date | |
Msg-id | 6a6756a0-bad5-4ee5-a591-c763af289940@gmail.com Whole thread Raw |
In response to | Re: AIO v2.5 (Andres Freund <andres@anarazel.de>) |
Responses |
Re: AIO v2.5
|
List | pgsql-hackers |
Hello Andres,
07.04.2025 19:20, Andres Freund wrote:
07.04.2025 19:20, Andres Freund wrote:
iteration 8: Sun Apr 6 19:22:39 UTC 2025 installchecks finished: Sun Apr 6 19:23:47 UTC 2025 2025-04-06 19:22:44.216 UTC [349525] LOG: could not read blocks 0..0 in file "base/6179194/2606": Operation canceled 2025-04-06 19:22:44.216 UTC [349525] ERROR: could not read blocks 0..0 in file "base/6179194/2606": Operation canceledThanks for the report, clearly something isn't right.
Thank you for your attention to it!
It's reproduced better on tmpfs for me; probably you would need to increase NUM_INSTALLCHECKS/NUM_ITERATIONS for your machine.I ran it for a while in a VM, it hasn't triggered yet. Neither on xfs nor on tmpfs.
Before sharing the script I tested it on two my machines, but I had
anticipated that the error can be hard to reproduce. Will try to reduce
the reproducer...
server.log contains: 2025-04-06 19:22:44.215 UTC [38231] LOG: checkpoint complete: wrote ... 2025-04-06 19:22:44.216 UTC [38231] LOG: checkpoint starting: immediate force wait flush-all 2025-04-06 19:22:44.216 UTC [349525] LOG: could not read blocks 0..0 in file "base/6179194/2606": Operation canceled 2025-04-06 19:22:44.216 UTC [349525] STATEMENT: alter table parted_copytest attach partition parted_copytest_a1 for values in(1); 2025-04-06 19:22:44.216 UTC [349525] ERROR: could not read blocks 0..0 in file "base/6179194/2606": Operation canceled 2025-04-06 19:22:44.216 UTC [349525] STATEMENT: alter table parted_copytest attach partition parted_copytest_a1 for values in(1);Hm. Does the failure vary between occurrences? - is it always the same statement? Probably not? - is it always 2606 (i.e. pg_constraint)? - does the failure always happen around a checkpoint? If so, is it always immediate? - I do assume it's always ECANCELED?
I've got three such failures with three runs of of the script to
answer your questions:
...
iteration 6: Mon Apr 7 18:08:35 UTC 2025
installchecks finished: Mon Apr 7 18:09:45 UTC 2025
2025-04-07 18:08:45.369 UTC [1941260] LOG: could not read blocks 7..7 in file "base/4448730/1255": Operation canceled
2025-04-07 18:08:45.369 UTC [1941260] ERROR: could not read blocks 7..7 in file "base/4448730/1255": Operation canceled
server.log:
...
2025-04-07 18:08:44.120 UTC [1713945] LOG: checkpoint starting: wal
2025-04-07 18:08:45.369 UTC [1941260] LOG: could not read blocks 7..7 in file "base/4448730/1255": Operation canceled
2025-04-07 18:08:45.369 UTC [1941260] STATEMENT: SELECT routine_name, ordinal_position, parameter_name, parameter_default
FROM information_schema.parameters JOIN information_schema.routines USING (specific_schema, specific_name)
WHERE routine_schema = 'temp_func_test' AND routine_name ~ '^functest_is_'
ORDER BY 1, 2;
2025-04-07 18:08:45.369 UTC [1941260] ERROR: could not read blocks 7..7 in file "base/4448730/1255": Operation canceled
2025-04-07 18:08:45.369 UTC [1941260] STATEMENT: SELECT routine_name, ordinal_position, parameter_name, parameter_default
FROM information_schema.parameters JOIN information_schema.routines USING (specific_schema, specific_name)
WHERE routine_schema = 'temp_func_test' AND routine_name ~ '^functest_is_'
ORDER BY 1, 2;
...
2025-04-07 18:08:51.836 UTC [1713945] LOG: checkpoint complete: wrote 1558 buffers (9.5%), wrote 22 SLRU buffers; 0 WAL file(s) added, 0 removed, 33 recycled; write=7.544 s, sync=0.019 s, total=7.720 s; sync files=0, longest=0.000 s, average=0.000 s; distance=533837 kB, estimate=533837 kB; lsn=C/E532D478, redo lsn=C/C7054358
...
---
iteration 8: Mon Apr 7 18:26:47 UTC 2025
installchecks finished: Mon Apr 7 18:27:59 UTC 2025
2025-04-07 18:26:53.252 UTC [2359398] LOG: could not read blocks 0..12 in file "base/4/1255": Operation canceled
2025-04-07 18:26:53.255 UTC [2359398] ERROR: could not read blocks 0..12 in file "base/4/1255": Operation canceled
server.log:
...
2025-04-07 18:26:53.249 UTC [2048053] LOG: checkpoint complete: wrote 3061 buffers (18.7%), wrote 3 SLRU buffers; 0 WAL file(s) added, 2 removed, 0 recycled; write=0.051 s, sync=0.002 s, total=0.068 s; sync files=0, longest=0.000 s, average=0.000 s; distance=18937 kB, estimate=120161 kB; lsn=11/6D42ED30, redo lsn=11/6C17AC30
2025-04-07 18:26:53.251 UTC [2359308] WARNING: "most_common_elems" must be specified when "most_common_elem_freqs" is specified
2025-04-07 18:26:53.251 UTC [2359351] WARNING: "elem_count_histogram" array cannot contain NULL values
2025-04-07 18:26:53.252 UTC [2359375] ERROR: cannot modify statistics for relation "testseq"
2025-04-07 18:26:53.252 UTC [2359375] DETAIL: This operation is not supported for sequences.
2025-04-07 18:26:53.252 UTC [2359375] STATEMENT: SELECT pg_catalog.pg_restore_relation_stats(
'schemaname', 'stats_import',
'relname', 'testseq');
2025-04-07 18:26:53.252 UTC [2359375] ERROR: cannot modify statistics for relation "testseq"
2025-04-07 18:26:53.252 UTC [2359375] DETAIL: This operation is not supported for sequences.
2025-04-07 18:26:53.252 UTC [2359375] STATEMENT: SELECT pg_catalog.pg_clear_relation_stats(schemaname => 'stats_import', relname => 'testseq');
2025-04-07 18:26:53.252 UTC [2359398] LOG: could not read blocks 0..12 in file "base/4/1255": Operation canceled
2025-04-07 18:26:53.252 UTC [2359398] STATEMENT: CREATE DATABASE regress020_tbd
ENCODING utf8 LC_COLLATE "C" LC_CTYPE "C" TEMPLATE template0;
2025-04-07 18:26:53.255 UTC [2359449] ERROR: invalid input syntax for type pg_snapshot: "31:12:" at character 8
2025-04-07 18:26:53.255 UTC [2359449] STATEMENT: select '31:12:'::pg_snapshot;
2025-04-07 18:26:53.255 UTC [2359449] ERROR: invalid input syntax for type pg_snapshot: "0:1:" at character 8
2025-04-07 18:26:53.255 UTC [2359449] STATEMENT: select '0:1:'::pg_snapshot;
2025-04-07 18:26:53.255 UTC [2359449] ERROR: invalid input syntax for type pg_snapshot: "12:13:0" at character 8
2025-04-07 18:26:53.255 UTC [2359449] STATEMENT: select '12:13:0'::pg_snapshot;
2025-04-07 18:26:53.255 UTC [2359449] ERROR: invalid input syntax for type pg_snapshot: "12:16:14,13" at character 8
2025-04-07 18:26:53.255 UTC [2359449] STATEMENT: select '12:16:14,13'::pg_snapshot;
2025-04-07 18:26:53.255 UTC [2359398] ERROR: could not read blocks 0..12 in file "base/4/1255": Operation canceled
2025-04-07 18:26:53.255 UTC [2359398] STATEMENT: CREATE DATABASE regress020_tbd
ENCODING utf8 LC_COLLATE "C" LC_CTYPE "C" TEMPLATE template0;
...
---
iteration 2: Mon Apr 7 19:02:57 UTC 2025
installchecks finished: Mon Apr 7 19:04:07 UTC 2025
2025-04-07 19:03:02.133 UTC [3537488] LOG: could not read blocks 0..0 in file "base/902183/2607": Operation canceled
2025-04-07 19:03:02.133 UTC [3537488] ERROR: could not read blocks 0..0 in file "base/902183/2607": Operation canceled
server.log:
2025-04-07 19:03:02.129 UTC [3487651] LOG: checkpoint complete: wrote 337 buffers (2.1%), wrote 2 SLRU buffers; 0 WAL file(s) added, 0 removed, 0 recycled; write=0.013 s, sync=0.001 s, total=0.024 s; sync files=0, longest=0.000 s, average=0.000 s; distance=2555 kB, estimate=293963 kB; lsn=2/9D3DECE8, redo lsn=2/9D2AAC00
2025-04-07 19:03:02.129 UTC [3537184] ERROR: invalid regular expression: invalid backreference number
2025-04-07 19:03:02.129 UTC [3537184] STATEMENT: select 'xyz' ~ 'x(\w)(?=(\1))';
2025-04-07 19:03:02.129 UTC [3537184] ERROR: invalid regular expression: invalid escape \ sequence
2025-04-07 19:03:02.129 UTC [3537184] STATEMENT: select 'a' ~ '\x7fffffff';
2025-04-07 19:03:02.130 UTC [3537651] ERROR: operator does not exist: point = box at character 23
2025-04-07 19:03:02.130 UTC [3537651] HINT: No operator matches the given name and argument types. You might need to add explicit type casts.
2025-04-07 19:03:02.130 UTC [3537651] STATEMENT: select '(0,0)'::point in ('(0,0,0,0)'::box, point(0,0));
2025-04-07 19:03:02.133 UTC [3537362] ERROR: invalid regular expression: invalid backreference number
2025-04-07 19:03:02.133 UTC [3537362] STATEMENT: select 'xyz' ~ 'x(\w)(?=\1)';
2025-04-07 19:03:02.133 UTC [3537362] ERROR: invalid regular expression: invalid backreference number
2025-04-07 19:03:02.133 UTC [3537362] STATEMENT: select 'xyz' ~ 'x(\w)(?=(\1))';
2025-04-07 19:03:02.133 UTC [3537362] ERROR: invalid regular expression: invalid escape \ sequence
2025-04-07 19:03:02.133 UTC [3537362] STATEMENT: select 'a' ~ '\x7fffffff';
2025-04-07 19:03:02.133 UTC [3537488] LOG: could not read blocks 0..0 in file "base/902183/2607": Operation canceled
2025-04-07 19:03:02.133 UTC [3537488] STATEMENT: SELECT p1.oid, p1.proname
FROM pg_proc as p1
WHERE 'cstring'::regtype = ANY (p1.proargtypes)
AND NOT EXISTS(SELECT 1 FROM pg_type WHERE typinput = p1.oid)
AND NOT EXISTS(SELECT 1 FROM pg_conversion WHERE conproc = p1.oid)
AND p1.oid != 'shell_in(cstring)'::regprocedure
ORDER BY 1;
2025-04-07 19:03:02.133 UTC [3537488] ERROR: could not read blocks 0..0 in file "base/902183/2607": Operation canceled
2025-04-07 19:03:02.133 UTC [3537488] STATEMENT: SELECT p1.oid, p1.proname
FROM pg_proc as p1
WHERE 'cstring'::regtype = ANY (p1.proargtypes)
AND NOT EXISTS(SELECT 1 FROM pg_type WHERE typinput = p1.oid)
AND NOT EXISTS(SELECT 1 FROM pg_conversion WHERE conproc = p1.oid)
AND p1.oid != 'shell_in(cstring)'::regprocedure
ORDER BY 1;
2025-04-07 19:03:02.135 UTC [3537671] WARNING: TIME(7) WITH TIME ZONE precision reduced to maximum allowed, 6
2025-04-07 19:03:02.135 UTC [3537671] WARNING: TIMESTAMP(7) WITH TIME ZONE precision reduced to maximum allowed, 6
2025-04-07 19:03:02.135 UTC [3537671] WARNING: TIME(7) precision reduced to maximum allowed, 6
2025-04-07 19:03:02.136 UTC [3487651] LOG: checkpoint starting: immediate force wait
...
So I suspect checkpointer, but I'm not sure yet.
(Sometimes (2 out of 5 runs) 10 iterations pass without the error.)
That script with s/grep 'could not read blocks '/grep 'was terminated '/
also discovers another anomaly (right now on the third run, with 10
iterations of 30 installchecks):
Core was generated by `postgres: postgres regress005 [local] SELECT '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x0000000000000000 in ?? ()
(gdb) bt
#0 0x0000000000000000 in ?? ()
#1 0x0000591086d5ccfa in pgaio_io_get_target_description (ioh=ioh@entry=0x7ffcf961bc20) at aio_target.c:85
#2 0x0000591086d5bfdd in pg_get_aios (fcinfo=<optimized out>) at aio_funcs.c:218
#3 0x0000591086bb64c1 in ExecMakeTableFunctionResult (setexpr=0x5910bddcbf48, econtext=0x5910bddcbde8, argContext=<optimized out>,
expectedDesc=0x5910bddff918, randomAccess=false) at execSRF.c:234
#4 0x0000591086bc9673 in FunctionNext (node=node@entry=0x5910bddcbbd8) at nodeFunctionscan.c:94
#5 0x0000591086bb7093 in ExecScanFetch (recheckMtd=0x591086bc9522 <FunctionRecheck>, accessMtd=0x591086bc95a2 <FunctionNext>, epqstate=0x0,
node=0x5910bddcbbd8) at ../../../src/include/executor/execScan.h:126
#6 ExecScanExtended (projInfo=0x0, qual=0x0, epqstate=0x0, recheckMtd=0x591086bc9522 <FunctionRecheck>, accessMtd=0x591086bc95a2 <FunctionNext>,
node=0x5910bddcbbd8) at ../../../src/include/executor/execScan.h:170
#7 ExecScan (node=0x5910bddcbbd8, accessMtd=accessMtd@entry=0x591086bc95a2 <FunctionNext>, recheckMtd=recheckMtd@entry=0x591086bc9522 <FunctionRecheck>)
at execScan.c:59
#8 0x0000591086bc9580 in ExecFunctionScan (pstate=<optimized out>) at nodeFunctionscan.c:269
#9 0x0000591086bb38f3 in ExecProcNodeFirst (node=0x5910bddcbbd8) at execProcnode.c:469
#10 0x0000591086bc0195 in ExecProcNode (node=0x5910bddcbbd8) at ../../../src/include/executor/executor.h:341
#11 0x0000591086bc021b in fetch_input_tuple (aggstate=aggstate@entry=0x5910bddcb480) at nodeAgg.c:563
#12 0x0000591086bc3825 in agg_retrieve_direct (aggstate=aggstate@entry=0x5910bddcb480) at nodeAgg.c:2450
#13 0x0000591086bc3a19 in ExecAgg (pstate=<optimized out>) at nodeAgg.c:2265
#14 0x0000591086bb38f3 in ExecProcNodeFirst (node=0x5910bddcb480) at execProcnode.c:469
#15 0x0000591086baac23 in ExecProcNode (node=node@entry=0x5910bddcb480) at ../../../src/include/executor/executor.h:341
#16 0x0000591086baacd0 in ExecutePlan (queryDesc=queryDesc@entry=0x5910bdde0e00, operation=operation@entry=CMD_SELECT, sendTuples=sendTuples@entry=true,
numberTuples=numberTuples@entry=0, direction=direction@entry=ForwardScanDirection, dest=dest@entry=0x5910bddf6070) at execMain.c:1783
#17 0x0000591086bab79d in standard_ExecutorRun (queryDesc=0x5910bdde0e00, direction=ForwardScanDirection, count=0) at execMain.c:435
#18 0x0000591086bab7e0 in ExecutorRun (queryDesc=queryDesc@entry=0x5910bdde0e00, direction=direction@entry=ForwardScanDirection, count=count@entry=0)
at execMain.c:371
#19 0x0000591086da9e9f in PortalRunSelect (portal=portal@entry=0x5910bdd72f10, forward=forward@entry=true, count=0, count@entry=9223372036854775807,
dest=dest@entry=0x5910bddf6070) at pquery.c:953
#20 0x0000591086dab83b in PortalRun (portal=portal@entry=0x5910bdd72f10, count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=true,
dest=dest@entry=0x5910bddf6070, altdest=altdest@entry=0x5910bddf6070, qc=qc@entry=0x7ffcf961c9f0) at pquery.c:797
#21 0x0000591086da74a4 in exec_simple_query (query_string=query_string@entry=0x5910bdcebe60 "SELECT COUNT(*) >= 0 AS ok FROM pg_aios;") at postgres.c:1274
But I'm yet to construct a more reliable reproducer for it. Hope I could
do this during the current week.
Best regards,
Alexander Lakhin
Neon (https://neon.tech)
pgsql-hackers by date: