Thread: Random pg_upgrade test failure on drongo
Dear hackers, While tracking a buildfarm, I found that drongo failed the test pg_upgrade/003_logical_slots [1]. A strange point is that the test passed in the next iteration. Currently I'm not sure the reason, but I will keep my eye for it and will investigate if it happens again. I think this failure is not related with our logical slots work, whereas it failed 003_logical_slots.pl. More detail, please see latter part. For more investigation, a server log during the upgrade may be needed. It will be in the data directory so BF system will not upload them. I may need additional information if it failed again. # Analysis of failure According to the output, pg_upgrade seemed to be failed while restoring objects to new cluster[2]. As code-level anaysis, pg_upgrade command failed in exec_prog(). In the function, pg_restore tried to be executed for database "postgres". Below is a brief call-stack. Note that pg_restore is not used for migrating logical replication slots, it is done by pg_upgrade binary itself. Also, the migration is done after all objects are copied, not in create_new_objects(). ``` exec_prog() parallel_exec_prog("pg_restore ... ") <-- Since -j option is not specified, it is just a wrapper create_new_objects() main() ``` In exec_prog(), system() system call was called but returned non-zero value. Doc said that sytem() returns value that is returned by the command interpreter, when input is not NULL [3]. Unfortunately, current code does not output the return code. Also, BF system does not upload data directory for failed tests. Therefore, I could not get more information for the investigation. [1]: https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=drongo&dt=2023-11-07%2013%3A43%3A23&stg=pg_upgrade-check [2]: ``` ... # No postmaster PID for node "oldpub" # Running: pg_upgrade --no-sync -d C:\\prog\\bf\\root\\HEAD\\pgsql.build/testrun/pg_upgrade/003_logical_slots\\data/t_003_logical_slots_oldpub_data/pgdata -D C:\\prog\\bf\\root\\HEAD\\pgsql.build/testrun/pg_upgrade/003_logical_slots\\data/t_003_logical_slots_newpub_data/pgdata -bC:/prog/bf/root/HEAD/PGSQL~1.BUI/TMP_IN~1/prog/bf/root/HEAD/inst/bin -B C:/prog/bf/root/HEAD/PGSQL~1.BUI/TMP_IN~1/prog/bf/root/HEAD/inst/bin-s 127.0.0.1 -p 54813 -P 54814 --copy Performing Consistency Checks ... Setting frozenxid and minmxid counters in new cluster ok Restoring global objects in the new cluster ok Restoring database schemas in the new cluster *failure* Consult the last few lines of "C:/prog/bf/root/HEAD/pgsql.build/testrun/pg_upgrade/003_logical_slots/data/t_003_logical_slots_newpub_data/pgdata/pg_upgrade_output.d/20231107T142224.580/log/pg_upgrade_dump_5.log" for the probable cause of the failure. Failure, exiting [14:23:26.632](70.141s) not ok 10 - run of pg_upgrade of old cluster [14:23:26.632](0.000s) # Failed test 'run of pg_upgrade of old cluster' # at C:/prog/bf/root/HEAD/pgsql/src/bin/pg_upgrade/t/003_logical_slots.pl line 170. ### Starting node "newpub" # Running: pg_ctl -w -D C:\\prog\\bf\\root\\HEAD\\pgsql.build/testrun/pg_upgrade/003_logical_slots\\data/t_003_logical_slots_newpub_data/pgdata -lC:\\prog\\bf\\root\\HEAD\\pgsql.build/testrun/pg_upgrade/003_logical_slots\\log/003_logical_slots_newpub.log -o --cluster-name=newpubstart waiting for server to start.... done server started # Postmaster PID for node "newpub" is 4604 [14:23:28.398](1.766s) not ok 11 - check the slot exists on new cluster [14:23:28.398](0.001s) # Failed test 'check the slot exists on new cluster' # at C:/prog/bf/root/HEAD/pgsql/src/bin/pg_upgrade/t/003_logical_slots.pl line 176. [14:23:28.399](0.000s) # got: '' # expected: 'regress_sub|t' ... ``` [3]: https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/system-wsystem?view=msvc-170 Best Regards, Hayato Kuroda FUJITSU LIMITED
Dear hackers, > While tracking a buildfarm, I found that drongo failed the test > pg_upgrade/003_logical_slots [1]. > A strange point is that the test passed in the next iteration. Currently I'm not > sure the reason, but I will keep my eye for it and will investigate if it > happens again. This email just tells an update. We found that fairywren was also failed due to the same reason [2]. It fails inconsistently, but there might be a bad thing on windows. I'm now trying to reproduce with my colleagues to analyze more detail. Also, working with Andrew for getting logs emitted during the upgrade. I will continue to keep on my eye. [2]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2023-11-08%2010%3A22%3A45 Best Regards, Hayato Kuroda FUJITSU LIMITED
Dear hackers, This email tells an update. The machine drongo failed the test a week ago [1] and finally got logfiles. PSA files. ## Observed failure pg_upgrade_server.log is a server log during the pg_upgrade command. According to it, the TRUNCATE command seemed to be failed due to a "File exists" error. ``` 2023-11-15 00:02:02.239 UTC [1752:18] 003_logical_slots.pl ERROR: could not create file "base/1/2683": File exists 2023-11-15 00:02:02.239 UTC [1752:19] 003_logical_slots.pl STATEMENT: -- For binary upgrade, preserve pg_largeobject and index relfilenodes SELECT pg_catalog.binary_upgrade_set_next_index_relfilenode('2683'::pg_catalog.oid); SELECT pg_catalog.binary_upgrade_set_next_heap_relfilenode('2613'::pg_catalog.oid); TRUNCATE pg_catalog.pg_largeobject; ... ``` ## Analysis I think it caused due to the STATUS_DELETE_PENDING failure, not related with recent updates for pg_upgrade. The file "base/1/2683" is an index file for pg_largeobject_loid_pn_index, and the output meant that file creation was failed. Below is a backtrace. ``` pgwin32_open() // <-- this returns -1 open() BasicOpenFilePerm() PathNameOpenFilePerm() PathNameOpenFile() mdcreate() smgrcreate() RelationCreateStorage() RelationSetNewRelfilenumber() ExecuteTruncateGuts() ExecuteTruncate() ``` But this is strange. Before calling mdcreate(), we surely unlink the file which have the same name. Below is a trace until unlink. ``` pgunlink() unlink() mdunlinkfork() mdunlink() smgrdounlinkall() RelationSetNewRelfilenumber() // common path with above ExecuteTruncateGuts() ExecuteTruncate() ``` I found Thomas said that [2] pgunlink sometimes could not remove file even if it returns OK, at that time NTSTATUS is STATUS_DELETE_PENDING. Also, a comment in pgwin32_open_handle() mentions the same thing: ``` /* * ERROR_ACCESS_DENIED is returned if the file is deleted but not yet * gone (Windows NT status code is STATUS_DELETE_PENDING). In that * case, we'd better ask for the NT status too so we can translate it * to a more Unix-like error. We hope that nothing clobbers the NT * status in between the internal NtCreateFile() call and CreateFile() * returning. * ``` The definition of STATUS_DELETE_PENDING can be seen in [3]. Based on that, indeed, open() would be able to fail with STATUS_DELETE_PENDING if the deletion is pending but it is tried to open. Another thread [4] also tries the issue while doing rmtree->unlink, and it reties to remove if it fails with STATUS_DELETE_PENDING. So, should we retry to open when it fails as well? Anyway, this fix seems out-of-scope for pg_upgrade. How do you think? Do you have any thoughts about it? ## Acknowledgement I want to say thanks to Sholk, Vingesh, for helping the analysis. [1]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sungazer&dt=2023-11-15%2006%3A16%3A15 [2]: https://www.postgresql.org/message-id/CA%2BhUKGKsdzw06c5nnb%3DKYG9GmvyykoVpJA_VR3k0r7cZOKcx6Q%40mail.gmail.com [3]: https://learn.microsoft.com/en-us/openspecs/windows_protocols/ms-erref/596a1078-e883-4972-9bbc-49e60bebca55 [4]: https://www.postgresql.org/message-id/flat/20220919213217.ptqfdlcc5idk5xup%40awork3.anarazel.de#6ae5e2ba3dd6e1fd680dcc34eab710d5 Best Regards, Hayato Kuroda FUJITSU LIMITED
Dear hackers, > This email tells an update. The machine drongo failed the test a week ago [1] > and finally got logfiles. PSA files. Oh, sorry. I missed to attach files. You can see pg_upgrade_server.log for now. Best Regards, Hayato Kuroda FUJITSU LIMITED
Attachment
Hello Kuroda-san, 21.11.2023 13:37, Hayato Kuroda (Fujitsu) wrote: >> This email tells an update. The machine drongo failed the test a week ago [1] >> and finally got logfiles. PSA files. > Oh, sorry. I missed to attach files. You can see pg_upgrade_server.log for now. > I can easily reproduce this failure on my workstation by running 5 tests 003_logical_slots in parallel inside Windows VM with it's CPU resources limited to 50%, like so: VBoxManage controlvm "Windows" cpuexecutioncap 50 set PGCTLTIMEOUT=180 python3 -c "NUMITERATIONS=20;NUMTESTS=5;import os;tsts='';exec('for i in range(1,NUMTESTS+1): tsts+=f\"pg_upgrade_{i}/003_logical_slots \"'); exec('for i in range(1,NUMITERATIONS+1):print(f\"iteration {i}\"); assert(os.system(f\"meson test --num-processes {NUMTESTS} {tsts}\") == 0)')" ... iteration 2 ninja: Entering directory `C:\src\postgresql\build' ninja: no work to do. 1/5 postgresql:pg_upgrade_2 / pg_upgrade_2/003_logical_slots ERROR 60.30s exit status 25 ... pg_restore: error: could not execute query: ERROR: could not create file "base/1/2683": File exists ... I agree with your analysis and would like to propose a PoC fix (see attached). With this patch applied, 20 iterations succeeded for me. Best regards, Alexander
Attachment
Dear Alexander, > > I can easily reproduce this failure on my workstation by running 5 tests > 003_logical_slots in parallel inside Windows VM with it's CPU resources > limited to 50%, like so: > VBoxManage controlvm "Windows" cpuexecutioncap 50 > > set PGCTLTIMEOUT=180 > python3 -c "NUMITERATIONS=20;NUMTESTS=5;import os;tsts='';exec('for i in > range(1,NUMTESTS+1): > tsts+=f\"pg_upgrade_{i}/003_logical_slots \"'); exec('for i in > range(1,NUMITERATIONS+1):print(f\"iteration {i}\"); > assert(os.system(f\"meson test --num-processes {NUMTESTS} {tsts}\") == 0)')" > ... > iteration 2 > ninja: Entering directory `C:\src\postgresql\build' > ninja: no work to do. > 1/5 postgresql:pg_upgrade_2 / pg_upgrade_2/003_logical_slots > ERROR 60.30s exit status 25 > ... > pg_restore: error: could not execute query: ERROR: could not create file > "base/1/2683": File exists > ... Great. I do not have such an environment so I could not find. This seemed to suggest that the failure was occurred because the system was busy. > I agree with your analysis and would like to propose a PoC fix (see > attached). With this patch applied, 20 iterations succeeded for me. Thanks, here are comments. I'm quite not sure for the windows, so I may say something wrong. * I'm not sure why the file/directory name was changed before doing a unlink. Could you add descriptions? * IIUC, the important points is the latter part, which waits until the status is changed. Based on that, can we remove a double rmtree() from cleanup_output_dirs()? They seems to be add for the similar motivation. ``` + loops = 0; + while (lstat(curpath, &st) < 0 && lstat_error_was_status_delete_pending()) + { + if (++loops > 100) /* time out after 10 sec */ + return -1; + pg_usleep(100000); /* us */ + } ``` Best Regards, Hayato Kuroda FUJITSU LIMITED
Hello Kuroda-san, 23.11.2023 15:15, Hayato Kuroda (Fujitsu) wrote: > >> I agree with your analysis and would like to propose a PoC fix (see >> attached). With this patch applied, 20 iterations succeeded for me. > Thanks, here are comments. I'm quite not sure for the windows, so I may say > something wrong. > > * I'm not sure why the file/directory name was changed before doing a unlink. > Could you add descriptions? Please look at the simple test program attached. It demonstrates the failure for me when running in two sessions as follows: unlink-open test 150 1000 unlink-open test2 150 1000 ... iteration 60 iteration 61 fopen() after unlink() failed (13) Process Monitor shows: ... 9:16:55.9249792 AM unlink-open.exe 3232 WriteFile C:\src\test2 SUCCESS Offset: 138,412,032, Length: 1,048,576 ### unlink() performed for the file "test": 9:16:55.9852903 AM unlink-open.exe 4968 CreateFile C:\src\test SUCCESS Desired Access: Read Attributes, Delete, Disposition: Open, Options: Non-Directory File, Open Reparse Point, Attributes: n/a, ShareMode: Read, Write, Delete, AllocationSize: n/a, OpenResult: Opened 9:16:55.9853637 AM unlink-open.exe 4968 QueryAttributeTagFile C:\src\test SUCCESS Attributes: A, ReparseTag: 0x0 ### file "test" gets into state DELETE PENDING: 9:16:55.9853756 AM unlink-open.exe 4968 SetDispositionInformationFile C:\src\test SUCCESS Delete: True 9:16:55.9853888 AM unlink-open.exe 4968 CloseFile C:\src\test SUCCESS ### concurrent operations with file "test2": 9:16:55.9866362 AM unlink-open.exe 3232 WriteFile C:\src\test2 SUCCESS Offset: 139,460,608, Length: 1,048,576 ... 9:16:55.9972373 AM unlink-open.exe 3232 WriteFile C:\src\test2 SUCCESS Offset: 157,286,400, Length: 1,048,576 9:16:55.9979040 AM unlink-open.exe 3232 CloseFile C:\src\test2 SUCCESS ### unlink() for "test2": 9:16:56.1029981 AM unlink-open.exe 3232 CreateFile C:\src\test2 SUCCESS Desired Access: Read Attributes, Delete, Disposition: Open, Options: Non-Directory File, Open Reparse Point, Attributes: n/a, ShareMode: Read, Write, Delete, AllocationSize: n/a, OpenResult: Opened 9:16:56.1030432 AM unlink-open.exe 3232 QueryAttributeTagFile C:\src\test2 SUCCESS Attributes: A, ReparseTag: 0x0 ### file "test2" gets into state DELETE PENDING: 9:16:56.1030517 AM unlink-open.exe 3232 SetDispositionInformationFile C:\src\test2 SUCCESS Delete: True 9:16:56.1030625 AM unlink-open.exe 3232 CloseFile C:\src\test2 SUCCESS ### and then it opened successfully: 9:16:56.1189503 AM unlink-open.exe 3232 CreateFile C:\src\test2 SUCCESS Desired Access: Generic Write, Read Attributes, Disposition: OverwriteIf, Options: Synchronous IO Non-Alert, Non-Directory File, Attributes: N, ShareMode: Read, Write, AllocationSize: 0, OpenResult: Created 9:16:56.1192016 AM unlink-open.exe 3232 CloseFile C:\src\test2 SUCCESS ### operations with file "test2" continued: 9:16:56.1193394 AM unlink-open.exe 3232 CreateFile C:\src\test2 SUCCESS Desired Access: Read Attributes, Delete, Disposition: Open, Options: Non-Directory File, Open Reparse Point, Attributes: n/a, ShareMode: Read, Write, Delete, AllocationSize: n/a, OpenResult: Opened 9:16:56.1193895 AM unlink-open.exe 3232 QueryAttributeTagFile C:\src\test2 SUCCESS Attributes: A, ReparseTag: 0x0 9:16:56.1194042 AM unlink-open.exe 3232 SetDispositionInformationFile C:\src\test2 SUCCESS Delete: True 9:16:56.1194188 AM unlink-open.exe 3232 CloseFile C:\src\test2 SUCCESS 9:16:56.1198459 AM unlink-open.exe 3232 CreateFile C:\src\test2 SUCCESS Desired Access: Generic Write, Read Attributes, Disposition: OverwriteIf, Options: Synchronous IO Non-Alert, Non-Directory File, Attributes: N, ShareMode: Read, Write, AllocationSize: 0, OpenResult: Created 9:16:56.1200302 AM unlink-open.exe 3232 WriteFile C:\src\test2 SUCCESS Offset: 0, Length: 1,048,576, Priority: Normal ... 9:16:56.1275871 AM unlink-open.exe 3232 WriteFile C:\src\test2 SUCCESS Offset: 10,485,760, Length: 1,048,576 ### at the same time, CreateFile() for file "test" failed: 9:16:56.1276453 AM unlink-open.exe 4968 CreateFile C:\src\test DELETE PENDING Desired Access: Generic Write, Read Attributes, Disposition: OverwriteIf, Options: Synchronous IO Non-Alert, Non-Directory File, Attributes: N, ShareMode: Read, Write, AllocationSize: 0 9:16:56.1279359 AM unlink-open.exe 3232 WriteFile C:\src\test2 SUCCESS Offset: 11,534,336, Length: 1,048,576 9:16:56.1283452 AM unlink-open.exe 3232 WriteFile C:\src\test2 SUCCESS Offset: 12,582,912, Length: 1,048,576 ... But with rename(MoveFileEx), I see: unlink-open test 150 1000 rename ... 9:38:01.7035286 AM unlink-open.exe 10208 WriteFile C:\src\test SUCCESS Offset: 156,237,824, Length: 1,048,576 9:38:01.7075621 AM unlink-open.exe 10208 WriteFile C:\src\test SUCCESS Offset: 157,286,400, Length: 1,048,576 9:38:01.7101299 AM unlink-open.exe 10208 CloseFile C:\src\test SUCCESS 9:38:01.7130802 AM unlink-open.exe 10208 CreateFile C:\src\test SUCCESS Desired Access: Read Attributes, Delete, Synchronize, Disposition: Open, Options: Synchronous IO Non-Alert, Open Reparse Point, Attributes: n/a, ShareMode: Read, Write, Delete, AllocationSize: n/a, OpenResult: Opened 9:38:01.7132844 AM unlink-open.exe 10208 QueryAttributeTagFile C:\src\test SUCCESS Attributes: A, ReparseTag: 0x0 9:38:01.7133420 AM unlink-open.exe 10208 QueryBasicInformationFile C:\src\test SUCCESS CreationTime: 11/24/2023 9:38:01 AM, LastAccessTime: 11/24/2023 9:38:01 AM, LastWriteTime: 11/24/2023 9:38:01 AM, ChangeTime: 11/24/2023 9:38:01 AM, FileAttributes: A 9:38:01.7135191 AM unlink-open.exe 10208 CreateFile C:\src SUCCESS Desired Access: Write Data/Add File, Synchronize, Disposition: Open, Options: , Attributes: n/a, ShareMode: Read, Write, AllocationSize: n/a, OpenResult: Opened ### file "test" renamed to "test.tmp", it doesn't get into state DELETE PENDING 9:38:01.7136221 AM unlink-open.exe 10208 SetRenameInformationFile C:\src\test SUCCESS ReplaceIfExists: True, FileName: C:\src\test.tmp 9:38:01.8384110 AM unlink-open.exe 10208 CloseFile C:\src SUCCESS 9:38:01.8388203 AM unlink-open.exe 10208 CloseFile C:\src\test.tmp SUCCESS ### then file "test.tmp" deleted as usual: 9:38:01.8394278 AM unlink-open.exe 10208 CreateFile C:\src\test.tmp SUCCESS Desired Access: Read Attributes, Delete, Disposition: Open, Options: Non-Directory File, Open Reparse Point, Attributes: n/a, ShareMode: Read, Write, Delete, AllocationSize: n/a, OpenResult: Opened 9:38:01.8396534 AM unlink-open.exe 10208 QueryAttributeTagFile C:\src\test.tmp SUCCESS Attributes: A, ReparseTag: 0x0 9:38:01.8396885 AM unlink-open.exe 10208 SetDispositionInformationFile C:\src\test.tmp SUCCESS Delete: True 9:38:01.8397312 AM unlink-open.exe 10208 CloseFile C:\src\test.tmp SUCCESS 9:38:01.9162566 AM unlink-open.exe 10208 CreateFile C:\src\test SUCCESS Desired Access: Generic Write, Read Attributes, Disposition: OverwriteIf, Options: Synchronous IO Non-Alert, Non-Directory File, Attributes: N, ShareMode: Read, Write, AllocationSize: 0, OpenResult: Created 9:38:01.9167628 AM unlink-open.exe 10208 CloseFile C:\src\test SUCCESS So the same test run with MoveFileEx(): unlink-open test 150 1000 rename unlink-open test2 150 1000 rename successfully passes for me in the same environment (Windows VM slowed down to 50%). That is, my idea was to try removing a file through renaming it as a fast path (thus avoiding that troublesome state DELETE PENDING), and if that fails, to perform removal as before. May be the whole function might be simplified, but I'm not sure about special cases yet. > * IIUC, the important points is the latter part, which waits until the status is > changed. Based on that, can we remove a double rmtree() from cleanup_output_dirs()? > They seems to be add for the similar motivation. I couldn't yet reproduce a failure, which motivated that doubling (IIUC, it was observed in [1]), with c28911750 reverted, so I need more time to research that issue to answer this question. [1] https://www.postgresql.org/message-id/20230131172806.GM22427%40telsasoft.com Best regards, Alexander
Attachment
Dear Alexander, > > Please look at the simple test program attached. It demonstrates the > failure for me when running in two sessions as follows: > unlink-open test 150 1000 > unlink-open test2 150 1000 Thanks for attaching a program. This helps us to understand the issue. I wanted to confirm your env - this failure was occurred on windows server XXXX, right? > > That is, my idea was to try removing a file through renaming it as a fast > path (thus avoiding that troublesome state DELETE PENDING), and if that > fails, to perform removal as before. May be the whole function might be > simplified, but I'm not sure about special cases yet. I felt that your result showed pgrename() would be more rarely delayed than unlink(). That's why a file which has original name would not exist when subsequent open() was called. About special cases, I wanted seniors to check. > > * IIUC, the important points is the latter part, which waits until the status is > > changed. Based on that, can we remove a double rmtree() from > cleanup_output_dirs()? > > They seems to be add for the similar motivation. > > I couldn't yet reproduce a failure, which motivated that doubling (IIUC, it > was observed in [1]), with c28911750 reverted, so I need more time to > research that issue to answer this question. Yeah, as the first place, this failure seldom occurred.... Best Regards, Hayato Kuroda FUJITSU LIMITED
Hello Kuroda-san, 25.11.2023 18:19, Hayato Kuroda (Fujitsu) wrote: > Thanks for attaching a program. This helps us to understand the issue. > I wanted to confirm your env - this failure was occurred on windows server XXXX, right? I see that behavior on: Windows 10 Version 1607 (OS Build 14393.0) Windows Server 2016 Version 1607 (OS Build 14393.0) Windows Server 2019 Version 1809 (OS Build 17763.1) But it's not reproduced on: Windows 10 Version 1809 (OS Build 17763.1) (triple-checked) Windows Server 2019 Version 1809 (OS Build 17763.592) Windows 10 Version 22H2 (OS Build 19045.3693) Windows 11 Version 21H2 (OS Build 22000.613) So it looks like the failure occurs depending not on Windows edition, but rather on it's build. For Windows Server 2019 the "good" build is somewhere between 17763.1 and 17763.592, but for Windows 10 it's between 14393.0 and 17763.1. (Maybe there was some change related to FILE_DISPOSITION_POSIX_SEMANTICS/ FILE_DISPOSITION_ON_CLOSE implementation; I don't know where to find information about that change.) It's also interesting, what is full version/build of OS on drongo and fairywren. >> That is, my idea was to try removing a file through renaming it as a fast >> path (thus avoiding that troublesome state DELETE PENDING), and if that >> fails, to perform removal as before. May be the whole function might be >> simplified, but I'm not sure about special cases yet. > I felt that your result showed pgrename() would be more rarely delayed than unlink(). > That's why a file which has original name would not exist when subsequent open() was called. I think that's because unlink() is performed asynchronously on those old Windows versions, but rename() is always synchronous. >>> * IIUC, the important points is the latter part, which waits until the status is >>> changed. Based on that, can we remove a double rmtree() from >> cleanup_output_dirs()? >>> They seems to be add for the similar motivation. >> I couldn't yet reproduce a failure, which motivated that doubling (IIUC, it >> was observed in [1]), with c28911750 reverted, so I need more time to >> research that issue to answer this question. > Yeah, as the first place, this failure seldom occurred.... I've managed to reproduce that issue (or at least a situation that manifested similarly) with a sleep added in miscinit.c: ereport(IsPostmasterEnvironment ? LOG : NOTICE, (errmsg("database system is shut down"))); + pg_usleep(500000L); With this change, I get the same warning as in [1] when running in parallel 10 tests 002_pg_upgrade with a minimal olddump (on iterations 33, 46, 8). And with my PoC patch applied, I could see the same warning as well (on iteration 6). I believe that's because rename() can't rename a directory containing an open file, just as unlink() can't remove it. In the light of the above, I think that the issue in question should be fixed in accordance with/as a supplement to [2]. [1] https://www.postgresql.org/message-id/20230131172806.GM22427%40telsasoft.com [2] https://www.postgresql.org/message-id/flat/CA%2BhUKG%2BajSQ_8eu2AogTncOnZ5me2D-Cn66iN_-wZnRjLN%2Bicg%40mail.gmail.com Best regards, Alexander
On 2023-11-27 Mo 07:00, Alexander Lakhin wrote: > Hello Kuroda-san, > > 25.11.2023 18:19, Hayato Kuroda (Fujitsu) wrote: >> Thanks for attaching a program. This helps us to understand the issue. >> I wanted to confirm your env - this failure was occurred on windows >> server XXXX, right? > > I see that behavior on: > Windows 10 Version 1607 (OS Build 14393.0) > Windows Server 2016 Version 1607 (OS Build 14393.0) > Windows Server 2019 Version 1809 (OS Build 17763.1) > > But it's not reproduced on: > Windows 10 Version 1809 (OS Build 17763.1) (triple-checked) > Windows Server 2019 Version 1809 (OS Build 17763.592) > Windows 10 Version 22H2 (OS Build 19045.3693) > Windows 11 Version 21H2 (OS Build 22000.613) > > So it looks like the failure occurs depending not on Windows edition, but > rather on it's build. For Windows Server 2019 the "good" build is > somewhere between 17763.1 and 17763.592, but for Windows 10 it's between > 14393.0 and 17763.1. > (Maybe there was some change related to FILE_DISPOSITION_POSIX_SEMANTICS/ > FILE_DISPOSITION_ON_CLOSE implementation; I don't know where to find > information about that change.) > > It's also interesting, what is full version/build of OS on drongo and > fairywren. > > It's WS 2019 1809/17763.4252. The latest available AFAICT is 17763.5122 cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
On 2023-11-27 Mo 07:39, Andrew Dunstan wrote: > > On 2023-11-27 Mo 07:00, Alexander Lakhin wrote: >> Hello Kuroda-san, >> >> 25.11.2023 18:19, Hayato Kuroda (Fujitsu) wrote: >>> Thanks for attaching a program. This helps us to understand the issue. >>> I wanted to confirm your env - this failure was occurred on windows >>> server XXXX, right? >> >> I see that behavior on: >> Windows 10 Version 1607 (OS Build 14393.0) >> Windows Server 2016 Version 1607 (OS Build 14393.0) >> Windows Server 2019 Version 1809 (OS Build 17763.1) >> >> But it's not reproduced on: >> Windows 10 Version 1809 (OS Build 17763.1) (triple-checked) >> Windows Server 2019 Version 1809 (OS Build 17763.592) >> Windows 10 Version 22H2 (OS Build 19045.3693) >> Windows 11 Version 21H2 (OS Build 22000.613) >> >> So it looks like the failure occurs depending not on Windows edition, >> but >> rather on it's build. For Windows Server 2019 the "good" build is >> somewhere between 17763.1 and 17763.592, but for Windows 10 it's between >> 14393.0 and 17763.1. >> (Maybe there was some change related to >> FILE_DISPOSITION_POSIX_SEMANTICS/ >> FILE_DISPOSITION_ON_CLOSE implementation; I don't know where to find >> information about that change.) >> >> It's also interesting, what is full version/build of OS on drongo and >> fairywren. >> >> > > It's WS 2019 1809/17763.4252. The latest available AFAICT is 17763.5122 > > > I've updated it to 17763.5122 now. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Dear Alexander, Andrew, Thanks for your analysis! > I see that behavior on: > Windows 10 Version 1607 (OS Build 14393.0) > Windows Server 2016 Version 1607 (OS Build 14393.0) > Windows Server 2019 Version 1809 (OS Build 17763.1) > > But it's not reproduced on: > Windows 10 Version 1809 (OS Build 17763.1) (triple-checked) > Windows Server 2019 Version 1809 (OS Build 17763.592) > Windows 10 Version 22H2 (OS Build 19045.3693) > Windows 11 Version 21H2 (OS Build 22000.613) > > So it looks like the failure occurs depending not on Windows edition, but > rather on it's build. For Windows Server 2019 the "good" build is > somewhere between 17763.1 and 17763.592, but for Windows 10 it's between > 14393.0 and 17763.1. > (Maybe there was some change related to > FILE_DISPOSITION_POSIX_SEMANTICS/ > FILE_DISPOSITION_ON_CLOSE implementation; I don't know where to find > information about that change.) > > It's also interesting, what is full version/build of OS on drongo and > fairywren. Thanks for your interest for the issue. I have been tracking the failure but been not occurred. Your analysis seems to solve BF failures, by updating OSes. > I think that's because unlink() is performed asynchronously on those old > Windows versions, but rename() is always synchronous. OK. Actually I could not find descriptions about them, but your experiment showed facts. > I've managed to reproduce that issue (or at least a situation that > manifested similarly) with a sleep added in miscinit.c: > ereport(IsPostmasterEnvironment ? LOG : NOTICE, > (errmsg("database system is shut down"))); > + pg_usleep(500000L); > > With this change, I get the same warning as in [1] when running in > parallel 10 tests 002_pg_upgrade with a minimal olddump (on iterations > 33, 46, 8). And with my PoC patch applied, I could see the same warning > as well (on iteration 6). > > I believe that's because rename() can't rename a directory containing an > open file, just as unlink() can't remove it. > > In the light of the above, I think that the issue in question should be > fixed in accordance with/as a supplement to [2]. OK, I understood that we need to fix more around here. For now, we should focus our points. Your patch seems good, but it needs more sight from windows-friendly developers. How do other think? Best Regards, Hayato Kuroda FUJITSU LIMITED
Hello Andrew and Kuroda-san, 27.11.2023 16:58, Andrew Dunstan wrote: >>> It's also interesting, what is full version/build of OS on drongo and >>> fairywren. >>> >> >> It's WS 2019 1809/17763.4252. The latest available AFAICT is 17763.5122 >> > > I've updated it to 17763.5122 now. > Thank you for the information! It had pushed me to upgrade my Server 2019 1809 17763.592 to 17763.4252. And then I discovered that I have difficulties with reproducing the issue on all my VMs after reboot (even on old versions/builds). It took me a while to understand what's going on and what affects reproduction of the issue. I was puzzled by the fact that I can't reproduce the issue with my unlink-open test program under seemingly the same conditions as before, until I realized that the issue reproduced only when the target directory opened in Windows Explorer. Now I'm sorry for bringing more mystery into the topic and for misleading information. So, the issue reproduced only when something scans the working directory for files/opens them. I added the same logic into my test program (see unlink-open-scandir attached) and now I see the failure on Windows Server 2019 (Version 10.0.17763.4252). A script like this: start cmd /c "unlink-open-scandir test1 10 5000 >log1 2>&1" ... start cmd /c "unlink-open-scandir test10 10 5000 >log10 2>&1" results in: C:\temp>find "failed" log* ---------- LOG1 ---------- LOG10 fopen() after unlink() failed (13) ---------- LOG2 fopen() after unlink() failed (13) ---------- LOG3 fopen() after unlink() failed (13) ---------- LOG4 fopen() after unlink() failed (13) ---------- LOG5 fopen() after unlink() failed (13) ---------- LOG6 fopen() after unlink() failed (13) ---------- LOG7 fopen() after unlink() failed (13) ---------- LOG8 fopen() after unlink() failed (13) ---------- LOG9 fopen() after unlink() failed (13) C:\temp>type log10 ... iteration 108 fopen() after unlink() failed (13) The same observed on: Windows 10 Version 1809 (OS Build 17763.1) But no failures on: Windows 10 Version 22H2 (OS Build 19045.3693) Windows 11 Version 21H2 (OS Build 22000.613) So the behavior change really took place, but my previous estimations were incorrect (my apologies). BTW, "rename" mode of the test program can produce more rare errors on rename: ---------- LOG3 MoveFileEx() failed (0) but not on open. 30.11.2023 13:00, Hayato Kuroda (Fujitsu) wrote: > Thanks for your interest for the issue. I have been tracking the failure but been not occurred. > Your analysis seems to solve BF failures, by updating OSes. Yes, but I don't think that leaving Server 2019 behind (I suppose Windows Server 2019 build 20348 would have the same behaviour as Windows 10 19045) is affordable. (Though looking at Cirrus CI logs, I see that what is entitled "Windows Server 2019" in fact is Windows Server 2022 there.) >> I think that's because unlink() is performed asynchronously on those old >> Windows versions, but rename() is always synchronous. > > OK. Actually I could not find descriptions about them, but your experiment showed facts. I don't know how this peculiarity is called, but it looks like when some other process captures the file handle, unlink() exits as if the file was deleted completely, but the subsequent open() fails. Best regards, Alexander
Attachment
Dear Alexander, > I agree with your analysis and would like to propose a PoC fix (see > attached). With this patch applied, 20 iterations succeeded for me. There are no reviewers so that I will review again. Let's move the PoC to the concrete patch. Note that I only focused on fixes of random failure, other parts are out-of-scope. Basically, code comments can be updated accordingly. 01. ``` /* * This function might be called for a regular file or for a junction * point (which we use to emulate symlinks). The latter must be unlinked * with rmdir() on Windows. Before we worry about any of that, let's see * if we can unlink directly, since that's expected to be the most common * case. */ snprintf(tmppath, sizeof(tmppath), "%s.tmp", path); if (pgrename(path, tmppath) == 0) { if (unlink(tmppath) == 0) return 0; curpath = tmppath; } ``` You can modify comments atop changes because it is not trivial. Below is my draft: ``` * XXX: we rename the target file to ".tmp" before calling unlink. The * removal may fail with STATUS_DELETE_PENDING status on Windows, so * creating the same file would fail. This assumes that renaming is a * synchronous operation. ``` 02. ``` loops = 0; while (lstat(curpath, &st) < 0 && lstat_error_was_status_delete_pending()) { if (++loops > 100) /* time out after 10 sec */ return -1; pg_usleep(100000); /* us */ } ``` Comments can be added atop the part. Below one is my draft. ``` /* * Wait until the removal is really finished to avoid ERRORs for creating a * same file in other functions. */ ``` Best Regards, Hayato Kuroda FUJITSU LIMITED
Hello Kuroda-san, 28.12.2023 06:08, Hayato Kuroda (Fujitsu) wrote: > Dear Alexander, > >> I agree with your analysis and would like to propose a PoC fix (see >> attached). With this patch applied, 20 iterations succeeded for me. > There are no reviewers so that I will review again. Let's move the PoC > to the concrete patch. Note that I only focused on fixes of random failure, > other parts are out-of-scope. Thinking about that fix more, I'm not satisfied with the approach proposed. First, it turns every unlink operation into two write operations (rename + unlink), not to say about new risks of having stale .tmp files (perhaps, it's ok for regular files (MoveFileEx can overwrite existing files), but not for directories) Second, it does that on any Windows OS versions, including modern ones, which are not affected by the issue, as we know. So I started to think about other approach: to perform unlink as it's implemented now, but then wait until the DELETE_PENDING state is gone. And I was very surprised to see that this state is not transient in our case. Additional investigation showed that the test fails not because some aside process opens a file (concretely, {template1_id/postgres_id}/2683), that is being deleted, but because of an internal process that opens the file and holds a handle to it indefinitely. And the internal process is ... background writer (BgBufferSync()). So, I tried just adding bgwriter_lru_maxpages = 0 to postgresql.conf and got 20 x 10 tests passing. Thus, it we want just to get rid of the test failure, maybe it's enough to add this to the test's config... The other way to go is to find out whether the background writer process should react on a shared-inval message, sent from smgrdounlinkall(), and close that file's handle, Maybe we could also (after changing the bgwriter's behaviour) add a waiting loop into pgwin32_open_handle() to completely rule out transient open() failures due to some other process (such as Windows Exporer) opening a file being deleted, but I would not complicate the things until we have a clear vision/plans of using modern APIs/relying of modern OS versions' behavior. I mean proceeding with something like: https://commitfest.postgresql.org/40/3951/ Best regards, Alexander
On Tue, Jan 2, 2024 at 10:30 AM Alexander Lakhin <exclusion@gmail.com> wrote: > > 28.12.2023 06:08, Hayato Kuroda (Fujitsu) wrote: > > Dear Alexander, > > > >> I agree with your analysis and would like to propose a PoC fix (see > >> attached). With this patch applied, 20 iterations succeeded for me. > > There are no reviewers so that I will review again. Let's move the PoC > > to the concrete patch. Note that I only focused on fixes of random failure, > > other parts are out-of-scope. > > Thinking about that fix more, I'm not satisfied with the approach proposed. > First, it turns every unlink operation into two write operations > (rename + unlink), not to say about new risks of having stale .tmp files > (perhaps, it's ok for regular files (MoveFileEx can overwrite existing > files), but not for directories) > Second, it does that on any Windows OS versions, including modern ones, > which are not affected by the issue, as we know. > > So I started to think about other approach: to perform unlink as it's > implemented now, but then wait until the DELETE_PENDING state is gone. > There is a comment in the code which suggests we shouldn't wait indefinitely. See "However, we won't wait indefinitely for someone else to close the file, as the caller might be holding locks and blocking other backends." > And I was very surprised to see that this state is not transient in our case. > Additional investigation showed that the test fails not because some aside > process opens a file (concretely, {template1_id/postgres_id}/2683), that is > being deleted, but because of an internal process that opens the file and > holds a handle to it indefinitely. > And the internal process is ... background writer (BgBufferSync()). > > So, I tried just adding bgwriter_lru_maxpages = 0 to postgresql.conf and > got 20 x 10 tests passing. > > Thus, it we want just to get rid of the test failure, maybe it's enough to > add this to the test's config... > What about checkpoints? Can't it do the same while writing the buffers? -- With Regards, Amit Kapila.
Hello Amit, 03.01.2024 14:42, Amit Kapila wrote: > >> So I started to think about other approach: to perform unlink as it's >> implemented now, but then wait until the DELETE_PENDING state is gone. >> > There is a comment in the code which suggests we shouldn't wait > indefinitely. See "However, we won't wait indefinitely for someone > else to close the file, as the caller might be holding locks and > blocking other backends." Yes, I saw it, but initially I thought that we have a transient condition there, so waiting in open() (instead of failing immediately) seemed like a good idea then... >> And the internal process is ... background writer (BgBufferSync()). >> >> So, I tried just adding bgwriter_lru_maxpages = 0 to postgresql.conf and >> got 20 x 10 tests passing. >> >> Thus, it we want just to get rid of the test failure, maybe it's enough to >> add this to the test's config... >> > What about checkpoints? Can't it do the same while writing the buffers? As we deal here with pg_upgrade/pg_restore, it must not be very easy to get the desired effect, but I think it's not impossible in principle. More details below. What happens during the pg_upgrade execution is essentially: 1) CREATE DATABASE "postgres" WITH TEMPLATE = template0 OID = 5 ...; -- this command flushes file buffers as well 2) ALTER DATABASE postgres OWNER TO ... 3) COMMENT ON DATABASE "postgres" IS ... 4) -- For binary upgrade, preserve pg_largeobject and index relfilenodes SELECT pg_catalog.binary_upgrade_set_next_index_relfilenode('2683'::pg_catalog.oid); SELECT pg_catalog.binary_upgrade_set_next_heap_relfilenode('2613'::pg_catalog.oid); TRUNCATE pg_catalog.pg_largeobject; -- ^^^ here we can get the error "could not create file "base/5/2683": File exists" ... We get the effect discussed when the background writer process decides to flush a file buffer for pg_largeobject during stage 1. (Thus, if a checkpoint somehow happened to occur during CREATE DATABASE, the result must be the same.) And another important factor is shared_buffers = 1MB (set during the test). With the default setting of 128MB I couldn't see the failure. It can be reproduced easily (on old Windows versions) just by running pg_upgrade in a loop (I've got failures on iterations 22, 37, 17 (with the default cluster)). If an old cluster contains dozen of databases, this increases the failure probability significantly (with 10 additional databases I've got failures on iterations 4, 1, 6). Please see the reproducing script attached. Best regards, Alexander
Attachment
On Thu, Jan 4, 2024 at 5:30 PM Alexander Lakhin <exclusion@gmail.com> wrote: > > 03.01.2024 14:42, Amit Kapila wrote: > > > > >> And the internal process is ... background writer (BgBufferSync()). > >> > >> So, I tried just adding bgwriter_lru_maxpages = 0 to postgresql.conf and > >> got 20 x 10 tests passing. > >> > >> Thus, it we want just to get rid of the test failure, maybe it's enough to > >> add this to the test's config... > >> > > What about checkpoints? Can't it do the same while writing the buffers? > > As we deal here with pg_upgrade/pg_restore, it must not be very easy to get > the desired effect, but I think it's not impossible in principle. > More details below. > What happens during the pg_upgrade execution is essentially: > 1) CREATE DATABASE "postgres" WITH TEMPLATE = template0 OID = 5 ...; > -- this command flushes file buffers as well > 2) ALTER DATABASE postgres OWNER TO ... > 3) COMMENT ON DATABASE "postgres" IS ... > 4) -- For binary upgrade, preserve pg_largeobject and index relfilenodes > SELECT pg_catalog.binary_upgrade_set_next_index_relfilenode('2683'::pg_catalog.oid); > SELECT pg_catalog.binary_upgrade_set_next_heap_relfilenode('2613'::pg_catalog.oid); > TRUNCATE pg_catalog.pg_largeobject; > -- ^^^ here we can get the error "could not create file "base/5/2683": File exists" > ... > > We get the effect discussed when the background writer process decides to > flush a file buffer for pg_largeobject during stage 1. > (Thus, if a checkpoint somehow happened to occur during CREATE DATABASE, > the result must be the same.) > And another important factor is shared_buffers = 1MB (set during the test). > With the default setting of 128MB I couldn't see the failure. > > It can be reproduced easily (on old Windows versions) just by running > pg_upgrade in a loop (I've got failures on iterations 22, 37, 17 (with the > default cluster)). > If an old cluster contains dozen of databases, this increases the failure > probability significantly (with 10 additional databases I've got failures > on iterations 4, 1, 6). > I don't have an old Windows environment to test but I agree with your analysis and theory. The question is what should we do for these new random BF failures? I think we should set bgwriter_lru_maxpages to 0 and checkpoint_timeout to 1hr for these new tests. Doing some invasive fix as part of this doesn't sound reasonable because this is an existing problem and there seems to be another patch by Thomas that probably deals with the root cause of the existing problem [1] as pointed out by you. [1] - https://commitfest.postgresql.org/40/3951/ -- With Regards, Amit Kapila.
On 1/4/24 10:19 PM, Amit Kapila wrote: > On Thu, Jan 4, 2024 at 5:30 PM Alexander Lakhin <exclusion@gmail.com> wrote: >> >> 03.01.2024 14:42, Amit Kapila wrote: >>> >> >>>> And the internal process is ... background writer (BgBufferSync()). >>>> >>>> So, I tried just adding bgwriter_lru_maxpages = 0 to postgresql.conf and >>>> got 20 x 10 tests passing. >>>> >>>> Thus, it we want just to get rid of the test failure, maybe it's enough to >>>> add this to the test's config... >>>> >>> What about checkpoints? Can't it do the same while writing the buffers? >> >> As we deal here with pg_upgrade/pg_restore, it must not be very easy to get >> the desired effect, but I think it's not impossible in principle. >> More details below. >> What happens during the pg_upgrade execution is essentially: >> 1) CREATE DATABASE "postgres" WITH TEMPLATE = template0 OID = 5 ...; >> -- this command flushes file buffers as well >> 2) ALTER DATABASE postgres OWNER TO ... >> 3) COMMENT ON DATABASE "postgres" IS ... >> 4) -- For binary upgrade, preserve pg_largeobject and index relfilenodes >> SELECT pg_catalog.binary_upgrade_set_next_index_relfilenode('2683'::pg_catalog.oid); >> SELECT pg_catalog.binary_upgrade_set_next_heap_relfilenode('2613'::pg_catalog.oid); >> TRUNCATE pg_catalog.pg_largeobject; >> -- ^^^ here we can get the error "could not create file "base/5/2683": File exists" >> ... >> >> We get the effect discussed when the background writer process decides to >> flush a file buffer for pg_largeobject during stage 1. >> (Thus, if a checkpoint somehow happened to occur during CREATE DATABASE, >> the result must be the same.) >> And another important factor is shared_buffers = 1MB (set during the test). >> With the default setting of 128MB I couldn't see the failure. >> >> It can be reproduced easily (on old Windows versions) just by running >> pg_upgrade in a loop (I've got failures on iterations 22, 37, 17 (with the >> default cluster)). >> If an old cluster contains dozen of databases, this increases the failure >> probability significantly (with 10 additional databases I've got failures >> on iterations 4, 1, 6). >> > > I don't have an old Windows environment to test but I agree with your > analysis and theory. The question is what should we do for these new > random BF failures? I think we should set bgwriter_lru_maxpages to 0 > and checkpoint_timeout to 1hr for these new tests. Doing some invasive > fix as part of this doesn't sound reasonable because this is an > existing problem and there seems to be another patch by Thomas that > probably deals with the root cause of the existing problem [1] as > pointed out by you. > > [1] - https://commitfest.postgresql.org/40/3951/ Isn't this just sweeping the problem (non-POSIX behavior on SMB and ReFS) under the carpet? I realize that synthetic test workloads like pg_upgrade in a loop aren't themselves real-world scenarios, but what about other cases? Even if we're certain it's not possible for these issues to wedge a server, it's still not a good experience for users to get random, unexplained IO-related errors... -- Jim Nasby, Data Architect, Austin TX
On Mon, Jan 8, 2024 at 9:36 PM Jim Nasby <jim.nasby@gmail.com> wrote: > > On 1/4/24 10:19 PM, Amit Kapila wrote: > > On Thu, Jan 4, 2024 at 5:30 PM Alexander Lakhin <exclusion@gmail.com> wrote: > >> > >> 03.01.2024 14:42, Amit Kapila wrote: > >>> > >> > >>>> And the internal process is ... background writer (BgBufferSync()). > >>>> > >>>> So, I tried just adding bgwriter_lru_maxpages = 0 to postgresql.conf and > >>>> got 20 x 10 tests passing. > >>>> > >>>> Thus, it we want just to get rid of the test failure, maybe it's enough to > >>>> add this to the test's config... > >>>> > >>> What about checkpoints? Can't it do the same while writing the buffers? > >> > >> As we deal here with pg_upgrade/pg_restore, it must not be very easy to get > >> the desired effect, but I think it's not impossible in principle. > >> More details below. > >> What happens during the pg_upgrade execution is essentially: > >> 1) CREATE DATABASE "postgres" WITH TEMPLATE = template0 OID = 5 ...; > >> -- this command flushes file buffers as well > >> 2) ALTER DATABASE postgres OWNER TO ... > >> 3) COMMENT ON DATABASE "postgres" IS ... > >> 4) -- For binary upgrade, preserve pg_largeobject and index relfilenodes > >> SELECT pg_catalog.binary_upgrade_set_next_index_relfilenode('2683'::pg_catalog.oid); > >> SELECT pg_catalog.binary_upgrade_set_next_heap_relfilenode('2613'::pg_catalog.oid); > >> TRUNCATE pg_catalog.pg_largeobject; > >> -- ^^^ here we can get the error "could not create file "base/5/2683": File exists" > >> ... > >> > >> We get the effect discussed when the background writer process decides to > >> flush a file buffer for pg_largeobject during stage 1. > >> (Thus, if a checkpoint somehow happened to occur during CREATE DATABASE, > >> the result must be the same.) > >> And another important factor is shared_buffers = 1MB (set during the test). > >> With the default setting of 128MB I couldn't see the failure. > >> > >> It can be reproduced easily (on old Windows versions) just by running > >> pg_upgrade in a loop (I've got failures on iterations 22, 37, 17 (with the > >> default cluster)). > >> If an old cluster contains dozen of databases, this increases the failure > >> probability significantly (with 10 additional databases I've got failures > >> on iterations 4, 1, 6). > >> > > > > I don't have an old Windows environment to test but I agree with your > > analysis and theory. The question is what should we do for these new > > random BF failures? I think we should set bgwriter_lru_maxpages to 0 > > and checkpoint_timeout to 1hr for these new tests. Doing some invasive > > fix as part of this doesn't sound reasonable because this is an > > existing problem and there seems to be another patch by Thomas that > > probably deals with the root cause of the existing problem [1] as > > pointed out by you. > > > > [1] - https://commitfest.postgresql.org/40/3951/ > > Isn't this just sweeping the problem (non-POSIX behavior on SMB and > ReFS) under the carpet? I realize that synthetic test workloads like > pg_upgrade in a loop aren't themselves real-world scenarios, but what > about other cases? Even if we're certain it's not possible for these > issues to wedge a server, it's still not a good experience for users to > get random, unexplained IO-related errors... > The point is that this is an existing known Windows behavior and that too only in certain versions. The fix doesn't seem to be straightforward, so it seems advisable to avoid random BF failures by having an appropriate configuration. -- With Regards, Amit Kapila.
Dear Amit, Alexander, > > We get the effect discussed when the background writer process decides to > > flush a file buffer for pg_largeobject during stage 1. > > (Thus, if a checkpoint somehow happened to occur during CREATE DATABASE, > > the result must be the same.) > > And another important factor is shared_buffers = 1MB (set during the test). > > With the default setting of 128MB I couldn't see the failure. > > > > It can be reproduced easily (on old Windows versions) just by running > > pg_upgrade in a loop (I've got failures on iterations 22, 37, 17 (with the > > default cluster)). > > If an old cluster contains dozen of databases, this increases the failure > > probability significantly (with 10 additional databases I've got failures > > on iterations 4, 1, 6). > > > > I don't have an old Windows environment to test but I agree with your > analysis and theory. The question is what should we do for these new > random BF failures? I think we should set bgwriter_lru_maxpages to 0 > and checkpoint_timeout to 1hr for these new tests. Doing some invasive > fix as part of this doesn't sound reasonable because this is an > existing problem and there seems to be another patch by Thomas that > probably deals with the root cause of the existing problem [1] as > pointed out by you. > > [1] - https://commitfest.postgresql.org/40/3951/ Based on the suggestion by Amit, I have created a patch with the alternative approach. This just does GUC settings. The reported failure is only for 003_logical_slots, but the patch also includes changes for the recently added test, 004_subscription. IIUC, there is a possibility that 004 would fail as well. Per our understanding, this patch can stop random failures. Alexander, can you test for the confirmation? Best Regards, Hayato Kuroda FUJITSU LIMITED
Attachment
Hello Kuroda-san, 09.01.2024 08:49, Hayato Kuroda (Fujitsu) wrote: > Based on the suggestion by Amit, I have created a patch with the alternative > approach. This just does GUC settings. The reported failure is only for > 003_logical_slots, but the patch also includes changes for the recently added > test, 004_subscription. IIUC, there is a possibility that 004 would fail as well. > > Per our understanding, this patch can stop random failures. Alexander, can you > test for the confirmation? > Yes, the patch fixes the issue for me (without the patch I observe failures on iterations 1-2, with 10 tests running in parallel, but with the patch 10 iterations succeeded). But as far I can see, 004_subscription is not affected by the issue, because it doesn't enable streaming for nodes new_sub, new_sub1. As I noted before, I could see the failure only with shared_buffers = 1MB (which is set with allows_streaming => 'logical'). So I'm not sure, whether we need to modify 004 (or any other test that runs pg_upgrade). As to checkpoint_timeout, personally I would not increase it, because it seems unbelievable to me that pg_restore (with the cluster containing only two empty databases) can run for longer than 5 minutes. I'd rather investigate such situation separately, in case we encounter it, but maybe it's only me. On the other hand, if a checkpoint could occur by some reason within a shorter time span, then increasing the timeout would not matter, I suppose. (I've also tested the bgwriter_lru_maxpages-only modification of your patch and can confirm that it works as well.) Best regards, Alexander
On Tue, Jan 9, 2024 at 2:30 PM Alexander Lakhin <exclusion@gmail.com> wrote: > > 09.01.2024 08:49, Hayato Kuroda (Fujitsu) wrote: > > Based on the suggestion by Amit, I have created a patch with the alternative > > approach. This just does GUC settings. The reported failure is only for > > 003_logical_slots, but the patch also includes changes for the recently added > > test, 004_subscription. IIUC, there is a possibility that 004 would fail as well. > > > > Per our understanding, this patch can stop random failures. Alexander, can you > > test for the confirmation? > > > > Yes, the patch fixes the issue for me (without the patch I observe failures > on iterations 1-2, with 10 tests running in parallel, but with the patch > 10 iterations succeeded). > > But as far I can see, 004_subscription is not affected by the issue, > because it doesn't enable streaming for nodes new_sub, new_sub1. > As I noted before, I could see the failure only with > shared_buffers = 1MB (which is set with allows_streaming => 'logical'). > So I'm not sure, whether we need to modify 004 (or any other test that > runs pg_upgrade). > I see your point and the probable reason for failure with shared_buffers=1MB is that the probability of bgwriter holding the file handle for pg_largeobject increases. So, let's change it only for 003. > As to checkpoint_timeout, personally I would not increase it, because it > seems unbelievable to me that pg_restore (with the cluster containing only > two empty databases) can run for longer than 5 minutes. I'd rather > investigate such situation separately, in case we encounter it, but maybe > it's only me. > I feel it is okay to set a higher value of checkpoint_timeout due to the same reason though the probability is less. I feel here it is important to explain in the comments why we are using these settings in the new test. I have thought of something like: "During the upgrade, bgwriter or checkpointer could hold the file handle for some removed file. Now, during restore when we try to create the file with the same name, it errors out. This behavior is specific to only some specific Windows versions and the probability of seeing this behavior is higher in this test because we use wal_level as logical via allows_streaming => 'logical' which in turn sets shared_buffers as 1MB." Thoughts? -- With Regards, Amit Kapila.
Hello Amit, 09.01.2024 13:08, Amit Kapila wrote: > >> As to checkpoint_timeout, personally I would not increase it, because it >> seems unbelievable to me that pg_restore (with the cluster containing only >> two empty databases) can run for longer than 5 minutes. I'd rather >> investigate such situation separately, in case we encounter it, but maybe >> it's only me. >> > I feel it is okay to set a higher value of checkpoint_timeout due to > the same reason though the probability is less. I feel here it is > important to explain in the comments why we are using these settings > in the new test. I have thought of something like: "During the > upgrade, bgwriter or checkpointer could hold the file handle for some > removed file. Now, during restore when we try to create the file with > the same name, it errors out. This behavior is specific to only some > specific Windows versions and the probability of seeing this behavior > is higher in this test because we use wal_level as logical via > allows_streaming => 'logical' which in turn sets shared_buffers as > 1MB." > > Thoughts? I would describe that behavior as "During upgrade, when pg_restore performs CREATE DATABASE, bgwriter or checkpointer may flush buffers and hold a file handle for pg_largeobject, so later TRUNCATE pg_largeobject command will fail if OS (such as older Windows versions) doesn't remove an unlinked file completely till it's open. ..." Best regards, Alexander
On Tue, Jan 9, 2024 at 4:30 PM Alexander Lakhin <exclusion@gmail.com> wrote: > > 09.01.2024 13:08, Amit Kapila wrote: > > > >> As to checkpoint_timeout, personally I would not increase it, because it > >> seems unbelievable to me that pg_restore (with the cluster containing only > >> two empty databases) can run for longer than 5 minutes. I'd rather > >> investigate such situation separately, in case we encounter it, but maybe > >> it's only me. > >> > > I feel it is okay to set a higher value of checkpoint_timeout due to > > the same reason though the probability is less. I feel here it is > > important to explain in the comments why we are using these settings > > in the new test. I have thought of something like: "During the > > upgrade, bgwriter or checkpointer could hold the file handle for some > > removed file. Now, during restore when we try to create the file with > > the same name, it errors out. This behavior is specific to only some > > specific Windows versions and the probability of seeing this behavior > > is higher in this test because we use wal_level as logical via > > allows_streaming => 'logical' which in turn sets shared_buffers as > > 1MB." > > > > Thoughts? > > I would describe that behavior as "During upgrade, when pg_restore performs > CREATE DATABASE, bgwriter or checkpointer may flush buffers and hold a file > handle for pg_largeobject, so later TRUNCATE pg_largeobject command will > fail if OS (such as older Windows versions) doesn't remove an unlinked file > completely till it's open. ..." > I am slightly hesitant to add any particular system table name in the comments as this can happen for any other system table as well, so slightly adjusted the comments in the attached. However, I think it is okay to mention the particular system table name in the commit message. Let me know what do you think. -- With Regards, Amit Kapila.
Attachment
10.01.2024 12:31, Amit Kapila wrote: > I am slightly hesitant to add any particular system table name in the > comments as this can happen for any other system table as well, so > slightly adjusted the comments in the attached. However, I think it is > okay to mention the particular system table name in the commit > message. Let me know what do you think. Thank you, Amit! I'd like to note that the culprit is exactly pg_largeobject as coded in dumpDatabase(): /* * pg_largeobject comes from the old system intact, so set its * relfrozenxids, relminmxids and relfilenode. */ if (dopt->binary_upgrade) ... appendPQExpBufferStr(loOutQry, "TRUNCATE pg_catalog.pg_largeobject;\n"); I see no other TRUNCATEs (or similar logic) around, so I would specify the table name in the comments. Though maybe I'm missing something... Best regards, Alexander
On Wed, Jan 10, 2024 at 3:30 PM Alexander Lakhin <exclusion@gmail.com> wrote: > > 10.01.2024 12:31, Amit Kapila wrote: > > I am slightly hesitant to add any particular system table name in the > > comments as this can happen for any other system table as well, so > > slightly adjusted the comments in the attached. However, I think it is > > okay to mention the particular system table name in the commit > > message. Let me know what do you think. > > Thank you, Amit! > > I'd like to note that the culprit is exactly pg_largeobject as coded in > dumpDatabase(): > /* > * pg_largeobject comes from the old system intact, so set its > * relfrozenxids, relminmxids and relfilenode. > */ > if (dopt->binary_upgrade) > ... > appendPQExpBufferStr(loOutQry, > "TRUNCATE pg_catalog.pg_largeobject;\n"); > > I see no other TRUNCATEs (or similar logic) around, so I would specify the > table name in the comments. Though maybe I'm missing something... > But tomorrow it could be for other tables and if we change this TRUNCATE logic for pg_largeobject (of which chances are less) then there is always a chance that one misses changing this comment. I feel keeping it generic in this case would be better as the problem is generic but it is currently shown for pg_largeobject. -- With Regards, Amit Kapila.
10.01.2024 13:37, Amit Kapila wrote: > But tomorrow it could be for other tables and if we change this > TRUNCATE logic for pg_largeobject (of which chances are less) then > there is always a chance that one misses changing this comment. I feel > keeping it generic in this case would be better as the problem is > generic but it is currently shown for pg_largeobject. Yes, for sure. So let's keep it generic as you prefer. Thank you! Best regards, Alexander
Dear Alexander, Amit, > > But tomorrow it could be for other tables and if we change this > > TRUNCATE logic for pg_largeobject (of which chances are less) then > > there is always a chance that one misses changing this comment. I feel > > keeping it generic in this case would be better as the problem is > > generic but it is currently shown for pg_largeobject. > > Yes, for sure. So let's keep it generic as you prefer. > > Thank you! Thanks for working the patch. I'm also OK to push the Amit's fix patch. Best Regards, Hayato Kuroda FUJITSU LIMITED
On Thu, Jan 11, 2024 at 8:15 AM Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> wrote: > > > > But tomorrow it could be for other tables and if we change this > > > TRUNCATE logic for pg_largeobject (of which chances are less) then > > > there is always a chance that one misses changing this comment. I feel > > > keeping it generic in this case would be better as the problem is > > > generic but it is currently shown for pg_largeobject. > > > > Yes, for sure. So let's keep it generic as you prefer. > > > > Thank you! > > Thanks for working the patch. I'm also OK to push the Amit's fix patch. > Thanks to both of you. I have pushed the patch. -- With Regards, Amit Kapila.
Dear hackers, > > Thanks to both of you. I have pushed the patch. > I have been tracking the BF animals these days, and this failure has not seen anymore. I think we can close the topic. Again, thanks for all efforts. Best Regards, Hayato Kuroda FUJITSU LIMITED