Thread: Random pg_upgrade test failure on drongo

Random pg_upgrade test failure on drongo

From
"Hayato Kuroda (Fujitsu)"
Date:
Dear hackers,

While tracking a buildfarm, I found that drongo failed the test pg_upgrade/003_logical_slots [1].
A strange point is that the test passed in the next iteration. Currently I'm not
sure the reason, but I will keep my eye for it and will investigate if it
happens again.

I think this failure is not related with our logical slots work, whereas it
failed 003_logical_slots.pl. More detail, please see latter part.

For more investigation, a server log during the upgrade may be needed. It will
be in the data directory so BF system will not upload them. I may need additional
information if it failed again.

# Analysis of failure

According to the output, pg_upgrade seemed to be failed while restoring objects
to new cluster[2].

As code-level anaysis, pg_upgrade command failed in exec_prog().
In the function, pg_restore tried to be executed for database "postgres".
Below is a brief call-stack. Note that pg_restore is not used for migrating
logical replication slots, it is done by pg_upgrade binary itself. Also, the
migration is done after all objects are copied, not in create_new_objects().

```
exec_prog()
parallel_exec_prog("pg_restore ... ") <-- Since -j option is not specified, it is just a wrapper
create_new_objects()
main()
```

In exec_prog(), system() system call was called but returned non-zero value.
Doc said that sytem() returns value that is returned by the command interpreter,
when input is not NULL [3]. Unfortunately, current code does not output the
return code.  Also, BF system does not upload data directory for failed tests.
Therefore, I could not get more information for the investigation.

[1]:
https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=drongo&dt=2023-11-07%2013%3A43%3A23&stg=pg_upgrade-check
[2]:
```
...
# No postmaster PID for node "oldpub"
# Running: pg_upgrade --no-sync -d
C:\\prog\\bf\\root\\HEAD\\pgsql.build/testrun/pg_upgrade/003_logical_slots\\data/t_003_logical_slots_oldpub_data/pgdata
-D
C:\\prog\\bf\\root\\HEAD\\pgsql.build/testrun/pg_upgrade/003_logical_slots\\data/t_003_logical_slots_newpub_data/pgdata
-bC:/prog/bf/root/HEAD/PGSQL~1.BUI/TMP_IN~1/prog/bf/root/HEAD/inst/bin -B
C:/prog/bf/root/HEAD/PGSQL~1.BUI/TMP_IN~1/prog/bf/root/HEAD/inst/bin-s 127.0.0.1 -p 54813 -P 54814 --copy 
Performing Consistency Checks
...
Setting frozenxid and minmxid counters in new cluster         ok
Restoring global objects in the new cluster                   ok
Restoring database schemas in the new cluster
*failure*

Consult the last few lines of
"C:/prog/bf/root/HEAD/pgsql.build/testrun/pg_upgrade/003_logical_slots/data/t_003_logical_slots_newpub_data/pgdata/pg_upgrade_output.d/20231107T142224.580/log/pg_upgrade_dump_5.log"
for
the probable cause of the failure.
Failure, exiting
[14:23:26.632](70.141s) not ok 10 - run of pg_upgrade of old cluster
[14:23:26.632](0.000s) #   Failed test 'run of pg_upgrade of old cluster'
#   at C:/prog/bf/root/HEAD/pgsql/src/bin/pg_upgrade/t/003_logical_slots.pl line 170.
### Starting node "newpub"
# Running: pg_ctl -w -D
C:\\prog\\bf\\root\\HEAD\\pgsql.build/testrun/pg_upgrade/003_logical_slots\\data/t_003_logical_slots_newpub_data/pgdata
-lC:\\prog\\bf\\root\\HEAD\\pgsql.build/testrun/pg_upgrade/003_logical_slots\\log/003_logical_slots_newpub.log -o
--cluster-name=newpubstart 
waiting for server to start.... done
server started
# Postmaster PID for node "newpub" is 4604
[14:23:28.398](1.766s) not ok 11 - check the slot exists on new cluster
[14:23:28.398](0.001s) #   Failed test 'check the slot exists on new cluster'
#   at C:/prog/bf/root/HEAD/pgsql/src/bin/pg_upgrade/t/003_logical_slots.pl line 176.
[14:23:28.399](0.000s) #          got: ''
#     expected: 'regress_sub|t'
...
```
[3]: https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/system-wsystem?view=msvc-170


Best Regards,
Hayato Kuroda
FUJITSU LIMITED




RE: Random pg_upgrade test failure on drongo

From
"Hayato Kuroda (Fujitsu)"
Date:
Dear hackers,

> While tracking a buildfarm, I found that drongo failed the test
> pg_upgrade/003_logical_slots [1].
> A strange point is that the test passed in the next iteration. Currently I'm not
> sure the reason, but I will keep my eye for it and will investigate if it
> happens again.

This email just tells an update. We found that fairywren was also failed due to
the same reason [2]. It fails inconsistently, but there might be a bad thing on
windows. I'm now trying to reproduce with my colleagues to analyze more detail.
Also, working with Andrew for getting logs emitted during the upgrade.
I will continue to keep on my eye.

[2]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2023-11-08%2010%3A22%3A45

Best Regards,
Hayato Kuroda
FUJITSU LIMITED




RE: Random pg_upgrade test failure on drongo

From
"Hayato Kuroda (Fujitsu)"
Date:
Dear hackers,

This email tells an update. The machine drongo failed the test a week ago [1]
and finally got logfiles. PSA files.

## Observed failure

pg_upgrade_server.log is a server log during the pg_upgrade command. According to
it, the TRUNCATE command seemed to be failed due to a "File exists" error.

```
2023-11-15 00:02:02.239 UTC [1752:18] 003_logical_slots.pl ERROR:  could not create file "base/1/2683": File exists
2023-11-15 00:02:02.239 UTC [1752:19] 003_logical_slots.pl STATEMENT:
    -- For binary upgrade, preserve pg_largeobject and index relfilenodes
    SELECT pg_catalog.binary_upgrade_set_next_index_relfilenode('2683'::pg_catalog.oid);
    SELECT pg_catalog.binary_upgrade_set_next_heap_relfilenode('2613'::pg_catalog.oid);
    TRUNCATE pg_catalog.pg_largeobject;
...
```

## Analysis

I think it caused due to the STATUS_DELETE_PENDING failure, not related with recent
updates for pg_upgrade.

The file "base/1/2683" is an index file for pg_largeobject_loid_pn_index, and the
output meant that file creation was failed. Below is a backtrace.

```
pgwin32_open() // <-- this returns -1
open()
BasicOpenFilePerm()
PathNameOpenFilePerm()
PathNameOpenFile()
mdcreate()
smgrcreate()
RelationCreateStorage()
RelationSetNewRelfilenumber()
ExecuteTruncateGuts()
ExecuteTruncate()
```

But this is strange. Before calling mdcreate(), we surely unlink the file which
have the same name. Below is a trace until unlink.

```
pgunlink()
unlink()
mdunlinkfork()
mdunlink()
smgrdounlinkall()
RelationSetNewRelfilenumber() // common path with above
ExecuteTruncateGuts()
ExecuteTruncate()
```

I found Thomas said that [2] pgunlink sometimes could not remove file even if
it returns OK, at that time NTSTATUS is STATUS_DELETE_PENDING. Also, a comment
in pgwin32_open_handle() mentions the same thing:

```
        /*
         * ERROR_ACCESS_DENIED is returned if the file is deleted but not yet
         * gone (Windows NT status code is STATUS_DELETE_PENDING).  In that
         * case, we'd better ask for the NT status too so we can translate it
         * to a more Unix-like error.  We hope that nothing clobbers the NT
         * status in between the internal NtCreateFile() call and CreateFile()
         * returning.
         *
```

The definition of STATUS_DELETE_PENDING can be seen in [3]. Based on that, indeed,
open() would be able to fail with STATUS_DELETE_PENDING if the deletion is pending
but it is tried to open.

Another thread [4] also tries the issue while doing rmtree->unlink, and it reties
to remove if it fails with STATUS_DELETE_PENDING. So, should we retry to open when
it fails as well? Anyway, this fix seems out-of-scope for pg_upgrade.

How do you think? Do you have any thoughts about it?

## Acknowledgement

I want to say thanks to Sholk, Vingesh, for helping the analysis.

[1]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sungazer&dt=2023-11-15%2006%3A16%3A15
[2]: https://www.postgresql.org/message-id/CA%2BhUKGKsdzw06c5nnb%3DKYG9GmvyykoVpJA_VR3k0r7cZOKcx6Q%40mail.gmail.com
[3]: https://learn.microsoft.com/en-us/openspecs/windows_protocols/ms-erref/596a1078-e883-4972-9bbc-49e60bebca55
[4]:
https://www.postgresql.org/message-id/flat/20220919213217.ptqfdlcc5idk5xup%40awork3.anarazel.de#6ae5e2ba3dd6e1fd680dcc34eab710d5

Best Regards,
Hayato Kuroda
FUJITSU LIMITED




RE: Random pg_upgrade test failure on drongo

From
"Hayato Kuroda (Fujitsu)"
Date:
Dear hackers,

> This email tells an update. The machine drongo failed the test a week ago [1]
> and finally got logfiles. PSA files.

Oh, sorry. I missed to attach files. You can see pg_upgrade_server.log for now.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED


Attachment

Re: Random pg_upgrade test failure on drongo

From
Alexander Lakhin
Date:
Hello Kuroda-san,

21.11.2023 13:37, Hayato Kuroda (Fujitsu) wrote:
>> This email tells an update. The machine drongo failed the test a week ago [1]
>> and finally got logfiles. PSA files.
> Oh, sorry. I missed to attach files. You can see pg_upgrade_server.log for now.
>

I can easily reproduce this failure on my workstation by running 5 tests
003_logical_slots in parallel inside Windows VM with it's CPU resources
limited to 50%, like so:
VBoxManage controlvm "Windows" cpuexecutioncap 50

set PGCTLTIMEOUT=180
python3 -c "NUMITERATIONS=20;NUMTESTS=5;import os;tsts='';exec('for i in range(1,NUMTESTS+1): 
tsts+=f\"pg_upgrade_{i}/003_logical_slots \"'); exec('for i in range(1,NUMITERATIONS+1):print(f\"iteration {i}\"); 
assert(os.system(f\"meson test --num-processes {NUMTESTS} {tsts}\") == 0)')"
...
iteration 2
ninja: Entering directory `C:\src\postgresql\build'
ninja: no work to do.
1/5 postgresql:pg_upgrade_2 / pg_upgrade_2/003_logical_slots ERROR            60.30s   exit status 25
...
pg_restore: error: could not execute query: ERROR:  could not create file "base/1/2683": File exists
...

I agree with your analysis and would like to propose a PoC fix (see
attached). With this patch applied, 20 iterations succeeded for me.

Best regards,
Alexander
Attachment

RE: Random pg_upgrade test failure on drongo

From
"Hayato Kuroda (Fujitsu)"
Date:
Dear Alexander,

> 
> I can easily reproduce this failure on my workstation by running 5 tests
> 003_logical_slots in parallel inside Windows VM with it's CPU resources
> limited to 50%, like so:
> VBoxManage controlvm "Windows" cpuexecutioncap 50
> 
> set PGCTLTIMEOUT=180
> python3 -c "NUMITERATIONS=20;NUMTESTS=5;import os;tsts='';exec('for i in
> range(1,NUMTESTS+1):
> tsts+=f\"pg_upgrade_{i}/003_logical_slots \"'); exec('for i in
> range(1,NUMITERATIONS+1):print(f\"iteration {i}\");
> assert(os.system(f\"meson test --num-processes {NUMTESTS} {tsts}\") == 0)')"
> ...
> iteration 2
> ninja: Entering directory `C:\src\postgresql\build'
> ninja: no work to do.
> 1/5 postgresql:pg_upgrade_2 / pg_upgrade_2/003_logical_slots
> ERROR            60.30s   exit status 25
> ...
> pg_restore: error: could not execute query: ERROR:  could not create file
> "base/1/2683": File exists
> ...

Great. I do not have such an environment so I could not find. This seemed to
suggest that the failure was occurred because the system was busy.

> I agree with your analysis and would like to propose a PoC fix (see
> attached). With this patch applied, 20 iterations succeeded for me.

Thanks, here are comments. I'm quite not sure for the windows, so I may say
something wrong.

* I'm not sure why the file/directory name was changed before doing a unlink.
  Could you add descriptions?
* IIUC, the important points is the latter part, which waits until the status is
  changed. Based on that, can we remove a double rmtree() from cleanup_output_dirs()?
  They seems to be add for the similar motivation.

```
+    loops = 0;
+    while (lstat(curpath, &st) < 0 && lstat_error_was_status_delete_pending())
+    {
+        if (++loops > 100)        /* time out after 10 sec */
+            return -1;
+        pg_usleep(100000);        /* us */
+    }
```

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Re: Random pg_upgrade test failure on drongo

From
Alexander Lakhin
Date:
Hello Kuroda-san,

23.11.2023 15:15, Hayato Kuroda (Fujitsu) wrote:
>
>> I agree with your analysis and would like to propose a PoC fix (see
>> attached). With this patch applied, 20 iterations succeeded for me.
> Thanks, here are comments. I'm quite not sure for the windows, so I may say
> something wrong.
>
> * I'm not sure why the file/directory name was changed before doing a unlink.
>    Could you add descriptions?

Please look at the simple test program attached. It demonstrates the
failure for me when running in two sessions as follows:
unlink-open test 150 1000
                 unlink-open test2 150 1000
...
iteration 60
iteration 61
fopen() after unlink() failed (13)

Process Monitor shows:
...
         9:16:55.9249792 AM    unlink-open.exe    3232 WriteFile    C:\src\test2    SUCCESS    Offset: 138,412,032, 
Length: 1,048,576
### unlink() performed for the file "test":
9:16:55.9852903 AM    unlink-open.exe    4968    CreateFile C:\src\test    SUCCESS    Desired Access: Read Attributes,

Delete, Disposition: Open, Options: Non-Directory File, Open Reparse Point, Attributes: n/a, ShareMode: Read, Write, 
Delete, AllocationSize: n/a, OpenResult: Opened
9:16:55.9853637 AM    unlink-open.exe    4968 QueryAttributeTagFile    C:\src\test    SUCCESS    Attributes: A, 
ReparseTag: 0x0
### file "test" gets into state DELETE PENDING:
9:16:55.9853756 AM    unlink-open.exe    4968 SetDispositionInformationFile    C:\src\test    SUCCESS    Delete: True
9:16:55.9853888 AM    unlink-open.exe    4968    CloseFile C:\src\test    SUCCESS
### concurrent operations with file "test2":
         9:16:55.9866362 AM    unlink-open.exe    3232 WriteFile    C:\src\test2    SUCCESS    Offset: 139,460,608, 
Length: 1,048,576
...
         9:16:55.9972373 AM    unlink-open.exe    3232 WriteFile    C:\src\test2    SUCCESS    Offset: 157,286,400, 
Length: 1,048,576
         9:16:55.9979040 AM    unlink-open.exe    3232 CloseFile    C:\src\test2    SUCCESS
### unlink() for "test2":
         9:16:56.1029981 AM    unlink-open.exe    3232 CreateFile    C:\src\test2    SUCCESS    Desired Access: Read 
Attributes, Delete, Disposition: Open, Options: Non-Directory File, Open Reparse Point, Attributes: n/a, ShareMode: 
Read, Write, Delete, AllocationSize: n/a, OpenResult: Opened
         9:16:56.1030432 AM    unlink-open.exe    3232 QueryAttributeTagFile    C:\src\test2    SUCCESS    Attributes:

A, ReparseTag: 0x0
### file "test2" gets into state DELETE PENDING:
         9:16:56.1030517 AM    unlink-open.exe    3232 SetDispositionInformationFile    C:\src\test2    SUCCESS    
Delete: True
         9:16:56.1030625 AM    unlink-open.exe    3232 CloseFile    C:\src\test2    SUCCESS
### and then it opened successfully:
         9:16:56.1189503 AM    unlink-open.exe    3232 CreateFile    C:\src\test2    SUCCESS    Desired Access: Generic

Write, Read Attributes, Disposition: OverwriteIf, Options: Synchronous IO Non-Alert, Non-Directory File, Attributes: N,

ShareMode: Read, Write, AllocationSize: 0, OpenResult: Created
         9:16:56.1192016 AM    unlink-open.exe    3232 CloseFile    C:\src\test2    SUCCESS
### operations with file "test2" continued:
         9:16:56.1193394 AM    unlink-open.exe    3232 CreateFile    C:\src\test2    SUCCESS    Desired Access: Read 
Attributes, Delete, Disposition: Open, Options: Non-Directory File, Open Reparse Point, Attributes: n/a, ShareMode: 
Read, Write, Delete, AllocationSize: n/a, OpenResult: Opened
         9:16:56.1193895 AM    unlink-open.exe    3232 QueryAttributeTagFile    C:\src\test2    SUCCESS    Attributes:

A, ReparseTag: 0x0
         9:16:56.1194042 AM    unlink-open.exe    3232 SetDispositionInformationFile    C:\src\test2    SUCCESS    
Delete: True
         9:16:56.1194188 AM    unlink-open.exe    3232 CloseFile    C:\src\test2    SUCCESS
         9:16:56.1198459 AM    unlink-open.exe    3232 CreateFile    C:\src\test2    SUCCESS    Desired Access: Generic

Write, Read Attributes, Disposition: OverwriteIf, Options: Synchronous IO Non-Alert, Non-Directory File, Attributes: N,

ShareMode: Read, Write, AllocationSize: 0, OpenResult: Created
         9:16:56.1200302 AM    unlink-open.exe    3232 WriteFile    C:\src\test2    SUCCESS    Offset: 0, Length: 
1,048,576, Priority: Normal
...
         9:16:56.1275871 AM    unlink-open.exe    3232 WriteFile    C:\src\test2    SUCCESS    Offset: 10,485,760, 
Length: 1,048,576

### at the same time, CreateFile() for file "test" failed:
9:16:56.1276453 AM    unlink-open.exe    4968    CreateFile C:\src\test    DELETE PENDING    Desired Access: Generic 
Write, Read Attributes, Disposition: OverwriteIf, Options: Synchronous IO Non-Alert, Non-Directory File, Attributes: N,

ShareMode: Read, Write, AllocationSize: 0
9:16:56.1279359 AM    unlink-open.exe    3232    WriteFile C:\src\test2    SUCCESS    Offset: 11,534,336, Length:
1,048,576
9:16:56.1283452 AM    unlink-open.exe    3232    WriteFile C:\src\test2    SUCCESS    Offset: 12,582,912, Length:
1,048,576
...

But with rename(MoveFileEx), I see:
unlink-open test 150 1000 rename
...
9:38:01.7035286 AM    unlink-open.exe    10208    WriteFile C:\src\test    SUCCESS    Offset: 156,237,824, Length:
1,048,576
9:38:01.7075621 AM    unlink-open.exe    10208    WriteFile C:\src\test    SUCCESS    Offset: 157,286,400, Length:
1,048,576
9:38:01.7101299 AM    unlink-open.exe    10208    CloseFile C:\src\test    SUCCESS
9:38:01.7130802 AM    unlink-open.exe    10208    CreateFile C:\src\test    SUCCESS    Desired Access: Read Attributes,

Delete, Synchronize, Disposition: Open, Options: Synchronous IO Non-Alert, Open Reparse Point, Attributes: n/a, 
ShareMode: Read, Write, Delete, AllocationSize: n/a, OpenResult: Opened
9:38:01.7132844 AM    unlink-open.exe    10208 QueryAttributeTagFile    C:\src\test    SUCCESS    Attributes: A, 
ReparseTag: 0x0
9:38:01.7133420 AM    unlink-open.exe    10208 QueryBasicInformationFile    C:\src\test    SUCCESS    CreationTime: 
11/24/2023 9:38:01 AM, LastAccessTime: 11/24/2023 9:38:01 AM, LastWriteTime: 11/24/2023 9:38:01 AM, ChangeTime: 
11/24/2023 9:38:01 AM, FileAttributes: A
9:38:01.7135191 AM    unlink-open.exe    10208    CreateFile C:\src    SUCCESS    Desired Access: Write Data/Add File,

Synchronize, Disposition: Open, Options: , Attributes: n/a, ShareMode: Read, Write, AllocationSize: n/a, OpenResult:
Opened

### file "test" renamed to "test.tmp", it doesn't get into state DELETE PENDING
9:38:01.7136221 AM    unlink-open.exe    10208 SetRenameInformationFile    C:\src\test    SUCCESS ReplaceIfExists:
True,
 
FileName: C:\src\test.tmp
9:38:01.8384110 AM    unlink-open.exe    10208    CloseFile C:\src    SUCCESS
9:38:01.8388203 AM    unlink-open.exe    10208    CloseFile C:\src\test.tmp    SUCCESS

### then file "test.tmp" deleted as usual:
9:38:01.8394278 AM    unlink-open.exe    10208    CreateFile C:\src\test.tmp    SUCCESS    Desired Access: Read 
Attributes, Delete, Disposition: Open, Options: Non-Directory File, Open Reparse Point, Attributes: n/a, ShareMode: 
Read, Write, Delete, AllocationSize: n/a, OpenResult: Opened
9:38:01.8396534 AM    unlink-open.exe    10208 QueryAttributeTagFile    C:\src\test.tmp    SUCCESS    Attributes: A, 
ReparseTag: 0x0
9:38:01.8396885 AM    unlink-open.exe    10208 SetDispositionInformationFile    C:\src\test.tmp    SUCCESS Delete:
True
9:38:01.8397312 AM    unlink-open.exe    10208    CloseFile C:\src\test.tmp    SUCCESS
9:38:01.9162566 AM    unlink-open.exe    10208    CreateFile C:\src\test    SUCCESS    Desired Access: Generic Write, 
Read Attributes, Disposition: OverwriteIf, Options: Synchronous IO Non-Alert, Non-Directory File, Attributes: N, 
ShareMode: Read, Write, AllocationSize: 0, OpenResult: Created
9:38:01.9167628 AM    unlink-open.exe    10208    CloseFile C:\src\test    SUCCESS

So the same test run with MoveFileEx():
unlink-open test 150 1000 rename
                 unlink-open test2 150 1000 rename
successfully passes for me in the same environment (Windows VM slowed down to 50%).

That is, my idea was to try removing a file through renaming it as a fast
path (thus avoiding that troublesome state DELETE PENDING), and if that
fails, to perform removal as before. May be the whole function might be
simplified, but I'm not sure about special cases yet.

> * IIUC, the important points is the latter part, which waits until the status is
>    changed. Based on that, can we remove a double rmtree() from cleanup_output_dirs()?
>    They seems to be add for the similar motivation.

I couldn't yet reproduce a failure, which motivated that doubling (IIUC, it
was observed in [1]), with c28911750 reverted, so I need more time to
research that issue to answer this question.

[1] https://www.postgresql.org/message-id/20230131172806.GM22427%40telsasoft.com

Best regards,
Alexander
Attachment

RE: Random pg_upgrade test failure on drongo

From
"Hayato Kuroda (Fujitsu)"
Date:
Dear Alexander,

> 
> Please look at the simple test program attached. It demonstrates the
> failure for me when running in two sessions as follows:
> unlink-open test 150 1000
>                  unlink-open test2 150 1000

Thanks for attaching a program. This helps us to understand the issue.
I wanted to confirm your env - this failure was occurred on windows server XXXX, right?

> 
> That is, my idea was to try removing a file through renaming it as a fast
> path (thus avoiding that troublesome state DELETE PENDING), and if that
> fails, to perform removal as before. May be the whole function might be
> simplified, but I'm not sure about special cases yet.

I felt that your result showed pgrename() would be more rarely delayed than unlink().
That's why a file which has original name would not exist when subsequent open() was called.

About special cases, I wanted seniors to check.

> > * IIUC, the important points is the latter part, which waits until the status is
> >    changed. Based on that, can we remove a double rmtree() from
> cleanup_output_dirs()?
> >    They seems to be add for the similar motivation.
> 
> I couldn't yet reproduce a failure, which motivated that doubling (IIUC, it
> was observed in [1]), with c28911750 reverted, so I need more time to
> research that issue to answer this question.

Yeah, as the first place, this failure seldom occurred....

Best Regards,
Hayato Kuroda
FUJITSU LIMITED


Re: Random pg_upgrade test failure on drongo

From
Alexander Lakhin
Date:
Hello Kuroda-san,

25.11.2023 18:19, Hayato Kuroda (Fujitsu) wrote:
> Thanks for attaching a program. This helps us to understand the issue.
> I wanted to confirm your env - this failure was occurred on windows server XXXX, right?

I see that behavior on:
Windows 10 Version 1607 (OS Build 14393.0)
Windows Server 2016 Version 1607 (OS Build 14393.0)
Windows Server 2019 Version 1809 (OS Build 17763.1)

But it's not reproduced on:
Windows 10 Version 1809 (OS Build 17763.1) (triple-checked)
Windows Server 2019 Version 1809 (OS Build 17763.592)
Windows 10 Version 22H2 (OS Build 19045.3693)
Windows 11 Version 21H2 (OS Build 22000.613)

So it looks like the failure occurs depending not on Windows edition, but
rather on it's build. For Windows Server 2019 the "good" build is
somewhere between 17763.1 and 17763.592, but for Windows 10 it's between
14393.0 and 17763.1.
(Maybe there was some change related to FILE_DISPOSITION_POSIX_SEMANTICS/
FILE_DISPOSITION_ON_CLOSE implementation; I don't know where to find
information about that change.)

It's also interesting, what is full version/build of OS on drongo and
fairywren.

>> That is, my idea was to try removing a file through renaming it as a fast
>> path (thus avoiding that troublesome state DELETE PENDING), and if that
>> fails, to perform removal as before. May be the whole function might be
>> simplified, but I'm not sure about special cases yet.
> I felt that your result showed pgrename() would be more rarely delayed than unlink().
> That's why a file which has original name would not exist when subsequent open() was called.

I think that's because unlink() is performed asynchronously on those old
Windows versions, but rename() is always synchronous.

>>> * IIUC, the important points is the latter part, which waits until the status is
>>>     changed. Based on that, can we remove a double rmtree() from
>> cleanup_output_dirs()?
>>>     They seems to be add for the similar motivation.
>> I couldn't yet reproduce a failure, which motivated that doubling (IIUC, it
>> was observed in [1]), with c28911750 reverted, so I need more time to
>> research that issue to answer this question.
> Yeah, as the first place, this failure seldom occurred....

I've managed to reproduce that issue (or at least a situation that
manifested similarly) with a sleep added in miscinit.c:
         ereport(IsPostmasterEnvironment ? LOG : NOTICE,
                         (errmsg("database system is shut down")));
+       pg_usleep(500000L);

With this change, I get the same warning as in [1] when running in
parallel 10 tests 002_pg_upgrade with a minimal olddump (on iterations
33, 46, 8). And with my PoC patch applied, I could see the same warning
as well (on iteration 6).

I believe that's because rename() can't rename a directory containing an
open file, just as unlink() can't remove it.

In the light of the above, I think that the issue in question should be
fixed in accordance with/as a supplement to [2].

[1] https://www.postgresql.org/message-id/20230131172806.GM22427%40telsasoft.com
[2]
https://www.postgresql.org/message-id/flat/CA%2BhUKG%2BajSQ_8eu2AogTncOnZ5me2D-Cn66iN_-wZnRjLN%2Bicg%40mail.gmail.com

Best regards,
Alexander



Re: Random pg_upgrade test failure on drongo

From
Andrew Dunstan
Date:
On 2023-11-27 Mo 07:00, Alexander Lakhin wrote:
> Hello Kuroda-san,
>
> 25.11.2023 18:19, Hayato Kuroda (Fujitsu) wrote:
>> Thanks for attaching a program. This helps us to understand the issue.
>> I wanted to confirm your env - this failure was occurred on windows 
>> server XXXX, right?
>
> I see that behavior on:
> Windows 10 Version 1607 (OS Build 14393.0)
> Windows Server 2016 Version 1607 (OS Build 14393.0)
> Windows Server 2019 Version 1809 (OS Build 17763.1)
>
> But it's not reproduced on:
> Windows 10 Version 1809 (OS Build 17763.1) (triple-checked)
> Windows Server 2019 Version 1809 (OS Build 17763.592)
> Windows 10 Version 22H2 (OS Build 19045.3693)
> Windows 11 Version 21H2 (OS Build 22000.613)
>
> So it looks like the failure occurs depending not on Windows edition, but
> rather on it's build. For Windows Server 2019 the "good" build is
> somewhere between 17763.1 and 17763.592, but for Windows 10 it's between
> 14393.0 and 17763.1.
> (Maybe there was some change related to FILE_DISPOSITION_POSIX_SEMANTICS/
> FILE_DISPOSITION_ON_CLOSE implementation; I don't know where to find
> information about that change.)
>
> It's also interesting, what is full version/build of OS on drongo and
> fairywren.
>
>

It's WS 2019 1809/17763.4252. The latest available AFAICT is 17763.5122


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com




Re: Random pg_upgrade test failure on drongo

From
Andrew Dunstan
Date:
On 2023-11-27 Mo 07:39, Andrew Dunstan wrote:
>
> On 2023-11-27 Mo 07:00, Alexander Lakhin wrote:
>> Hello Kuroda-san,
>>
>> 25.11.2023 18:19, Hayato Kuroda (Fujitsu) wrote:
>>> Thanks for attaching a program. This helps us to understand the issue.
>>> I wanted to confirm your env - this failure was occurred on windows 
>>> server XXXX, right?
>>
>> I see that behavior on:
>> Windows 10 Version 1607 (OS Build 14393.0)
>> Windows Server 2016 Version 1607 (OS Build 14393.0)
>> Windows Server 2019 Version 1809 (OS Build 17763.1)
>>
>> But it's not reproduced on:
>> Windows 10 Version 1809 (OS Build 17763.1) (triple-checked)
>> Windows Server 2019 Version 1809 (OS Build 17763.592)
>> Windows 10 Version 22H2 (OS Build 19045.3693)
>> Windows 11 Version 21H2 (OS Build 22000.613)
>>
>> So it looks like the failure occurs depending not on Windows edition, 
>> but
>> rather on it's build. For Windows Server 2019 the "good" build is
>> somewhere between 17763.1 and 17763.592, but for Windows 10 it's between
>> 14393.0 and 17763.1.
>> (Maybe there was some change related to 
>> FILE_DISPOSITION_POSIX_SEMANTICS/
>> FILE_DISPOSITION_ON_CLOSE implementation; I don't know where to find
>> information about that change.)
>>
>> It's also interesting, what is full version/build of OS on drongo and
>> fairywren.
>>
>>
>
> It's WS 2019 1809/17763.4252. The latest available AFAICT is 17763.5122
>
>
>

I've updated it to 17763.5122 now.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com




RE: Random pg_upgrade test failure on drongo

From
"Hayato Kuroda (Fujitsu)"
Date:
Dear Alexander, Andrew,
 
Thanks for your analysis!
 
> I see that behavior on:
> Windows 10 Version 1607 (OS Build 14393.0)
> Windows Server 2016 Version 1607 (OS Build 14393.0)
> Windows Server 2019 Version 1809 (OS Build 17763.1)
>
> But it's not reproduced on:
> Windows 10 Version 1809 (OS Build 17763.1) (triple-checked)
> Windows Server 2019 Version 1809 (OS Build 17763.592)
> Windows 10 Version 22H2 (OS Build 19045.3693)
> Windows 11 Version 21H2 (OS Build 22000.613)
>
> So it looks like the failure occurs depending not on Windows edition, but
> rather on it's build. For Windows Server 2019 the "good" build is
> somewhere between 17763.1 and 17763.592, but for Windows 10 it's between
> 14393.0 and 17763.1.
> (Maybe there was some change related to
> FILE_DISPOSITION_POSIX_SEMANTICS/
> FILE_DISPOSITION_ON_CLOSE implementation; I don't know where to find
> information about that change.)
>
> It's also interesting, what is full version/build of OS on drongo and
> fairywren.
 
Thanks for your interest for the issue. I have been tracking the failure but been not occurred.
Your analysis seems to solve BF failures, by updating OSes.
 
> I think that's because unlink() is performed asynchronously on those old
> Windows versions, but rename() is always synchronous.
 
OK. Actually I could not find descriptions about them, but your experiment showed facts.
 
> I've managed to reproduce that issue (or at least a situation that
> manifested similarly) with a sleep added in miscinit.c:
>          ereport(IsPostmasterEnvironment ? LOG : NOTICE,
>                         (errmsg("database system is shut down")));
> +       pg_usleep(500000L);
>
> With this change, I get the same warning as in [1] when running in
> parallel 10 tests 002_pg_upgrade with a minimal olddump (on iterations
> 33, 46, 8). And with my PoC patch applied, I could see the same warning
> as well (on iteration 6).
>
> I believe that's because rename() can't rename a directory containing an
> open file, just as unlink() can't remove it.
>
> In the light of the above, I think that the issue in question should be
> fixed in accordance with/as a supplement to [2].
 
OK, I understood that we need to fix more around here. For now, we should focus our points.
 
Your patch seems good, but it needs more sight from windows-friendly developers.
How do other think?

Best Regards,
Hayato Kuroda
FUJITSU LIMITED


Re: Random pg_upgrade test failure on drongo

From
Alexander Lakhin
Date:
Hello Andrew and Kuroda-san,

27.11.2023 16:58, Andrew Dunstan wrote:
>>> It's also interesting, what is full version/build of OS on drongo and
>>> fairywren.
>>>
>>
>> It's WS 2019 1809/17763.4252. The latest available AFAICT is 17763.5122
>>
>
> I've updated it to 17763.5122 now.
>

Thank you for the information! It had pushed me to upgrade my Server
2019 1809 17763.592 to 17763.4252. And then I discovered that I have
difficulties with reproducing the issue on all my VMs after reboot (even
on old versions/builds). It took me a while to understand what's going on
and what affects reproduction of the issue.
I was puzzled by the fact that I can't reproduce the issue with my
unlink-open test program under seemingly the same conditions as before,
until I realized that the issue reproduced only when the target directory
opened in Windows Explorer.

Now I'm sorry for bringing more mystery into the topic and for misleading
information.

So, the issue reproduced only when something scans the working directory
for files/opens them. I added the same logic into my test program (see
unlink-open-scandir attached) and now I see the failure on Windows Server
2019 (Version 10.0.17763.4252).
A script like this:
start cmd /c "unlink-open-scandir test1 10 5000 >log1 2>&1"
...
start cmd /c "unlink-open-scandir test10 10 5000 >log10 2>&1"

results in:
C:\temp>find "failed" log*
---------- LOG1
---------- LOG10
fopen() after unlink() failed (13)
---------- LOG2
fopen() after unlink() failed (13)
---------- LOG3
fopen() after unlink() failed (13)
---------- LOG4
fopen() after unlink() failed (13)
---------- LOG5
fopen() after unlink() failed (13)
---------- LOG6
fopen() after unlink() failed (13)
---------- LOG7
fopen() after unlink() failed (13)
---------- LOG8
fopen() after unlink() failed (13)
---------- LOG9
fopen() after unlink() failed (13)

C:\temp>type log10
...
iteration 108
fopen() after unlink() failed (13)

The same observed on:
Windows 10 Version 1809 (OS Build 17763.1)

But no failures on:
Windows 10 Version 22H2 (OS Build 19045.3693)
Windows 11 Version 21H2 (OS Build 22000.613)

So the behavior change really took place, but my previous estimations were
incorrect (my apologies).

BTW, "rename" mode of the test program can produce more rare errors on
rename:
---------- LOG3
MoveFileEx() failed (0)

but not on open.

30.11.2023 13:00, Hayato Kuroda (Fujitsu) wrote:

> Thanks for your interest for the issue. I have been tracking the failure but been not occurred.
> Your analysis seems to solve BF failures, by updating OSes.

Yes, but I don't think that leaving Server 2019 behind (I suppose Windows
Server 2019 build 20348 would have the same behaviour as Windows 10 19045)
is affordable. (Though looking at Cirrus CI logs, I see that what is
entitled "Windows Server 2019" in fact is Windows Server 2022 there.)

>> I think that's because unlink() is performed asynchronously on those old
>> Windows versions, but rename() is always synchronous.
>   
> OK. Actually I could not find descriptions about them, but your experiment showed facts.

I don't know how this peculiarity is called, but it looks like when some
other process captures the file handle, unlink() exits as if the file was
deleted completely, but the subsequent open() fails.

Best regards,
Alexander
Attachment

RE: Random pg_upgrade test failure on drongo

From
"Hayato Kuroda (Fujitsu)"
Date:
Dear Alexander, 

> I agree with your analysis and would like to propose a PoC fix (see
> attached). With this patch applied, 20 iterations succeeded for me.

There are no reviewers so that I will review again. Let's move the PoC
to the concrete patch. Note that I only focused on fixes of random failure,
other parts are out-of-scope.

Basically, code comments can be updated accordingly.

01.

```
    /*
     * This function might be called for a regular file or for a junction
     * point (which we use to emulate symlinks).  The latter must be unlinked
     * with rmdir() on Windows.  Before we worry about any of that, let's see
     * if we can unlink directly, since that's expected to be the most common
     * case.
     */
    snprintf(tmppath, sizeof(tmppath), "%s.tmp", path);
    if (pgrename(path, tmppath) == 0)
    {
        if (unlink(tmppath) == 0)
            return 0;
        curpath = tmppath;
    }
```

You can modify comments atop changes because it is not trivial.
Below is my draft:

```
     * XXX: we rename the target file to ".tmp" before calling unlink. The
     * removal may fail with STATUS_DELETE_PENDING status on Windows, so
     * creating the same file would fail. This assumes that renaming is a
     * synchronous operation.
```

02.

```
    loops = 0;
    while (lstat(curpath, &st) < 0 && lstat_error_was_status_delete_pending())
    {
        if (++loops > 100)        /* time out after 10 sec */
            return -1;
        pg_usleep(100000);        /* us */
    }
```

Comments can be added atop the part. Below one is my draft.

```
    /*
     * Wait until the removal is really finished to avoid ERRORs for creating a
     * same file in other functions.
     */
```


Best Regards,
Hayato Kuroda
FUJITSU LIMITED


Re: Random pg_upgrade test failure on drongo

From
Alexander Lakhin
Date:
Hello Kuroda-san,

28.12.2023 06:08, Hayato Kuroda (Fujitsu) wrote:
> Dear Alexander,
>
>> I agree with your analysis and would like to propose a PoC fix (see
>> attached). With this patch applied, 20 iterations succeeded for me.
> There are no reviewers so that I will review again. Let's move the PoC
> to the concrete patch. Note that I only focused on fixes of random failure,
> other parts are out-of-scope.

Thinking about that fix more, I'm not satisfied with the approach proposed.
First, it turns every unlink operation into two write operations
(rename + unlink), not to say about new risks of having stale .tmp files
(perhaps, it's ok for regular files (MoveFileEx can overwrite existing
files), but not for directories)
Second, it does that on any Windows OS versions, including modern ones,
which are not affected by the issue, as we know.

So I started to think about other approach: to perform unlink as it's
implemented now, but then wait until the DELETE_PENDING state is gone.
And I was very surprised to see that this state is not transient in our case.
Additional investigation showed that the test fails not because some aside
process opens a file (concretely, {template1_id/postgres_id}/2683), that is
being deleted, but because of an internal process that opens the file and
holds a handle to it indefinitely.
And the internal process is ... background writer (BgBufferSync()).

So, I tried just adding bgwriter_lru_maxpages = 0 to postgresql.conf and
got 20 x 10 tests passing.

Thus, it we want just to get rid of the test failure, maybe it's enough to
add this to the test's config...

The other way to go is to find out whether the background writer process
should react on a shared-inval message, sent from smgrdounlinkall(), and
close that file's handle,

Maybe we could also (after changing the bgwriter's behaviour) add a waiting
loop into pgwin32_open_handle() to completely rule out transient open()
failures due to some other process (such as Windows Exporer) opening a file
being deleted, but I would not complicate the things until we have a clear
vision/plans of using modern APIs/relying of modern OS versions' behavior.
I mean proceeding with something like:
https://commitfest.postgresql.org/40/3951/

Best regards,
Alexander



Re: Random pg_upgrade test failure on drongo

From
Amit Kapila
Date:
On Tue, Jan 2, 2024 at 10:30 AM Alexander Lakhin <exclusion@gmail.com> wrote:
>
> 28.12.2023 06:08, Hayato Kuroda (Fujitsu) wrote:
> > Dear Alexander,
> >
> >> I agree with your analysis and would like to propose a PoC fix (see
> >> attached). With this patch applied, 20 iterations succeeded for me.
> > There are no reviewers so that I will review again. Let's move the PoC
> > to the concrete patch. Note that I only focused on fixes of random failure,
> > other parts are out-of-scope.
>
> Thinking about that fix more, I'm not satisfied with the approach proposed.
> First, it turns every unlink operation into two write operations
> (rename + unlink), not to say about new risks of having stale .tmp files
> (perhaps, it's ok for regular files (MoveFileEx can overwrite existing
> files), but not for directories)
> Second, it does that on any Windows OS versions, including modern ones,
> which are not affected by the issue, as we know.
>
> So I started to think about other approach: to perform unlink as it's
> implemented now, but then wait until the DELETE_PENDING state is gone.
>

There is a comment in the code which suggests we shouldn't wait
indefinitely. See "However, we won't wait indefinitely for someone
else to close the file, as the caller might be holding locks and
blocking other backends."

> And I was very surprised to see that this state is not transient in our case.
> Additional investigation showed that the test fails not because some aside
> process opens a file (concretely, {template1_id/postgres_id}/2683), that is
> being deleted, but because of an internal process that opens the file and
> holds a handle to it indefinitely.
> And the internal process is ... background writer (BgBufferSync()).
>
> So, I tried just adding bgwriter_lru_maxpages = 0 to postgresql.conf and
> got 20 x 10 tests passing.
>
> Thus, it we want just to get rid of the test failure, maybe it's enough to
> add this to the test's config...
>

What about checkpoints? Can't it do the same while writing the buffers?

--
With Regards,
Amit Kapila.



Re: Random pg_upgrade test failure on drongo

From
Alexander Lakhin
Date:
Hello Amit,

03.01.2024 14:42, Amit Kapila wrote:
>
>> So I started to think about other approach: to perform unlink as it's
>> implemented now, but then wait until the DELETE_PENDING state is gone.
>>
> There is a comment in the code which suggests we shouldn't wait
> indefinitely. See "However, we won't wait indefinitely for someone
> else to close the file, as the caller might be holding locks and
> blocking other backends."

Yes, I saw it, but initially I thought that we have a transient condition
there, so waiting in open() (instead of failing immediately) seemed like a
good idea then...

>> And the internal process is ... background writer (BgBufferSync()).
>>
>> So, I tried just adding bgwriter_lru_maxpages = 0 to postgresql.conf and
>> got 20 x 10 tests passing.
>>
>> Thus, it we want just to get rid of the test failure, maybe it's enough to
>> add this to the test's config...
>>
> What about checkpoints? Can't it do the same while writing the buffers?

As we deal here with pg_upgrade/pg_restore, it must not be very easy to get
the desired effect, but I think it's not impossible in principle.
More details below.
What happens during the pg_upgrade execution is essentially:
1) CREATE DATABASE "postgres" WITH TEMPLATE = template0 OID = 5 ...;
-- this command flushes file buffers as well
2) ALTER DATABASE postgres OWNER TO ...
3) COMMENT ON DATABASE "postgres" IS ...
4)     -- For binary upgrade, preserve pg_largeobject and index relfilenodes
     SELECT pg_catalog.binary_upgrade_set_next_index_relfilenode('2683'::pg_catalog.oid);
     SELECT pg_catalog.binary_upgrade_set_next_heap_relfilenode('2613'::pg_catalog.oid);
     TRUNCATE pg_catalog.pg_largeobject;
--  ^^^ here we can get the error "could not create file "base/5/2683": File exists"
...

We get the effect discussed when the background writer process decides to
flush a file buffer for pg_largeobject during stage 1.
(Thus, if a checkpoint somehow happened to occur during CREATE DATABASE,
the result must be the same.)
And another important factor is shared_buffers = 1MB (set during the test).
With the default setting of 128MB I couldn't see the failure.

It can be reproduced easily (on old Windows versions) just by running
pg_upgrade in a loop (I've got failures on iterations 22, 37, 17 (with the
default cluster)).
If an old cluster contains dozen of databases, this increases the failure
probability significantly (with 10 additional databases I've got failures
on iterations 4, 1, 6).

Please see the reproducing script attached.

Best regards,
Alexander
Attachment

Re: Random pg_upgrade test failure on drongo

From
Amit Kapila
Date:
On Thu, Jan 4, 2024 at 5:30 PM Alexander Lakhin <exclusion@gmail.com> wrote:
>
> 03.01.2024 14:42, Amit Kapila wrote:
> >
>
> >> And the internal process is ... background writer (BgBufferSync()).
> >>
> >> So, I tried just adding bgwriter_lru_maxpages = 0 to postgresql.conf and
> >> got 20 x 10 tests passing.
> >>
> >> Thus, it we want just to get rid of the test failure, maybe it's enough to
> >> add this to the test's config...
> >>
> > What about checkpoints? Can't it do the same while writing the buffers?
>
> As we deal here with pg_upgrade/pg_restore, it must not be very easy to get
> the desired effect, but I think it's not impossible in principle.
> More details below.
> What happens during the pg_upgrade execution is essentially:
> 1) CREATE DATABASE "postgres" WITH TEMPLATE = template0 OID = 5 ...;
> -- this command flushes file buffers as well
> 2) ALTER DATABASE postgres OWNER TO ...
> 3) COMMENT ON DATABASE "postgres" IS ...
> 4)     -- For binary upgrade, preserve pg_largeobject and index relfilenodes
>      SELECT pg_catalog.binary_upgrade_set_next_index_relfilenode('2683'::pg_catalog.oid);
>      SELECT pg_catalog.binary_upgrade_set_next_heap_relfilenode('2613'::pg_catalog.oid);
>      TRUNCATE pg_catalog.pg_largeobject;
> --  ^^^ here we can get the error "could not create file "base/5/2683": File exists"
> ...
>
> We get the effect discussed when the background writer process decides to
> flush a file buffer for pg_largeobject during stage 1.
> (Thus, if a checkpoint somehow happened to occur during CREATE DATABASE,
> the result must be the same.)
> And another important factor is shared_buffers = 1MB (set during the test).
> With the default setting of 128MB I couldn't see the failure.
>
> It can be reproduced easily (on old Windows versions) just by running
> pg_upgrade in a loop (I've got failures on iterations 22, 37, 17 (with the
> default cluster)).
> If an old cluster contains dozen of databases, this increases the failure
> probability significantly (with 10 additional databases I've got failures
> on iterations 4, 1, 6).
>

I don't have an old Windows environment to test but I agree with your
analysis and theory. The question is what should we do for these new
random BF failures? I think we should set bgwriter_lru_maxpages to 0
and checkpoint_timeout to 1hr for these new tests. Doing some invasive
fix as part of this doesn't sound reasonable because this is an
existing problem and there seems to be another patch by Thomas that
probably deals with the root cause of the existing problem [1] as
pointed out by you.

[1] - https://commitfest.postgresql.org/40/3951/

--
With Regards,
Amit Kapila.



Re: Random pg_upgrade test failure on drongo

From
Jim Nasby
Date:
On 1/4/24 10:19 PM, Amit Kapila wrote:
> On Thu, Jan 4, 2024 at 5:30 PM Alexander Lakhin <exclusion@gmail.com> wrote:
>>
>> 03.01.2024 14:42, Amit Kapila wrote:
>>>
>>
>>>> And the internal process is ... background writer (BgBufferSync()).
>>>>
>>>> So, I tried just adding bgwriter_lru_maxpages = 0 to postgresql.conf and
>>>> got 20 x 10 tests passing.
>>>>
>>>> Thus, it we want just to get rid of the test failure, maybe it's enough to
>>>> add this to the test's config...
>>>>
>>> What about checkpoints? Can't it do the same while writing the buffers?
>>
>> As we deal here with pg_upgrade/pg_restore, it must not be very easy to get
>> the desired effect, but I think it's not impossible in principle.
>> More details below.
>> What happens during the pg_upgrade execution is essentially:
>> 1) CREATE DATABASE "postgres" WITH TEMPLATE = template0 OID = 5 ...;
>> -- this command flushes file buffers as well
>> 2) ALTER DATABASE postgres OWNER TO ...
>> 3) COMMENT ON DATABASE "postgres" IS ...
>> 4)     -- For binary upgrade, preserve pg_largeobject and index relfilenodes
>>       SELECT pg_catalog.binary_upgrade_set_next_index_relfilenode('2683'::pg_catalog.oid);
>>       SELECT pg_catalog.binary_upgrade_set_next_heap_relfilenode('2613'::pg_catalog.oid);
>>       TRUNCATE pg_catalog.pg_largeobject;
>> --  ^^^ here we can get the error "could not create file "base/5/2683": File exists"
>> ...
>>
>> We get the effect discussed when the background writer process decides to
>> flush a file buffer for pg_largeobject during stage 1.
>> (Thus, if a checkpoint somehow happened to occur during CREATE DATABASE,
>> the result must be the same.)
>> And another important factor is shared_buffers = 1MB (set during the test).
>> With the default setting of 128MB I couldn't see the failure.
>>
>> It can be reproduced easily (on old Windows versions) just by running
>> pg_upgrade in a loop (I've got failures on iterations 22, 37, 17 (with the
>> default cluster)).
>> If an old cluster contains dozen of databases, this increases the failure
>> probability significantly (with 10 additional databases I've got failures
>> on iterations 4, 1, 6).
>>
> 
> I don't have an old Windows environment to test but I agree with your
> analysis and theory. The question is what should we do for these new
> random BF failures? I think we should set bgwriter_lru_maxpages to 0
> and checkpoint_timeout to 1hr for these new tests. Doing some invasive
> fix as part of this doesn't sound reasonable because this is an
> existing problem and there seems to be another patch by Thomas that
> probably deals with the root cause of the existing problem [1] as
> pointed out by you.
> 
> [1] - https://commitfest.postgresql.org/40/3951/

Isn't this just sweeping the problem (non-POSIX behavior on SMB and 
ReFS) under the carpet? I realize that synthetic test workloads like 
pg_upgrade in a loop aren't themselves real-world scenarios, but what 
about other cases? Even if we're certain it's not possible for these 
issues to wedge a server, it's still not a good experience for users to 
get random, unexplained IO-related errors...
-- 
Jim Nasby, Data Architect, Austin TX




Re: Random pg_upgrade test failure on drongo

From
Amit Kapila
Date:
On Mon, Jan 8, 2024 at 9:36 PM Jim Nasby <jim.nasby@gmail.com> wrote:
>
> On 1/4/24 10:19 PM, Amit Kapila wrote:
> > On Thu, Jan 4, 2024 at 5:30 PM Alexander Lakhin <exclusion@gmail.com> wrote:
> >>
> >> 03.01.2024 14:42, Amit Kapila wrote:
> >>>
> >>
> >>>> And the internal process is ... background writer (BgBufferSync()).
> >>>>
> >>>> So, I tried just adding bgwriter_lru_maxpages = 0 to postgresql.conf and
> >>>> got 20 x 10 tests passing.
> >>>>
> >>>> Thus, it we want just to get rid of the test failure, maybe it's enough to
> >>>> add this to the test's config...
> >>>>
> >>> What about checkpoints? Can't it do the same while writing the buffers?
> >>
> >> As we deal here with pg_upgrade/pg_restore, it must not be very easy to get
> >> the desired effect, but I think it's not impossible in principle.
> >> More details below.
> >> What happens during the pg_upgrade execution is essentially:
> >> 1) CREATE DATABASE "postgres" WITH TEMPLATE = template0 OID = 5 ...;
> >> -- this command flushes file buffers as well
> >> 2) ALTER DATABASE postgres OWNER TO ...
> >> 3) COMMENT ON DATABASE "postgres" IS ...
> >> 4)     -- For binary upgrade, preserve pg_largeobject and index relfilenodes
> >>       SELECT pg_catalog.binary_upgrade_set_next_index_relfilenode('2683'::pg_catalog.oid);
> >>       SELECT pg_catalog.binary_upgrade_set_next_heap_relfilenode('2613'::pg_catalog.oid);
> >>       TRUNCATE pg_catalog.pg_largeobject;
> >> --  ^^^ here we can get the error "could not create file "base/5/2683": File exists"
> >> ...
> >>
> >> We get the effect discussed when the background writer process decides to
> >> flush a file buffer for pg_largeobject during stage 1.
> >> (Thus, if a checkpoint somehow happened to occur during CREATE DATABASE,
> >> the result must be the same.)
> >> And another important factor is shared_buffers = 1MB (set during the test).
> >> With the default setting of 128MB I couldn't see the failure.
> >>
> >> It can be reproduced easily (on old Windows versions) just by running
> >> pg_upgrade in a loop (I've got failures on iterations 22, 37, 17 (with the
> >> default cluster)).
> >> If an old cluster contains dozen of databases, this increases the failure
> >> probability significantly (with 10 additional databases I've got failures
> >> on iterations 4, 1, 6).
> >>
> >
> > I don't have an old Windows environment to test but I agree with your
> > analysis and theory. The question is what should we do for these new
> > random BF failures? I think we should set bgwriter_lru_maxpages to 0
> > and checkpoint_timeout to 1hr for these new tests. Doing some invasive
> > fix as part of this doesn't sound reasonable because this is an
> > existing problem and there seems to be another patch by Thomas that
> > probably deals with the root cause of the existing problem [1] as
> > pointed out by you.
> >
> > [1] - https://commitfest.postgresql.org/40/3951/
>
> Isn't this just sweeping the problem (non-POSIX behavior on SMB and
> ReFS) under the carpet? I realize that synthetic test workloads like
> pg_upgrade in a loop aren't themselves real-world scenarios, but what
> about other cases? Even if we're certain it's not possible for these
> issues to wedge a server, it's still not a good experience for users to
> get random, unexplained IO-related errors...
>

The point is that this is an existing known Windows behavior and that
too only in certain versions. The fix doesn't seem to be
straightforward, so it seems advisable to avoid random BF failures by
having an appropriate configuration.

--
With Regards,
Amit Kapila.



RE: Random pg_upgrade test failure on drongo

From
"Hayato Kuroda (Fujitsu)"
Date:
Dear Amit, Alexander,

> > We get the effect discussed when the background writer process decides to
> > flush a file buffer for pg_largeobject during stage 1.
> > (Thus, if a checkpoint somehow happened to occur during CREATE DATABASE,
> > the result must be the same.)
> > And another important factor is shared_buffers = 1MB (set during the test).
> > With the default setting of 128MB I couldn't see the failure.
> >
> > It can be reproduced easily (on old Windows versions) just by running
> > pg_upgrade in a loop (I've got failures on iterations 22, 37, 17 (with the
> > default cluster)).
> > If an old cluster contains dozen of databases, this increases the failure
> > probability significantly (with 10 additional databases I've got failures
> > on iterations 4, 1, 6).
> >
> 
> I don't have an old Windows environment to test but I agree with your
> analysis and theory. The question is what should we do for these new
> random BF failures? I think we should set bgwriter_lru_maxpages to 0
> and checkpoint_timeout to 1hr for these new tests. Doing some invasive
> fix as part of this doesn't sound reasonable because this is an
> existing problem and there seems to be another patch by Thomas that
> probably deals with the root cause of the existing problem [1] as
> pointed out by you.
> 
> [1] - https://commitfest.postgresql.org/40/3951/

Based on the suggestion by Amit, I have created a patch with the alternative
approach. This just does GUC settings. The reported failure is only for
003_logical_slots, but the patch also includes changes for the recently added
test, 004_subscription. IIUC, there is a possibility that 004 would fail as well.

Per our understanding, this patch can stop random failures. Alexander, can you
test for the confirmation?

Best Regards,
Hayato Kuroda
FUJITSU LIMITED


Attachment

Re: Random pg_upgrade test failure on drongo

From
Alexander Lakhin
Date:
Hello Kuroda-san,

09.01.2024 08:49, Hayato Kuroda (Fujitsu) wrote:
> Based on the suggestion by Amit, I have created a patch with the alternative
> approach. This just does GUC settings. The reported failure is only for
> 003_logical_slots, but the patch also includes changes for the recently added
> test, 004_subscription. IIUC, there is a possibility that 004 would fail as well.
>
> Per our understanding, this patch can stop random failures. Alexander, can you
> test for the confirmation?
>

Yes, the patch fixes the issue for me (without the patch I observe failures
on iterations 1-2, with 10 tests running in parallel, but with the patch
10 iterations succeeded).

But as far I can see, 004_subscription is not affected by the issue,
because it doesn't enable streaming for nodes new_sub, new_sub1.
As I noted before, I could see the failure only with
shared_buffers = 1MB (which is set with allows_streaming => 'logical').
So I'm not sure, whether we need to modify 004 (or any other test that
runs pg_upgrade).

As to checkpoint_timeout, personally I would not increase it, because it
seems unbelievable to me that pg_restore (with the cluster containing only
two empty databases) can run for longer than 5 minutes. I'd rather
investigate such situation separately, in case we encounter it, but maybe
it's only me.
On the other hand, if a checkpoint could occur by some reason within a
shorter time span, then increasing the timeout would not matter, I suppose.
(I've also tested the bgwriter_lru_maxpages-only modification of your patch
and can confirm that it works as well.)

Best regards,
Alexander



Re: Random pg_upgrade test failure on drongo

From
Amit Kapila
Date:
On Tue, Jan 9, 2024 at 2:30 PM Alexander Lakhin <exclusion@gmail.com> wrote:
>
> 09.01.2024 08:49, Hayato Kuroda (Fujitsu) wrote:
> > Based on the suggestion by Amit, I have created a patch with the alternative
> > approach. This just does GUC settings. The reported failure is only for
> > 003_logical_slots, but the patch also includes changes for the recently added
> > test, 004_subscription. IIUC, there is a possibility that 004 would fail as well.
> >
> > Per our understanding, this patch can stop random failures. Alexander, can you
> > test for the confirmation?
> >
>
> Yes, the patch fixes the issue for me (without the patch I observe failures
> on iterations 1-2, with 10 tests running in parallel, but with the patch
> 10 iterations succeeded).
>
> But as far I can see, 004_subscription is not affected by the issue,
> because it doesn't enable streaming for nodes new_sub, new_sub1.
> As I noted before, I could see the failure only with
> shared_buffers = 1MB (which is set with allows_streaming => 'logical').
> So I'm not sure, whether we need to modify 004 (or any other test that
> runs pg_upgrade).
>

I see your point and the probable reason for failure with
shared_buffers=1MB is that the probability of bgwriter holding the
file handle for pg_largeobject increases. So, let's change it only for
003.

> As to checkpoint_timeout, personally I would not increase it, because it
> seems unbelievable to me that pg_restore (with the cluster containing only
> two empty databases) can run for longer than 5 minutes. I'd rather
> investigate such situation separately, in case we encounter it, but maybe
> it's only me.
>

I feel it is okay to set a higher value of checkpoint_timeout due to
the same reason though the probability is less. I feel here it is
important to explain in the comments why we are using these settings
in the new test. I have thought of something like: "During the
upgrade, bgwriter or checkpointer could hold the file handle for some
removed file. Now, during restore when we try to create the file with
the same name, it errors out. This behavior is specific to only some
specific Windows versions and the probability of seeing this behavior
is higher in this test because we use wal_level as logical via
allows_streaming => 'logical' which in turn sets shared_buffers as
1MB."

Thoughts?

--
With Regards,
Amit Kapila.



Re: Random pg_upgrade test failure on drongo

From
Alexander Lakhin
Date:
Hello Amit,

09.01.2024 13:08, Amit Kapila wrote:
>
>> As to checkpoint_timeout, personally I would not increase it, because it
>> seems unbelievable to me that pg_restore (with the cluster containing only
>> two empty databases) can run for longer than 5 minutes. I'd rather
>> investigate such situation separately, in case we encounter it, but maybe
>> it's only me.
>>
> I feel it is okay to set a higher value of checkpoint_timeout due to
> the same reason though the probability is less. I feel here it is
> important to explain in the comments why we are using these settings
> in the new test. I have thought of something like: "During the
> upgrade, bgwriter or checkpointer could hold the file handle for some
> removed file. Now, during restore when we try to create the file with
> the same name, it errors out. This behavior is specific to only some
> specific Windows versions and the probability of seeing this behavior
> is higher in this test because we use wal_level as logical via
> allows_streaming => 'logical' which in turn sets shared_buffers as
> 1MB."
>
> Thoughts?

I would describe that behavior as "During upgrade, when pg_restore performs
CREATE DATABASE, bgwriter or checkpointer may flush buffers and hold a file
handle for pg_largeobject, so later TRUNCATE pg_largeobject command will
fail if OS (such as older Windows versions) doesn't remove an unlinked file
completely till it's open. ..."

Best regards,
Alexander




Re: Random pg_upgrade test failure on drongo

From
Amit Kapila
Date:
On Tue, Jan 9, 2024 at 4:30 PM Alexander Lakhin <exclusion@gmail.com> wrote:
>
> 09.01.2024 13:08, Amit Kapila wrote:
> >
> >> As to checkpoint_timeout, personally I would not increase it, because it
> >> seems unbelievable to me that pg_restore (with the cluster containing only
> >> two empty databases) can run for longer than 5 minutes. I'd rather
> >> investigate such situation separately, in case we encounter it, but maybe
> >> it's only me.
> >>
> > I feel it is okay to set a higher value of checkpoint_timeout due to
> > the same reason though the probability is less. I feel here it is
> > important to explain in the comments why we are using these settings
> > in the new test. I have thought of something like: "During the
> > upgrade, bgwriter or checkpointer could hold the file handle for some
> > removed file. Now, during restore when we try to create the file with
> > the same name, it errors out. This behavior is specific to only some
> > specific Windows versions and the probability of seeing this behavior
> > is higher in this test because we use wal_level as logical via
> > allows_streaming => 'logical' which in turn sets shared_buffers as
> > 1MB."
> >
> > Thoughts?
>
> I would describe that behavior as "During upgrade, when pg_restore performs
> CREATE DATABASE, bgwriter or checkpointer may flush buffers and hold a file
> handle for pg_largeobject, so later TRUNCATE pg_largeobject command will
> fail if OS (such as older Windows versions) doesn't remove an unlinked file
> completely till it's open. ..."
>

I am slightly hesitant to add any particular system table name in the
comments as this can happen for any other system table as well, so
slightly adjusted the comments in the attached. However, I think it is
okay to mention the particular system table name in the commit
message. Let me know what do you think.

--
With Regards,
Amit Kapila.

Attachment

Re: Random pg_upgrade test failure on drongo

From
Alexander Lakhin
Date:
10.01.2024 12:31, Amit Kapila wrote:
> I am slightly hesitant to add any particular system table name in the
> comments as this can happen for any other system table as well, so
> slightly adjusted the comments in the attached. However, I think it is
> okay to mention the particular system table name in the commit
> message. Let me know what do you think.

Thank you, Amit!

I'd like to note that the culprit is exactly pg_largeobject as coded in
dumpDatabase():
     /*
      * pg_largeobject comes from the old system intact, so set its
      * relfrozenxids, relminmxids and relfilenode.
      */
     if (dopt->binary_upgrade)
...
         appendPQExpBufferStr(loOutQry,
                              "TRUNCATE pg_catalog.pg_largeobject;\n");

I see no other TRUNCATEs (or similar logic) around, so I would specify the
table name in the comments. Though maybe I'm missing something...

Best regards,
Alexander



Re: Random pg_upgrade test failure on drongo

From
Amit Kapila
Date:
On Wed, Jan 10, 2024 at 3:30 PM Alexander Lakhin <exclusion@gmail.com> wrote:
>
> 10.01.2024 12:31, Amit Kapila wrote:
> > I am slightly hesitant to add any particular system table name in the
> > comments as this can happen for any other system table as well, so
> > slightly adjusted the comments in the attached. However, I think it is
> > okay to mention the particular system table name in the commit
> > message. Let me know what do you think.
>
> Thank you, Amit!
>
> I'd like to note that the culprit is exactly pg_largeobject as coded in
> dumpDatabase():
>      /*
>       * pg_largeobject comes from the old system intact, so set its
>       * relfrozenxids, relminmxids and relfilenode.
>       */
>      if (dopt->binary_upgrade)
> ...
>          appendPQExpBufferStr(loOutQry,
>                               "TRUNCATE pg_catalog.pg_largeobject;\n");
>
> I see no other TRUNCATEs (or similar logic) around, so I would specify the
> table name in the comments. Though maybe I'm missing something...
>

But tomorrow it could be for other tables and if we change this
TRUNCATE logic for pg_largeobject (of which chances are less) then
there is always a chance that one misses changing this comment. I feel
keeping it generic in this case would be better as the problem is
generic but it is currently shown for pg_largeobject.

--
With Regards,
Amit Kapila.



Re: Random pg_upgrade test failure on drongo

From
Alexander Lakhin
Date:
10.01.2024 13:37, Amit Kapila wrote:
> But tomorrow it could be for other tables and if we change this
> TRUNCATE logic for pg_largeobject (of which chances are less) then
> there is always a chance that one misses changing this comment. I feel
> keeping it generic in this case would be better as the problem is
> generic but it is currently shown for pg_largeobject.

Yes, for sure. So let's keep it generic as you prefer.

Thank you!

Best regards,
Alexander



RE: Random pg_upgrade test failure on drongo

From
"Hayato Kuroda (Fujitsu)"
Date:
Dear Alexander, Amit,

> > But tomorrow it could be for other tables and if we change this
> > TRUNCATE logic for pg_largeobject (of which chances are less) then
> > there is always a chance that one misses changing this comment. I feel
> > keeping it generic in this case would be better as the problem is
> > generic but it is currently shown for pg_largeobject.
> 
> Yes, for sure. So let's keep it generic as you prefer.
> 
> Thank you!

Thanks for working the patch. I'm also OK to push the Amit's fix patch.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED


Re: Random pg_upgrade test failure on drongo

From
Amit Kapila
Date:
On Thu, Jan 11, 2024 at 8:15 AM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:
>
> > > But tomorrow it could be for other tables and if we change this
> > > TRUNCATE logic for pg_largeobject (of which chances are less) then
> > > there is always a chance that one misses changing this comment. I feel
> > > keeping it generic in this case would be better as the problem is
> > > generic but it is currently shown for pg_largeobject.
> >
> > Yes, for sure. So let's keep it generic as you prefer.
> >
> > Thank you!
>
> Thanks for working the patch. I'm also OK to push the Amit's fix patch.
>

Thanks to both of you. I have pushed the patch.

--
With Regards,
Amit Kapila.



RE: Random pg_upgrade test failure on drongo

From
"Hayato Kuroda (Fujitsu)"
Date:
Dear hackers,

> 
> Thanks to both of you. I have pushed the patch.
>

I have been tracking the BF animals these days, and this failure has not seen anymore.
I think we can close the topic. Again, thanks for all efforts.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED