Random pg_upgrade test failure on drongo - Mailing list pgsql-hackers

From Hayato Kuroda (Fujitsu)
Subject Random pg_upgrade test failure on drongo
Date
Msg-id TYAPR01MB5866AB7FD922CE30A2565B8BF5A8A@TYAPR01MB5866.jpnprd01.prod.outlook.com
Whole thread Raw
Responses RE: Random pg_upgrade test failure on drongo
List pgsql-hackers
Dear hackers,

While tracking a buildfarm, I found that drongo failed the test pg_upgrade/003_logical_slots [1].
A strange point is that the test passed in the next iteration. Currently I'm not
sure the reason, but I will keep my eye for it and will investigate if it
happens again.

I think this failure is not related with our logical slots work, whereas it
failed 003_logical_slots.pl. More detail, please see latter part.

For more investigation, a server log during the upgrade may be needed. It will
be in the data directory so BF system will not upload them. I may need additional
information if it failed again.

# Analysis of failure

According to the output, pg_upgrade seemed to be failed while restoring objects
to new cluster[2].

As code-level anaysis, pg_upgrade command failed in exec_prog().
In the function, pg_restore tried to be executed for database "postgres".
Below is a brief call-stack. Note that pg_restore is not used for migrating
logical replication slots, it is done by pg_upgrade binary itself. Also, the
migration is done after all objects are copied, not in create_new_objects().

```
exec_prog()
parallel_exec_prog("pg_restore ... ") <-- Since -j option is not specified, it is just a wrapper
create_new_objects()
main()
```

In exec_prog(), system() system call was called but returned non-zero value.
Doc said that sytem() returns value that is returned by the command interpreter,
when input is not NULL [3]. Unfortunately, current code does not output the
return code.  Also, BF system does not upload data directory for failed tests.
Therefore, I could not get more information for the investigation.

[1]:
https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=drongo&dt=2023-11-07%2013%3A43%3A23&stg=pg_upgrade-check
[2]:
```
...
# No postmaster PID for node "oldpub"
# Running: pg_upgrade --no-sync -d
C:\\prog\\bf\\root\\HEAD\\pgsql.build/testrun/pg_upgrade/003_logical_slots\\data/t_003_logical_slots_oldpub_data/pgdata
-D
C:\\prog\\bf\\root\\HEAD\\pgsql.build/testrun/pg_upgrade/003_logical_slots\\data/t_003_logical_slots_newpub_data/pgdata
-bC:/prog/bf/root/HEAD/PGSQL~1.BUI/TMP_IN~1/prog/bf/root/HEAD/inst/bin -B
C:/prog/bf/root/HEAD/PGSQL~1.BUI/TMP_IN~1/prog/bf/root/HEAD/inst/bin-s 127.0.0.1 -p 54813 -P 54814 --copy 
Performing Consistency Checks
...
Setting frozenxid and minmxid counters in new cluster         ok
Restoring global objects in the new cluster                   ok
Restoring database schemas in the new cluster
*failure*

Consult the last few lines of
"C:/prog/bf/root/HEAD/pgsql.build/testrun/pg_upgrade/003_logical_slots/data/t_003_logical_slots_newpub_data/pgdata/pg_upgrade_output.d/20231107T142224.580/log/pg_upgrade_dump_5.log"
for
the probable cause of the failure.
Failure, exiting
[14:23:26.632](70.141s) not ok 10 - run of pg_upgrade of old cluster
[14:23:26.632](0.000s) #   Failed test 'run of pg_upgrade of old cluster'
#   at C:/prog/bf/root/HEAD/pgsql/src/bin/pg_upgrade/t/003_logical_slots.pl line 170.
### Starting node "newpub"
# Running: pg_ctl -w -D
C:\\prog\\bf\\root\\HEAD\\pgsql.build/testrun/pg_upgrade/003_logical_slots\\data/t_003_logical_slots_newpub_data/pgdata
-lC:\\prog\\bf\\root\\HEAD\\pgsql.build/testrun/pg_upgrade/003_logical_slots\\log/003_logical_slots_newpub.log -o
--cluster-name=newpubstart 
waiting for server to start.... done
server started
# Postmaster PID for node "newpub" is 4604
[14:23:28.398](1.766s) not ok 11 - check the slot exists on new cluster
[14:23:28.398](0.001s) #   Failed test 'check the slot exists on new cluster'
#   at C:/prog/bf/root/HEAD/pgsql/src/bin/pg_upgrade/t/003_logical_slots.pl line 176.
[14:23:28.399](0.000s) #          got: ''
#     expected: 'regress_sub|t'
...
```
[3]: https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/system-wsystem?view=msvc-170


Best Regards,
Hayato Kuroda
FUJITSU LIMITED




pgsql-hackers by date:

Previous
From: Peter Eisentraut
Date:
Subject: Re: pgsql: Clean up role created in new subscription test.
Next
From: Michael Paquier
Date:
Subject: Re: Fix use of openssl.path() if openssl isn't found