Thread: Windows CFBot is broken because ecpg dec_test.c error
Since about ~11 hours ago the ecpg test is consistently failing on Window with this error[1]: > Could not open file C:/cirrus/build/src/interfaces/ecpg/test/compat_informix/dec_test.c for reading I took a quick look at possible causes but couldn't find a clear winner. My current guess is that there's some dependency rule missing in the meson file and due to some infra changes files now get compiled in the wrong order. One recent suspicious commit seems to be: 7819a25cd101b574f5422edb00fe3376fbb646da But there are a bunch of successful changes that include that commit, so it seems to be a red herring. (CC-ed Noah anyway) [1]: https://cirrus-ci.com/task/6305422665056256?logs=check_world#L143
Hi, On January 28, 2025 7:13:16 AM EST, Jelte Fennema-Nio <postgres@jeltef.nl> wrote: >Since about ~11 hours ago the ecpg test is consistently failing on >Window with this error[1]: > >> Could not open file C:/cirrus/build/src/interfaces/ecpg/test/compat_informix/dec_test.c for reading > >I took a quick look at possible causes but couldn't find a clear >winner. My current guess is that there's some dependency rule missing >in the meson file and due to some infra changes files now get compiled >in the wrong order. > >One recent suspicious commit seems to be: >7819a25cd101b574f5422edb00fe3376fbb646da >But there are a bunch of successful changes that include that commit, >so it seems to be a red herring. (CC-ed Noah anyway) I think it's due to a new version of meson. Seems we under specified test dependencies. I'll write up a patch. Andres -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
Hi, On Tue, 28 Jan 2025 at 17:02, Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On January 28, 2025 7:13:16 AM EST, Jelte Fennema-Nio <postgres@jeltef.nl> wrote: > >Since about ~11 hours ago the ecpg test is consistently failing on > >Window with this error[1]: > > > >> Could not open file C:/cirrus/build/src/interfaces/ecpg/test/compat_informix/dec_test.c for reading > > > >I took a quick look at possible causes but couldn't find a clear > >winner. My current guess is that there's some dependency rule missing > >in the meson file and due to some infra changes files now get compiled > >in the wrong order. > > > >One recent suspicious commit seems to be: > >7819a25cd101b574f5422edb00fe3376fbb646da > >But there are a bunch of successful changes that include that commit, > >so it seems to be a red herring. (CC-ed Noah anyway) > > I think it's due to a new version of meson. Seems we under specified test dependencies. I'll write up a patch. The cause is that meson fixed a bug [1] in v.1.7.0. Before meson v1.7.0; although --no-rebuild is used while running tests, meson was building all targets. This is fixed with v.1.7.0. The change below fixes the problem: diff --git a/.cirrus.tasks.yml b/.cirrus.tasks.yml index 18e944ca89d..c7a94ff6471 100644 --- a/.cirrus.tasks.yml +++ b/.cirrus.tasks.yml @@ -17,7 +17,7 @@ env: CHECK: check-world PROVE_FLAGS=$PROVE_FLAGS CHECKFLAGS: -Otarget PROVE_FLAGS: --timer - MTEST_ARGS: --print-errorlogs --no-rebuild -C build + MTEST_ARGS: --print-errorlogs -C build PGCTLTIMEOUT: 120 # avoids spurious failures during parallel tests TEMP_CONFIG: ${CIRRUS_WORKING_DIR}/src/tools/ci/pg_ci_base.conf PG_TEST_EXTRA: kerberos ldap ssl libpq_encryption load_balance And I think this is the correct approach. It builds all of the not-yet-built targets before running the tests. Another solution might be manually building ecpg target before running tests but I think the former approach is more suitable for the CI. CI run after this change applied: https://cirrus-ci.com/build/6264369203380224 [1] https://mesonbuild.com/Release-notes-for-1-7-0.html#test-targets-no-longer-built-by-default -- Regards, Nazir Bilal Yavuz Microsoft
Hi, On 2025-01-29 18:24:45 +0300, Nazir Bilal Yavuz wrote: > On Tue, 28 Jan 2025 at 17:02, Andres Freund <andres@anarazel.de> wrote: > > > > Hi, > > > > On January 28, 2025 7:13:16 AM EST, Jelte Fennema-Nio <postgres@jeltef.nl> wrote: > > >Since about ~11 hours ago the ecpg test is consistently failing on > > >Window with this error[1]: > > > > > >> Could not open file C:/cirrus/build/src/interfaces/ecpg/test/compat_informix/dec_test.c for reading > > > > > >I took a quick look at possible causes but couldn't find a clear > > >winner. My current guess is that there's some dependency rule missing > > >in the meson file and due to some infra changes files now get compiled > > >in the wrong order. > > > > > >One recent suspicious commit seems to be: > > >7819a25cd101b574f5422edb00fe3376fbb646da > > >But there are a bunch of successful changes that include that commit, > > >so it seems to be a red herring. (CC-ed Noah anyway) > > > > I think it's due to a new version of meson. Seems we under specified test dependencies. I'll write up a patch. Sorry, got distracted with somewhat pressing matters. > The cause is that meson fixed a bug [1] in v.1.7.0. Before meson > v1.7.0; although --no-rebuild is used while running tests, meson was > building all targets. This is fixed with v.1.7.0. > > [1] https://mesonbuild.com/Release-notes-for-1-7-0.html#test-targets-no-longer-built-by-default That's not *quite* right - it wasn't that the targets were built when --no-rebuild was specified, it's that the default build target (for just 'ninja'), built all test dependencies. > The change below fixes the problem: > > diff --git a/.cirrus.tasks.yml b/.cirrus.tasks.yml > index 18e944ca89d..c7a94ff6471 100644 > --- a/.cirrus.tasks.yml > +++ b/.cirrus.tasks.yml > @@ -17,7 +17,7 @@ env: > CHECK: check-world PROVE_FLAGS=$PROVE_FLAGS > CHECKFLAGS: -Otarget > PROVE_FLAGS: --timer > - MTEST_ARGS: --print-errorlogs --no-rebuild -C build > + MTEST_ARGS: --print-errorlogs -C build > PGCTLTIMEOUT: 120 # avoids spurious failures during parallel tests > TEMP_CONFIG: ${CIRRUS_WORKING_DIR}/src/tools/ci/pg_ci_base.conf > PG_TEST_EXTRA: kerberos ldap ssl libpq_encryption load_balance > > And I think this is the correct approach. It builds all of the > not-yet-built targets before running the tests. Another solution might > be manually building ecpg target before running tests but I think the > former approach is more suitable for the CI. > > CI run after this change applied: https://cirrus-ci.com/build/6264369203380224 I don't think that's the entirety of the issue. Our dependencies aren't quite airtight enough. With a sufficiently modern meson, try doing e.g. rm -rf tmp_install/ && ninja clean && meson test --suite setup --suite ecpg It'll fail, because the dependencies of the tests are insufficient. See the set of patches at https://www.postgresql.org/message-id/qh4c5tvkgjef7jikjig56rclbcdrrotngnwpycukd2n3k25zi2%4044hxxvtwmgum I think the only reason your patch on its own suffices, is that the "all" target, that we ran separately beforehand, actually has sufficient dependencies to make things work. The nice thing is that with this meson improvement we should be able to get rid of the "setup" test suite and instead generate the test install via dependencies. Obviously we either have to wait a fair bit or do it depending on the meson version... Greetings, Andres Freund
Hi, On Wed, 29 Jan 2025 at 19:50, Andres Freund <andres@anarazel.de> wrote: > > On 2025-01-29 18:24:45 +0300, Nazir Bilal Yavuz wrote: > > > The cause is that meson fixed a bug [1] in v.1.7.0. Before meson > > v1.7.0; although --no-rebuild is used while running tests, meson was > > building all targets. This is fixed with v.1.7.0. > > > > [1] https://mesonbuild.com/Release-notes-for-1-7-0.html#test-targets-no-longer-built-by-default > > That's not *quite* right - it wasn't that the targets were built when > --no-rebuild was specified, it's that the default build target (for just > 'ninja'), built all test dependencies. > > > > The change below fixes the problem: > > > > diff --git a/.cirrus.tasks.yml b/.cirrus.tasks.yml > > index 18e944ca89d..c7a94ff6471 100644 > > --- a/.cirrus.tasks.yml > > +++ b/.cirrus.tasks.yml > > @@ -17,7 +17,7 @@ env: > > CHECK: check-world PROVE_FLAGS=$PROVE_FLAGS > > CHECKFLAGS: -Otarget > > PROVE_FLAGS: --timer > > - MTEST_ARGS: --print-errorlogs --no-rebuild -C build > > + MTEST_ARGS: --print-errorlogs -C build > > PGCTLTIMEOUT: 120 # avoids spurious failures during parallel tests > > TEMP_CONFIG: ${CIRRUS_WORKING_DIR}/src/tools/ci/pg_ci_base.conf > > PG_TEST_EXTRA: kerberos ldap ssl libpq_encryption load_balance > > > > And I think this is the correct approach. It builds all of the > > not-yet-built targets before running the tests. Another solution might > > be manually building ecpg target before running tests but I think the > > former approach is more suitable for the CI. > > > > CI run after this change applied: https://cirrus-ci.com/build/6264369203380224 > > I don't think that's the entirety of the issue. > > Our dependencies aren't quite airtight enough. With a sufficiently modern > meson, try doing e.g. > > rm -rf tmp_install/ && ninja clean && meson test --suite setup --suite ecpg > > It'll fail, because the dependencies of the tests are insufficient. > > See the set of patches at > https://www.postgresql.org/message-id/qh4c5tvkgjef7jikjig56rclbcdrrotngnwpycukd2n3k25zi2%4044hxxvtwmgum > > > I think the only reason your patch on its own suffices, is that the "all" > target, that we ran separately beforehand, actually has sufficient > dependencies to make things work. Yes, you are right. I agree that what you said is the correct solution and that should be the ultimate goal. What I shared could be a band-aid fix to make the Windows CI task happy until the patches you shared get committed. Another solution might be to downgrade the meson version in the Windows images at the CI repository [1], that would be better for the commit history. [1] https://github.com/anarazel/pg-vm-images -- Regards, Nazir Bilal Yavuz Microsoft
Hi, On 2025-01-30 16:18:54 +0300, Nazir Bilal Yavuz wrote: > On Wed, 29 Jan 2025 at 19:50, Andres Freund <andres@anarazel.de> wrote: > > I don't think that's the entirety of the issue. > > > > Our dependencies aren't quite airtight enough. With a sufficiently modern > > meson, try doing e.g. > > > > rm -rf tmp_install/ && ninja clean && meson test --suite setup --suite ecpg > > > > It'll fail, because the dependencies of the tests are insufficient. > > > > See the set of patches at > > https://www.postgresql.org/message-id/qh4c5tvkgjef7jikjig56rclbcdrrotngnwpycukd2n3k25zi2%4044hxxvtwmgum > > > > > > I think the only reason your patch on its own suffices, is that the "all" > > target, that we ran separately beforehand, actually has sufficient > > dependencies to make things work. > > Yes, you are right. I agree that what you said is the correct solution > and that should be the ultimate goal. I think we need to fix this properly across branches. The version of meson is going to be more common soon. And it'll be a problem for developers too, not just CI. I'll start working on committing these fixes across the branches, unless somebody protests immediately. > What I shared could be a band-aid fix to make the Windows CI task happy > until the patches you shared get committed. I think we'll still need something like what you propose. Although I do think it'd be better if we continued building all targets in a dedicated _script: block, so that you can see all build failures in those steps. Greetings, Andres Freund
On Wed, 5 Feb 2025 at 00:22, Andres Freund <andres@anarazel.de> wrote: > Pushed like that. > > I'll watch CI and BF over the next hours. I guess you probably noticed, but in case you didn't: CI on windows is still broken.
Hi, On 2025-02-05 19:42:05 +0100, Jelte Fennema-Nio wrote: > On Wed, 5 Feb 2025 at 00:22, Andres Freund <andres@anarazel.de> wrote: > > Pushed like that. > > > > I'll watch CI and BF over the next hours. > > I guess you probably noticed, but in case you didn't: CI on windows is > still broken. Huh. CI did pass on all platforms after my push: https://cirrus-ci.com/github/postgres/postgres/ While there is a failure on master, it isn't due to this: https://cirrus-ci.com/task/6185223693533184 [17:55:32.636] ------------------------------------- 8< ------------------------------------- [17:55:32.636] stderr: [17:55:32.636] # Failed test 'can't connect to invalid database - error message' [17:55:32.636] # at C:/cirrus/src/test/recovery/t/037_invalid_database.pl line 40. [17:55:32.636] # 'psql: error: connection to server on socket "C:/Windows/TEMP/kqIhcyR2yC/.s.PGSQL.31868"failed: server closed the connection unexpectedly [17:55:32.636] # This probably means the server terminated abnormally [17:55:32.636] # before or while processing the request.' [17:55:32.636] # doesn't match '(?^:FATAL:\s+cannot connect to invalid database "regression_invalid")' [17:55:32.636] # Looks like you failed 1 test of 10. [17:55:32.636] [17:55:32.636] (test program exited with status code 1) [17:55:32.636] ------------------------------------------------------------------------------ I think that may be due to commit a14707da564e8c94bd123f0e3a75e194fd7ef56a (upstream/master) Author: Tom Lane <tgl@sss.pgh.pa.us> Date: 2025-02-05 12:45:58 -0500 Show more-intuitive titles for psql commands \dt, \di, etc. I do see a lot of failures on cfbot - but afaict that's because for some reason there haven't been recent runs. Thomas? E.g. the currently newest run is https://cirrus-ci.com/build/6378368658046976 which is based on commit 43493cceda2 Author: Peter Eisentraut <peter@eisentraut.org> Date: 2025-01-24 22:58:13 +0100 Add get_opfamily_name() function Greetings, Andres Freund
Jelte Fennema-Nio <postgres@jeltef.nl> writes: > I guess you probably noticed, but in case you didn't: CI on windows is > still broken. Hard to tell, considering the cfbot has been completely wedged since Sunday. regards, tom lane
Hi, On 2025-02-05 14:09:02 -0500, Tom Lane wrote: > Jelte Fennema-Nio <postgres@jeltef.nl> writes: > > I guess you probably noticed, but in case you didn't: CI on windows is > > still broken. > > Hard to tell, considering the cfbot has been completely wedged > since Sunday. It passed on the postgres repo just before this commit: https://cirrus-ci.com/build/4733656549294080 and then failed with it: https://cirrus-ci.com/build/5944955807465472 Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > On 2025-02-05 14:09:02 -0500, Tom Lane wrote: >> Hard to tell, considering the cfbot has been completely wedged >> since Sunday. > It passed on the postgres repo just before this commit: > https://cirrus-ci.com/build/4733656549294080 > and then failed with it: > https://cirrus-ci.com/build/5944955807465472 Hmm, maybe it's only the cfbot's web server that's broken, but none of the pages at http://cfbot.cputube.org appear to be updating. What other mechanism are you using to find the cirrus-ci.com logs? regards, tom lane
On Wed, 5 Feb 2025 at 20:21, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Andres Freund <andres@anarazel.de> writes: > > On 2025-02-05 14:09:02 -0500, Tom Lane wrote: > >> Hard to tell, considering the cfbot has been completely wedged > >> since Sunday. > > > It passed on the postgres repo just before this commit: > > https://cirrus-ci.com/build/4733656549294080 > > and then failed with it: > > https://cirrus-ci.com/build/5944955807465472 > > Hmm, maybe it's only the cfbot's web server that's broken, > but none of the pages at http://cfbot.cputube.org > appear to be updating. What other mechanism are you using > to find the cirrus-ci.com logs? Ugh yes, cfbot isn't updating at all anymore. So Andres' commits might very well have fixed the issue, but the prod cfbot is not doing any builds at the moment... I'll look into fixing that soonish. I took a quick look and it seems related to some unexpected response from the Cirrus API.
Hi, On 2025-02-05 14:20:59 -0500, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > On 2025-02-05 14:09:02 -0500, Tom Lane wrote: > >> Hard to tell, considering the cfbot has been completely wedged > >> since Sunday. > > > It passed on the postgres repo just before this commit: > > https://cirrus-ci.com/build/4733656549294080 > > and then failed with it: > > https://cirrus-ci.com/build/5944955807465472 > > Hmm, maybe it's only the cfbot's web server that's broken, > but none of the pages at http://cfbot.cputube.org > appear to be updating. It does look to me like cfbot isn't updating the relevant branches, i.e. it's not just the website that's not updating, or CI somehow not triggering after cfbot updates the relevant branches. > What other mechanism are you using to find the cirrus-ci.com logs? This isn't run via cfbot, but via postgres' github mirror. Whenever the repo sync pushes a change it also triggers CI. You can see all the runs of that on https://cirrus-ci.com/github/postgres/postgres/ CI on windows failed in ecpg for a few days, there were just two master runs that didn't fail inbetween that being fixed and the failure I linked to above. But recovery/037_invalid_database didn't fail at that time. Greetings, Andres Freund
On Wed, 5 Feb 2025 at 20:29, Jelte Fennema-Nio <postgres@jeltef.nl> wrote: > I'll look into fixing that soonish. I took a quick look and it seems > related to some unexpected response from the Cirrus API. Okay I think I got it running again. It didn't like that there was no commitfest with number 54 yet. So I created one, and it's doing more than before now. I'll check after dinner if it's still running correctly then.
> On 5 Feb 2025, at 20:36, Jelte Fennema-Nio <postgres@jeltef.nl> wrote: > > On Wed, 5 Feb 2025 at 20:29, Jelte Fennema-Nio <postgres@jeltef.nl> wrote: >> I'll look into fixing that soonish. I took a quick look and it seems >> related to some unexpected response from the Cirrus API. > > Okay I think I got it running again. It didn't like that there was no > commitfest with number 54 yet. So I created one, and it's doing more > than before now. I'll check after dinner if it's still running > correctly then. For reference, you meant 53 right? (There is no 54 in the system.) If the CFBot always need one in "Future" state we should document that to make sure we don't miss that going forward (and perhaps automate it to make sure we dont make manual work for ourselves). -- Daniel Gustafsson
On Wed, 5 Feb 2025 at 21:05, Daniel Gustafsson <daniel@yesql.se> wrote: > For reference, you meant 53 right? Yes, I meant 53 > If the > CFBot always need one in "Future" state we should document that to make sure we > don't miss that going forward (and perhaps automate it to make sure we dont > make manual work for ourselves). Afaict it doesn't need at least one in the "Future" state, instead it needs one after the current[1] commitfest. I don't think it should rely on that though. So I created an issue to fix that[2]. It does seem silly that we require people to manually create new commitfests though, so I created an issue to track that[3] [1]: https://commitfest.postgresql.org/current/ [2]: https://github.com/macdice/cfbot/issues/22 [3]: https://github.com/postgres/pgcommitfest/issues/25