Thread: Windows CFBot is broken because ecpg dec_test.c error

Windows CFBot is broken because ecpg dec_test.c error

From
Jelte Fennema-Nio
Date:
Since about ~11 hours ago the ecpg test is consistently failing on
Window with this error[1]:

> Could not open file C:/cirrus/build/src/interfaces/ecpg/test/compat_informix/dec_test.c for reading

I took a quick look at possible causes but couldn't find a clear
winner. My current guess is that there's some dependency rule missing
in the meson file and due to some infra changes files now get compiled
in the wrong order.

One recent suspicious commit seems to be:
7819a25cd101b574f5422edb00fe3376fbb646da
But there are a bunch of successful changes that include that commit,
so it seems to be a red herring. (CC-ed Noah anyway)

[1]: https://cirrus-ci.com/task/6305422665056256?logs=check_world#L143



Re: Windows CFBot is broken because ecpg dec_test.c error

From
Andres Freund
Date:
Hi,

On January 28, 2025 7:13:16 AM EST, Jelte Fennema-Nio <postgres@jeltef.nl> wrote:
>Since about ~11 hours ago the ecpg test is consistently failing on
>Window with this error[1]:
>
>> Could not open file C:/cirrus/build/src/interfaces/ecpg/test/compat_informix/dec_test.c for reading
>
>I took a quick look at possible causes but couldn't find a clear
>winner. My current guess is that there's some dependency rule missing
>in the meson file and due to some infra changes files now get compiled
>in the wrong order.
>
>One recent suspicious commit seems to be:
>7819a25cd101b574f5422edb00fe3376fbb646da
>But there are a bunch of successful changes that include that commit,
>so it seems to be a red herring. (CC-ed Noah anyway)

I think it's due to a new version of meson. Seems we under specified test dependencies. I'll write up a patch.

Andres

--
Sent from my Android device with K-9 Mail. Please excuse my brevity.



Re: Windows CFBot is broken because ecpg dec_test.c error

From
Nazir Bilal Yavuz
Date:
Hi,

On Tue, 28 Jan 2025 at 17:02, Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On January 28, 2025 7:13:16 AM EST, Jelte Fennema-Nio <postgres@jeltef.nl> wrote:
> >Since about ~11 hours ago the ecpg test is consistently failing on
> >Window with this error[1]:
> >
> >> Could not open file C:/cirrus/build/src/interfaces/ecpg/test/compat_informix/dec_test.c for reading
> >
> >I took a quick look at possible causes but couldn't find a clear
> >winner. My current guess is that there's some dependency rule missing
> >in the meson file and due to some infra changes files now get compiled
> >in the wrong order.
> >
> >One recent suspicious commit seems to be:
> >7819a25cd101b574f5422edb00fe3376fbb646da
> >But there are a bunch of successful changes that include that commit,
> >so it seems to be a red herring. (CC-ed Noah anyway)
>
> I think it's due to a new version of meson. Seems we under specified test dependencies. I'll write up a patch.

The cause is that meson fixed a bug [1] in v.1.7.0. Before meson
v1.7.0; although --no-rebuild is used while running tests, meson was
building all targets. This is fixed with v.1.7.0.

The change below fixes the problem:

diff --git a/.cirrus.tasks.yml b/.cirrus.tasks.yml
index 18e944ca89d..c7a94ff6471 100644
--- a/.cirrus.tasks.yml
+++ b/.cirrus.tasks.yml
@@ -17,7 +17,7 @@ env:
   CHECK: check-world PROVE_FLAGS=$PROVE_FLAGS
   CHECKFLAGS: -Otarget
   PROVE_FLAGS: --timer
-  MTEST_ARGS: --print-errorlogs --no-rebuild -C build
+  MTEST_ARGS: --print-errorlogs -C build
   PGCTLTIMEOUT: 120 # avoids spurious failures during parallel tests
   TEMP_CONFIG: ${CIRRUS_WORKING_DIR}/src/tools/ci/pg_ci_base.conf
   PG_TEST_EXTRA: kerberos ldap ssl libpq_encryption load_balance

And I think this is the correct approach. It builds all of the
not-yet-built targets before running the tests. Another solution might
be manually building ecpg target before running tests but I think the
former approach is more suitable for the CI.

CI run after this change applied: https://cirrus-ci.com/build/6264369203380224

[1] https://mesonbuild.com/Release-notes-for-1-7-0.html#test-targets-no-longer-built-by-default

--
Regards,
Nazir Bilal Yavuz
Microsoft



Re: Windows CFBot is broken because ecpg dec_test.c error

From
Andres Freund
Date:
Hi,

On 2025-01-29 18:24:45 +0300, Nazir Bilal Yavuz wrote:
> On Tue, 28 Jan 2025 at 17:02, Andres Freund <andres@anarazel.de> wrote:
> >
> > Hi,
> >
> > On January 28, 2025 7:13:16 AM EST, Jelte Fennema-Nio <postgres@jeltef.nl> wrote:
> > >Since about ~11 hours ago the ecpg test is consistently failing on
> > >Window with this error[1]:
> > >
> > >> Could not open file C:/cirrus/build/src/interfaces/ecpg/test/compat_informix/dec_test.c for reading
> > >
> > >I took a quick look at possible causes but couldn't find a clear
> > >winner. My current guess is that there's some dependency rule missing
> > >in the meson file and due to some infra changes files now get compiled
> > >in the wrong order.
> > >
> > >One recent suspicious commit seems to be:
> > >7819a25cd101b574f5422edb00fe3376fbb646da
> > >But there are a bunch of successful changes that include that commit,
> > >so it seems to be a red herring. (CC-ed Noah anyway)
> >
> > I think it's due to a new version of meson. Seems we under specified test dependencies. I'll write up a patch.

Sorry, got distracted with somewhat pressing matters.


> The cause is that meson fixed a bug [1] in v.1.7.0. Before meson
> v1.7.0; although --no-rebuild is used while running tests, meson was
> building all targets. This is fixed with v.1.7.0.
>
> [1] https://mesonbuild.com/Release-notes-for-1-7-0.html#test-targets-no-longer-built-by-default

That's not *quite* right - it wasn't that the targets were built when
--no-rebuild was specified, it's that the default build target (for just
'ninja'), built all test dependencies.


> The change below fixes the problem:
>
> diff --git a/.cirrus.tasks.yml b/.cirrus.tasks.yml
> index 18e944ca89d..c7a94ff6471 100644
> --- a/.cirrus.tasks.yml
> +++ b/.cirrus.tasks.yml
> @@ -17,7 +17,7 @@ env:
>    CHECK: check-world PROVE_FLAGS=$PROVE_FLAGS
>    CHECKFLAGS: -Otarget
>    PROVE_FLAGS: --timer
> -  MTEST_ARGS: --print-errorlogs --no-rebuild -C build
> +  MTEST_ARGS: --print-errorlogs -C build
>    PGCTLTIMEOUT: 120 # avoids spurious failures during parallel tests
>    TEMP_CONFIG: ${CIRRUS_WORKING_DIR}/src/tools/ci/pg_ci_base.conf
>    PG_TEST_EXTRA: kerberos ldap ssl libpq_encryption load_balance
>
> And I think this is the correct approach. It builds all of the
> not-yet-built targets before running the tests. Another solution might
> be manually building ecpg target before running tests but I think the
> former approach is more suitable for the CI.
>
> CI run after this change applied: https://cirrus-ci.com/build/6264369203380224

I don't think that's the entirety of the issue.

Our dependencies aren't quite airtight enough. With a sufficiently modern
meson, try doing e.g.

  rm -rf tmp_install/ && ninja clean && meson test --suite setup --suite ecpg

It'll fail, because the dependencies of the tests are insufficient.

See the set of patches at
https://www.postgresql.org/message-id/qh4c5tvkgjef7jikjig56rclbcdrrotngnwpycukd2n3k25zi2%4044hxxvtwmgum


I think the only reason your patch on its own suffices, is that the "all"
target, that we ran separately beforehand, actually has sufficient
dependencies to make things work.



The nice thing is that with this meson improvement we should be able to get
rid of the "setup" test suite and instead generate the test install via
dependencies.  Obviously we either have to wait a fair bit or do it depending
on the meson version...


Greetings,

Andres Freund



Re: Windows CFBot is broken because ecpg dec_test.c error

From
Nazir Bilal Yavuz
Date:
Hi,

On Wed, 29 Jan 2025 at 19:50, Andres Freund <andres@anarazel.de> wrote:
>
> On 2025-01-29 18:24:45 +0300, Nazir Bilal Yavuz wrote:
>
> > The cause is that meson fixed a bug [1] in v.1.7.0. Before meson
> > v1.7.0; although --no-rebuild is used while running tests, meson was
> > building all targets. This is fixed with v.1.7.0.
> >
> > [1] https://mesonbuild.com/Release-notes-for-1-7-0.html#test-targets-no-longer-built-by-default
>
> That's not *quite* right - it wasn't that the targets were built when
> --no-rebuild was specified, it's that the default build target (for just
> 'ninja'), built all test dependencies.
>
>
> > The change below fixes the problem:
> >
> > diff --git a/.cirrus.tasks.yml b/.cirrus.tasks.yml
> > index 18e944ca89d..c7a94ff6471 100644
> > --- a/.cirrus.tasks.yml
> > +++ b/.cirrus.tasks.yml
> > @@ -17,7 +17,7 @@ env:
> >    CHECK: check-world PROVE_FLAGS=$PROVE_FLAGS
> >    CHECKFLAGS: -Otarget
> >    PROVE_FLAGS: --timer
> > -  MTEST_ARGS: --print-errorlogs --no-rebuild -C build
> > +  MTEST_ARGS: --print-errorlogs -C build
> >    PGCTLTIMEOUT: 120 # avoids spurious failures during parallel tests
> >    TEMP_CONFIG: ${CIRRUS_WORKING_DIR}/src/tools/ci/pg_ci_base.conf
> >    PG_TEST_EXTRA: kerberos ldap ssl libpq_encryption load_balance
> >
> > And I think this is the correct approach. It builds all of the
> > not-yet-built targets before running the tests. Another solution might
> > be manually building ecpg target before running tests but I think the
> > former approach is more suitable for the CI.
> >
> > CI run after this change applied: https://cirrus-ci.com/build/6264369203380224
>
> I don't think that's the entirety of the issue.
>
> Our dependencies aren't quite airtight enough. With a sufficiently modern
> meson, try doing e.g.
>
>   rm -rf tmp_install/ && ninja clean && meson test --suite setup --suite ecpg
>
> It'll fail, because the dependencies of the tests are insufficient.
>
> See the set of patches at
> https://www.postgresql.org/message-id/qh4c5tvkgjef7jikjig56rclbcdrrotngnwpycukd2n3k25zi2%4044hxxvtwmgum
>
>
> I think the only reason your patch on its own suffices, is that the "all"
> target, that we ran separately beforehand, actually has sufficient
> dependencies to make things work.

Yes, you are right. I agree that what you said is the correct solution
and that should be the ultimate goal. What I shared could be a
band-aid fix to make the Windows CI task happy until the patches you
shared get committed. Another solution might be to downgrade the meson
version in the Windows images at the CI repository [1], that would be
better for the commit history.

[1] https://github.com/anarazel/pg-vm-images

-- 
Regards,
Nazir Bilal Yavuz
Microsoft



Re: Windows CFBot is broken because ecpg dec_test.c error

From
Andres Freund
Date:
Hi,

On 2025-01-30 16:18:54 +0300, Nazir Bilal Yavuz wrote:
> On Wed, 29 Jan 2025 at 19:50, Andres Freund <andres@anarazel.de> wrote:
> > I don't think that's the entirety of the issue.
> >
> > Our dependencies aren't quite airtight enough. With a sufficiently modern
> > meson, try doing e.g.
> >
> >   rm -rf tmp_install/ && ninja clean && meson test --suite setup --suite ecpg
> >
> > It'll fail, because the dependencies of the tests are insufficient.
> >
> > See the set of patches at
> > https://www.postgresql.org/message-id/qh4c5tvkgjef7jikjig56rclbcdrrotngnwpycukd2n3k25zi2%4044hxxvtwmgum
> >
> >
> > I think the only reason your patch on its own suffices, is that the "all"
> > target, that we ran separately beforehand, actually has sufficient
> > dependencies to make things work.
> 
> Yes, you are right. I agree that what you said is the correct solution
> and that should be the ultimate goal.

I think we need to fix this properly across branches. The version of meson is
going to be more common soon. And it'll be a problem for developers too, not
just CI.  I'll start working on committing these fixes across the branches,
unless somebody protests immediately.


> What I shared could be a band-aid fix to make the Windows CI task happy
> until the patches you shared get committed.

I think we'll still need something like what you propose.  Although I do think
it'd be better if we continued building all targets in a dedicated _script:
block, so that you can see all build failures in those steps.

Greetings,

Andres Freund



Re: Windows CFBot is broken because ecpg dec_test.c error

From
Jelte Fennema-Nio
Date:
On Wed, 5 Feb 2025 at 00:22, Andres Freund <andres@anarazel.de> wrote:
> Pushed like that.
>
> I'll watch CI and BF over the next hours.

I guess you probably noticed, but in case you didn't: CI on windows is
still broken.



Re: Windows CFBot is broken because ecpg dec_test.c error

From
Andres Freund
Date:
Hi,

On 2025-02-05 19:42:05 +0100, Jelte Fennema-Nio wrote:
> On Wed, 5 Feb 2025 at 00:22, Andres Freund <andres@anarazel.de> wrote:
> > Pushed like that.
> >
> > I'll watch CI and BF over the next hours.
> 
> I guess you probably noticed, but in case you didn't: CI on windows is
> still broken.

Huh. CI did pass on all platforms after my push:
https://cirrus-ci.com/github/postgres/postgres/

While there is a failure on master, it isn't due to this:
https://cirrus-ci.com/task/6185223693533184
[17:55:32.636] ------------------------------------- 8< -------------------------------------
[17:55:32.636] stderr:
[17:55:32.636] #   Failed test 'can't connect to invalid database - error message'
[17:55:32.636] #   at C:/cirrus/src/test/recovery/t/037_invalid_database.pl line 40.
[17:55:32.636] #                   'psql: error: connection to server on socket
"C:/Windows/TEMP/kqIhcyR2yC/.s.PGSQL.31868"failed: server closed the connection unexpectedly
 
[17:55:32.636] #     This probably means the server terminated abnormally
[17:55:32.636] #     before or while processing the request.'
[17:55:32.636] #     doesn't match '(?^:FATAL:\s+cannot connect to invalid database "regression_invalid")'
[17:55:32.636] # Looks like you failed 1 test of 10.
[17:55:32.636] 
[17:55:32.636] (test program exited with status code 1)
[17:55:32.636] ------------------------------------------------------------------------------

I think that may be due to

commit a14707da564e8c94bd123f0e3a75e194fd7ef56a (upstream/master)
Author: Tom Lane <tgl@sss.pgh.pa.us>
Date:   2025-02-05 12:45:58 -0500
 
    Show more-intuitive titles for psql commands \dt, \di, etc.




I do see a lot of failures on cfbot - but afaict that's because for some
reason there haven't been recent runs. Thomas?

E.g. the currently newest run is https://cirrus-ci.com/build/6378368658046976
which is based on

commit 43493cceda2
Author: Peter Eisentraut <peter@eisentraut.org>
Date:   2025-01-24 22:58:13 +0100

    Add get_opfamily_name() function

Greetings,

Andres Freund



Re: Windows CFBot is broken because ecpg dec_test.c error

From
Tom Lane
Date:
Jelte Fennema-Nio <postgres@jeltef.nl> writes:
> I guess you probably noticed, but in case you didn't: CI on windows is
> still broken.

Hard to tell, considering the cfbot has been completely wedged
since Sunday.

            regards, tom lane



Re: Windows CFBot is broken because ecpg dec_test.c error

From
Andres Freund
Date:
Hi,

On 2025-02-05 14:09:02 -0500, Tom Lane wrote:
> Jelte Fennema-Nio <postgres@jeltef.nl> writes:
> > I guess you probably noticed, but in case you didn't: CI on windows is
> > still broken.
> 
> Hard to tell, considering the cfbot has been completely wedged
> since Sunday.

It passed on the postgres repo just before this commit:
  https://cirrus-ci.com/build/4733656549294080
and then failed with it:
  https://cirrus-ci.com/build/5944955807465472

Greetings,

Andres Freund



Re: Windows CFBot is broken because ecpg dec_test.c error

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> On 2025-02-05 14:09:02 -0500, Tom Lane wrote:
>> Hard to tell, considering the cfbot has been completely wedged
>> since Sunday.

> It passed on the postgres repo just before this commit:
>   https://cirrus-ci.com/build/4733656549294080
> and then failed with it:
>   https://cirrus-ci.com/build/5944955807465472

Hmm, maybe it's only the cfbot's web server that's broken,
but none of the pages at http://cfbot.cputube.org
appear to be updating.  What other mechanism are you using
to find the cirrus-ci.com logs?

            regards, tom lane



Re: Windows CFBot is broken because ecpg dec_test.c error

From
Jelte Fennema-Nio
Date:
On Wed, 5 Feb 2025 at 20:21, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Andres Freund <andres@anarazel.de> writes:
> > On 2025-02-05 14:09:02 -0500, Tom Lane wrote:
> >> Hard to tell, considering the cfbot has been completely wedged
> >> since Sunday.
>
> > It passed on the postgres repo just before this commit:
> >   https://cirrus-ci.com/build/4733656549294080
> > and then failed with it:
> >   https://cirrus-ci.com/build/5944955807465472
>
> Hmm, maybe it's only the cfbot's web server that's broken,
> but none of the pages at http://cfbot.cputube.org
> appear to be updating.  What other mechanism are you using
> to find the cirrus-ci.com logs?

Ugh yes, cfbot isn't updating at all anymore. So Andres' commits might
very well have fixed the issue, but the prod cfbot is not doing any
builds at the moment...

I'll look into fixing that soonish. I took a quick look and it seems
related to some unexpected response from the Cirrus API.



Re: Windows CFBot is broken because ecpg dec_test.c error

From
Andres Freund
Date:
Hi,

On 2025-02-05 14:20:59 -0500, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On 2025-02-05 14:09:02 -0500, Tom Lane wrote:
> >> Hard to tell, considering the cfbot has been completely wedged
> >> since Sunday.
> 
> > It passed on the postgres repo just before this commit:
> >   https://cirrus-ci.com/build/4733656549294080
> > and then failed with it:
> >   https://cirrus-ci.com/build/5944955807465472
> 
> Hmm, maybe it's only the cfbot's web server that's broken,
> but none of the pages at http://cfbot.cputube.org
> appear to be updating.

It does look to me like cfbot isn't updating the relevant branches, i.e. it's
not just the website that's not updating, or CI somehow not triggering after
cfbot updates the relevant branches.


> What other mechanism are you using to find the cirrus-ci.com logs?

This isn't run via cfbot, but via postgres' github mirror. Whenever the repo
sync pushes a change it also triggers CI.

You can see all the runs of that on
https://cirrus-ci.com/github/postgres/postgres/


CI on windows failed in ecpg for a few days, there were just two master runs
that didn't fail inbetween that being fixed and the failure I linked to
above. But recovery/037_invalid_database didn't fail at that time.

Greetings,

Andres Freund



Re: Windows CFBot is broken because ecpg dec_test.c error

From
Jelte Fennema-Nio
Date:
On Wed, 5 Feb 2025 at 20:29, Jelte Fennema-Nio <postgres@jeltef.nl> wrote:
> I'll look into fixing that soonish. I took a quick look and it seems
> related to some unexpected response from the Cirrus API.

Okay I think I got it running again. It didn't like that there was no
commitfest with number 54 yet. So I created one, and it's doing more
than before now. I'll check after dinner if it's still running
correctly then.



Re: Windows CFBot is broken because ecpg dec_test.c error

From
Daniel Gustafsson
Date:
> On 5 Feb 2025, at 20:36, Jelte Fennema-Nio <postgres@jeltef.nl> wrote:
>
> On Wed, 5 Feb 2025 at 20:29, Jelte Fennema-Nio <postgres@jeltef.nl> wrote:
>> I'll look into fixing that soonish. I took a quick look and it seems
>> related to some unexpected response from the Cirrus API.
>
> Okay I think I got it running again. It didn't like that there was no
> commitfest with number 54 yet. So I created one, and it's doing more
> than before now. I'll check after dinner if it's still running
> correctly then.

For reference, you meant 53 right?  (There is no 54 in the system.) If the
CFBot always need one in "Future" state we should document that to make sure we
don't miss that going forward (and perhaps automate it to make sure we dont
make manual work for ourselves).

--
Daniel Gustafsson




Re: Windows CFBot is broken because ecpg dec_test.c error

From
Jelte Fennema-Nio
Date:
On Wed, 5 Feb 2025 at 21:05, Daniel Gustafsson <daniel@yesql.se> wrote:
> For reference, you meant 53 right?

Yes, I meant 53

> If the
> CFBot always need one in "Future" state we should document that to make sure we
> don't miss that going forward (and perhaps automate it to make sure we dont
> make manual work for ourselves).

Afaict it doesn't need at least one in the "Future" state, instead it
needs one after the current[1] commitfest. I don't think it should
rely on that though. So I created an issue to fix that[2].

It does seem silly that we require people to manually create new
commitfests though, so I created an issue to track that[3]

[1]: https://commitfest.postgresql.org/current/
[2]: https://github.com/macdice/cfbot/issues/22
[3]: https://github.com/postgres/pgcommitfest/issues/25