Thread: Cirrus-ci is lowering free CI cycles - what to do with cfbot, etc?

Cirrus-ci is lowering free CI cycles - what to do with cfbot, etc?

From
Andres Freund
Date:
Hi,

As some of you might have seen when running CI, cirrus-ci is restricting how
much CI cycles everyone can use for free (announcement at [1]). This takes
effect September 1st.

This obviously has consequences both for individual users of CI as well as
cfbot.


The first thing I think we should do is to lower the cost of CI. One thing I
had not entirely realized previously, is that macos CI is by far the most
expensive CI to provide. That's not just the case with cirrus-ci, but also
with other providers.  See the series of patches described later in the email.


To me, the situation for cfbot is different than the one for individual
users.

IMO, for the individual user case it's important to use CI for "free", without
a whole lot of complexity. Which imo rules approaches like providing
$cloud_provider compute accounts, that's too much setup work.  With the
improvements detailed below, cirrus' free CI would last about ~65 runs /
month.

For cfbot I hope we can find funding to pay for compute to use for CI. The, by
far, most expensive bit is macos. To a significant degree due to macos
licensing terms not allowing more than 2 VMs on a physical host :(.


The reason we chose cirrus-ci were

a) Ability to use full VMs, rather than a pre-selected set of VMs, which
   allows us to test a larger number

b) Ability to link to log files, without requiring an account. E.g. github
   actions doesn't allow to view logs unless logged in.

c) Amount of compute available.


The set of free CI providers has shrunk since we chose cirrus, as have the
"free" resources provided. I started, quite incomplete as of now, wiki page at
[4].


Potential paths forward for individual CI:

- migrate wholesale to another CI provider

- split CI tasks across different CI providers, rely on github et al
  displaying the CI status for different platforms

- give up


Potential paths forward for cfbot, in addition to the above:

- Pay for compute / ask the various cloud providers to grant us compute
  credits. At least some of the cloud providers can be used via cirrus-ci.

- Host (some) CI runners ourselves. Particularly with macos and windows, that
  could provide significant savings.

- Build our own system, using buildbot, jenkins or whatnot.


Opinions as to what to do?



The attached series of patches:

1) Makes startup of macos instances faster, using more efficient caching of
   the required packages. Also submitted as [2].

2) Introduces a template initdb that's reused during the tests. Also submitted
   as [3]

3) Remove use of -DRANDOMIZE_ALLOCATED_MEMORY from macos tasks. It's
   expensive. And CI also uses asan on linux, so I don't think it's really
   needed.

4) Switch tasks to use debugoptimized builds. Previously many tasks used -Og,
   to get decent backtraces etc. But the amount of CPU burned that way is too
   large. One issue with that is that use of ccache becomes much more crucial,
   uncached build times do significantly increase.

5) Move use of -Dsegsize_blocks=6 from macos to linux

   Macos is expensive, -Dsegsize_blocks=6 slows things down. Alternatively we
   could stop covering both meson and autoconf segsize_blocks. It does affect
   runtime on linux as well.

6) Disable write cache flushes on windows

   It's a bit ugly to do this without using the UI... Shaves off about 30s
   from the tests.

7) pg_regress only checked once a second whether postgres started up, but it's
   usually much faster. Use pg_ctl's logic.  It might be worth replacing the
   use psql with directly using libpq in pg_regress instead, looks like the
   overhead of repeatedly starting psql is noticeable.


FWIW: with the patches applied, the "credit costs" in cirrus CI are roughly
like the following (depends on caching etc):

task costs in credits
    linux-sanity: 0.01
    linux-compiler-warnings: 0.05
    linux-meson: 0.07
    freebsd   : 0.08
    linux-autoconf: 0.09
    windows   : 0.18
    macos     : 0.28
total task runtime is 40.8
cost in credits is 0.76, monthly credits of 50 allow approx 66.10 runs/month


Greetings,

Andres Freund

[1] https://cirrus-ci.org/blog/2023/07/17/limiting-free-usage-of-cirrus-ci/
[2] https://www.postgresql.org/message-id/20230805202539.r3umyamsnctysdc7%40awork3.anarazel.de
[3] https://postgr.es/m/20220120021859.3zpsfqn4z7ob7afz@alap3.anarazel.de

Attachment

Re: Cirrus-ci is lowering free CI cycles - what to do with cfbot, etc?

From
Andres Freund
Date:
Hi,

On 2023-08-07 19:15:41 -0700, Andres Freund wrote:
> The set of free CI providers has shrunk since we chose cirrus, as have the
> "free" resources provided. I started, quite incomplete as of now, wiki page at
> [4].

Oops, as Thomas just noticed, I left off that link:

[4] https://wiki.postgresql.org/wiki/CI_Providers

Greetings,

Andres Freund



Re: Cirrus-ci is lowering free CI cycles - what to do with cfbot, etc?

From
Heikki Linnakangas
Date:
On 08/08/2023 05:15, Andres Freund wrote:
> IMO, for the individual user case it's important to use CI for "free", without
> a whole lot of complexity. Which imo rules approaches like providing
> $cloud_provider compute accounts, that's too much setup work.

+1

> With the improvements detailed below, cirrus' free CI would last
> about ~65 runs / month.
I think that's plenty.

> For cfbot I hope we can find funding to pay for compute to use for CI.

+1

> Potential paths forward for cfbot, in addition to the above:
> 
> - Pay for compute / ask the various cloud providers to grant us compute
>    credits. At least some of the cloud providers can be used via cirrus-ci.
> 
> - Host (some) CI runners ourselves. Particularly with macos and windows, that
>    could provide significant savings.
> 
> - Build our own system, using buildbot, jenkins or whatnot.
> 
> 
> Opinions as to what to do?

The resources for running our own system isn't free either. I'm sure we 
can get sponsors for the cirrus-ci credits, or use donations.

I have been quite happy with Cirrus CI overall.

> The attached series of patches:

All of this makes sense to me, although I don't use macos myself.

> 5) Move use of -Dsegsize_blocks=6 from macos to linux
> 
>     Macos is expensive, -Dsegsize_blocks=6 slows things down. Alternatively we
>     could stop covering both meson and autoconf segsize_blocks. It does affect
>     runtime on linux as well.

Could we have a comment somewhere on why we use -Dsegsize_blocks on 
these particular CI runs? It seems pretty random. I guess the idea is to 
have one autoconf task and one meson task with that option, to check 
that the autoconf/meson option works?

> 6) Disable write cache flushes on windows
> 
>     It's a bit ugly to do this without using the UI... Shaves off about 30s
>     from the tests.

A brief comment would be nice: "We don't care about persistence over 
hard crashes in the CI, so disable write cache flushes to speed it up."

-- 
Heikki Linnakangas
Neon (https://neon.tech)




Re: Cirrus-ci is lowering free CI cycles - what to do with cfbot, etc?

From
Peter Eisentraut
Date:
On 08.08.23 04:15, Andres Freund wrote:
> Potential paths forward for individual CI:
> 
> - migrate wholesale to another CI provider
> 
> - split CI tasks across different CI providers, rely on github et al
>    displaying the CI status for different platforms
> 
> - give up

With the proposed optimizations, it seems you can still do a fair amount 
of work under the free plan.

> Potential paths forward for cfbot, in addition to the above:
> 
> - Pay for compute / ask the various cloud providers to grant us compute
>    credits. At least some of the cloud providers can be used via cirrus-ci.
> 
> - Host (some) CI runners ourselves. Particularly with macos and windows, that
>    could provide significant savings.
> 
> - Build our own system, using buildbot, jenkins or whatnot.

I think we should use the "compute credits" plan from Cirrus CI.  It 
should be possible to estimate the costs for that.  Money is available, 
I think.




Re: Cirrus-ci is lowering free CI cycles - what to do with cfbot, etc?

From
Andres Freund
Date:
Hi,

On 2023-08-08 16:28:49 +0200, Peter Eisentraut wrote:
> On 08.08.23 04:15, Andres Freund wrote:
> > Potential paths forward for cfbot, in addition to the above:
> >
> > - Pay for compute / ask the various cloud providers to grant us compute
> >    credits. At least some of the cloud providers can be used via cirrus-ci.
> >
> > - Host (some) CI runners ourselves. Particularly with macos and windows, that
> >    could provide significant savings.
> >
> > - Build our own system, using buildbot, jenkins or whatnot.
>
> I think we should use the "compute credits" plan from Cirrus CI.  It should
> be possible to estimate the costs for that.  Money is available, I think.

Unfortunately just doing that seems like it would up considerably on the too
expensive side. Here are the stats for last months' cfbot runtimes (provided
by Thomas):

                   task_name                    |    sum
------------------------------------------------+------------
 FreeBSD - 13 - Meson                           | 1017:56:09
 Windows - Server 2019, MinGW64 - Meson         | 00:00:00
 SanityCheck                                    | 76:48:41
 macOS - Ventura - Meson                        | 873:12:43
 Windows - Server 2019, VS 2019 - Meson & ninja | 1251:08:06
 Linux - Debian Bullseye - Autoconf             | 830:17:26
 Linux - Debian Bullseye - Meson                | 860:37:21
 CompilerWarnings                               | 935:30:35
(8 rows)

If I did the math right, that's about 7000 credits (and 1 credit costs 1 USD).

task costs in credits
    linux-sanity: 55.30
    linux-autoconf: 598.04
    linux-meson: 619.40
    linux-compiler-warnings: 674.28
    freebsd   : 732.24
    windows   : 1201.09
    macos     : 3143.52


Now, those times are before optimizing test runtime. And besides optimizing
the tasks, we can also optimize not running tests for docs patches etc. And
optimize cfbot to schedule a bit better.

But still, the costs look not realistic to me.

If instead we were to use our own GCP account, it's a lot less. t2d-standard-4
instances, which are faster than what we use right now, cost $0.168984 / hour
as "normal" instances and $0.026764 as "spot" instances right now [1]. Windows
VMs are considerably more expensive due to licensing - 0.184$/h in addition.

Assuming spot instances, linux+freebsd tasks would cost ~100USD month (maybe
10-20% more in reality, due to a) spot instances getting terminated requiring
retries and b) disks).

Windows would be ~255 USD / month (same retries caveats).

Given the cost of macos, it seems like it'd be by far the most of affordable
to just buy 1-2 mac minis (2x ~660USD) and stick them in a shelf somewhere, as
persistent runners. Cirrus has builtin macos virtualization support - but can
only host two VMs on each mac, due to macos licensing restrictions. A single
mac mini would suffice to keep up with our unoptimized monthly runtime
(although there likely would be some overhead).

Greetings,

Andres Freund

[1] https://cloud.google.com/compute/all-pricing



Re: Cirrus-ci is lowering free CI cycles - what to do with cfbot, etc?

From
"Tristan Partin"
Date:
On Mon Aug 7, 2023 at 9:15 PM CDT, Andres Freund wrote:
> FWIW: with the patches applied, the "credit costs" in cirrus CI are roughly
> like the following (depends on caching etc):
>
> task costs in credits
>     linux-sanity: 0.01
>     linux-compiler-warnings: 0.05
>     linux-meson: 0.07
>     freebsd   : 0.08
>     linux-autoconf: 0.09
>     windows   : 0.18
>     macos     : 0.28
> total task runtime is 40.8
> cost in credits is 0.76, monthly credits of 50 allow approx 66.10 runs/month

I am not in the loop on the autotools vs meson stuff. How much longer do
we anticipate keeping autotools around? Seems like it could be a good
opportunity to reduce some CI usage if autotools were finally dropped,
but I know there are still outstanding tasks to complete.

Back of the napkin math says autotools is about 12% of the credit cost,
though I haven't looked to see if linux-meson and linux-autotools are
1:1.

--
Tristan Partin
Neon (https://neon.tech)



Re: Cirrus-ci is lowering free CI cycles - what to do with cfbot, etc?

From
Andres Freund
Date:
Hi,

On 2023-08-08 10:25:52 -0500, Tristan Partin wrote:
> On Mon Aug 7, 2023 at 9:15 PM CDT, Andres Freund wrote:
> > FWIW: with the patches applied, the "credit costs" in cirrus CI are roughly
> > like the following (depends on caching etc):
> > 
> > task costs in credits
> >     linux-sanity: 0.01
> >     linux-compiler-warnings: 0.05
> >     linux-meson: 0.07
> >     freebsd   : 0.08
> >     linux-autoconf: 0.09
> >     windows   : 0.18
> >     macos     : 0.28
> > total task runtime is 40.8
> > cost in credits is 0.76, monthly credits of 50 allow approx 66.10 runs/month
> 
> I am not in the loop on the autotools vs meson stuff. How much longer do we
> anticipate keeping autotools around?

I think it depends in what fashion. We've been talking about supporting
building out-of-tree modules with "pgxs" for at least a 5 year support
window. But the replacement isn't yet finished [1], so that clock hasn't yet
started ticking.


> Seems like it could be a good opportunity to reduce some CI usage if
> autotools were finally dropped, but I know there are still outstanding tasks
> to complete.
> 
> Back of the napkin math says autotools is about 12% of the credit cost,
> though I haven't looked to see if linux-meson and linux-autotools are 1:1.

The autoconf task is actually doing quite useful stuff right now, leaving the
use of configure aside, as it builds with address sanitizer. Without that it'd
be a lot faster. But we'd loose, imo quite important, coverage. The tests
would run a bit faster with meson, but it'd be overall a difference on the
margins.

Greetings,

Andres Freund

[1] https://github.com/anarazel/postgres/tree/meson-pkgconfig



Re: Cirrus-ci is lowering free CI cycles - what to do with cfbot, etc?

From
"Tristan Partin"
Date:
On Tue Aug 8, 2023 at 10:38 AM CDT, Andres Freund wrote:
> Hi,
>
> On 2023-08-08 10:25:52 -0500, Tristan Partin wrote:
> > On Mon Aug 7, 2023 at 9:15 PM CDT, Andres Freund wrote:
> > > FWIW: with the patches applied, the "credit costs" in cirrus CI are roughly
> > > like the following (depends on caching etc):
> > >
> > > task costs in credits
> > >     linux-sanity: 0.01
> > >     linux-compiler-warnings: 0.05
> > >     linux-meson: 0.07
> > >     freebsd   : 0.08
> > >     linux-autoconf: 0.09
> > >     windows   : 0.18
> > >     macos     : 0.28
> > > total task runtime is 40.8
> > > cost in credits is 0.76, monthly credits of 50 allow approx 66.10 runs/month
> >
> > I am not in the loop on the autotools vs meson stuff. How much longer do we
> > anticipate keeping autotools around?
>
> I think it depends in what fashion. We've been talking about supporting
> building out-of-tree modules with "pgxs" for at least a 5 year support
> window. But the replacement isn't yet finished [1], so that clock hasn't yet
> started ticking.
>
>
> > Seems like it could be a good opportunity to reduce some CI usage if
> > autotools were finally dropped, but I know there are still outstanding tasks
> > to complete.
> >
> > Back of the napkin math says autotools is about 12% of the credit cost,
> > though I haven't looked to see if linux-meson and linux-autotools are 1:1.
>
> The autoconf task is actually doing quite useful stuff right now, leaving the
> use of configure aside, as it builds with address sanitizer. Without that it'd
> be a lot faster. But we'd loose, imo quite important, coverage. The tests
> would run a bit faster with meson, but it'd be overall a difference on the
> margins.
>
> [1] https://github.com/anarazel/postgres/tree/meson-pkgconfig

Makes sense. Please let me know if I can help you out in anyway for the
v17 development cycle besides what we have already talked about.

--
Tristan Partin
Neon (https://neon.tech)



Re: Cirrus-ci is lowering free CI cycles - what to do with cfbot, etc?

From
Alvaro Herrera
Date:
On 2023-Aug-08, Andres Freund wrote:

> Given the cost of macos, it seems like it'd be by far the most of affordable
> to just buy 1-2 mac minis (2x ~660USD) and stick them in a shelf somewhere, as
> persistent runners. Cirrus has builtin macos virtualization support - but can
> only host two VMs on each mac, due to macos licensing restrictions. A single
> mac mini would suffice to keep up with our unoptimized monthly runtime
> (although there likely would be some overhead).

If using persistent workers is an option, maybe we should explore that.
I think we could move all or some of the Linux - Debian builds to
hardware that we already have in shelves (depending on how much compute
power is really needed.)

I think using other OSes is more difficult, mostly because I doubt we
want to deal with licenses; but even FreeBSD might not be a realistic
option, at least not in the short term.

Still,

>                    task_name                    |    sum
> ------------------------------------------------+------------
>  FreeBSD - 13 - Meson                           | 1017:56:09
>  Windows - Server 2019, MinGW64 - Meson         | 00:00:00
>  SanityCheck                                    | 76:48:41
>  macOS - Ventura - Meson                        | 873:12:43
>  Windows - Server 2019, VS 2019 - Meson & ninja | 1251:08:06
>  Linux - Debian Bullseye - Autoconf             | 830:17:26
>  Linux - Debian Bullseye - Meson                | 860:37:21
>  CompilerWarnings                               | 935:30:35
> (8 rows)
>

moving just Debian, that might alleviate 76+830+860+935 hours from the
Cirrus infra, which is ~46%.  Not bad.


(How come Windows - Meson reports allballs?)

-- 
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/
"No tengo por qué estar de acuerdo con lo que pienso"
                             (Carlos Caszeli)



Re: Cirrus-ci is lowering free CI cycles - what to do with cfbot, etc?

From
Andres Freund
Date:
Hi,

On 2023-08-08 18:34:58 +0200, Alvaro Herrera wrote:
> On 2023-Aug-08, Andres Freund wrote:
> 
> > Given the cost of macos, it seems like it'd be by far the most of affordable
> > to just buy 1-2 mac minis (2x ~660USD) and stick them in a shelf somewhere, as
> > persistent runners. Cirrus has builtin macos virtualization support - but can
> > only host two VMs on each mac, due to macos licensing restrictions. A single
> > mac mini would suffice to keep up with our unoptimized monthly runtime
> > (although there likely would be some overhead).
> 
> If using persistent workers is an option, maybe we should explore that.
> I think we could move all or some of the Linux - Debian builds to
> hardware that we already have in shelves (depending on how much compute
> power is really needed.)

(76+830+860+935)/((365/12)*24) = 3.7

3.7 instances with 4 "vcores" are busy 100% of the time. So we'd need at least
~16 cpu threads - I think cirrus sometimes uses instances that disable HT, so
it'd perhaps be 16 cores actually.


> I think using other OSes is more difficult, mostly because I doubt we
> want to deal with licenses; but even FreeBSD might not be a realistic
> option, at least not in the short term.

They can be VMs, so that shouldn't be a big issue.

> >                    task_name                    |    sum
> > ------------------------------------------------+------------
> >  FreeBSD - 13 - Meson                           | 1017:56:09
> >  Windows - Server 2019, MinGW64 - Meson         | 00:00:00
> >  SanityCheck                                    | 76:48:41
> >  macOS - Ventura - Meson                        | 873:12:43
> >  Windows - Server 2019, VS 2019 - Meson & ninja | 1251:08:06
> >  Linux - Debian Bullseye - Autoconf             | 830:17:26
> >  Linux - Debian Bullseye - Meson                | 860:37:21
> >  CompilerWarnings                               | 935:30:35
> > (8 rows)
> >
> 
> moving just Debian, that might alleviate 76+830+860+935 hours from the
> Cirrus infra, which is ~46%.  Not bad.
> 
> 
> (How come Windows - Meson reports allballs?)

It's mingw64, which we've marked as "manual", because we didn't have the cpu
cycles to run it.

Greetings,

Andres Freund



Re: Cirrus-ci is lowering free CI cycles - what to do with cfbot, etc?

From
Andres Freund
Date:
Hi,

On 2023-08-08 11:58:25 +0300, Heikki Linnakangas wrote:
> On 08/08/2023 05:15, Andres Freund wrote:
> > With the improvements detailed below, cirrus' free CI would last
> > about ~65 runs / month.
>
> I think that's plenty.

Not so sure, I would regularly exceed it, I think. But it definitely will
suffice for more casual contributors.


> > Potential paths forward for cfbot, in addition to the above:
> > 
> > - Pay for compute / ask the various cloud providers to grant us compute
> >    credits. At least some of the cloud providers can be used via cirrus-ci.
> > 
> > - Host (some) CI runners ourselves. Particularly with macos and windows, that
> >    could provide significant savings.
> > 
> > - Build our own system, using buildbot, jenkins or whatnot.
> > 
> > 
> > Opinions as to what to do?
> 
> The resources for running our own system isn't free either. I'm sure we can
> get sponsors for the cirrus-ci credits, or use donations.

As outlined in my reply to Alvaro, just using credits likely is financially
not viable...


> > 5) Move use of -Dsegsize_blocks=6 from macos to linux
> > 
> >     Macos is expensive, -Dsegsize_blocks=6 slows things down. Alternatively we
> >     could stop covering both meson and autoconf segsize_blocks. It does affect
> >     runtime on linux as well.
> 
> Could we have a comment somewhere on why we use -Dsegsize_blocks on these
> particular CI runs? It seems pretty random. I guess the idea is to have one
> autoconf task and one meson task with that option, to check that the
> autoconf/meson option works?

Hm, some of that was in the commit message, but I should have added it to
.cirrus.yml as well.

Normally, the "relation segment" code basically has no coverage in our tests,
because we (quite reasonably) don't generate tables large enough. We've had
plenty bugs that we didn't notice due the code not being exercised much. So it
seemed useful to add CI coverage, by making the segments very small.

I chose the tasks by looking at how long they took at the time, I
think. Adding them to to the slower ones.


> > 6) Disable write cache flushes on windows
> > 
> >     It's a bit ugly to do this without using the UI... Shaves off about 30s
> >     from the tests.
> 
> A brief comment would be nice: "We don't care about persistence over hard
> crashes in the CI, so disable write cache flushes to speed it up."

Turns out that patch doesn't work on its own anyway, at least not
reliably... I tested it by interactively logging into a windows vm and testing
it there. It doesn't actually seem to suffice when run in isolation, because
the relevant registry key doesn't yet exist. I haven't yet figured out the
magic incantations for adding the missing "intermediary", but I'm getting
there...

Greetings,

Andres Freund



Re: Cirrus-ci is lowering free CI cycles - what to do with cfbot, etc?

From
Robert Treat
Date:
On Tue, Aug 8, 2023 at 9:26 PM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2023-08-08 11:58:25 +0300, Heikki Linnakangas wrote:
> > On 08/08/2023 05:15, Andres Freund wrote:
> > > With the improvements detailed below, cirrus' free CI would last
> > > about ~65 runs / month.
> >
> > I think that's plenty.
>
> Not so sure, I would regularly exceed it, I think. But it definitely will
> suffice for more casual contributors.
>
>
> > > Potential paths forward for cfbot, in addition to the above:
> > >
> > > - Pay for compute / ask the various cloud providers to grant us compute
> > >    credits. At least some of the cloud providers can be used via cirrus-ci.
> > >
> > > - Host (some) CI runners ourselves. Particularly with macos and windows, that
> > >    could provide significant savings.
> > >
> > > - Build our own system, using buildbot, jenkins or whatnot.
> > >
> > >
> > > Opinions as to what to do?
> >
> > The resources for running our own system isn't free either. I'm sure we can
> > get sponsors for the cirrus-ci credits, or use donations.
>
> As outlined in my reply to Alvaro, just using credits likely is financially
> not viable...
>
>

In case it's helpful, from an SPI oriented perspective, $7K/month is
probably an order of magnitude more than what we can sustain, so I
don't see a way to make that work without some kind of additional
magic that includes other non-profits and/or commercial companies
changing donation habits between now and September.

Purchasing a couple of mac-mini's (and/or similar gear) would be near
trivial though, just a matter of figuring out where/how to host it
(but I think infra can chime in on that if that's what get's decided).

The other likely option would be to seek out cloud credits from one of
the big three (or others); Amazon has continually said they would be
happy to donate more credits to us if we had a use, and I think some
of the other hosting providers have said similarly at times; so we'd
need to ask and hope it's not too bureaucratic.

Robert Treat
https://xzilla.net



Re: Cirrus-ci is lowering free CI cycles - what to do with cfbot, etc?

From
Andres Freund
Date:
Hi,

On 2023-08-08 22:29:50 -0400, Robert Treat wrote:
> In case it's helpful, from an SPI oriented perspective, $7K/month is
> probably an order of magnitude more than what we can sustain, so I
> don't see a way to make that work without some kind of additional
> magic that includes other non-profits and/or commercial companies
> changing donation habits between now and September.

Yea, I think that'd make no sense, even if we could afford it. I think the
patches I've written should drop it to 1/2 already. Thomas added some
throttling to push it down further.


> Purchasing a couple of mac-mini's (and/or similar gear) would be near
> trivial though, just a matter of figuring out where/how to host it
> (but I think infra can chime in on that if that's what get's decided).

Cool. Because of the limitation of running two VMs at a time on macos and the
comparatively low cost of mac minis, it seems they beat alternative models by
a fair bit.

Pginfra/sysadmin: ^


Based on being off by an order of magnitude, as you mention earlier, it seems
that:

1) reducing test runtime etc, as already in progress
2) getting 2 mac minis as runners
3) using ~350 USD / mo in GCP costs for windows, linux, freebsd (*)

Would be viable for a month or three? I hope we can get some cloud providers
to chip in for 3), but I'd like to have something in place that doesn't depend
on that.

Given the cost of macos VMs at AWS, the only of the big cloud providers to
have macos instances, I think we'd burn pointlessly quick through credits if
we used VMs for that.

(*) I think we should be able to get below that, but ...


> The other likely option would be to seek out cloud credits from one of
> the big three (or others); Amazon has continually said they would be
> happy to donate more credits to us if we had a use, and I think some
> of the other hosting providers have said similarly at times; so we'd
> need to ask and hope it's not too bureaucratic.

Yep.

I tried to start that progress within microsoft, fwiw.  Perhaps Joe and
Jonathan know how to start within AWS?  And perhaps Noah inside GCP?

It'd be the least work to get it up and running in GCP, as it's already
running there, but should be quite doable at the others as well.

Greetings,

Andres Freund



Re: Cirrus-ci is lowering free CI cycles - what to do with cfbot, etc?

From
Juan José Santamaría Flecha
Date:

On Wed, Aug 9, 2023 at 3:26 AM Andres Freund <andres@anarazel.de> wrote:

> > 6) Disable write cache flushes on windows
> >
> >     It's a bit ugly to do this without using the UI... Shaves off about 30s
> >     from the tests.
>
> A brief comment would be nice: "We don't care about persistence over hard
> crashes in the CI, so disable write cache flushes to speed it up."

Turns out that patch doesn't work on its own anyway, at least not
reliably... I tested it by interactively logging into a windows vm and testing
it there. It doesn't actually seem to suffice when run in isolation, because
the relevant registry key doesn't yet exist. I haven't yet figured out the
magic incantations for adding the missing "intermediary", but I'm getting
there...


You can find a good example on how to accomplish this in:


Regards,

Juan José Santamaría Flecha 

Re: Cirrus-ci is lowering free CI cycles - what to do with cfbot, etc?

From
Alvaro Herrera
Date:
Hello

So pginfra had a little chat about this.  Firstly, there's consensus
that it makes sense for pginfra to help out with some persistent workers
in our existing VM system; however there are some aspects that need
some further discussion, to avoid destabilizing the rest of the
infrastructure.  We're looking into it and we'll let you know.

Hosting a couple of Mac Minis is definitely a possibility, if some
entity like SPI buys them.  Let's take this off-list to arrange the
details.

Regards

-- 
Álvaro Herrera



Re: Cirrus-ci is lowering free CI cycles - what to do with cfbot, etc?

From
Noah Misch
Date:
On Tue, Aug 08, 2023 at 07:59:55PM -0700, Andres Freund wrote:
> On 2023-08-08 22:29:50 -0400, Robert Treat wrote:

> 3) using ~350 USD / mo in GCP costs for windows, linux, freebsd (*)

> > The other likely option would be to seek out cloud credits

> I tried to start that progress within microsoft, fwiw.  Perhaps Joe and
> Jonathan know how to start within AWS?  And perhaps Noah inside GCP?
> 
> It'd be the least work to get it up and running in GCP, as it's already
> running there

I'm looking at this.  Thanks for bringing it to my attention.



Re: Cirrus-ci is lowering free CI cycles - what to do with cfbot, etc?

From
Andres Freund
Date:
Hi,

On 2023-08-07 19:15:41 -0700, Andres Freund wrote:
> As some of you might have seen when running CI, cirrus-ci is restricting how
> much CI cycles everyone can use for free (announcement at [1]). This takes
> effect September 1st.
>
> This obviously has consequences both for individual users of CI as well as
> cfbot.
>
> [...]

> Potential paths forward for individual CI:
>
> - migrate wholesale to another CI provider
>
> - split CI tasks across different CI providers, rely on github et al
>   displaying the CI status for different platforms
>
> - give up
>
>
> Potential paths forward for cfbot, in addition to the above:
>
> - Pay for compute / ask the various cloud providers to grant us compute
>   credits. At least some of the cloud providers can be used via cirrus-ci.
>
> - Host (some) CI runners ourselves. Particularly with macos and windows, that
>   could provide significant savings.

To make that possible, we need to make the compute resources for CI
configurable on a per-repository basis.  After experimenting with a bunch of
ways to do that, I got stuck on that for a while. But since today we have
sufficient macos runners for cfbot available, so... I think the approach I
finally settled on is decent, although not great. It's described in the "main"
commit message:
    ci: Prepare to make compute resources for CI configurable

    cirrus-ci will soon restrict the amount of free resources every user gets (as
    have many other CI providers). For most users of CI that should not be an
    issue. But e.g. for cfbot it will be an issue.

    To allow configuring different resources on a per-repository basis, introduce
    infrastructure for overriding the task execution environment. Unfortunately
    this is not entirely trivial, as yaml anchors have to be defined before their
    use, and cirrus-ci only allows injecting additional contents at the end of
    .cirrus.yml.

    To deal with that, move the definition of the CI tasks to
    .cirrus.tasks.yml. The main .cirrus.yml is loaded first, then, if defined, the
    file referenced by the REPO_CI_CONFIG_GIT_URL variable, will be added,
    followed by the contents of .cirrus.tasks.yml. That allows
    REPO_CI_CONFIG_GIT_URL to override the yaml anchors defined in .cirrus.yml.

    Unfortunately git's default merge / rebase strategy does not handle copied
    files, just renamed ones. To avoid painful rebasing over this change, this
    commit just renames .cirrus.yml to .cirrus.tasks.yml, without adding a new
    .cirrus.yml. That's done in the followup commit, which moves the relevant
    portion of .cirrus.tasks.yml to .cirrus.yml.  Until that is done,
    REPO_CI_CONFIG_GIT_URL does not fully work.

    The subsequent commit adds documentation for how to configure custom compute
    resources to src/tools/ci/README

    Discussion: https://postgr.es/m/20230808021541.7lbzdefvma7qmn3w@awork3.anarazel.de
    Backpatch: 15-, where CI support was added


I don't love moving most of the contents of .cirrus.yml into a new file, but I
don't see another way. I did implement it without that as well (see [1]), but
that ends up considerably harder to understand, and hardcodes what cfbot
needs.  Splitting the commit, as explained above, at least makes git rebase
fairly painless. FWIW, I did merge the changes into 15, with only reasonable
conflicts (due to new tasks, autoconf->meson).


A prerequisite commit converts "SanityCheck" and "CompilerWarnings" to use a
full VM instead of a container - that way providing custom compute resources
doesn't have to deal with containers in addition to VMs. It also looks like
the increased startup overhead is outweighed by the reduction in runtime
overhead.


I'm hoping to push this fairly soon, as I'll be on vacation the last week of
August. I'll be online intermittently though, if there are issues, I can react
(very limited connectivity for middday Aug 29th - midday Aug 31th though). I'd
appreciate a quick review or two.


Greetings,

Andres Freund

[1] https://github.com/anarazel/postgres/commit/b95fd302161b951f1dc14d586162ed3d85564bfc

Attachment

Re: Cirrus-ci is lowering free CI cycles - what to do with cfbot, etc?

From
Daniel Gustafsson
Date:
> On 23 Aug 2023, at 08:58, Andres Freund <andres@anarazel.de> wrote:

> I'm hoping to push this fairly soon, as I'll be on vacation the last week of
> August. I'll be online intermittently though, if there are issues, I can react
> (very limited connectivity for middday Aug 29th - midday Aug 31th though). I'd
> appreciate a quick review or two.

I've been reading over these and the thread, and while not within my area of
expertise, nothing really sticks out.

I'll do another pass, but below are a few small comments so far.

I don't know Windows to know the implications, but should the below file have
some sort of warning about not doing that for production/shared systems, only
for dedicated test instances?

+++ b/src/tools/ci/windows_write_cache.ps1
@@ -0,0 +1,20 @@
+# Define the write cache to be power protected. This reduces the rate of cache
+# flushes, which seems to help metadata heavy workloads on NTFS. We're just
+# testing here anyway, so ...
+#
+# Let's do so for all disks, this could be useful beyond cirrus-ci.

One thing in 0010 caught my eye, and while not introduced in this patchset it
might be of interest here.  In the below hunks we loop X ticks around
system(psql), with the loop assuming the server can come up really quickly and
sleeping if it doesn't.  On my systems I always reach the pg_usleep after
failing the check, but if I reverse the check such it first sleeps and then
checks I only need to check once instead of twice.

@@ -2499,7 +2502,7 @@ regression_main(int argc, char *argv[],
         else
             wait_seconds = 60;

-        for (i = 0; i < wait_seconds; i++)
+        for (i = 0; i < wait_seconds * WAITS_PER_SEC; i++)
         {
             /* Done if psql succeeds */
             fflush(NULL);
@@ -2519,7 +2522,7 @@ regression_main(int argc, char *argv[],
                      outputdir);
             }

-            pg_usleep(1000000L);
+            pg_usleep(1000000L / WAITS_PER_SEC);
         }
         if (i >= wait_seconds)
         {

It's a micro-optimization, but if we're changing things here to chase cycles it
might perhaps be worth doing?

--
Daniel Gustafsson




Re: Cirrus-ci is lowering free CI cycles - what to do with cfbot, etc?

From
Andres Freund
Date:
Hi,

On 2023-08-23 14:48:26 +0200, Daniel Gustafsson wrote:
> > On 23 Aug 2023, at 08:58, Andres Freund <andres@anarazel.de> wrote:
>
> > I'm hoping to push this fairly soon, as I'll be on vacation the last week of
> > August. I'll be online intermittently though, if there are issues, I can react
> > (very limited connectivity for middday Aug 29th - midday Aug 31th though). I'd
> > appreciate a quick review or two.
>
> I've been reading over these and the thread, and while not within my area of
> expertise, nothing really sticks out.

Thanks!


> I'll do another pass, but below are a few small comments so far.
>
> I don't know Windows to know the implications, but should the below file have
> some sort of warning about not doing that for production/shared systems, only
> for dedicated test instances?

Ah, I should have explained that: I'm not planning to apply
- regress: Check for postgres startup completion more often
- ci: windows: Disabling write cache flushing during test
right now. Compared to the other patches the wins are much smaller and/or more
work is needed to make them good.

I think it might be worth going for
- ci: switch tasks to debugoptimized build
because that provides a fair bit of gain. But it might be more hurtful than
helpful due to costing more when ccache doesn't work...


> +++ b/src/tools/ci/windows_write_cache.ps1
> @@ -0,0 +1,20 @@
> +# Define the write cache to be power protected. This reduces the rate of cache
> +# flushes, which seems to help metadata heavy workloads on NTFS. We're just
> +# testing here anyway, so ...
> +#
> +# Let's do so for all disks, this could be useful beyond cirrus-ci.
>
> One thing in 0010 caught my eye, and while not introduced in this patchset it
> might be of interest here.  In the below hunks we loop X ticks around
> system(psql), with the loop assuming the server can come up really quickly and
> sleeping if it doesn't.  On my systems I always reach the pg_usleep after
> failing the check, but if I reverse the check such it first sleeps and then
> checks I only need to check once instead of twice.

I think there's more effective ways to make this cheaper. The basic thing
would be to use libpq instead of forking of psql to make a connection
check. Medium term, I think we should invent a way for pg_ctl and other
tooling (including pg_regress) to wait for the service to come up. E.g. having
a named pipe that postmaster opens once the server is up, which should allow
multiple clients to use select/epoll/... to wait for it without looping.

ISTM making pg_regress use libpq w/ PQping() should be a pretty simple patch?
The non-polling approach obviously is even better, but also requires more
thought (and documentation and ...).


> @@ -2499,7 +2502,7 @@ regression_main(int argc, char *argv[],
>          else
>              wait_seconds = 60;
>
> -        for (i = 0; i < wait_seconds; i++)
> +        for (i = 0; i < wait_seconds * WAITS_PER_SEC; i++)
>          {
>              /* Done if psql succeeds */
>              fflush(NULL);
> @@ -2519,7 +2522,7 @@ regression_main(int argc, char *argv[],
>                       outputdir);
>              }
>
> -            pg_usleep(1000000L);
> +            pg_usleep(1000000L / WAITS_PER_SEC);
>          }
>          if (i >= wait_seconds)
>          {
>
> It's a micro-optimization, but if we're changing things here to chase cycles it
> might perhaps be worth doing?

I wouldn't quite call not waiting for 1s for the server to start, when it does
so within a few ms, chasing cycles ;). For short tests it's a substantial
fraction of the overall runtime...

Greetings,

Andres Freund



Re: Cirrus-ci is lowering free CI cycles - what to do with cfbot, etc?

From
Daniel Gustafsson
Date:
> On 23 Aug 2023, at 21:22, Andres Freund <andres@anarazel.de> wrote:
> On 2023-08-23 14:48:26 +0200, Daniel Gustafsson wrote:

>> I'll do another pass, but below are a few small comments so far.
>>
>> I don't know Windows to know the implications, but should the below file have
>> some sort of warning about not doing that for production/shared systems, only
>> for dedicated test instances?
>
> Ah, I should have explained that: I'm not planning to apply
> - regress: Check for postgres startup completion more often
> - ci: windows: Disabling write cache flushing during test
> right now. Compared to the other patches the wins are much smaller and/or more
> work is needed to make them good.
>
> I think it might be worth going for
> - ci: switch tasks to debugoptimized build
> because that provides a fair bit of gain. But it might be more hurtful than
> helpful due to costing more when ccache doesn't work...

Gotcha.

>> +++ b/src/tools/ci/windows_write_cache.ps1
>> @@ -0,0 +1,20 @@
>> +# Define the write cache to be power protected. This reduces the rate of cache
>> +# flushes, which seems to help metadata heavy workloads on NTFS. We're just
>> +# testing here anyway, so ...
>> +#
>> +# Let's do so for all disks, this could be useful beyond cirrus-ci.
>>
>> One thing in 0010 caught my eye, and while not introduced in this patchset it
>> might be of interest here.  In the below hunks we loop X ticks around
>> system(psql), with the loop assuming the server can come up really quickly and
>> sleeping if it doesn't.  On my systems I always reach the pg_usleep after
>> failing the check, but if I reverse the check such it first sleeps and then
>> checks I only need to check once instead of twice.
>
> I think there's more effective ways to make this cheaper. The basic thing
> would be to use libpq instead of forking of psql to make a connection
> check.

I had it in my head that not using libpq in pg_regress was a deliberate choice,
but I fail to find a reference to it in the archives.

>> It's a micro-optimization, but if we're changing things here to chase cycles it
>> might perhaps be worth doing?
>
> I wouldn't quite call not waiting for 1s for the server to start, when it does
> so within a few ms, chasing cycles ;). For short tests it's a substantial
> fraction of the overall runtime...

Absolutely, I was referring to shifting the sleep before the test to avoid the
extra test, not the reduction of the pg_usleep.  Reducing the sleep is a clear
win.

--
Daniel Gustafsson




Re: Cirrus-ci is lowering free CI cycles - what to do with cfbot, etc?

From
Nazir Bilal Yavuz
Date:
Hi,

Thanks for the patch!

On Wed, 23 Aug 2023 at 09:58, Andres Freund <andres@anarazel.de> wrote:
> I'm hoping to push this fairly soon, as I'll be on vacation the last week of
> August. I'll be online intermittently though, if there are issues, I can react
> (very limited connectivity for middday Aug 29th - midday Aug 31th though). I'd
> appreciate a quick review or two.

Patch looks good to me besides some minor points.

v3-0004-ci-Prepare-to-make-compute-resources-for-CI-confi.patch:
diff --git a/.cirrus.star b/.cirrus.star
+    """The main function is executed by cirrus-ci after loading
.cirrus.yml and can
+    extend the CI definition further.
+
+    As documented in .cirrus.yml, the final CI configuration is composed of
+
+    1) the contents of this file

Instead of '1) the contents of this file' comment,  '1) the contents
of .cirrus.yml file' could be better since this comment appears in
.cirrus.star file.

+    if repo_config_url != None:
+        print("loading additional configuration from
\"{}\"".format(repo_config_url))
+        output += config_from(repo_config_url)
+    else:
+        output += "n# REPO_CI_CONFIG_URL was not set\n"

Possible typo at output += "n# REPO_CI_CONFIG_URL was not set\n".

v3-0008-ci-switch-tasks-to-debugoptimized-build.patch:
Just thinking of possible optimizations and thought can't we create
something like 'buildtype: xxx' to override default buildtype using
.cirrus.star? This could be better for PG developers. For sure that
could be the subject of another patch.

Regards,
Nazir Bilal Yavuz
Microsoft



Re: Cirrus-ci is lowering free CI cycles - what to do with cfbot, etc?

From
Tom Lane
Date:
Daniel Gustafsson <daniel@yesql.se> writes:
> On 23 Aug 2023, at 21:22, Andres Freund <andres@anarazel.de> wrote:
>> I think there's more effective ways to make this cheaper. The basic thing
>> would be to use libpq instead of forking of psql to make a connection
>> check.

> I had it in my head that not using libpq in pg_regress was a deliberate choice,
> but I fail to find a reference to it in the archives.

I have a vague feeling that you are right about that.  Perhaps the
concern was that under "make installcheck", pg_regress might be
using a build-tree copy of libpq rather than the one from the
system under test.  As long as we're just trying to ping the server,
that shouldn't matter too much I think ... unless we hit problems
with, say, a different default port number or socket path compiled into
one copy vs. the other?  That seems like it's probably a "so don't
do that" case, though.

            regards, tom lane



Re: Cirrus-ci is lowering free CI cycles - what to do with cfbot, etc?

From
Daniel Gustafsson
Date:
> On 23 Aug 2023, at 23:02, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Daniel Gustafsson <daniel@yesql.se> writes:
>> On 23 Aug 2023, at 21:22, Andres Freund <andres@anarazel.de> wrote:
>>> I think there's more effective ways to make this cheaper. The basic thing
>>> would be to use libpq instead of forking of psql to make a connection
>>> check.
>
>> I had it in my head that not using libpq in pg_regress was a deliberate choice,
>> but I fail to find a reference to it in the archives.
>
> I have a vague feeling that you are right about that.  Perhaps the
> concern was that under "make installcheck", pg_regress might be
> using a build-tree copy of libpq rather than the one from the
> system under test.  As long as we're just trying to ping the server,
> that shouldn't matter too much I think ... unless we hit problems
> with, say, a different default port number or socket path compiled into
> one copy vs. the other?  That seems like it's probably a "so don't
> do that" case, though.

Ah yes, that does ring a familiar bell.  I agree that using it for pinging the
server should be safe either way, but we should document the use-with-caution
in pg_regress.c if/when we go down that path.  I'll take a stab at changing the
psql retry loop for pinging tomorrow to see what it would look like.

--
Daniel Gustafsson




Re: Cirrus-ci is lowering free CI cycles - what to do with cfbot, etc?

From
Andres Freund
Date:
Hi,

On 2023-08-23 17:02:51 -0400, Tom Lane wrote:
> Daniel Gustafsson <daniel@yesql.se> writes:
> > On 23 Aug 2023, at 21:22, Andres Freund <andres@anarazel.de> wrote:
> >> I think there's more effective ways to make this cheaper. The basic thing
> >> would be to use libpq instead of forking of psql to make a connection
> >> check.
>
> > I had it in my head that not using libpq in pg_regress was a deliberate choice,
> > but I fail to find a reference to it in the archives.
>
> I have a vague feeling that you are right about that.  Perhaps the
> concern was that under "make installcheck", pg_regress might be
> using a build-tree copy of libpq rather than the one from the
> system under test.  As long as we're just trying to ping the server,
> that shouldn't matter too much I think

Or perhaps the opposite? That an installcheck pg_regress run might use the
system libpq, which doesn't have the symbols, or such?

Either way, with a function like PQping(), which existing in well beyond the
supported branches, that shouldn't be an issue?


> ... unless we hit problems with, say, a different default port number or
> socket path compiled into one copy vs. the other?  That seems like it's
> probably a "so don't do that" case, though.

If we were to find such a case, it seems we could just add whatever missing
parameter to the connection string? I think we would likely already hit such
problems though, the psql started by an installcheck pg_regress might use the
system libpq, I think?

Greetings,

Andres Freund



Re: Cirrus-ci is lowering free CI cycles - what to do with cfbot, etc?

From
Andres Freund
Date:
Hi,

On 2023-08-23 23:55:15 +0300, Nazir Bilal Yavuz wrote:
> On Wed, 23 Aug 2023 at 09:58, Andres Freund <andres@anarazel.de> wrote:
> > I'm hoping to push this fairly soon, as I'll be on vacation the last week of
> > August. I'll be online intermittently though, if there are issues, I can react
> > (very limited connectivity for middday Aug 29th - midday Aug 31th though). I'd
> > appreciate a quick review or two.
> 
> Patch looks good to me besides some minor points.

Thanks for looking!


> v3-0004-ci-Prepare-to-make-compute-resources-for-CI-confi.patch:
> diff --git a/.cirrus.star b/.cirrus.star
> +    """The main function is executed by cirrus-ci after loading
> .cirrus.yml and can
> +    extend the CI definition further.
> +
> +    As documented in .cirrus.yml, the final CI configuration is composed of
> +
> +    1) the contents of this file
> 
> Instead of '1) the contents of this file' comment,  '1) the contents
> of .cirrus.yml file' could be better since this comment appears in
> .cirrus.star file.

Good catch.

> +    if repo_config_url != None:
> +        print("loading additional configuration from
> \"{}\"".format(repo_config_url))
> +        output += config_from(repo_config_url)
> +    else:
> +        output += "n# REPO_CI_CONFIG_URL was not set\n"
> 
> Possible typo at output += "n# REPO_CI_CONFIG_URL was not set\n".

Fixed.


> v3-0008-ci-switch-tasks-to-debugoptimized-build.patch:
> Just thinking of possible optimizations and thought can't we create
> something like 'buildtype: xxx' to override default buildtype using
> .cirrus.star? This could be better for PG developers. For sure that
> could be the subject of another patch.

We could, but I'm not sure what the use would be?

Greetings,

Andres Freund



Re: Cirrus-ci is lowering free CI cycles - what to do with cfbot, etc?

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> On 2023-08-23 17:02:51 -0400, Tom Lane wrote:
>> ... unless we hit problems with, say, a different default port number or
>> socket path compiled into one copy vs. the other?  That seems like it's
>> probably a "so don't do that" case, though.

> If we were to find such a case, it seems we could just add whatever missing
> parameter to the connection string? I think we would likely already hit such
> problems though, the psql started by an installcheck pg_regress might use the
> system libpq, I think?

The trouble with that approach is that in "make installcheck", we
don't really want to assume we know what the installed libpq's default
connection parameters are.  So we don't explicitly know where that
libpq will connect.

As I said, we might be able to start treating installed-libpq-not-
compatible-with-build as a "don't do it" case.  Another idea is to try
to ensure that pg_regress uses the same libpq that the psql-under-test
does; but I'm not sure how to implement that.

            regards, tom lane



Re: Cirrus-ci is lowering free CI cycles - what to do with cfbot, etc?

From
Andres Freund
Date:
Hi,

On 2023-08-23 17:55:53 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On 2023-08-23 17:02:51 -0400, Tom Lane wrote:
> >> ... unless we hit problems with, say, a different default port number or
> >> socket path compiled into one copy vs. the other?  That seems like it's
> >> probably a "so don't do that" case, though.
>
> > If we were to find such a case, it seems we could just add whatever missing
> > parameter to the connection string? I think we would likely already hit such
> > problems though, the psql started by an installcheck pg_regress might use the
> > system libpq, I think?
>
> The trouble with that approach is that in "make installcheck", we
> don't really want to assume we know what the installed libpq's default
> connection parameters are.  So we don't explicitly know where that
> libpq will connect.

Stepping back: I don't think installcheck matters for the concrete use of
libpq we're discussing - the only time we wait for server startup is the
non-installcheck case.

There are other potential uses for libpq in pg_regress though - I'd e.g. like
to have a "monitoring" session open, which we could use to detect that the
server crashed (by waiting for the FD to be become invalid). Where the
connection default issue could matter more?

I was wondering if we could create an unambiguous connection info, but that
seems like it'd be hard to do, without creating cross version hazards.


> As I said, we might be able to start treating installed-libpq-not-
> compatible-with-build as a "don't do it" case.  Another idea is to try
> to ensure that pg_regress uses the same libpq that the psql-under-test
> does; but I'm not sure how to implement that.

I don't think that's likely to work, psql could use a libpq with a different
soversion. We could dlopen() libpq, etc, but that seems way too complicated.


What's the reason we don't force psql to come from the same build as
pg_regress?

Greetings,

Andres Freund



Re: Cirrus-ci is lowering free CI cycles - what to do with cfbot, etc?

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> On 2023-08-23 17:55:53 -0400, Tom Lane wrote:
>> The trouble with that approach is that in "make installcheck", we
>> don't really want to assume we know what the installed libpq's default
>> connection parameters are.  So we don't explicitly know where that
>> libpq will connect.

> Stepping back: I don't think installcheck matters for the concrete use of
> libpq we're discussing - the only time we wait for server startup is the
> non-installcheck case.

Oh, that's an excellent point.  So for the immediately proposed use-case,
there's no issue.  (We don't have a mode where we try to start a server
using already-installed executables.)

> There are other potential uses for libpq in pg_regress though - I'd e.g. like
> to have a "monitoring" session open, which we could use to detect that the
> server crashed (by waiting for the FD to be become invalid). Where the
> connection default issue could matter more?

Meh.  I don't find that idea compelling enough to justify adding
restrictions on what test scenarios will work.  It's seldom hard to
tell from the test output whether the server crashed.

> I was wondering if we could create an unambiguous connection info, but that
> seems like it'd be hard to do, without creating cross version hazards.

Hmm, we don't expect the regression test suite to work against other
server versions, so maybe that could be made to work --- that is, we
could run the psql under test and get a full set of connection
parameters out of it?  But I'm still not finding this worth the
trouble.

> What's the reason we don't force psql to come from the same build as
> pg_regress?

Because the point of installcheck is to check the installed binaries
--- including the installed psql and libpq.

(Thinks for a bit...)  Maybe we should add pg_regress to the installed
fileset, and use that copy not the in-tree copy for installcheck?
Then we could assume it's using the same libpq as psql.  IIRC there
have already been suggestions to do that for the benefit of PGXS
testing.

            regards, tom lane



Re: Cirrus-ci is lowering free CI cycles - what to do with cfbot, etc?

From
Andres Freund
Date:
Hi,

On 2023-08-23 18:32:26 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > There are other potential uses for libpq in pg_regress though - I'd e.g. like
> > to have a "monitoring" session open, which we could use to detect that the
> > server crashed (by waiting for the FD to be become invalid). Where the
> > connection default issue could matter more?
> 
> Meh.  I don't find that idea compelling enough to justify adding
> restrictions on what test scenarios will work.  It's seldom hard to
> tell from the test output whether the server crashed.

I find it pretty painful to wade through a several-megabyte regression.diffs
to find the cause of a crash. I think we ought to use
restart_after_crash=false, since after a crash there's no hope for the tests
to succeed, but even in that case, we end up with a lot of pointless contents
in regression.diffs. If we instead realized that we shouldn't start further
tests, we'd limit that by a fair bit.

Greetings,

Andres Freund



Re: Cirrus-ci is lowering free CI cycles - what to do with cfbot, etc?

From
Peter Eisentraut
Date:
On 24.08.23 00:56, Andres Freund wrote:
> Hi,
> 
> On 2023-08-23 18:32:26 -0400, Tom Lane wrote:
>> Andres Freund <andres@anarazel.de> writes:
>>> There are other potential uses for libpq in pg_regress though - I'd e.g. like
>>> to have a "monitoring" session open, which we could use to detect that the
>>> server crashed (by waiting for the FD to be become invalid). Where the
>>> connection default issue could matter more?
>>
>> Meh.  I don't find that idea compelling enough to justify adding
>> restrictions on what test scenarios will work.  It's seldom hard to
>> tell from the test output whether the server crashed.
> 
> I find it pretty painful to wade through a several-megabyte regression.diffs
> to find the cause of a crash. I think we ought to use
> restart_after_crash=false, since after a crash there's no hope for the tests
> to succeed, but even in that case, we end up with a lot of pointless contents
> in regression.diffs. If we instead realized that we shouldn't start further
> tests, we'd limit that by a fair bit.

I once coded it up so that if the server crashes during a test, it would 
wait until it recovers before running the next test.  I found that 
useful.  I agree the current behavior is not useful in any case.




Re: Cirrus-ci is lowering free CI cycles - what to do with cfbot, etc?

From
Nazir Bilal Yavuz
Date:
Hi,

On Thu, 24 Aug 2023 at 00:48, Andres Freund <andres@anarazel.de> wrote:
> > v3-0008-ci-switch-tasks-to-debugoptimized-build.patch:
> > Just thinking of possible optimizations and thought can't we create
> > something like 'buildtype: xxx' to override default buildtype using
> > .cirrus.star? This could be better for PG developers. For sure that
> > could be the subject of another patch.
>
> We could, but I'm not sure what the use would be?

My main idea behind this was that PG developers could choose
'buildtype: debug' while working on their patches and that
optimization makes it easier to choose the buildtype.

Regards,
Nazir Bilal Yavuz
Microsoft



Re: Cirrus-ci is lowering free CI cycles - what to do with cfbot, etc?

From
Andres Freund
Date:
Hi,

On 2023-08-22 23:58:33 -0700, Andres Freund wrote:
> To make that possible, we need to make the compute resources for CI
> configurable on a per-repository basis.  After experimenting with a bunch of
> ways to do that, I got stuck on that for a while. But since today we have
> sufficient macos runners for cfbot available, so... I think the approach I
> finally settled on is decent, although not great. It's described in the "main"
> commit message:
> [...]
>     ci: Prepare to make compute resources for CI configurable
> I'm hoping to push this fairly soon, as I'll be on vacation the last week of
> August. I'll be online intermittently though, if there are issues, I can react
> (very limited connectivity for middday Aug 29th - midday Aug 31th though). I'd
> appreciate a quick review or two.

I've pushed this yesterday.

And then utilized it to make cfbot use
1) macos persistent workers, hosted by two community members
2) our own GCP account for all the other operating systems

There were a few issues initially (needed to change how to run multiple jobs
on a single mac, and looks like there were some issues with macos going to
sleep while processing jobs...). But it now seems to be chugging alone ok.

One of the nice things is that with our own compute we also control how much
storage can be used, making things like generating docs or code coverage as
part of cfbot more realistic. And we could enable mingw by default when run as
part of cfbot...

Greetings,

Andres Freund



Re: Cirrus-ci is lowering free CI cycles - what to do with cfbot, etc?

From
Daniel Gustafsson
Date:
> On 23 Aug 2023, at 23:12, Daniel Gustafsson <daniel@yesql.se> wrote:
>
>> On 23 Aug 2023, at 23:02, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>
>> Daniel Gustafsson <daniel@yesql.se> writes:
>>> On 23 Aug 2023, at 21:22, Andres Freund <andres@anarazel.de> wrote:
>>>> I think there's more effective ways to make this cheaper. The basic thing
>>>> would be to use libpq instead of forking of psql to make a connection
>>>> check.
>>
>>> I had it in my head that not using libpq in pg_regress was a deliberate choice,
>>> but I fail to find a reference to it in the archives.
>>
>> I have a vague feeling that you are right about that.  Perhaps the
>> concern was that under "make installcheck", pg_regress might be
>> using a build-tree copy of libpq rather than the one from the
>> system under test.  As long as we're just trying to ping the server,
>> that shouldn't matter too much I think ... unless we hit problems
>> with, say, a different default port number or socket path compiled into
>> one copy vs. the other?  That seems like it's probably a "so don't
>> do that" case, though.
>
> Ah yes, that does ring a familiar bell.  I agree that using it for pinging the
> server should be safe either way, but we should document the use-with-caution
> in pg_regress.c if/when we go down that path.  I'll take a stab at changing the
> psql retry loop for pinging tomorrow to see what it would look like.

Attached is a patch with a quick PoC for using PQPing instead of using psql for
connection checks in pg_regress.  In order to see performance it also includes
a diag output for "Time to first test" which contains all setup costs.  This
might not make it into a commit but it was quite helpful in hacking so I left
it in for now.

The patch incorporates Andres' idea for finer granularity of checks by checking
TICKS times per second rather than once per second, it also shifts the
pg_usleep around to require just one ping in most cases compard to two today.

On my relatively tired laptop this speeds up pg_regress setup with 100+ms with
much bigger wins on Windows in the CI.  While it does add a dependency on
libpq, I think it's a fairly decent price to pay for running tests faster.

--
Daniel Gustafsson



Attachment

Re: Cirrus-ci is lowering free CI cycles - what to do with cfbot, etc?

From
Daniel Gustafsson
Date:
> On 28 Aug 2023, at 14:32, Daniel Gustafsson <daniel@yesql.se> wrote:

> Attached is a patch with a quick PoC for using PQPing instead of using psql for
> connection checks in pg_regress.

The attached v2 fixes a silly mistake which led to a compiler warning.

--
Daniel Gustafsson


Attachment

Re: Cirrus-ci is lowering free CI cycles - what to do with cfbot, etc?

From
Andres Freund
Date:
Hi,

On 2023-08-30 10:57:10 +0200, Daniel Gustafsson wrote:
> > On 28 Aug 2023, at 14:32, Daniel Gustafsson <daniel@yesql.se> wrote:
> 
> > Attached is a patch with a quick PoC for using PQPing instead of using psql for
> > connection checks in pg_regress.
> 
> The attached v2 fixes a silly mistake which led to a compiler warning.

Still seems like a good idea to me. To see what impact it has, I measured the
time running the pg_regress tests that take less than 6s on my machine - I
excluded the slower ones (like the main regression tests) because they'd hide
any overall difference.

ninja && m test --suite setup --no-rebuild && tests=$(m test --no-rebuild --list|grep -E '/regress'|grep -vE
'(regress|postgres_fdw|test_integerset|intarray|amcheck|test_decoding)/regress'|cut-d' ' -f 3) && time m test
--no-rebuild$tests
 

Time for:


master:

cassert:
real    0m5.265s
user    0m8.422s
sys    0m8.381s

optimized:
real    0m4.926s
user    0m6.356s
sys    0m8.263s


my patch (probing every 100ms with psql):

cassert:
real    0m3.465s
user    0m8.827s
sys    0m8.579s

optimized:
real    0m2.932s
user    0m6.596s
sys    0m8.458s


Daniel's (probing every 50ms with PQping()):

cassert:
real    0m3.347s
user    0m8.373s
sys    0m8.354s

optimized:
real    0m2.527s
user    0m6.156s
sys    0m8.315s


My patch increased user/sys time a bit (likely due to a higher number of
futile psql forks), but Daniel's doesn't. And it does show a nice overall wall
clock time saving.

Greetings,

Andres Freund



Re: Cirrus-ci is lowering free CI cycles - what to do with cfbot, etc?

From
Daniel Gustafsson
Date:
> On 13 Sep 2023, at 01:49, Andres Freund <andres@anarazel.de> wrote:
> On 2023-08-30 10:57:10 +0200, Daniel Gustafsson wrote:
>>> On 28 Aug 2023, at 14:32, Daniel Gustafsson <daniel@yesql.se> wrote:
>>
>>> Attached is a patch with a quick PoC for using PQPing instead of using psql for
>>> connection checks in pg_regress.
>>
>> The attached v2 fixes a silly mistake which led to a compiler warning.
>
> Still seems like a good idea to me. To see what impact it has, I measured the
> time running the pg_regress tests that take less than 6s on my machine - I
> excluded the slower ones (like the main regression tests) because they'd hide
> any overall difference.

> My patch increased user/sys time a bit (likely due to a higher number of
> futile psql forks), but Daniel's doesn't. And it does show a nice overall wall
> clock time saving.

While it does add a lib dependency I think it's worth doing, so I propose we go
ahead with this for master.

--
Daniel Gustafsson




Re: Cirrus-ci is lowering free CI cycles - what to do with cfbot, etc?

From
Daniel Gustafsson
Date:
> On 13 Sep 2023, at 01:49, Andres Freund <andres@anarazel.de> wrote:

> My patch increased user/sys time a bit (likely due to a higher number of
> futile psql forks), but Daniel's doesn't. And it does show a nice overall wall
> clock time saving.

I went ahead and applied this on master, thanks for review!  Now to see if
there will be any noticeable difference in resource usage.

--
Daniel Gustafsson




Re: Cirrus-ci is lowering free CI cycles - what to do with cfbot, etc?

From
Tom Lane
Date:
Daniel Gustafsson <daniel@yesql.se> writes:
> I went ahead and applied this on master, thanks for review!  Now to see if
> there will be any noticeable difference in resource usage.

I think that tools like Coverity are likely to whine about your
use of sprintf instead of snprintf.  Sure, it's perfectly safe,
but that won't stop the no-sprintf-ever crowd from complaining.

            regards, tom lane



Re: Cirrus-ci is lowering free CI cycles - what to do with cfbot, etc?

From
Daniel Gustafsson
Date:
> On 24 Oct 2023, at 22:34, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Daniel Gustafsson <daniel@yesql.se> writes:
>> I went ahead and applied this on master, thanks for review!  Now to see if
>> there will be any noticeable difference in resource usage.
>
> I think that tools like Coverity are likely to whine about your
> use of sprintf instead of snprintf.  Sure, it's perfectly safe,
> but that won't stop the no-sprintf-ever crowd from complaining.

Fair point, that's probably quite likely to happen.  I can apply an snprintf()
conversion change like this in the two places introduced by this:

-        sprintf(s, "%d", port);
+        sprintf(s, sizeof(s), "%d", port);

--
Daniel Gustafsson