Thread: Cirrus-ci is lowering free CI cycles - what to do with cfbot, etc?

Cirrus-ci is lowering free CI cycles - what to do with cfbot, etc?

From

Andres Freund

Date:

08 August 2023, 02:15:41

Hi,

As some of you might have seen when running CI, cirrus-ci is restricting how
much CI cycles everyone can use for free (announcement at [1]). This takes
effect September 1st.

This obviously has consequences both for individual users of CI as well as
cfbot.

The first thing I think we should do is to lower the cost of CI. One thing I
had not entirely realized previously, is that macos CI is by far the most
expensive CI to provide. That's not just the case with cirrus-ci, but also
with other providers. See the series of patches described later in the email.

To me, the situation for cfbot is different than the one for individual
users.

IMO, for the individual user case it's important to use CI for "free", without
a whole lot of complexity. Which imo rules approaches like providing
$cloud_provider compute accounts, that's too much setup work. With the
improvements detailed below, cirrus' free CI would last about ~65 runs /
month.

For cfbot I hope we can find funding to pay for compute to use for CI. The, by
far, most expensive bit is macos. To a significant degree due to macos
licensing terms not allowing more than 2 VMs on a physical host :(.

The reason we chose cirrus-ci were

a) Ability to use full VMs, rather than a pre-selected set of VMs, which
allows us to test a larger number

b) Ability to link to log files, without requiring an account. E.g. github
actions doesn't allow to view logs unless logged in.

c) Amount of compute available.

The set of free CI providers has shrunk since we chose cirrus, as have the
"free" resources provided. I started, quite incomplete as of now, wiki page at
[4].

Potential paths forward for individual CI:

- migrate wholesale to another CI provider

- split CI tasks across different CI providers, rely on github et al
displaying the CI status for different platforms

- give up

Potential paths forward for cfbot, in addition to the above:

- Pay for compute / ask the various cloud providers to grant us compute
credits. At least some of the cloud providers can be used via cirrus-ci.

- Host (some) CI runners ourselves. Particularly with macos and windows, that
could provide significant savings.

- Build our own system, using buildbot, jenkins or whatnot.

Opinions as to what to do?

The attached series of patches:

1) Makes startup of macos instances faster, using more efficient caching of
the required packages. Also submitted as [2].

2) Introduces a template initdb that's reused during the tests. Also submitted
as [3]

3) Remove use of -DRANDOMIZE_ALLOCATED_MEMORY from macos tasks. It's
expensive. And CI also uses asan on linux, so I don't think it's really
needed.

4) Switch tasks to use debugoptimized builds. Previously many tasks used -Og,
to get decent backtraces etc. But the amount of CPU burned that way is too
large. One issue with that is that use of ccache becomes much more crucial,
uncached build times do significantly increase.

5) Move use of -Dsegsize_blocks=6 from macos to linux

Macos is expensive, -Dsegsize_blocks=6 slows things down. Alternatively we
could stop covering both meson and autoconf segsize_blocks. It does affect
runtime on linux as well.

6) Disable write cache flushes on windows

It's a bit ugly to do this without using the UI... Shaves off about 30s
from the tests.

7) pg_regress only checked once a second whether postgres started up, but it's
usually much faster. Use pg_ctl's logic. It might be worth replacing the
use psql with directly using libpq in pg_regress instead, looks like the
overhead of repeatedly starting psql is noticeable.

FWIW: with the patches applied, the "credit costs" in cirrus CI are roughly
like the following (depends on caching etc):

From

Juan José Santamaría Flecha

Date:

09 August 2023, 07:22:45

On Wed, Aug 9, 2023 at 3:26 AM Andres Freund <andres@anarazel.de> wrote:

> > 6) Disable write cache flushes on windows
> >
> > It's a bit ugly to do this without using the UI... Shaves off about 30s
> > from the tests.
>
> A brief comment would be nice: "We don't care about persistence over hard
> crashes in the CI, so disable write cache flushes to speed it up."

Turns out that patch doesn't work on its own anyway, at least not
reliably... I tested it by interactively logging into a windows vm and testing
it there. It doesn't actually seem to suffice when run in isolation, because
the relevant registry key doesn't yet exist. I haven't yet figured out the
magic incantations for adding the missing "intermediary", but I'm getting
there...

You can find a good example on how to accomplish this in:

https://github.com/farag2/Utilities/blob/master/Enable_disk_write_caching.ps1

Regards,

Juan José Santamaría Flecha

Re: Cirrus-ci is lowering free CI cycles - what to do with cfbot, etc?

From

Alvaro Herrera

Date:

09 August 2023, 17:21:53

Hello

So pginfra had a little chat about this.  Firstly, there's consensus
that it makes sense for pginfra to help out with some persistent workers
in our existing VM system; however there are some aspects that need
some further discussion, to avoid destabilizing the rest of the
infrastructure.  We're looking into it and we'll let you know.

Hosting a couple of Mac Minis is definitely a possibility, if some
entity like SPI buys them.  Let's take this off-list to arrange the
details.

Regards

-- 
Álvaro Herrera

Re: Cirrus-ci is lowering free CI cycles - what to do with cfbot, etc?

From

Noah Misch

Date:

11 August 2023, 02:44:20

On Tue, Aug 08, 2023 at 07:59:55PM -0700, Andres Freund wrote:
> On 2023-08-08 22:29:50 -0400, Robert Treat wrote:

> 3) using ~350 USD / mo in GCP costs for windows, linux, freebsd (*)

> > The other likely option would be to seek out cloud credits

> I tried to start that progress within microsoft, fwiw.  Perhaps Joe and
> Jonathan know how to start within AWS?  And perhaps Noah inside GCP?
> 
> It'd be the least work to get it up and running in GCP, as it's already
> running there

I'm looking at this.  Thanks for bringing it to my attention.

Re: Cirrus-ci is lowering free CI cycles - what to do with cfbot, etc?

From

Andres Freund

Date:

23 August 2023, 06:58:33

Hi,

On 2023-08-07 19:15:41 -0700, Andres Freund wrote:
> As some of you might have seen when running CI, cirrus-ci is restricting how
> much CI cycles everyone can use for free (announcement at [1]). This takes
> effect September 1st.
>
> This obviously has consequences both for individual users of CI as well as
> cfbot.
>
> [...]

> Potential paths forward for individual CI:
>
> - migrate wholesale to another CI provider
>
> - split CI tasks across different CI providers, rely on github et al
>   displaying the CI status for different platforms
>
> - give up
>
>
> Potential paths forward for cfbot, in addition to the above:
>
> - Pay for compute / ask the various cloud providers to grant us compute
>   credits. At least some of the cloud providers can be used via cirrus-ci.
>
> - Host (some) CI runners ourselves. Particularly with macos and windows, that
>   could provide significant savings.

To make that possible, we need to make the compute resources for CI
configurable on a per-repository basis.  After experimenting with a bunch of
ways to do that, I got stuck on that for a while. But since today we have
sufficient macos runners for cfbot available, so... I think the approach I
finally settled on is decent, although not great. It's described in the "main"
commit message:
    ci: Prepare to make compute resources for CI configurable

    cirrus-ci will soon restrict the amount of free resources every user gets (as
    have many other CI providers). For most users of CI that should not be an
    issue. But e.g. for cfbot it will be an issue.

    To allow configuring different resources on a per-repository basis, introduce
    infrastructure for overriding the task execution environment. Unfortunately
    this is not entirely trivial, as yaml anchors have to be defined before their
    use, and cirrus-ci only allows injecting additional contents at the end of
    .cirrus.yml.

    To deal with that, move the definition of the CI tasks to
    .cirrus.tasks.yml. The main .cirrus.yml is loaded first, then, if defined, the
    file referenced by the REPO_CI_CONFIG_GIT_URL variable, will be added,
    followed by the contents of .cirrus.tasks.yml. That allows
    REPO_CI_CONFIG_GIT_URL to override the yaml anchors defined in .cirrus.yml.

    Unfortunately git's default merge / rebase strategy does not handle copied
    files, just renamed ones. To avoid painful rebasing over this change, this
    commit just renames .cirrus.yml to .cirrus.tasks.yml, without adding a new
    .cirrus.yml. That's done in the followup commit, which moves the relevant
    portion of .cirrus.tasks.yml to .cirrus.yml.  Until that is done,
    REPO_CI_CONFIG_GIT_URL does not fully work.

    The subsequent commit adds documentation for how to configure custom compute
    resources to src/tools/ci/README

    Discussion: https://postgr.es/m/20230808021541.7lbzdefvma7qmn3w@awork3.anarazel.de
    Backpatch: 15-, where CI support was added


I don't love moving most of the contents of .cirrus.yml into a new file, but I
don't see another way. I did implement it without that as well (see [1]), but
that ends up considerably harder to understand, and hardcodes what cfbot
needs.  Splitting the commit, as explained above, at least makes git rebase
fairly painless. FWIW, I did merge the changes into 15, with only reasonable
conflicts (due to new tasks, autoconf->meson).


A prerequisite commit converts "SanityCheck" and "CompilerWarnings" to use a
full VM instead of a container - that way providing custom compute resources
doesn't have to deal with containers in addition to VMs. It also looks like
the increased startup overhead is outweighed by the reduction in runtime
overhead.


I'm hoping to push this fairly soon, as I'll be on vacation the last week of
August. I'll be online intermittently though, if there are issues, I can react
(very limited connectivity for middday Aug 29th - midday Aug 31th though). I'd
appreciate a quick review or two.


Greetings,

Andres Freund

[1] https://github.com/anarazel/postgres/commit/b95fd302161b951f1dc14d586162ed3d85564bfc

> On 23 Aug 2023, at 23:12, Daniel Gustafsson <daniel@yesql.se> wrote:
>
>> On 23 Aug 2023, at 23:02, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>
>> Daniel Gustafsson <daniel@yesql.se> writes:
>>> On 23 Aug 2023, at 21:22, Andres Freund <andres@anarazel.de> wrote:
>>>> I think there's more effective ways to make this cheaper. The basic thing
>>>> would be to use libpq instead of forking of psql to make a connection
>>>> check.
>>
>>> I had it in my head that not using libpq in pg_regress was a deliberate choice,
>>> but I fail to find a reference to it in the archives.
>>
>> I have a vague feeling that you are right about that.  Perhaps the
>> concern was that under "make installcheck", pg_regress might be
>> using a build-tree copy of libpq rather than the one from the
>> system under test.  As long as we're just trying to ping the server,
>> that shouldn't matter too much I think ... unless we hit problems
>> with, say, a different default port number or socket path compiled into
>> one copy vs. the other?  That seems like it's probably a "so don't
>> do that" case, though.
>
> Ah yes, that does ring a familiar bell.  I agree that using it for pinging the
> server should be safe either way, but we should document the use-with-caution
> in pg_regress.c if/when we go down that path.  I'll take a stab at changing the
> psql retry loop for pinging tomorrow to see what it would look like.

Attached is a patch with a quick PoC for using PQPing instead of using psql for
connection checks in pg_regress.  In order to see performance it also includes
a diag output for "Time to first test" which contains all setup costs.  This
might not make it into a commit but it was quite helpful in hacking so I left
it in for now.

The patch incorporates Andres' idea for finer granularity of checks by checking
TICKS times per second rather than once per second, it also shifts the
pg_usleep around to require just one ping in most cases compard to two today.

On my relatively tired laptop this speeds up pg_regress setup with 100+ms with
much bigger wins on Windows in the CI.  While it does add a dependency on
libpq, I think it's a fairly decent price to pay for running tests faster.

--
Daniel Gustafsson

Attachment

v1-0001-Speed-up-pg_regress-server-testing.patch

Re: Cirrus-ci is lowering free CI cycles - what to do with cfbot, etc?

From

Daniel Gustafsson

Date:

30 August 2023, 08:57:10

> On 28 Aug 2023, at 14:32, Daniel Gustafsson <daniel@yesql.se> wrote:

> Attached is a patch with a quick PoC for using PQPing instead of using psql for
> connection checks in pg_regress.

The attached v2 fixes a silly mistake which led to a compiler warning.

--
Daniel Gustafsson

Attachment

v2-0001-Speed-up-pg_regress-server-testing.patch

Re: Cirrus-ci is lowering free CI cycles - what to do with cfbot, etc?

From

Andres Freund

Date:

12 September 2023, 23:49:01

Hi,

On 2023-08-30 10:57:10 +0200, Daniel Gustafsson wrote:
> > On 28 Aug 2023, at 14:32, Daniel Gustafsson <daniel@yesql.se> wrote:
> 
> > Attached is a patch with a quick PoC for using PQPing instead of using psql for
> > connection checks in pg_regress.
> 
> The attached v2 fixes a silly mistake which led to a compiler warning.

Still seems like a good idea to me. To see what impact it has, I measured the
time running the pg_regress tests that take less than 6s on my machine - I
excluded the slower ones (like the main regression tests) because they'd hide
any overall difference.

ninja && m test --suite setup --no-rebuild && tests=$(m test --no-rebuild --list|grep -E '/regress'|grep -vE
'(regress|postgres_fdw|test_integerset|intarray|amcheck|test_decoding)/regress'|cut-d' ' -f 3) && time m test
--no-rebuild$tests

Time for:

master:

cassert:
real    0m5.265s
user    0m8.422s
sys    0m8.381s

optimized:
real    0m4.926s
user    0m6.356s
sys    0m8.263s

my patch (probing every 100ms with psql):

cassert:
real    0m3.465s
user    0m8.827s
sys    0m8.579s

optimized:
real    0m2.932s
user    0m6.596s
sys    0m8.458s

Daniel's (probing every 50ms with PQping()):

cassert:
real    0m3.347s
user    0m8.373s
sys    0m8.354s

optimized:
real    0m2.527s
user    0m6.156s
sys    0m8.315s

My patch increased user/sys time a bit (likely due to a higher number of
futile psql forks), but Daniel's doesn't. And it does show a nice overall wall
clock time saving.

Greetings,

Andres Freund

Re: Cirrus-ci is lowering free CI cycles - what to do with cfbot, etc?

From

Daniel Gustafsson

Date:

18 September 2023, 13:12:49

> On 13 Sep 2023, at 01:49, Andres Freund <andres@anarazel.de> wrote:
> On 2023-08-30 10:57:10 +0200, Daniel Gustafsson wrote:
>>> On 28 Aug 2023, at 14:32, Daniel Gustafsson <daniel@yesql.se> wrote:
>>
>>> Attached is a patch with a quick PoC for using PQPing instead of using psql for
>>> connection checks in pg_regress.
>>
>> The attached v2 fixes a silly mistake which led to a compiler warning.
>
> Still seems like a good idea to me. To see what impact it has, I measured the
> time running the pg_regress tests that take less than 6s on my machine - I
> excluded the slower ones (like the main regression tests) because they'd hide
> any overall difference.

> My patch increased user/sys time a bit (likely due to a higher number of
> futile psql forks), but Daniel's doesn't. And it does show a nice overall wall
> clock time saving.

While it does add a lib dependency I think it's worth doing, so I propose we go
ahead with this for master.

--
Daniel Gustafsson

Re: Cirrus-ci is lowering free CI cycles - what to do with cfbot, etc?

From

Daniel Gustafsson

Date:

24 October 2023, 20:25:54

> On 13 Sep 2023, at 01:49, Andres Freund <andres@anarazel.de> wrote:

> My patch increased user/sys time a bit (likely due to a higher number of
> futile psql forks), but Daniel's doesn't. And it does show a nice overall wall
> clock time saving.

I went ahead and applied this on master, thanks for review!  Now to see if
there will be any noticeable difference in resource usage.

--
Daniel Gustafsson

Re: Cirrus-ci is lowering free CI cycles - what to do with cfbot, etc?

From

Tom Lane

Date:

24 October 2023, 20:34:54

Daniel Gustafsson <daniel@yesql.se> writes:
> I went ahead and applied this on master, thanks for review!  Now to see if
> there will be any noticeable difference in resource usage.

I think that tools like Coverity are likely to whine about your
use of sprintf instead of snprintf.  Sure, it's perfectly safe,
but that won't stop the no-sprintf-ever crowd from complaining.

            regards, tom lane

Re: Cirrus-ci is lowering free CI cycles - what to do with cfbot, etc?

From

Daniel Gustafsson

Date:

24 October 2023, 20:45:48

> On 24 Oct 2023, at 22:34, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Daniel Gustafsson <daniel@yesql.se> writes:
>> I went ahead and applied this on master, thanks for review!  Now to see if
>> there will be any noticeable difference in resource usage.
>
> I think that tools like Coverity are likely to whine about your
> use of sprintf instead of snprintf.  Sure, it's perfectly safe,
> but that won't stop the no-sprintf-ever crowd from complaining.

Fair point, that's probably quite likely to happen.  I can apply an snprintf()
conversion change like this in the two places introduced by this:

-        sprintf(s, "%d", port);
+        sprintf(s, sizeof(s), "%d", port);

--
Daniel Gustafsson