Thread: Cirrus-ci is lowering free CI cycles - what to do with cfbot, etc?
Hi, As some of you might have seen when running CI, cirrus-ci is restricting how much CI cycles everyone can use for free (announcement at [1]). This takes effect September 1st. This obviously has consequences both for individual users of CI as well as cfbot. The first thing I think we should do is to lower the cost of CI. One thing I had not entirely realized previously, is that macos CI is by far the most expensive CI to provide. That's not just the case with cirrus-ci, but also with other providers. See the series of patches described later in the email. To me, the situation for cfbot is different than the one for individual users. IMO, for the individual user case it's important to use CI for "free", without a whole lot of complexity. Which imo rules approaches like providing $cloud_provider compute accounts, that's too much setup work. With the improvements detailed below, cirrus' free CI would last about ~65 runs / month. For cfbot I hope we can find funding to pay for compute to use for CI. The, by far, most expensive bit is macos. To a significant degree due to macos licensing terms not allowing more than 2 VMs on a physical host :(. The reason we chose cirrus-ci were a) Ability to use full VMs, rather than a pre-selected set of VMs, which allows us to test a larger number b) Ability to link to log files, without requiring an account. E.g. github actions doesn't allow to view logs unless logged in. c) Amount of compute available. The set of free CI providers has shrunk since we chose cirrus, as have the "free" resources provided. I started, quite incomplete as of now, wiki page at [4]. Potential paths forward for individual CI: - migrate wholesale to another CI provider - split CI tasks across different CI providers, rely on github et al displaying the CI status for different platforms - give up Potential paths forward for cfbot, in addition to the above: - Pay for compute / ask the various cloud providers to grant us compute credits. At least some of the cloud providers can be used via cirrus-ci. - Host (some) CI runners ourselves. Particularly with macos and windows, that could provide significant savings. - Build our own system, using buildbot, jenkins or whatnot. Opinions as to what to do? The attached series of patches: 1) Makes startup of macos instances faster, using more efficient caching of the required packages. Also submitted as [2]. 2) Introduces a template initdb that's reused during the tests. Also submitted as [3] 3) Remove use of -DRANDOMIZE_ALLOCATED_MEMORY from macos tasks. It's expensive. And CI also uses asan on linux, so I don't think it's really needed. 4) Switch tasks to use debugoptimized builds. Previously many tasks used -Og, to get decent backtraces etc. But the amount of CPU burned that way is too large. One issue with that is that use of ccache becomes much more crucial, uncached build times do significantly increase. 5) Move use of -Dsegsize_blocks=6 from macos to linux Macos is expensive, -Dsegsize_blocks=6 slows things down. Alternatively we could stop covering both meson and autoconf segsize_blocks. It does affect runtime on linux as well. 6) Disable write cache flushes on windows It's a bit ugly to do this without using the UI... Shaves off about 30s from the tests. 7) pg_regress only checked once a second whether postgres started up, but it's usually much faster. Use pg_ctl's logic. It might be worth replacing the use psql with directly using libpq in pg_regress instead, looks like the overhead of repeatedly starting psql is noticeable. FWIW: with the patches applied, the "credit costs" in cirrus CI are roughly like the following (depends on caching etc): task costs in credits linux-sanity: 0.01 linux-compiler-warnings: 0.05 linux-meson: 0.07 freebsd : 0.08 linux-autoconf: 0.09 windows : 0.18 macos : 0.28 total task runtime is 40.8 cost in credits is 0.76, monthly credits of 50 allow approx 66.10 runs/month Greetings, Andres Freund [1] https://cirrus-ci.org/blog/2023/07/17/limiting-free-usage-of-cirrus-ci/ [2] https://www.postgresql.org/message-id/20230805202539.r3umyamsnctysdc7%40awork3.anarazel.de [3] https://postgr.es/m/20220120021859.3zpsfqn4z7ob7afz@alap3.anarazel.de
Attachment
- v1-0001-ci-macos-used-cached-macports-install.patch
- v1-0002-Use-template-initdb-in-tests.patch
- v1-0003-ci-macos-Remove-use-of-DRANDOMIZE_ALLOCATED_MEMOR.patch
- v1-0004-ci-switch-tasks-to-debugoptimized-build.patch
- v1-0005-ci-Move-use-of-Dsegsize_blocks-6-from-macos-to-li.patch
- v1-0006-ci-windows-Disabling-write-cache-flushing-during-.patch
- v1-0007-regress-Check-for-postgres-startup-completion-mor.patch
- v1-0008-ci-Don-t-specify-amount-of-memory.patch
Hi, On 2023-08-07 19:15:41 -0700, Andres Freund wrote: > The set of free CI providers has shrunk since we chose cirrus, as have the > "free" resources provided. I started, quite incomplete as of now, wiki page at > [4]. Oops, as Thomas just noticed, I left off that link: [4] https://wiki.postgresql.org/wiki/CI_Providers Greetings, Andres Freund
Re: Cirrus-ci is lowering free CI cycles - what to do with cfbot, etc?
From
Heikki Linnakangas
Date:
On 08/08/2023 05:15, Andres Freund wrote: > IMO, for the individual user case it's important to use CI for "free", without > a whole lot of complexity. Which imo rules approaches like providing > $cloud_provider compute accounts, that's too much setup work. +1 > With the improvements detailed below, cirrus' free CI would last > about ~65 runs / month. I think that's plenty. > For cfbot I hope we can find funding to pay for compute to use for CI. +1 > Potential paths forward for cfbot, in addition to the above: > > - Pay for compute / ask the various cloud providers to grant us compute > credits. At least some of the cloud providers can be used via cirrus-ci. > > - Host (some) CI runners ourselves. Particularly with macos and windows, that > could provide significant savings. > > - Build our own system, using buildbot, jenkins or whatnot. > > > Opinions as to what to do? The resources for running our own system isn't free either. I'm sure we can get sponsors for the cirrus-ci credits, or use donations. I have been quite happy with Cirrus CI overall. > The attached series of patches: All of this makes sense to me, although I don't use macos myself. > 5) Move use of -Dsegsize_blocks=6 from macos to linux > > Macos is expensive, -Dsegsize_blocks=6 slows things down. Alternatively we > could stop covering both meson and autoconf segsize_blocks. It does affect > runtime on linux as well. Could we have a comment somewhere on why we use -Dsegsize_blocks on these particular CI runs? It seems pretty random. I guess the idea is to have one autoconf task and one meson task with that option, to check that the autoconf/meson option works? > 6) Disable write cache flushes on windows > > It's a bit ugly to do this without using the UI... Shaves off about 30s > from the tests. A brief comment would be nice: "We don't care about persistence over hard crashes in the CI, so disable write cache flushes to speed it up." -- Heikki Linnakangas Neon (https://neon.tech)
On 08.08.23 04:15, Andres Freund wrote: > Potential paths forward for individual CI: > > - migrate wholesale to another CI provider > > - split CI tasks across different CI providers, rely on github et al > displaying the CI status for different platforms > > - give up With the proposed optimizations, it seems you can still do a fair amount of work under the free plan. > Potential paths forward for cfbot, in addition to the above: > > - Pay for compute / ask the various cloud providers to grant us compute > credits. At least some of the cloud providers can be used via cirrus-ci. > > - Host (some) CI runners ourselves. Particularly with macos and windows, that > could provide significant savings. > > - Build our own system, using buildbot, jenkins or whatnot. I think we should use the "compute credits" plan from Cirrus CI. It should be possible to estimate the costs for that. Money is available, I think.
Hi, On 2023-08-08 16:28:49 +0200, Peter Eisentraut wrote: > On 08.08.23 04:15, Andres Freund wrote: > > Potential paths forward for cfbot, in addition to the above: > > > > - Pay for compute / ask the various cloud providers to grant us compute > > credits. At least some of the cloud providers can be used via cirrus-ci. > > > > - Host (some) CI runners ourselves. Particularly with macos and windows, that > > could provide significant savings. > > > > - Build our own system, using buildbot, jenkins or whatnot. > > I think we should use the "compute credits" plan from Cirrus CI. It should > be possible to estimate the costs for that. Money is available, I think. Unfortunately just doing that seems like it would up considerably on the too expensive side. Here are the stats for last months' cfbot runtimes (provided by Thomas): task_name | sum ------------------------------------------------+------------ FreeBSD - 13 - Meson | 1017:56:09 Windows - Server 2019, MinGW64 - Meson | 00:00:00 SanityCheck | 76:48:41 macOS - Ventura - Meson | 873:12:43 Windows - Server 2019, VS 2019 - Meson & ninja | 1251:08:06 Linux - Debian Bullseye - Autoconf | 830:17:26 Linux - Debian Bullseye - Meson | 860:37:21 CompilerWarnings | 935:30:35 (8 rows) If I did the math right, that's about 7000 credits (and 1 credit costs 1 USD). task costs in credits linux-sanity: 55.30 linux-autoconf: 598.04 linux-meson: 619.40 linux-compiler-warnings: 674.28 freebsd : 732.24 windows : 1201.09 macos : 3143.52 Now, those times are before optimizing test runtime. And besides optimizing the tasks, we can also optimize not running tests for docs patches etc. And optimize cfbot to schedule a bit better. But still, the costs look not realistic to me. If instead we were to use our own GCP account, it's a lot less. t2d-standard-4 instances, which are faster than what we use right now, cost $0.168984 / hour as "normal" instances and $0.026764 as "spot" instances right now [1]. Windows VMs are considerably more expensive due to licensing - 0.184$/h in addition. Assuming spot instances, linux+freebsd tasks would cost ~100USD month (maybe 10-20% more in reality, due to a) spot instances getting terminated requiring retries and b) disks). Windows would be ~255 USD / month (same retries caveats). Given the cost of macos, it seems like it'd be by far the most of affordable to just buy 1-2 mac minis (2x ~660USD) and stick them in a shelf somewhere, as persistent runners. Cirrus has builtin macos virtualization support - but can only host two VMs on each mac, due to macos licensing restrictions. A single mac mini would suffice to keep up with our unoptimized monthly runtime (although there likely would be some overhead). Greetings, Andres Freund [1] https://cloud.google.com/compute/all-pricing
On Mon Aug 7, 2023 at 9:15 PM CDT, Andres Freund wrote: > FWIW: with the patches applied, the "credit costs" in cirrus CI are roughly > like the following (depends on caching etc): > > task costs in credits > linux-sanity: 0.01 > linux-compiler-warnings: 0.05 > linux-meson: 0.07 > freebsd : 0.08 > linux-autoconf: 0.09 > windows : 0.18 > macos : 0.28 > total task runtime is 40.8 > cost in credits is 0.76, monthly credits of 50 allow approx 66.10 runs/month I am not in the loop on the autotools vs meson stuff. How much longer do we anticipate keeping autotools around? Seems like it could be a good opportunity to reduce some CI usage if autotools were finally dropped, but I know there are still outstanding tasks to complete. Back of the napkin math says autotools is about 12% of the credit cost, though I haven't looked to see if linux-meson and linux-autotools are 1:1. -- Tristan Partin Neon (https://neon.tech)
Hi, On 2023-08-08 10:25:52 -0500, Tristan Partin wrote: > On Mon Aug 7, 2023 at 9:15 PM CDT, Andres Freund wrote: > > FWIW: with the patches applied, the "credit costs" in cirrus CI are roughly > > like the following (depends on caching etc): > > > > task costs in credits > > linux-sanity: 0.01 > > linux-compiler-warnings: 0.05 > > linux-meson: 0.07 > > freebsd : 0.08 > > linux-autoconf: 0.09 > > windows : 0.18 > > macos : 0.28 > > total task runtime is 40.8 > > cost in credits is 0.76, monthly credits of 50 allow approx 66.10 runs/month > > I am not in the loop on the autotools vs meson stuff. How much longer do we > anticipate keeping autotools around? I think it depends in what fashion. We've been talking about supporting building out-of-tree modules with "pgxs" for at least a 5 year support window. But the replacement isn't yet finished [1], so that clock hasn't yet started ticking. > Seems like it could be a good opportunity to reduce some CI usage if > autotools were finally dropped, but I know there are still outstanding tasks > to complete. > > Back of the napkin math says autotools is about 12% of the credit cost, > though I haven't looked to see if linux-meson and linux-autotools are 1:1. The autoconf task is actually doing quite useful stuff right now, leaving the use of configure aside, as it builds with address sanitizer. Without that it'd be a lot faster. But we'd loose, imo quite important, coverage. The tests would run a bit faster with meson, but it'd be overall a difference on the margins. Greetings, Andres Freund [1] https://github.com/anarazel/postgres/tree/meson-pkgconfig
On Tue Aug 8, 2023 at 10:38 AM CDT, Andres Freund wrote: > Hi, > > On 2023-08-08 10:25:52 -0500, Tristan Partin wrote: > > On Mon Aug 7, 2023 at 9:15 PM CDT, Andres Freund wrote: > > > FWIW: with the patches applied, the "credit costs" in cirrus CI are roughly > > > like the following (depends on caching etc): > > > > > > task costs in credits > > > linux-sanity: 0.01 > > > linux-compiler-warnings: 0.05 > > > linux-meson: 0.07 > > > freebsd : 0.08 > > > linux-autoconf: 0.09 > > > windows : 0.18 > > > macos : 0.28 > > > total task runtime is 40.8 > > > cost in credits is 0.76, monthly credits of 50 allow approx 66.10 runs/month > > > > I am not in the loop on the autotools vs meson stuff. How much longer do we > > anticipate keeping autotools around? > > I think it depends in what fashion. We've been talking about supporting > building out-of-tree modules with "pgxs" for at least a 5 year support > window. But the replacement isn't yet finished [1], so that clock hasn't yet > started ticking. > > > > Seems like it could be a good opportunity to reduce some CI usage if > > autotools were finally dropped, but I know there are still outstanding tasks > > to complete. > > > > Back of the napkin math says autotools is about 12% of the credit cost, > > though I haven't looked to see if linux-meson and linux-autotools are 1:1. > > The autoconf task is actually doing quite useful stuff right now, leaving the > use of configure aside, as it builds with address sanitizer. Without that it'd > be a lot faster. But we'd loose, imo quite important, coverage. The tests > would run a bit faster with meson, but it'd be overall a difference on the > margins. > > [1] https://github.com/anarazel/postgres/tree/meson-pkgconfig Makes sense. Please let me know if I can help you out in anyway for the v17 development cycle besides what we have already talked about. -- Tristan Partin Neon (https://neon.tech)
On 2023-Aug-08, Andres Freund wrote: > Given the cost of macos, it seems like it'd be by far the most of affordable > to just buy 1-2 mac minis (2x ~660USD) and stick them in a shelf somewhere, as > persistent runners. Cirrus has builtin macos virtualization support - but can > only host two VMs on each mac, due to macos licensing restrictions. A single > mac mini would suffice to keep up with our unoptimized monthly runtime > (although there likely would be some overhead). If using persistent workers is an option, maybe we should explore that. I think we could move all or some of the Linux - Debian builds to hardware that we already have in shelves (depending on how much compute power is really needed.) I think using other OSes is more difficult, mostly because I doubt we want to deal with licenses; but even FreeBSD might not be a realistic option, at least not in the short term. Still, > task_name | sum > ------------------------------------------------+------------ > FreeBSD - 13 - Meson | 1017:56:09 > Windows - Server 2019, MinGW64 - Meson | 00:00:00 > SanityCheck | 76:48:41 > macOS - Ventura - Meson | 873:12:43 > Windows - Server 2019, VS 2019 - Meson & ninja | 1251:08:06 > Linux - Debian Bullseye - Autoconf | 830:17:26 > Linux - Debian Bullseye - Meson | 860:37:21 > CompilerWarnings | 935:30:35 > (8 rows) > moving just Debian, that might alleviate 76+830+860+935 hours from the Cirrus infra, which is ~46%. Not bad. (How come Windows - Meson reports allballs?) -- Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/ "No tengo por qué estar de acuerdo con lo que pienso" (Carlos Caszeli)
Hi, On 2023-08-08 18:34:58 +0200, Alvaro Herrera wrote: > On 2023-Aug-08, Andres Freund wrote: > > > Given the cost of macos, it seems like it'd be by far the most of affordable > > to just buy 1-2 mac minis (2x ~660USD) and stick them in a shelf somewhere, as > > persistent runners. Cirrus has builtin macos virtualization support - but can > > only host two VMs on each mac, due to macos licensing restrictions. A single > > mac mini would suffice to keep up with our unoptimized monthly runtime > > (although there likely would be some overhead). > > If using persistent workers is an option, maybe we should explore that. > I think we could move all or some of the Linux - Debian builds to > hardware that we already have in shelves (depending on how much compute > power is really needed.) (76+830+860+935)/((365/12)*24) = 3.7 3.7 instances with 4 "vcores" are busy 100% of the time. So we'd need at least ~16 cpu threads - I think cirrus sometimes uses instances that disable HT, so it'd perhaps be 16 cores actually. > I think using other OSes is more difficult, mostly because I doubt we > want to deal with licenses; but even FreeBSD might not be a realistic > option, at least not in the short term. They can be VMs, so that shouldn't be a big issue. > > task_name | sum > > ------------------------------------------------+------------ > > FreeBSD - 13 - Meson | 1017:56:09 > > Windows - Server 2019, MinGW64 - Meson | 00:00:00 > > SanityCheck | 76:48:41 > > macOS - Ventura - Meson | 873:12:43 > > Windows - Server 2019, VS 2019 - Meson & ninja | 1251:08:06 > > Linux - Debian Bullseye - Autoconf | 830:17:26 > > Linux - Debian Bullseye - Meson | 860:37:21 > > CompilerWarnings | 935:30:35 > > (8 rows) > > > > moving just Debian, that might alleviate 76+830+860+935 hours from the > Cirrus infra, which is ~46%. Not bad. > > > (How come Windows - Meson reports allballs?) It's mingw64, which we've marked as "manual", because we didn't have the cpu cycles to run it. Greetings, Andres Freund
Hi, On 2023-08-08 11:58:25 +0300, Heikki Linnakangas wrote: > On 08/08/2023 05:15, Andres Freund wrote: > > With the improvements detailed below, cirrus' free CI would last > > about ~65 runs / month. > > I think that's plenty. Not so sure, I would regularly exceed it, I think. But it definitely will suffice for more casual contributors. > > Potential paths forward for cfbot, in addition to the above: > > > > - Pay for compute / ask the various cloud providers to grant us compute > > credits. At least some of the cloud providers can be used via cirrus-ci. > > > > - Host (some) CI runners ourselves. Particularly with macos and windows, that > > could provide significant savings. > > > > - Build our own system, using buildbot, jenkins or whatnot. > > > > > > Opinions as to what to do? > > The resources for running our own system isn't free either. I'm sure we can > get sponsors for the cirrus-ci credits, or use donations. As outlined in my reply to Alvaro, just using credits likely is financially not viable... > > 5) Move use of -Dsegsize_blocks=6 from macos to linux > > > > Macos is expensive, -Dsegsize_blocks=6 slows things down. Alternatively we > > could stop covering both meson and autoconf segsize_blocks. It does affect > > runtime on linux as well. > > Could we have a comment somewhere on why we use -Dsegsize_blocks on these > particular CI runs? It seems pretty random. I guess the idea is to have one > autoconf task and one meson task with that option, to check that the > autoconf/meson option works? Hm, some of that was in the commit message, but I should have added it to .cirrus.yml as well. Normally, the "relation segment" code basically has no coverage in our tests, because we (quite reasonably) don't generate tables large enough. We've had plenty bugs that we didn't notice due the code not being exercised much. So it seemed useful to add CI coverage, by making the segments very small. I chose the tasks by looking at how long they took at the time, I think. Adding them to to the slower ones. > > 6) Disable write cache flushes on windows > > > > It's a bit ugly to do this without using the UI... Shaves off about 30s > > from the tests. > > A brief comment would be nice: "We don't care about persistence over hard > crashes in the CI, so disable write cache flushes to speed it up." Turns out that patch doesn't work on its own anyway, at least not reliably... I tested it by interactively logging into a windows vm and testing it there. It doesn't actually seem to suffice when run in isolation, because the relevant registry key doesn't yet exist. I haven't yet figured out the magic incantations for adding the missing "intermediary", but I'm getting there... Greetings, Andres Freund
On Tue, Aug 8, 2023 at 9:26 PM Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On 2023-08-08 11:58:25 +0300, Heikki Linnakangas wrote: > > On 08/08/2023 05:15, Andres Freund wrote: > > > With the improvements detailed below, cirrus' free CI would last > > > about ~65 runs / month. > > > > I think that's plenty. > > Not so sure, I would regularly exceed it, I think. But it definitely will > suffice for more casual contributors. > > > > > Potential paths forward for cfbot, in addition to the above: > > > > > > - Pay for compute / ask the various cloud providers to grant us compute > > > credits. At least some of the cloud providers can be used via cirrus-ci. > > > > > > - Host (some) CI runners ourselves. Particularly with macos and windows, that > > > could provide significant savings. > > > > > > - Build our own system, using buildbot, jenkins or whatnot. > > > > > > > > > Opinions as to what to do? > > > > The resources for running our own system isn't free either. I'm sure we can > > get sponsors for the cirrus-ci credits, or use donations. > > As outlined in my reply to Alvaro, just using credits likely is financially > not viable... > > In case it's helpful, from an SPI oriented perspective, $7K/month is probably an order of magnitude more than what we can sustain, so I don't see a way to make that work without some kind of additional magic that includes other non-profits and/or commercial companies changing donation habits between now and September. Purchasing a couple of mac-mini's (and/or similar gear) would be near trivial though, just a matter of figuring out where/how to host it (but I think infra can chime in on that if that's what get's decided). The other likely option would be to seek out cloud credits from one of the big three (or others); Amazon has continually said they would be happy to donate more credits to us if we had a use, and I think some of the other hosting providers have said similarly at times; so we'd need to ask and hope it's not too bureaucratic. Robert Treat https://xzilla.net
Hi, On 2023-08-08 22:29:50 -0400, Robert Treat wrote: > In case it's helpful, from an SPI oriented perspective, $7K/month is > probably an order of magnitude more than what we can sustain, so I > don't see a way to make that work without some kind of additional > magic that includes other non-profits and/or commercial companies > changing donation habits between now and September. Yea, I think that'd make no sense, even if we could afford it. I think the patches I've written should drop it to 1/2 already. Thomas added some throttling to push it down further. > Purchasing a couple of mac-mini's (and/or similar gear) would be near > trivial though, just a matter of figuring out where/how to host it > (but I think infra can chime in on that if that's what get's decided). Cool. Because of the limitation of running two VMs at a time on macos and the comparatively low cost of mac minis, it seems they beat alternative models by a fair bit. Pginfra/sysadmin: ^ Based on being off by an order of magnitude, as you mention earlier, it seems that: 1) reducing test runtime etc, as already in progress 2) getting 2 mac minis as runners 3) using ~350 USD / mo in GCP costs for windows, linux, freebsd (*) Would be viable for a month or three? I hope we can get some cloud providers to chip in for 3), but I'd like to have something in place that doesn't depend on that. Given the cost of macos VMs at AWS, the only of the big cloud providers to have macos instances, I think we'd burn pointlessly quick through credits if we used VMs for that. (*) I think we should be able to get below that, but ... > The other likely option would be to seek out cloud credits from one of > the big three (or others); Amazon has continually said they would be > happy to donate more credits to us if we had a use, and I think some > of the other hosting providers have said similarly at times; so we'd > need to ask and hope it's not too bureaucratic. Yep. I tried to start that progress within microsoft, fwiw. Perhaps Joe and Jonathan know how to start within AWS? And perhaps Noah inside GCP? It'd be the least work to get it up and running in GCP, as it's already running there, but should be quite doable at the others as well. Greetings, Andres Freund
Re: Cirrus-ci is lowering free CI cycles - what to do with cfbot, etc?
From
Juan José Santamaría Flecha
Date:
On Wed, Aug 9, 2023 at 3:26 AM Andres Freund <andres@anarazel.de> wrote:
> > 6) Disable write cache flushes on windows
> >
> > It's a bit ugly to do this without using the UI... Shaves off about 30s
> > from the tests.
>
> A brief comment would be nice: "We don't care about persistence over hard
> crashes in the CI, so disable write cache flushes to speed it up."
Turns out that patch doesn't work on its own anyway, at least not
reliably... I tested it by interactively logging into a windows vm and testing
it there. It doesn't actually seem to suffice when run in isolation, because
the relevant registry key doesn't yet exist. I haven't yet figured out the
magic incantations for adding the missing "intermediary", but I'm getting
there...
You can find a good example on how to accomplish this in:
Regards,
Juan José Santamaría Flecha
Hello So pginfra had a little chat about this. Firstly, there's consensus that it makes sense for pginfra to help out with some persistent workers in our existing VM system; however there are some aspects that need some further discussion, to avoid destabilizing the rest of the infrastructure. We're looking into it and we'll let you know. Hosting a couple of Mac Minis is definitely a possibility, if some entity like SPI buys them. Let's take this off-list to arrange the details. Regards -- Álvaro Herrera
On Tue, Aug 08, 2023 at 07:59:55PM -0700, Andres Freund wrote: > On 2023-08-08 22:29:50 -0400, Robert Treat wrote: > 3) using ~350 USD / mo in GCP costs for windows, linux, freebsd (*) > > The other likely option would be to seek out cloud credits > I tried to start that progress within microsoft, fwiw. Perhaps Joe and > Jonathan know how to start within AWS? And perhaps Noah inside GCP? > > It'd be the least work to get it up and running in GCP, as it's already > running there I'm looking at this. Thanks for bringing it to my attention.
Hi, On 2023-08-07 19:15:41 -0700, Andres Freund wrote: > As some of you might have seen when running CI, cirrus-ci is restricting how > much CI cycles everyone can use for free (announcement at [1]). This takes > effect September 1st. > > This obviously has consequences both for individual users of CI as well as > cfbot. > > [...] > Potential paths forward for individual CI: > > - migrate wholesale to another CI provider > > - split CI tasks across different CI providers, rely on github et al > displaying the CI status for different platforms > > - give up > > > Potential paths forward for cfbot, in addition to the above: > > - Pay for compute / ask the various cloud providers to grant us compute > credits. At least some of the cloud providers can be used via cirrus-ci. > > - Host (some) CI runners ourselves. Particularly with macos and windows, that > could provide significant savings. To make that possible, we need to make the compute resources for CI configurable on a per-repository basis. After experimenting with a bunch of ways to do that, I got stuck on that for a while. But since today we have sufficient macos runners for cfbot available, so... I think the approach I finally settled on is decent, although not great. It's described in the "main" commit message: ci: Prepare to make compute resources for CI configurable cirrus-ci will soon restrict the amount of free resources every user gets (as have many other CI providers). For most users of CI that should not be an issue. But e.g. for cfbot it will be an issue. To allow configuring different resources on a per-repository basis, introduce infrastructure for overriding the task execution environment. Unfortunately this is not entirely trivial, as yaml anchors have to be defined before their use, and cirrus-ci only allows injecting additional contents at the end of .cirrus.yml. To deal with that, move the definition of the CI tasks to .cirrus.tasks.yml. The main .cirrus.yml is loaded first, then, if defined, the file referenced by the REPO_CI_CONFIG_GIT_URL variable, will be added, followed by the contents of .cirrus.tasks.yml. That allows REPO_CI_CONFIG_GIT_URL to override the yaml anchors defined in .cirrus.yml. Unfortunately git's default merge / rebase strategy does not handle copied files, just renamed ones. To avoid painful rebasing over this change, this commit just renames .cirrus.yml to .cirrus.tasks.yml, without adding a new .cirrus.yml. That's done in the followup commit, which moves the relevant portion of .cirrus.tasks.yml to .cirrus.yml. Until that is done, REPO_CI_CONFIG_GIT_URL does not fully work. The subsequent commit adds documentation for how to configure custom compute resources to src/tools/ci/README Discussion: https://postgr.es/m/20230808021541.7lbzdefvma7qmn3w@awork3.anarazel.de Backpatch: 15-, where CI support was added I don't love moving most of the contents of .cirrus.yml into a new file, but I don't see another way. I did implement it without that as well (see [1]), but that ends up considerably harder to understand, and hardcodes what cfbot needs. Splitting the commit, as explained above, at least makes git rebase fairly painless. FWIW, I did merge the changes into 15, with only reasonable conflicts (due to new tasks, autoconf->meson). A prerequisite commit converts "SanityCheck" and "CompilerWarnings" to use a full VM instead of a container - that way providing custom compute resources doesn't have to deal with containers in addition to VMs. It also looks like the increased startup overhead is outweighed by the reduction in runtime overhead. I'm hoping to push this fairly soon, as I'll be on vacation the last week of August. I'll be online intermittently though, if there are issues, I can react (very limited connectivity for middday Aug 29th - midday Aug 31th though). I'd appreciate a quick review or two. Greetings, Andres Freund [1] https://github.com/anarazel/postgres/commit/b95fd302161b951f1dc14d586162ed3d85564bfc
Attachment
- v3-0001-ci-Don-t-specify-amount-of-memory.patch
- v3-0002-ci-Move-execution-method-of-tasks-into-yaml-templ.patch
- v3-0003-ci-Use-VMs-for-SanityCheck-and-CompilerWarnings.patch
- v3-0004-ci-Prepare-to-make-compute-resources-for-CI-confi.patch
- v3-0005-ci-Make-compute-resources-for-CI-configurable.patch
- v3-0006-ci-dontmerge-Example-custom-CI-configuration.patch
- v3-0007-Use-template-initdb-in-tests.patch
- v3-0008-ci-switch-tasks-to-debugoptimized-build.patch
- v3-0009-ci-windows-Disabling-write-cache-flushing-during-.patch
- v3-0010-regress-Check-for-postgres-startup-completion-mor.patch
> On 23 Aug 2023, at 08:58, Andres Freund <andres@anarazel.de> wrote: > I'm hoping to push this fairly soon, as I'll be on vacation the last week of > August. I'll be online intermittently though, if there are issues, I can react > (very limited connectivity for middday Aug 29th - midday Aug 31th though). I'd > appreciate a quick review or two. I've been reading over these and the thread, and while not within my area of expertise, nothing really sticks out. I'll do another pass, but below are a few small comments so far. I don't know Windows to know the implications, but should the below file have some sort of warning about not doing that for production/shared systems, only for dedicated test instances? +++ b/src/tools/ci/windows_write_cache.ps1 @@ -0,0 +1,20 @@ +# Define the write cache to be power protected. This reduces the rate of cache +# flushes, which seems to help metadata heavy workloads on NTFS. We're just +# testing here anyway, so ... +# +# Let's do so for all disks, this could be useful beyond cirrus-ci. One thing in 0010 caught my eye, and while not introduced in this patchset it might be of interest here. In the below hunks we loop X ticks around system(psql), with the loop assuming the server can come up really quickly and sleeping if it doesn't. On my systems I always reach the pg_usleep after failing the check, but if I reverse the check such it first sleeps and then checks I only need to check once instead of twice. @@ -2499,7 +2502,7 @@ regression_main(int argc, char *argv[], else wait_seconds = 60; - for (i = 0; i < wait_seconds; i++) + for (i = 0; i < wait_seconds * WAITS_PER_SEC; i++) { /* Done if psql succeeds */ fflush(NULL); @@ -2519,7 +2522,7 @@ regression_main(int argc, char *argv[], outputdir); } - pg_usleep(1000000L); + pg_usleep(1000000L / WAITS_PER_SEC); } if (i >= wait_seconds) { It's a micro-optimization, but if we're changing things here to chase cycles it might perhaps be worth doing? -- Daniel Gustafsson
Hi, On 2023-08-23 14:48:26 +0200, Daniel Gustafsson wrote: > > On 23 Aug 2023, at 08:58, Andres Freund <andres@anarazel.de> wrote: > > > I'm hoping to push this fairly soon, as I'll be on vacation the last week of > > August. I'll be online intermittently though, if there are issues, I can react > > (very limited connectivity for middday Aug 29th - midday Aug 31th though). I'd > > appreciate a quick review or two. > > I've been reading over these and the thread, and while not within my area of > expertise, nothing really sticks out. Thanks! > I'll do another pass, but below are a few small comments so far. > > I don't know Windows to know the implications, but should the below file have > some sort of warning about not doing that for production/shared systems, only > for dedicated test instances? Ah, I should have explained that: I'm not planning to apply - regress: Check for postgres startup completion more often - ci: windows: Disabling write cache flushing during test right now. Compared to the other patches the wins are much smaller and/or more work is needed to make them good. I think it might be worth going for - ci: switch tasks to debugoptimized build because that provides a fair bit of gain. But it might be more hurtful than helpful due to costing more when ccache doesn't work... > +++ b/src/tools/ci/windows_write_cache.ps1 > @@ -0,0 +1,20 @@ > +# Define the write cache to be power protected. This reduces the rate of cache > +# flushes, which seems to help metadata heavy workloads on NTFS. We're just > +# testing here anyway, so ... > +# > +# Let's do so for all disks, this could be useful beyond cirrus-ci. > > One thing in 0010 caught my eye, and while not introduced in this patchset it > might be of interest here. In the below hunks we loop X ticks around > system(psql), with the loop assuming the server can come up really quickly and > sleeping if it doesn't. On my systems I always reach the pg_usleep after > failing the check, but if I reverse the check such it first sleeps and then > checks I only need to check once instead of twice. I think there's more effective ways to make this cheaper. The basic thing would be to use libpq instead of forking of psql to make a connection check. Medium term, I think we should invent a way for pg_ctl and other tooling (including pg_regress) to wait for the service to come up. E.g. having a named pipe that postmaster opens once the server is up, which should allow multiple clients to use select/epoll/... to wait for it without looping. ISTM making pg_regress use libpq w/ PQping() should be a pretty simple patch? The non-polling approach obviously is even better, but also requires more thought (and documentation and ...). > @@ -2499,7 +2502,7 @@ regression_main(int argc, char *argv[], > else > wait_seconds = 60; > > - for (i = 0; i < wait_seconds; i++) > + for (i = 0; i < wait_seconds * WAITS_PER_SEC; i++) > { > /* Done if psql succeeds */ > fflush(NULL); > @@ -2519,7 +2522,7 @@ regression_main(int argc, char *argv[], > outputdir); > } > > - pg_usleep(1000000L); > + pg_usleep(1000000L / WAITS_PER_SEC); > } > if (i >= wait_seconds) > { > > It's a micro-optimization, but if we're changing things here to chase cycles it > might perhaps be worth doing? I wouldn't quite call not waiting for 1s for the server to start, when it does so within a few ms, chasing cycles ;). For short tests it's a substantial fraction of the overall runtime... Greetings, Andres Freund
> On 23 Aug 2023, at 21:22, Andres Freund <andres@anarazel.de> wrote: > On 2023-08-23 14:48:26 +0200, Daniel Gustafsson wrote: >> I'll do another pass, but below are a few small comments so far. >> >> I don't know Windows to know the implications, but should the below file have >> some sort of warning about not doing that for production/shared systems, only >> for dedicated test instances? > > Ah, I should have explained that: I'm not planning to apply > - regress: Check for postgres startup completion more often > - ci: windows: Disabling write cache flushing during test > right now. Compared to the other patches the wins are much smaller and/or more > work is needed to make them good. > > I think it might be worth going for > - ci: switch tasks to debugoptimized build > because that provides a fair bit of gain. But it might be more hurtful than > helpful due to costing more when ccache doesn't work... Gotcha. >> +++ b/src/tools/ci/windows_write_cache.ps1 >> @@ -0,0 +1,20 @@ >> +# Define the write cache to be power protected. This reduces the rate of cache >> +# flushes, which seems to help metadata heavy workloads on NTFS. We're just >> +# testing here anyway, so ... >> +# >> +# Let's do so for all disks, this could be useful beyond cirrus-ci. >> >> One thing in 0010 caught my eye, and while not introduced in this patchset it >> might be of interest here. In the below hunks we loop X ticks around >> system(psql), with the loop assuming the server can come up really quickly and >> sleeping if it doesn't. On my systems I always reach the pg_usleep after >> failing the check, but if I reverse the check such it first sleeps and then >> checks I only need to check once instead of twice. > > I think there's more effective ways to make this cheaper. The basic thing > would be to use libpq instead of forking of psql to make a connection > check. I had it in my head that not using libpq in pg_regress was a deliberate choice, but I fail to find a reference to it in the archives. >> It's a micro-optimization, but if we're changing things here to chase cycles it >> might perhaps be worth doing? > > I wouldn't quite call not waiting for 1s for the server to start, when it does > so within a few ms, chasing cycles ;). For short tests it's a substantial > fraction of the overall runtime... Absolutely, I was referring to shifting the sleep before the test to avoid the extra test, not the reduction of the pg_usleep. Reducing the sleep is a clear win. -- Daniel Gustafsson
Hi, Thanks for the patch! On Wed, 23 Aug 2023 at 09:58, Andres Freund <andres@anarazel.de> wrote: > I'm hoping to push this fairly soon, as I'll be on vacation the last week of > August. I'll be online intermittently though, if there are issues, I can react > (very limited connectivity for middday Aug 29th - midday Aug 31th though). I'd > appreciate a quick review or two. Patch looks good to me besides some minor points. v3-0004-ci-Prepare-to-make-compute-resources-for-CI-confi.patch: diff --git a/.cirrus.star b/.cirrus.star + """The main function is executed by cirrus-ci after loading .cirrus.yml and can + extend the CI definition further. + + As documented in .cirrus.yml, the final CI configuration is composed of + + 1) the contents of this file Instead of '1) the contents of this file' comment, '1) the contents of .cirrus.yml file' could be better since this comment appears in .cirrus.star file. + if repo_config_url != None: + print("loading additional configuration from \"{}\"".format(repo_config_url)) + output += config_from(repo_config_url) + else: + output += "n# REPO_CI_CONFIG_URL was not set\n" Possible typo at output += "n# REPO_CI_CONFIG_URL was not set\n". v3-0008-ci-switch-tasks-to-debugoptimized-build.patch: Just thinking of possible optimizations and thought can't we create something like 'buildtype: xxx' to override default buildtype using .cirrus.star? This could be better for PG developers. For sure that could be the subject of another patch. Regards, Nazir Bilal Yavuz Microsoft
Daniel Gustafsson <daniel@yesql.se> writes: > On 23 Aug 2023, at 21:22, Andres Freund <andres@anarazel.de> wrote: >> I think there's more effective ways to make this cheaper. The basic thing >> would be to use libpq instead of forking of psql to make a connection >> check. > I had it in my head that not using libpq in pg_regress was a deliberate choice, > but I fail to find a reference to it in the archives. I have a vague feeling that you are right about that. Perhaps the concern was that under "make installcheck", pg_regress might be using a build-tree copy of libpq rather than the one from the system under test. As long as we're just trying to ping the server, that shouldn't matter too much I think ... unless we hit problems with, say, a different default port number or socket path compiled into one copy vs. the other? That seems like it's probably a "so don't do that" case, though. regards, tom lane
> On 23 Aug 2023, at 23:02, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Daniel Gustafsson <daniel@yesql.se> writes: >> On 23 Aug 2023, at 21:22, Andres Freund <andres@anarazel.de> wrote: >>> I think there's more effective ways to make this cheaper. The basic thing >>> would be to use libpq instead of forking of psql to make a connection >>> check. > >> I had it in my head that not using libpq in pg_regress was a deliberate choice, >> but I fail to find a reference to it in the archives. > > I have a vague feeling that you are right about that. Perhaps the > concern was that under "make installcheck", pg_regress might be > using a build-tree copy of libpq rather than the one from the > system under test. As long as we're just trying to ping the server, > that shouldn't matter too much I think ... unless we hit problems > with, say, a different default port number or socket path compiled into > one copy vs. the other? That seems like it's probably a "so don't > do that" case, though. Ah yes, that does ring a familiar bell. I agree that using it for pinging the server should be safe either way, but we should document the use-with-caution in pg_regress.c if/when we go down that path. I'll take a stab at changing the psql retry loop for pinging tomorrow to see what it would look like. -- Daniel Gustafsson
Hi, On 2023-08-23 17:02:51 -0400, Tom Lane wrote: > Daniel Gustafsson <daniel@yesql.se> writes: > > On 23 Aug 2023, at 21:22, Andres Freund <andres@anarazel.de> wrote: > >> I think there's more effective ways to make this cheaper. The basic thing > >> would be to use libpq instead of forking of psql to make a connection > >> check. > > > I had it in my head that not using libpq in pg_regress was a deliberate choice, > > but I fail to find a reference to it in the archives. > > I have a vague feeling that you are right about that. Perhaps the > concern was that under "make installcheck", pg_regress might be > using a build-tree copy of libpq rather than the one from the > system under test. As long as we're just trying to ping the server, > that shouldn't matter too much I think Or perhaps the opposite? That an installcheck pg_regress run might use the system libpq, which doesn't have the symbols, or such? Either way, with a function like PQping(), which existing in well beyond the supported branches, that shouldn't be an issue? > ... unless we hit problems with, say, a different default port number or > socket path compiled into one copy vs. the other? That seems like it's > probably a "so don't do that" case, though. If we were to find such a case, it seems we could just add whatever missing parameter to the connection string? I think we would likely already hit such problems though, the psql started by an installcheck pg_regress might use the system libpq, I think? Greetings, Andres Freund
Hi, On 2023-08-23 23:55:15 +0300, Nazir Bilal Yavuz wrote: > On Wed, 23 Aug 2023 at 09:58, Andres Freund <andres@anarazel.de> wrote: > > I'm hoping to push this fairly soon, as I'll be on vacation the last week of > > August. I'll be online intermittently though, if there are issues, I can react > > (very limited connectivity for middday Aug 29th - midday Aug 31th though). I'd > > appreciate a quick review or two. > > Patch looks good to me besides some minor points. Thanks for looking! > v3-0004-ci-Prepare-to-make-compute-resources-for-CI-confi.patch: > diff --git a/.cirrus.star b/.cirrus.star > + """The main function is executed by cirrus-ci after loading > .cirrus.yml and can > + extend the CI definition further. > + > + As documented in .cirrus.yml, the final CI configuration is composed of > + > + 1) the contents of this file > > Instead of '1) the contents of this file' comment, '1) the contents > of .cirrus.yml file' could be better since this comment appears in > .cirrus.star file. Good catch. > + if repo_config_url != None: > + print("loading additional configuration from > \"{}\"".format(repo_config_url)) > + output += config_from(repo_config_url) > + else: > + output += "n# REPO_CI_CONFIG_URL was not set\n" > > Possible typo at output += "n# REPO_CI_CONFIG_URL was not set\n". Fixed. > v3-0008-ci-switch-tasks-to-debugoptimized-build.patch: > Just thinking of possible optimizations and thought can't we create > something like 'buildtype: xxx' to override default buildtype using > .cirrus.star? This could be better for PG developers. For sure that > could be the subject of another patch. We could, but I'm not sure what the use would be? Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > On 2023-08-23 17:02:51 -0400, Tom Lane wrote: >> ... unless we hit problems with, say, a different default port number or >> socket path compiled into one copy vs. the other? That seems like it's >> probably a "so don't do that" case, though. > If we were to find such a case, it seems we could just add whatever missing > parameter to the connection string? I think we would likely already hit such > problems though, the psql started by an installcheck pg_regress might use the > system libpq, I think? The trouble with that approach is that in "make installcheck", we don't really want to assume we know what the installed libpq's default connection parameters are. So we don't explicitly know where that libpq will connect. As I said, we might be able to start treating installed-libpq-not- compatible-with-build as a "don't do it" case. Another idea is to try to ensure that pg_regress uses the same libpq that the psql-under-test does; but I'm not sure how to implement that. regards, tom lane
Hi, On 2023-08-23 17:55:53 -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > On 2023-08-23 17:02:51 -0400, Tom Lane wrote: > >> ... unless we hit problems with, say, a different default port number or > >> socket path compiled into one copy vs. the other? That seems like it's > >> probably a "so don't do that" case, though. > > > If we were to find such a case, it seems we could just add whatever missing > > parameter to the connection string? I think we would likely already hit such > > problems though, the psql started by an installcheck pg_regress might use the > > system libpq, I think? > > The trouble with that approach is that in "make installcheck", we > don't really want to assume we know what the installed libpq's default > connection parameters are. So we don't explicitly know where that > libpq will connect. Stepping back: I don't think installcheck matters for the concrete use of libpq we're discussing - the only time we wait for server startup is the non-installcheck case. There are other potential uses for libpq in pg_regress though - I'd e.g. like to have a "monitoring" session open, which we could use to detect that the server crashed (by waiting for the FD to be become invalid). Where the connection default issue could matter more? I was wondering if we could create an unambiguous connection info, but that seems like it'd be hard to do, without creating cross version hazards. > As I said, we might be able to start treating installed-libpq-not- > compatible-with-build as a "don't do it" case. Another idea is to try > to ensure that pg_regress uses the same libpq that the psql-under-test > does; but I'm not sure how to implement that. I don't think that's likely to work, psql could use a libpq with a different soversion. We could dlopen() libpq, etc, but that seems way too complicated. What's the reason we don't force psql to come from the same build as pg_regress? Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > On 2023-08-23 17:55:53 -0400, Tom Lane wrote: >> The trouble with that approach is that in "make installcheck", we >> don't really want to assume we know what the installed libpq's default >> connection parameters are. So we don't explicitly know where that >> libpq will connect. > Stepping back: I don't think installcheck matters for the concrete use of > libpq we're discussing - the only time we wait for server startup is the > non-installcheck case. Oh, that's an excellent point. So for the immediately proposed use-case, there's no issue. (We don't have a mode where we try to start a server using already-installed executables.) > There are other potential uses for libpq in pg_regress though - I'd e.g. like > to have a "monitoring" session open, which we could use to detect that the > server crashed (by waiting for the FD to be become invalid). Where the > connection default issue could matter more? Meh. I don't find that idea compelling enough to justify adding restrictions on what test scenarios will work. It's seldom hard to tell from the test output whether the server crashed. > I was wondering if we could create an unambiguous connection info, but that > seems like it'd be hard to do, without creating cross version hazards. Hmm, we don't expect the regression test suite to work against other server versions, so maybe that could be made to work --- that is, we could run the psql under test and get a full set of connection parameters out of it? But I'm still not finding this worth the trouble. > What's the reason we don't force psql to come from the same build as > pg_regress? Because the point of installcheck is to check the installed binaries --- including the installed psql and libpq. (Thinks for a bit...) Maybe we should add pg_regress to the installed fileset, and use that copy not the in-tree copy for installcheck? Then we could assume it's using the same libpq as psql. IIRC there have already been suggestions to do that for the benefit of PGXS testing. regards, tom lane
Hi, On 2023-08-23 18:32:26 -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > There are other potential uses for libpq in pg_regress though - I'd e.g. like > > to have a "monitoring" session open, which we could use to detect that the > > server crashed (by waiting for the FD to be become invalid). Where the > > connection default issue could matter more? > > Meh. I don't find that idea compelling enough to justify adding > restrictions on what test scenarios will work. It's seldom hard to > tell from the test output whether the server crashed. I find it pretty painful to wade through a several-megabyte regression.diffs to find the cause of a crash. I think we ought to use restart_after_crash=false, since after a crash there's no hope for the tests to succeed, but even in that case, we end up with a lot of pointless contents in regression.diffs. If we instead realized that we shouldn't start further tests, we'd limit that by a fair bit. Greetings, Andres Freund
On 24.08.23 00:56, Andres Freund wrote: > Hi, > > On 2023-08-23 18:32:26 -0400, Tom Lane wrote: >> Andres Freund <andres@anarazel.de> writes: >>> There are other potential uses for libpq in pg_regress though - I'd e.g. like >>> to have a "monitoring" session open, which we could use to detect that the >>> server crashed (by waiting for the FD to be become invalid). Where the >>> connection default issue could matter more? >> >> Meh. I don't find that idea compelling enough to justify adding >> restrictions on what test scenarios will work. It's seldom hard to >> tell from the test output whether the server crashed. > > I find it pretty painful to wade through a several-megabyte regression.diffs > to find the cause of a crash. I think we ought to use > restart_after_crash=false, since after a crash there's no hope for the tests > to succeed, but even in that case, we end up with a lot of pointless contents > in regression.diffs. If we instead realized that we shouldn't start further > tests, we'd limit that by a fair bit. I once coded it up so that if the server crashes during a test, it would wait until it recovers before running the next test. I found that useful. I agree the current behavior is not useful in any case.
Hi, On Thu, 24 Aug 2023 at 00:48, Andres Freund <andres@anarazel.de> wrote: > > v3-0008-ci-switch-tasks-to-debugoptimized-build.patch: > > Just thinking of possible optimizations and thought can't we create > > something like 'buildtype: xxx' to override default buildtype using > > .cirrus.star? This could be better for PG developers. For sure that > > could be the subject of another patch. > > We could, but I'm not sure what the use would be? My main idea behind this was that PG developers could choose 'buildtype: debug' while working on their patches and that optimization makes it easier to choose the buildtype. Regards, Nazir Bilal Yavuz Microsoft
Hi, On 2023-08-22 23:58:33 -0700, Andres Freund wrote: > To make that possible, we need to make the compute resources for CI > configurable on a per-repository basis. After experimenting with a bunch of > ways to do that, I got stuck on that for a while. But since today we have > sufficient macos runners for cfbot available, so... I think the approach I > finally settled on is decent, although not great. It's described in the "main" > commit message: > [...] > ci: Prepare to make compute resources for CI configurable > I'm hoping to push this fairly soon, as I'll be on vacation the last week of > August. I'll be online intermittently though, if there are issues, I can react > (very limited connectivity for middday Aug 29th - midday Aug 31th though). I'd > appreciate a quick review or two. I've pushed this yesterday. And then utilized it to make cfbot use 1) macos persistent workers, hosted by two community members 2) our own GCP account for all the other operating systems There were a few issues initially (needed to change how to run multiple jobs on a single mac, and looks like there were some issues with macos going to sleep while processing jobs...). But it now seems to be chugging alone ok. One of the nice things is that with our own compute we also control how much storage can be used, making things like generating docs or code coverage as part of cfbot more realistic. And we could enable mingw by default when run as part of cfbot... Greetings, Andres Freund
> On 23 Aug 2023, at 23:12, Daniel Gustafsson <daniel@yesql.se> wrote: > >> On 23 Aug 2023, at 23:02, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> >> Daniel Gustafsson <daniel@yesql.se> writes: >>> On 23 Aug 2023, at 21:22, Andres Freund <andres@anarazel.de> wrote: >>>> I think there's more effective ways to make this cheaper. The basic thing >>>> would be to use libpq instead of forking of psql to make a connection >>>> check. >> >>> I had it in my head that not using libpq in pg_regress was a deliberate choice, >>> but I fail to find a reference to it in the archives. >> >> I have a vague feeling that you are right about that. Perhaps the >> concern was that under "make installcheck", pg_regress might be >> using a build-tree copy of libpq rather than the one from the >> system under test. As long as we're just trying to ping the server, >> that shouldn't matter too much I think ... unless we hit problems >> with, say, a different default port number or socket path compiled into >> one copy vs. the other? That seems like it's probably a "so don't >> do that" case, though. > > Ah yes, that does ring a familiar bell. I agree that using it for pinging the > server should be safe either way, but we should document the use-with-caution > in pg_regress.c if/when we go down that path. I'll take a stab at changing the > psql retry loop for pinging tomorrow to see what it would look like. Attached is a patch with a quick PoC for using PQPing instead of using psql for connection checks in pg_regress. In order to see performance it also includes a diag output for "Time to first test" which contains all setup costs. This might not make it into a commit but it was quite helpful in hacking so I left it in for now. The patch incorporates Andres' idea for finer granularity of checks by checking TICKS times per second rather than once per second, it also shifts the pg_usleep around to require just one ping in most cases compard to two today. On my relatively tired laptop this speeds up pg_regress setup with 100+ms with much bigger wins on Windows in the CI. While it does add a dependency on libpq, I think it's a fairly decent price to pay for running tests faster. -- Daniel Gustafsson
Attachment
> On 28 Aug 2023, at 14:32, Daniel Gustafsson <daniel@yesql.se> wrote: > Attached is a patch with a quick PoC for using PQPing instead of using psql for > connection checks in pg_regress. The attached v2 fixes a silly mistake which led to a compiler warning. -- Daniel Gustafsson
Attachment
Hi, On 2023-08-30 10:57:10 +0200, Daniel Gustafsson wrote: > > On 28 Aug 2023, at 14:32, Daniel Gustafsson <daniel@yesql.se> wrote: > > > Attached is a patch with a quick PoC for using PQPing instead of using psql for > > connection checks in pg_regress. > > The attached v2 fixes a silly mistake which led to a compiler warning. Still seems like a good idea to me. To see what impact it has, I measured the time running the pg_regress tests that take less than 6s on my machine - I excluded the slower ones (like the main regression tests) because they'd hide any overall difference. ninja && m test --suite setup --no-rebuild && tests=$(m test --no-rebuild --list|grep -E '/regress'|grep -vE '(regress|postgres_fdw|test_integerset|intarray|amcheck|test_decoding)/regress'|cut-d' ' -f 3) && time m test --no-rebuild$tests Time for: master: cassert: real 0m5.265s user 0m8.422s sys 0m8.381s optimized: real 0m4.926s user 0m6.356s sys 0m8.263s my patch (probing every 100ms with psql): cassert: real 0m3.465s user 0m8.827s sys 0m8.579s optimized: real 0m2.932s user 0m6.596s sys 0m8.458s Daniel's (probing every 50ms with PQping()): cassert: real 0m3.347s user 0m8.373s sys 0m8.354s optimized: real 0m2.527s user 0m6.156s sys 0m8.315s My patch increased user/sys time a bit (likely due to a higher number of futile psql forks), but Daniel's doesn't. And it does show a nice overall wall clock time saving. Greetings, Andres Freund
> On 13 Sep 2023, at 01:49, Andres Freund <andres@anarazel.de> wrote: > On 2023-08-30 10:57:10 +0200, Daniel Gustafsson wrote: >>> On 28 Aug 2023, at 14:32, Daniel Gustafsson <daniel@yesql.se> wrote: >> >>> Attached is a patch with a quick PoC for using PQPing instead of using psql for >>> connection checks in pg_regress. >> >> The attached v2 fixes a silly mistake which led to a compiler warning. > > Still seems like a good idea to me. To see what impact it has, I measured the > time running the pg_regress tests that take less than 6s on my machine - I > excluded the slower ones (like the main regression tests) because they'd hide > any overall difference. > My patch increased user/sys time a bit (likely due to a higher number of > futile psql forks), but Daniel's doesn't. And it does show a nice overall wall > clock time saving. While it does add a lib dependency I think it's worth doing, so I propose we go ahead with this for master. -- Daniel Gustafsson
> On 13 Sep 2023, at 01:49, Andres Freund <andres@anarazel.de> wrote: > My patch increased user/sys time a bit (likely due to a higher number of > futile psql forks), but Daniel's doesn't. And it does show a nice overall wall > clock time saving. I went ahead and applied this on master, thanks for review! Now to see if there will be any noticeable difference in resource usage. -- Daniel Gustafsson
Daniel Gustafsson <daniel@yesql.se> writes: > I went ahead and applied this on master, thanks for review! Now to see if > there will be any noticeable difference in resource usage. I think that tools like Coverity are likely to whine about your use of sprintf instead of snprintf. Sure, it's perfectly safe, but that won't stop the no-sprintf-ever crowd from complaining. regards, tom lane
> On 24 Oct 2023, at 22:34, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Daniel Gustafsson <daniel@yesql.se> writes: >> I went ahead and applied this on master, thanks for review! Now to see if >> there will be any noticeable difference in resource usage. > > I think that tools like Coverity are likely to whine about your > use of sprintf instead of snprintf. Sure, it's perfectly safe, > but that won't stop the no-sprintf-ever crowd from complaining. Fair point, that's probably quite likely to happen. I can apply an snprintf() conversion change like this in the two places introduced by this: - sprintf(s, "%d", port); + sprintf(s, sizeof(s), "%d", port); -- Daniel Gustafsson