Thread: Use windows VMs instead of windows containers on the CI
Hi,
I propose using windows VMs instead of containers, the patch is attached. Currently, windows containers are used on the CI, but these container images are needs to get pulled on every CI run, also they are slow to run.
These VM images are created in the same way how container images are created [1].
The comparison between VMs and containers are (based on d952373a98 and with same numbers of CPU and memory):
Scheduling step:
VS 2019 | MinGW64 | |
VM [2] | 00:17m | 00:16m |
Container [3] | 03:51m | 04:28m |
Execution step:
VS 2019 | MinGW64 | |
VM [2] | 12:16m | 07.55m |
Container [3] | 26:02m | 16:34m |
There is more than 2x speed gain when VMs are used.
[1] https://github.com/anarazel/pg-vm-images/blob/main/packer/windows.pkr.hcl
[2] https://cirrus-ci.com/build/4720774045499392
[3] https://cirrus-ci.com/build/5468256027279360
Regards,
Nazir Bilal Yavuz
Microsoft
Attachment
Hi,
Tables didn't seem nice on web interface. Re-sending with correct formatting.
Scheduling step:
| VS 2019 | MinGW64
--------------------------------------------------------------
VM | 00:17m | 00:16m
--------------------------------------------------------------
Container | 03:51m | 04:28m
Execution step:
| VS 2019 | MinGW64
--------------------------------------------------------------
VM | 12:16m | 07:55m
--------------------------------------------------------------
Container | 26:02m | 16:34m
Regards,
Nazir Bilal Yavuz
Microsoft
Hi,
It didn't work again. Sending numbers until I figure out how to solve this.
Scheduling Step:
VM + VS 2019: 00.17m
Container + VS 2019: 03.51m
VM + MinGW64: 00.16m
Container + MinGW64: 04.28m
Execution step:
VM + VS 2019: 12.16m
Container + VS 2019: 26.02m
VM + MinGW64: 07.55m
Container + MinGW64: 16.34m
Sorry for the multiple mails.
Regards,
Nazir Bilal Yavuz
Microsoft
On Tue, Jan 10, 2023 at 03:20:18PM +0300, Nazir Bilal Yavuz wrote: > Hi, > > I propose using windows VMs instead of containers, the patch is attached. > Currently, windows containers are used on the CI, but these container images > are needs to get pulled on every CI run, also they are slow to run. +1 > There is more than 2x speed gain when VMs are used. One consideration is that if windows runs twice as fast, we'll suddenly start using twice as many resources at cirrus/google/amazon - the windows task has been throttling everything else. Not sure if we should to do anything beyond the limits that cfbot already uses. -- Justin
Hi, On 2023-01-10 09:22:12 -0600, Justin Pryzby wrote: > > There is more than 2x speed gain when VMs are used. > > One consideration is that if windows runs twice as fast, we'll suddenly > start using twice as many resources at cirrus/google/amazon - the > windows task has been throttling everything else. Not sure if we should > to do anything beyond the limits that cfbot already uses. I'm not sure we would. cfbot has a time based limit for how often it tries to rebuild entries, and I think we were just about keeping up with that. In which case we shouldn't, on average, schedule more jobs than we currently do. Although peak "job throughput" would be higher. Thomas? Greetings, Andres Freund
On Wed, Jan 11, 2023 at 8:20 AM Andres Freund <andres@anarazel.de> wrote: > On 2023-01-10 09:22:12 -0600, Justin Pryzby wrote: > > > There is more than 2x speed gain when VMs are used. > > > > One consideration is that if windows runs twice as fast, we'll suddenly > > start using twice as many resources at cirrus/google/amazon - the > > windows task has been throttling everything else. Not sure if we should > > to do anything beyond the limits that cfbot already uses. > > I'm not sure we would. cfbot has a time based limit for how often it tries to > rebuild entries, and I think we were just about keeping up with that. In which > case we shouldn't, on average, schedule more jobs than we currently > do. Although peak "job throughput" would be higher. > > Thomas? It currently tries to re-test each patch every 24 hours, but doesn't achieve that. It looks like it's currently re-testing every ~30 hours. Justin's right, we'll consume more non-Windows resources if Windows speeds up, but not 2x, more like 1.25x when cfbot's own throttling kicks in. Or I could change the cycle target to 36 or 48 hours, to spread the work out more. Back-of-a-napkin maths: * there are currently 240 entries in a testable status * it takes ~0.5 hours to test (because that's the slow Windows time) * therefore it takes ~120 hours to test them all * but we can do 4 at a time, so that's ~30 hours to get through them all and start again * that matches what we see: cfbot=> select created - lag(created) over (order by created) from branch where submission_id = 4068; ?column? ----------------------- 1 day 06:30:00.265047 1 day 05:43:59.978949 1 day 04:13:59.754048 1 day 05:28:59.811916 1 day 07:00:00.651655 (6 rows) If, with this change, we can test in only ~0.25 hours, then we'll only need 60 hours of Cirrus time to test them all. With a target of re-testing every 24 hours, it should now only have to run ~2.5 jobs at all times. Having free slots would be kind to Cirrus, and also lower the latency when a new patch is posted (which currently has to wait for a free slot before it can begin). Great news.
On Tue, Jan 10, 2023 at 03:20:18PM +0300, Nazir Bilal Yavuz wrote: > Hi, > > I propose using windows VMs instead of containers, the patch is attached. > Currently, windows containers are used on the CI, but these container images > are needs to get pulled on every CI run, also they are slow to run. > @@ -589,8 +591,10 @@ task: > # otherwise it'll be sorted before other tasks > depends_on: SanityCheck > > - windows_container: > - image: $CONTAINER_REPO/windows_ci_mingw64:latest > + compute_engine_instance: > + image_project: $IMAGE_PROJECT > + image: family/pg-ci-windows-ci-mingw64 > + platform: windows > cpu: $CPUS > memory: 4G It looks like MinGW currently doesn't have the necessary perl modules: [19:58:46.356] Message: Can't locate IPC/Run.pm in @INC (you may need to install the IPC::Run module) (@INC contains: C:/msys64/ucrt64/lib/perl5/site_perl/5.32.1C:/msys64/ucrt64/lib/perl5/site_perl/5.32.1 C:/msys64/ucrt64/lib/perl5/site_perlC:/msys64/ucrt64/lib/perl5/vendor_perl C:/msys64/ucrt64/lib/perl5/core_perl) at config/check_modules.plline 11. [19:58:46.356] BEGIN failed--compilation aborted at config/check_modules.pl line 11. [19:58:46.356] meson.build:1337: WARNING: Additional Perl modules are required to run TAP tests. That could be caused by a transient failure combined with bad error handling - if there's an error while building the image, it shouldn't be uploaded. -- Justin
Hi, On 2023-01-11 17:21:21 -0600, Justin Pryzby wrote: > On Tue, Jan 10, 2023 at 03:20:18PM +0300, Nazir Bilal Yavuz wrote: > > Hi, > > > > I propose using windows VMs instead of containers, the patch is attached. > > Currently, windows containers are used on the CI, but these container images > > are needs to get pulled on every CI run, also they are slow to run. > > > @@ -589,8 +591,10 @@ task: > > # otherwise it'll be sorted before other tasks > > depends_on: SanityCheck > > > > - windows_container: > > - image: $CONTAINER_REPO/windows_ci_mingw64:latest > > + compute_engine_instance: > > + image_project: $IMAGE_PROJECT > > + image: family/pg-ci-windows-ci-mingw64 > > + platform: windows > > cpu: $CPUS > > memory: 4G > > It looks like MinGW currently doesn't have the necessary perl modules: > > [19:58:46.356] Message: Can't locate IPC/Run.pm in @INC (you may need to install the IPC::Run module) (@INC contains: C:/msys64/ucrt64/lib/perl5/site_perl/5.32.1C:/msys64/ucrt64/lib/perl5/site_perl/5.32.1 C:/msys64/ucrt64/lib/perl5/site_perlC:/msys64/ucrt64/lib/perl5/vendor_perl C:/msys64/ucrt64/lib/perl5/core_perl) at config/check_modules.plline 11. > [19:58:46.356] BEGIN failed--compilation aborted at config/check_modules.pl line 11. > [19:58:46.356] meson.build:1337: WARNING: Additional Perl modules are required to run TAP tests. > > That could be caused by a transient failure combined with bad error > handling - if there's an error while building the image, it shouldn't be > uploaded. Yea, there's a problem where packer on windows doesn't seem to abort after a powershell script error out. The reason isn't yet quiete clear. I think Bilal is working on a workaround. Greetings, Andres Freund
Hi, On 1/12/2023 3:30 AM, Andres Freund wrote: > Yea, there's a problem where packer on windows doesn't seem to abort after a > powershell script error out. The reason isn't yet quiete clear. I think Bilal > is working on a workaround. That should be fixed now. Also, adding a patch for PG15. There were conflicts while applying current patch to the REL_15_STABLE branch. Regards, Nazir Bilal Yavuz Microsoft
Attachment
Hi, On 2023-02-02 17:47:37 +0300, Nazir Bilal Yavuz wrote: > On 1/12/2023 3:30 AM, Andres Freund wrote: > > Yea, there's a problem where packer on windows doesn't seem to abort after a > > powershell script error out. The reason isn't yet quiete clear. I think Bilal > > is working on a workaround. > > > That should be fixed now. Also, adding a patch for PG15. There were > conflicts while applying current patch to the REL_15_STABLE branch. And pushed! I think an improvement in CI times of this degree is pretty awesome. Unfortunately I also noticed that the tap tests on mingw don't run anymore, due to IPC::Run not being available. But it's independent of this change. I don't know when that broke. Could you check it out? Greetings, Andres Freund
On Fri, Feb 3, 2023 at 6:57 PM Andres Freund <andres@anarazel.de> wrote: > And pushed! I think an improvement in CI times of this degree is pretty > awesome. +1 A lot of CI compute time is saved. The Cirrus account[1] was previously hitting the 4 job limit all day long, and now it's often running 1 or 2 jobs when I look, and it has space capacity to start a new job immediately if someone posts a new patch. I'll monitor it over the next few days but it looks great. [1] https://cirrus-ci.com/github/postgresql-cfbot/postgresql
On Fri, Feb 3, 2023 at 3:27 PM Thomas Munro <thomas.munro@gmail.com> wrote: > > On Fri, Feb 3, 2023 at 6:57 PM Andres Freund <andres@anarazel.de> wrote: > > And pushed! I think an improvement in CI times of this degree is pretty > > awesome. > > +1 > > A lot of CI compute time is saved. The Cirrus account[1] was > previously hitting the 4 job limit all day long, and now it's often > running 1 or 2 jobs when I look, and it has space capacity to start a > new job immediately if someone posts a new patch. I'll monitor it > over the next few days but it looks great. > > [1] https://cirrus-ci.com/github/postgresql-cfbot/postgresql Oh, wow! This commit drastically improved testing time on Windows. It was Windows tests that were always behind in my github repo's CI, now I can see it got much faster. Thanks for working on this. Windows tests were taking around 27min (https://cirrus-ci.com/build/6060448332644352) before the patch, but it came down to 13min (https://cirrus-ci.com/build/6528980879147008) after the patch - Yay! 2X improvement :). -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com