Thread: Should we add debug_parallel_query=regress to CI?

Should we add debug_parallel_query=regress to CI?

From
Andres Freund
Date:
Hi,

In https://postgr.es/m/CAH2-WzkYXSnM60ZNo-vQLxFoGzHLHFD0x%3DiPHF6VGxiZmWUuwQ%40mail.gmail.com
Peter wrote:

On 2025-03-05 09:37:05 -0500, Peter Geoghegan wrote:
> Committed just now. Thanks again.

But since had to revert, due to BF issues:

commit d00107cd63e780753aa25563fa37603369997d0c
Author: Peter Geoghegan <pg@bowt.ie>
Date:   2025-03-05 10:27:31 -0500

    Revert "Show index search count in EXPLAIN ANALYZE."

    This reverts commit 5ead85fbc81162ab1594f656b036a22e814f96b3.

    This commit shows test failures with debug_parallel_query=regress.  The
    underlying issue needs to be debugged, so revert for now.


Post-commit issues due to debug_parallel_query=regress seem rather common,
surely not helped by CI/cfbot not flagging them. I wonder if we ought to make
one of the CI tasks use debug_parallel_query=regress, to avoid that problem?

Greetings,

Andres



Re: Should we add debug_parallel_query=regress to CI?

From
Peter Geoghegan
Date:
On Wed, Mar 5, 2025 at 11:06 AM Andres Freund <andres@anarazel.de> wrote:
> Post-commit issues due to debug_parallel_query=regress seem rather common,
> surely not helped by CI/cfbot not flagging them. I wonder if we ought to make
> one of the CI tasks use debug_parallel_query=regress, to avoid that problem?

That would certainly be nice.

--
Peter Geoghegan



Re: Should we add debug_parallel_query=regress to CI?

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> Post-commit issues due to debug_parallel_query=regress seem rather common,
> surely not helped by CI/cfbot not flagging them. I wonder if we ought to make
> one of the CI tasks use debug_parallel_query=regress, to avoid that problem?

Yeah, it certainly seems like a test coverage gap.  However, we seem to
be moving towards a situation where each type of CI run is a special
snowflake that differs in multiple dimensions from other types.
That might make it difficult to figure out which dimension is
responsible for a particular failure.

(OTOH, the same can be said of the buildfarm, and we've survived
regardless.  So maybe I'm worried over nothing.)

            regards, tom lane



Re: Should we add debug_parallel_query=regress to CI?

From
Andres Freund
Date:
Hi,

On 2025-03-05 11:19:46 -0500, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > Post-commit issues due to debug_parallel_query=regress seem rather common,
> > surely not helped by CI/cfbot not flagging them. I wonder if we ought to make
> > one of the CI tasks use debug_parallel_query=regress, to avoid that problem?
> 
> Yeah, it certainly seems like a test coverage gap.

I decided to use freebsd, as it's a relatively fast task.  Additionally I
thought it might be interesting to do this testing on the test run that also
does debug_write_read_parse_plan_trees etc.

I tested it by intentionally not including the revert, and it indeed finds the
problem (not that that was really in doubt, but it seemed worth verifying).

https://cirrus-ci.com/task/5782413399293952?logs=test_world#L214



> However, we seem to be moving towards a situation where each type of CI run
> is a special snowflake that differs in multiple dimensions from other types.
> That might make it difficult to figure out which dimension is responsible
> for a particular failure.

True, but I don't really see an alternative. Having dedicated tasks for
testing just debug_parallel_query=regress (and about half a dozen other
things) on their own seems like a *lot* of resource usage for the gain.


> (OTOH, the same can be said of the buildfarm, and we've survived
> regardless.  So maybe I'm worried over nothing.)

The alternative seems to be to figure out the problem after commit, with
similar issues, as you point out. I'd much rather have to spend a minute
analyzing why one task triggers an issue than doing so under pressure after
commit.

I guess we could be add a "standardized" section at the top of each task
describing their oddities? Not sure it's worth it.

Greetings,

Andres Freund

Attachment

Re: Should we add debug_parallel_query=regress to CI?

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> On 2025-03-05 11:19:46 -0500, Tom Lane wrote:
>> However, we seem to be moving towards a situation where each type of CI run
>> is a special snowflake that differs in multiple dimensions from other types.
>> That might make it difficult to figure out which dimension is responsible
>> for a particular failure.

> True, but I don't really see an alternative. Having dedicated tasks for
> testing just debug_parallel_query=regress (and about half a dozen other
> things) on their own seems like a *lot* of resource usage for the gain.

Agreed.

> I guess we could be add a "standardized" section at the top of each task
> describing their oddities? Not sure it's worth it.

I think this does need to be documented somewhere/somehow, just so
that people don't waste time focusing on "it's failing on FreeBSD"
when the actual cause is some other thing we happened to load
onto that task.

            regards, tom lane



Re: Should we add debug_parallel_query=regress to CI?

From
Andres Freund
Date:
Hi,

On 2025-03-05 12:29:15 -0500, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > I guess we could be add a "standardized" section at the top of each task
> > describing their oddities? Not sure it's worth it.
> 
> I think this does need to be documented somewhere/somehow, just so
> that people don't waste time focusing on "it's failing on FreeBSD"
> when the actual cause is some other thing we happened to load
> onto that task.

0002 is a first draft for one way to do this.

Of course this still requires somebody analyzing a failure to look at
cirrus.tasks.yml, but I don't know if we can do a whole lot about that?

I wondered about making the SPECIAL thing an environment variable instead of a
comment, that way it'd probably be visible to cfbot. But I don't really see
what it could do with that information.

I guess we could make the SPECIAL: comments into echos in a
  special_script:

that way it would show up as an explicit "section" in the per-task CI
output. But I don't know how much that would help, the failures due to the
tasks specialness could be during build, testing etc, so the "special" section
won't necessarily be close to the failing step.

Greetings,

Andres Freund

Attachment

Re: Should we add debug_parallel_query=regress to CI?

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> On 2025-03-05 12:29:15 -0500, Tom Lane wrote:
>> I think this does need to be documented somewhere/somehow, just so
>> that people don't waste time focusing on "it's failing on FreeBSD"
>> when the actual cause is some other thing we happened to load
>> onto that task.

> 0002 is a first draft for one way to do this.

Works for me.

> Of course this still requires somebody analyzing a failure to look at
> cirrus.tasks.yml, but I don't know if we can do a whole lot about that?

If it's far away from the actual task-spec code it probably won't
get maintained very well, so putting it here seems OK to me.

            regards, tom lane



Re: Should we add debug_parallel_query=regress to CI?

From
Andres Freund
Date:
Hi,

On 2025-03-05 13:10:00 -0500, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On 2025-03-05 12:29:15 -0500, Tom Lane wrote:
> >> I think this does need to be documented somewhere/somehow, just so
> >> that people don't waste time focusing on "it's failing on FreeBSD"
> >> when the actual cause is some other thing we happened to load
> >> onto that task.
> 
> > 0002 is a first draft for one way to do this.
> 
> Works for me.

Cool. Thanks for looking at it.

Pushed both patches.

Greetings,

Andres Freund



Re: Should we add debug_parallel_query=regress to CI?

From
Peter Geoghegan
Date:
On Wed, Mar 5, 2025 at 1:34 PM Andres Freund <andres@anarazel.de> wrote:
> Pushed both patches.

Thanks!
--
Peter Geoghegan