Thread: Should we add debug_parallel_query=regress to CI?
Hi, In https://postgr.es/m/CAH2-WzkYXSnM60ZNo-vQLxFoGzHLHFD0x%3DiPHF6VGxiZmWUuwQ%40mail.gmail.com Peter wrote: On 2025-03-05 09:37:05 -0500, Peter Geoghegan wrote: > Committed just now. Thanks again. But since had to revert, due to BF issues: commit d00107cd63e780753aa25563fa37603369997d0c Author: Peter Geoghegan <pg@bowt.ie> Date: 2025-03-05 10:27:31 -0500 Revert "Show index search count in EXPLAIN ANALYZE." This reverts commit 5ead85fbc81162ab1594f656b036a22e814f96b3. This commit shows test failures with debug_parallel_query=regress. The underlying issue needs to be debugged, so revert for now. Post-commit issues due to debug_parallel_query=regress seem rather common, surely not helped by CI/cfbot not flagging them. I wonder if we ought to make one of the CI tasks use debug_parallel_query=regress, to avoid that problem? Greetings, Andres
On Wed, Mar 5, 2025 at 11:06 AM Andres Freund <andres@anarazel.de> wrote: > Post-commit issues due to debug_parallel_query=regress seem rather common, > surely not helped by CI/cfbot not flagging them. I wonder if we ought to make > one of the CI tasks use debug_parallel_query=regress, to avoid that problem? That would certainly be nice. -- Peter Geoghegan
Andres Freund <andres@anarazel.de> writes: > Post-commit issues due to debug_parallel_query=regress seem rather common, > surely not helped by CI/cfbot not flagging them. I wonder if we ought to make > one of the CI tasks use debug_parallel_query=regress, to avoid that problem? Yeah, it certainly seems like a test coverage gap. However, we seem to be moving towards a situation where each type of CI run is a special snowflake that differs in multiple dimensions from other types. That might make it difficult to figure out which dimension is responsible for a particular failure. (OTOH, the same can be said of the buildfarm, and we've survived regardless. So maybe I'm worried over nothing.) regards, tom lane
Hi, On 2025-03-05 11:19:46 -0500, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > Post-commit issues due to debug_parallel_query=regress seem rather common, > > surely not helped by CI/cfbot not flagging them. I wonder if we ought to make > > one of the CI tasks use debug_parallel_query=regress, to avoid that problem? > > Yeah, it certainly seems like a test coverage gap. I decided to use freebsd, as it's a relatively fast task. Additionally I thought it might be interesting to do this testing on the test run that also does debug_write_read_parse_plan_trees etc. I tested it by intentionally not including the revert, and it indeed finds the problem (not that that was really in doubt, but it seemed worth verifying). https://cirrus-ci.com/task/5782413399293952?logs=test_world#L214 > However, we seem to be moving towards a situation where each type of CI run > is a special snowflake that differs in multiple dimensions from other types. > That might make it difficult to figure out which dimension is responsible > for a particular failure. True, but I don't really see an alternative. Having dedicated tasks for testing just debug_parallel_query=regress (and about half a dozen other things) on their own seems like a *lot* of resource usage for the gain. > (OTOH, the same can be said of the buildfarm, and we've survived > regardless. So maybe I'm worried over nothing.) The alternative seems to be to figure out the problem after commit, with similar issues, as you point out. I'd much rather have to spend a minute analyzing why one task triggers an issue than doing so under pressure after commit. I guess we could be add a "standardized" section at the top of each task describing their oddities? Not sure it's worth it. Greetings, Andres Freund
Attachment
Andres Freund <andres@anarazel.de> writes: > On 2025-03-05 11:19:46 -0500, Tom Lane wrote: >> However, we seem to be moving towards a situation where each type of CI run >> is a special snowflake that differs in multiple dimensions from other types. >> That might make it difficult to figure out which dimension is responsible >> for a particular failure. > True, but I don't really see an alternative. Having dedicated tasks for > testing just debug_parallel_query=regress (and about half a dozen other > things) on their own seems like a *lot* of resource usage for the gain. Agreed. > I guess we could be add a "standardized" section at the top of each task > describing their oddities? Not sure it's worth it. I think this does need to be documented somewhere/somehow, just so that people don't waste time focusing on "it's failing on FreeBSD" when the actual cause is some other thing we happened to load onto that task. regards, tom lane
Hi, On 2025-03-05 12:29:15 -0500, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > I guess we could be add a "standardized" section at the top of each task > > describing their oddities? Not sure it's worth it. > > I think this does need to be documented somewhere/somehow, just so > that people don't waste time focusing on "it's failing on FreeBSD" > when the actual cause is some other thing we happened to load > onto that task. 0002 is a first draft for one way to do this. Of course this still requires somebody analyzing a failure to look at cirrus.tasks.yml, but I don't know if we can do a whole lot about that? I wondered about making the SPECIAL thing an environment variable instead of a comment, that way it'd probably be visible to cfbot. But I don't really see what it could do with that information. I guess we could make the SPECIAL: comments into echos in a special_script: that way it would show up as an explicit "section" in the per-task CI output. But I don't know how much that would help, the failures due to the tasks specialness could be during build, testing etc, so the "special" section won't necessarily be close to the failing step. Greetings, Andres Freund
Attachment
Andres Freund <andres@anarazel.de> writes: > On 2025-03-05 12:29:15 -0500, Tom Lane wrote: >> I think this does need to be documented somewhere/somehow, just so >> that people don't waste time focusing on "it's failing on FreeBSD" >> when the actual cause is some other thing we happened to load >> onto that task. > 0002 is a first draft for one way to do this. Works for me. > Of course this still requires somebody analyzing a failure to look at > cirrus.tasks.yml, but I don't know if we can do a whole lot about that? If it's far away from the actual task-spec code it probably won't get maintained very well, so putting it here seems OK to me. regards, tom lane
Hi, On 2025-03-05 13:10:00 -0500, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > On 2025-03-05 12:29:15 -0500, Tom Lane wrote: > >> I think this does need to be documented somewhere/somehow, just so > >> that people don't waste time focusing on "it's failing on FreeBSD" > >> when the actual cause is some other thing we happened to load > >> onto that task. > > > 0002 is a first draft for one way to do this. > > Works for me. Cool. Thanks for looking at it. Pushed both patches. Greetings, Andres Freund
On Wed, Mar 5, 2025 at 1:34 PM Andres Freund <andres@anarazel.de> wrote: > Pushed both patches. Thanks! -- Peter Geoghegan