Re: [PATCH] llvmjit: always add the simplifycfg pass - Mailing list pgsql-hackers

From Pierre Ducroquet
Subject Re: [PATCH] llvmjit: always add the simplifycfg pass
Date
Msg-id pJBA_YlJwSojFSBFctsdfSOfoSv2cPS9u68eH1niIUFzYj8eImTRvNCx1jaKGbBsHMM2o6plKbQZlBcoLqG7GjK0scAeuior6SkmggWrmLs=@pinaraf.info
Whole thread Raw
In response to Re: [PATCH] llvmjit: always add the simplifycfg pass  (Andres Freund <andres@anarazel.de>)
List pgsql-hackers
Le jeudi 29 janvier 2026 à 12:19 AM, Andres Freund <andres@anarazel.de> a écrit :

> Hi,
>
> On 2026-01-28 07:56:46 +0000, Pierre Ducroquet wrote:
>
> > Here is a rebased version of the patch with a rewrite of the comment. Thank
> > you again for your previous review. FYI, I've tried adding other passes but
> > none had a similar benefits over cost ratio. The benefits could rather be in
> > changing from O3 to an extensive list of passes.
>
>
> I agree that we should have a better list of passes. I'm a bit worried that
> having an explicit list of passes that we manage ourselves is going to be
> somewhat of a pain to maintain across llvm versions, but ...
>
> WRT passes that might be worth having even with -O0 - running duplicate
> function merging early on could be quite useful, particularly because we won't
> inline the deform routines anyway.
>
> > > I did some benchmarks on some TPCH queries (1 and 4) and I got these
> > > results. Note that for these tests I set jit_optimize_above_cost=1000000
> > > so that it force to use the default<O0> pass with simplifycfg.
>
>
> FYI, you can use -1 to just disble it, instead of having to rely on a specific
> cost.
>
> > > Master Q1:
> > > Timing: Generation 1.553 ms (Deform 0.573 ms), Inlining 0.052 ms, Optimization 95.571 ms, Emission 58.941 ms,
Total156.116 ms 
> > > Execution Time: 38221.318 ms
> > >
> > > Patch Q1:
> > > Timing: Generation 1.477 ms (Deform 0.534 ms), Inlining 0.040 ms, Optimization 95.364 ms, Emission 58.046 ms,
Total154.927 ms 
> > > Execution Time: 38257.797 ms
> > >
> > > Master Q4:
> > > Timing: Generation 0.836 ms (Deform 0.309 ms), Inlining 0.086 ms, Optimization 5.098 ms, Emission 6.963 ms, Total
12.983ms 
> > > Execution Time: 19512.134 ms
> > >
> > > Patch Q4:
> > > Timing: Generation 0.802 ms (Deform 0.294 ms), Inlining 0.090 ms, Optimization 5.234 ms, Emission 6.521 ms, Total
12.648ms 
> > > Execution Time: 16051.483 ms
> > >
> > > For Q4 I see a small increase on Optimization phase but we have a good
> > > performance improvement on execution time. For Q1 the results are almost
> > > the same.
>
>
> These queries are all simple enough that I'm not sure this is a particularly
> good benchmark for optimization speed. In particular, the deform routines
> don't have to deal with a lot of columns and there aren't a lot of functions
> (although I guess that shouldn't really matter WRT simplifycfg).
>

simplifycfg seems to do more things on the deforming functions than I anticipated initially, explaining the performance
benefits.I've written patches to our C code to generate better IR, but I discovered quite a puzzle. 
The biggest gain I see on the generated amd64 code for a very simple query (SELECT * FROM demo WHERE a = 42) with
simplifycfgis that it prevents spilling on the stack and it does what mem2reg was supposed to be doing. 


Running opt -debug-pass-manager on a deform function, I get:
- with default<O0>,mem2reg

Running pass: AnnotationRemarksPass on deform_0_1 (56 instructions)
Running analysis: TargetLibraryAnalysis on deform_0_1
Running pass: PromotePass on deform_0_1 (56 instructions)
Running analysis: DominatorTreeAnalysis on deform_0_1
Running analysis: AssumptionAnalysis on deform_0_1
Running analysis: TargetIRAnalysis on deform_0_1

deform_0_1:                             # @deform_0_1
        .cfi_startproc
# %bb.0:                                # %entry
        movq    24(%rdi), %rax
        movq    %rax, -48(%rsp)                 # 8-byte Spill
        movq    32(%rdi), %rax
        movq    %rax, -40(%rsp)                 # 8-byte Spill
        movq    %rdi, %rax
        addq    $4, %rax
        movq    %rax, -32(%rsp)                 # 8-byte Spill
        movq    %rdi, %rax
        addq    $6, %rax
        movq    %rax, -24(%rsp)                 # 8-byte Spill
        movq    %rdi, %rax
        addq    $72, %rax
        movq    %rax, -16(%rsp)                 # 8-byte Spill
...



- with default<O0>,simplifycfg

Running pass: AnnotationRemarksPass on deform_0_1 (56 instructions)
Running analysis: TargetLibraryAnalysis on deform_0_1
Running pass: SimplifyCFGPass on deform_0_1 (56 instructions)
Running analysis: TargetIRAnalysis on deform_0_1
Running analysis: AssumptionAnalysis on deform_0_1

deform_0_1:                             # @deform_0_1
        .cfi_startproc
# %bb.0:                                # %entry
        movq    24(%rdi), %rax
        movq    32(%rdi), %rsi
        movq    64(%rdi), %rcx
        movq    16(%rcx), %rcx
        movzbl  22(%rcx), %edx
        movslq  %edx, %rdx
        addq    %rdx, %rcx
        movl    72(%rdi), %edx
...

- with default<O0>,simplifycfg,mem2reg

Running pass: SimplifyCFGPass on deform_0_1 (56 instructions)
Running analysis: TargetIRAnalysis on deform_0_1
Running analysis: AssumptionAnalysis on deform_0_1
Running pass: PromotePass on deform_0_1 (46 instructions)
Running analysis: DominatorTreeAnalysis on deform_0_1

deform_0_1:                             # @deform_0_1
        .cfi_startproc
# %bb.0:                                # %entry
        movq    24(%rdi), %rax
        movq    32(%rdi), %rsi
        movq    64(%rdi), %rcx
        movq    16(%rcx), %rcx
        movzbl  22(%rcx), %edx
        movb    $0, (%rsi)
...


So even when running only simplifycfg, the stack allocation goes away.
I am trying to figure that one out, but I suspect we are no longer doing the optimizations we thought we were doing
withmem2reg only, hence the (surprising) speed gains with simplifycfg. 


Note:
Ubuntu LLVM version 19.1.7
  Optimized build.
  Default target: x86_64-pc-linux-gnu
  Host CPU: znver5




pgsql-hackers by date:

Previous
From: Boris Mironov
Date:
Subject: Re: Idea to enhance pgbench by more modes to generate data (multi-TXNs, UNNEST, COPY BINARY)
Next
From: Andres Freund
Date:
Subject: Re: Flush some statistics within running transactions