Re: Why is pq_begintypsend so slow? - Mailing list pgsql-hackers

From Andres Freund
Subject Re: Why is pq_begintypsend so slow?
Date
Msg-id 20200114224520.gh4qktr5kuhu5pyi@alap3.anarazel.de
Whole thread Raw
In response to Re: Why is pq_begintypsend so slow?  (Andres Freund <andres@anarazel.de>)
Responses Re: Why is pq_begintypsend so slow?  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
Hi,

On 2020-01-13 15:18:04 -0800, Andres Freund wrote:
> On 2020-01-11 22:32:45 -0500, Tom Lane wrote:
> > I wrote:
> > > I saw at this point that the remaining top spots were
> > > enlargeStringInfo and appendBinaryStringInfo, so I experimented
> > > with inlining them (again, see patch below).  That *did* move
> > > the needle: down to 72691 ms, or 20% better than HEAD.
> >
> > Oh ... marking the test in the inline part of enlargeStringInfo()
> > as unlikely() helps quite a bit more: 66100 ms, a further 9% gain.
> > Might be over-optimizing for this particular case, perhaps
> >
> > (But ... I'm not finding these numbers to be super reproducible
> > across different ASLR layouts.  So take it with a grain of salt.)
>
> FWIW, I've also observed, in another thread (the node func generation
> thing [1]), that inlining enlargeStringInfo() helps a lot, especially
> when inlining some of its callers. Moving e.g. appendStringInfo() inline
> allows the compiler to sometimes optimize away the strlen. But if
> e.g. an inlined appendBinaryStringInfo() still calls enlargeStringInfo()
> unconditionally, successive appends cannot optimize away memory accesses
> for ->len/->data.

With a set of patches doing so, int4send itself is not a significant
factor for my test benchmark [1] anymore. The assembly looks about as
good as one could hope, I think:

# save rbx on the stack
   0x00000000004b7f90 <+0>:    push   %rbx
   0x00000000004b7f91 <+1>:    sub    $0x20,%rsp
# store integer to be sent into rbx
   0x00000000004b7f95 <+5>:    mov    0x20(%rdi),%rbx
# palloc length argument
   0x00000000004b7f99 <+9>:    mov    $0x9,%edi
   0x00000000004b7f9e <+14>:    callq  0x5d9aa0 <palloc>
# store integer in buffer (ebx is 4 byte portion of rbx)
   0x00000000004b7fa3 <+19>:    movbe  %ebx,0x4(%rax)
# store varlena header
   0x00000000004b7fa8 <+24>:    movl   $0x20,(%rax)
# restore stack and rbx registerz
   0x00000000004b7fae <+30>:    add    $0x20,%rsp
   0x00000000004b7fb2 <+34>:    pop    %rbx
   0x00000000004b7fb3 <+35>:    retq

All the $0x20 constants are a bit confusing, but they just happen to be
the same for int4send. It's the size of the stack frame,
offset for FunctionCallInfoBaseData->args[0], the varlena header (and then the stack
frame again) respectively.

Note that I had to annotate palloc with __attribute__((malloc)) to make
the compiler understand that palloc's returned value will not alias with
anything problematic (e.g. the potential of aliasing with fcinfo
prevents optimizing to the above without that annotation).  I think such
annotations would be a good idea anyway, precisely because they allow
the compiler to optimize code significantly better.


These together yields about a 1.8x speedup for me. The profile shows
that the overhead now is overwhelmingly elsewhere:
+   26.30%  postgres  postgres          [.] CopyOneRowTo
+   13.40%  postgres  postgres          [.] tts_buffer_heap_getsomeattrs
+   10.61%  postgres  postgres          [.] AllocSetAlloc
+    9.26%  postgres  libc-2.29.so      [.] __memmove_avx_unaligned_erms
+    7.32%  postgres  postgres          [.] SendFunctionCall
+    6.02%  postgres  postgres          [.] palloc
+    4.45%  postgres  postgres          [.] int4send
+    3.68%  postgres  libc-2.29.so      [.] _IO_fwrite
+    2.71%  postgres  postgres          [.] heapgettup_pagemode
+    1.96%  postgres  postgres          [.] AllocSetReset
+    1.83%  postgres  postgres          [.] CopySendEndOfRow
+    1.75%  postgres  libc-2.29.so      [.] _IO_file_xsputn@@GLIBC_2.2.5
+    1.60%  postgres  postgres          [.] ExecStoreBufferHeapTuple
+    1.57%  postgres  postgres          [.] DoCopyTo
+    1.16%  postgres  postgres          [.] memcpy@plt
+    1.07%  postgres  postgres          [.] heapgetpage

Even without using the new pq_begintypesend_ex()/initStringInfoEx(), the
generated code is still considerably better than before, yielding a
1.58x speedup. Tallocator overhead unsurprisingly is higher:
+   24.93%  postgres  postgres          [.] CopyOneRowTo
+   17.10%  postgres  postgres          [.] AllocSetAlloc
+   10.09%  postgres  postgres          [.] tts_buffer_heap_getsomeattrs
+    6.50%  postgres  libc-2.29.so      [.] __memmove_avx_unaligned_erms
+    5.99%  postgres  postgres          [.] SendFunctionCall
+    5.11%  postgres  postgres          [.] palloc
+    3.95%  postgres  libc-2.29.so      [.] _int_malloc
+    3.38%  postgres  postgres          [.] int4send
+    2.54%  postgres  postgres          [.] heapgettup_pagemode
+    2.11%  postgres  libc-2.29.so      [.] _int_free
+    2.06%  postgres  postgres          [.] MemoryContextReset
+    2.02%  postgres  postgres          [.] AllocSetReset
+    1.97%  postgres  libc-2.29.so      [.] _IO_fwrite
+    1.47%  postgres  postgres          [.] DoCopyTo
+    1.14%  postgres  postgres          [.] ExecStoreBufferHeapTuple
+    1.06%  postgres  libc-2.29.so      [.] _IO_file_xsputn@@GLIBC_2.2.5
+    1.04%  postgres  libc-2.29.so      [.] malloc


Adding a few pg_restrict*, and using appendBinaryStringInfoNT instead of
appendBinaryStringInfo in CopySend* gains another 1.05x.


This does result in some code growth, but given the size of the
improvements, and that the improvements are significant even without
code changes to callsites, that seems worth it.

before:
   text       data        bss        dec        hex    filename
8482739     172304     204240    8859283     872e93    src/backend/postgres
after:
   text       data        bss        dec        hex    filename
8604300     172304     204240    8980844     89096c    src/backend/postgres

Regards,

Andres

[1]
CREATE TABLE lotsaints4(c01 int4 NOT NULL, c02 int4 NOT NULL, c03 int4 NOT NULL, c04 int4 NOT NULL, c05 int4 NOT NULL,
c06int4 NOT NULL, c07 int4 NOT NULL, c08 int4 NOT NULL, c09 int4 NOT NULL, c10 int4 NOT NULL);
 
INSERT INTO lotsaints4 SELECT ((random() * 2^31) - 1)::int, ((random() * 2^31) - 1)::int, ((random() * 2^31) - 1)::int,
((random()* 2^31) - 1)::int, ((random() * 2^31) - 1)::int, ((random() * 2^31) - 1)::int, ((random() * 2^31) - 1)::int,
((random()* 2^31) - 1)::int, ((random() * 2^31) - 1)::int, ((random() * 2^31) - 1)::int FROM generate_series(1,
2000000);
VACUUM FREEZE lotsaints4;
COPY lotsaints4 TO '/dev/null' WITH binary;

CREATE TABLE lotsaints8(c01 int8 NOT NULL, c02 int8 NOT NULL, c03 int8 NOT NULL, c04 int8 NOT NULL, c05 int8 NOT NULL,
c06int8 NOT NULL, c07 int8 NOT NULL, c08 int8 NOT NULL, c09 int8 NOT NULL, c10 int8 NOT NULL);
 
INSERT INTO lotsaints8 SELECT ((random() * 2^31) - 1)::int, ((random() * 2^31) - 1)::int, ((random() * 2^31) - 1)::int,
((random()* 2^31) - 1)::int, ((random() * 2^31) - 1)::int, ((random() * 2^31) - 1)::int, ((random() * 2^31) - 1)::int,
((random()* 2^31) - 1)::int, ((random() * 2^31) - 1)::int, ((random() * 2^31) - 1)::int FROM generate_series(1,
2000000);
VACUUM FREEZE lotsaints8;
COPY lotsaints8 TO '/dev/null' WITH binary;

Attachment

pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Add FOREIGN to ALTER TABLE in pg_dump
Next
From: Alexander Korotkov
Date:
Subject: Re: Avoid full GIN index scan when possible