Thread: Make printtup a bit faster

Make printtup a bit faster

From
Andy Fan
Date:
Usually I see printtup in the perf-report with a noticeable ratio. Take
"SELECT * FROM pg_class" for example, we can see:

85.65%     3.25%  postgres  postgres  [.] printtup

The high level design of printtup is:

1. Used a pre-allocated StringInfo DR_printtup.buf to store data for
each tuples. 
2. for each datum in the tuple, it calls the type-specific out function
and get a cstring. 
3. after get the cstring, we figure out the "len"  and add both len and
'data' into DR_printtup.buf.  
4. after all the datums are handled, socket_putmessage copies them into
PqSendBuffer. 
5. When the usage of PgSendBuffer is up to PqSendBufferSize, using send
syscall to sent them into client (by copying the data from userspace to
kernel space again). 

Part of the slowness is caused by "memcpy", "strlen" and palloc in
outfunction. 

8.35%     8.35%  postgres  libc.so.6  [.] __strlen_avx2     
4.27%     0.00%  postgres  libc.so.6  [.] __memcpy_avx_unaligned_erms
3.93%     3.93%  postgres  postgres  [.] palloc (part of them caused by
out function) 
5.70%     5.70%  postgres  postgres  [.] AllocSetAlloc (part of them
caused by printtup.) 

My high level proposal is define a type specific print function like:

oidprint(Datum datum, StringInfo buf)
textprint(Datum datum, StringInfo buf)

This function should append both data and len into buf directly. 

for the oidprint case, we can avoid:

5. the dedicate palloc in oid function.
6. the memcpy from the above memory into DR_printtup.buf

for the textprint case, we can avoid

7. strlen, since we can figure out the length from varlena.vl_len

int2/4/8/timestamp/date/time are similar with oid. and numeric, varchar
are similar with text. This almost covers all the common used type.

Hard coding the relationship between common used type and {type}print
function OID looks not cool, Adding a new attribute in pg_type looks too
aggressive however. Anyway this is the next topic to talk about.

If a type's print function is not defined, we can still using the out 
function (and PrinttupAttrInfo caches FmgrInfo rather than
FunctionCallInfo, so there is some optimization in this step as well).  

This proposal covers the step 2 & 3.  If we can do something more
aggressively, we can let the xxxprint print to PqSendBuffer directly,
but this is more complex and need some infrastructure changes. the
memcpy in step 4 is: "1.27% __memcpy_avx_unaligned_erms" in my above
case. 

What do you think?

-- 
Best Regards
Andy Fan




Re: Make printtup a bit faster

From
David Rowley
Date:
On Thu, 29 Aug 2024 at 21:40, Andy Fan <zhihuifan1213@163.com> wrote:
>
>
> Usually I see printtup in the perf-report with a noticeable ratio.

> Part of the slowness is caused by "memcpy", "strlen" and palloc in
> outfunction.

Yeah, it's a pretty inefficient API from a performance point of view.

> My high level proposal is define a type specific print function like:
>
> oidprint(Datum datum, StringInfo buf)
> textprint(Datum datum, StringInfo buf)

I think what we should do instead is make the output functions take a
StringInfo and just pass it the StringInfo where we'd like the bytes
written.

That of course would require rewriting all the output functions for
all the built-in types, so not a small task.  Extensions make that job
harder. I don't think it would be good to force extensions to rewrite
their output functions, so perhaps some wrapper function could help us
align the APIs for extensions that have not been converted yet.

There's a similar problem with input functions not having knowledge of
the input length. You only have to look at textin() to see how useful
that could be. Fixing that would probably make COPY FROM horrendously
faster. Team that up with SIMD for the delimiter char search and COPY
go a bit better still. Neil Conway did propose the SIMD part in [1],
but it's just not nearly as good as it could be when having to still
perform the strlen() calls.

I had planned to work on this for PG18, but I'd be happy for some
assistance if you're willing.

David

[1] https://postgr.es/m/CAOW5sYb1HprQKrzjCsrCP1EauQzZy+njZ-AwBbOUMoGJHJS7Sw@mail.gmail.com



Re: Make printtup a bit faster

From
Tom Lane
Date:
David Rowley <dgrowleyml@gmail.com> writes:
> [ redesign I/O function APIs ]
> I had planned to work on this for PG18, but I'd be happy for some
> assistance if you're willing.

I'm skeptical that such a thing will ever be practical.  To avoid
breaking un-converted data types, all the call sites would have to
support both old and new APIs.  To avoid breaking non-core callers,
all the I/O functions would have to support both old and new APIs.
That probably adds enough overhead to negate whatever benefit you'd
get.

            regards, tom lane



Re: Make printtup a bit faster

From
Andy Fan
Date:
David Rowley <dgrowleyml@gmail.com> writes:

Hello David,

>> My high level proposal is define a type specific print function like:
>>
>> oidprint(Datum datum, StringInfo buf)
>> textprint(Datum datum, StringInfo buf)
>
> I think what we should do instead is make the output functions take a
> StringInfo and just pass it the StringInfo where we'd like the bytes
> written.
>
> That of course would require rewriting all the output functions for
> all the built-in types, so not a small task.  Extensions make that job
> harder. I don't think it would be good to force extensions to rewrite
> their output functions, so perhaps some wrapper function could help us
> align the APIs for extensions that have not been converted yet.

I have the similar concern as Tom that this method looks too
aggressive. That's why I said: 

"If a type's print function is not defined, we can still using the out 
function."

AND

"Hard coding the relationship between [common] used type and {type}print
function OID looks not cool, Adding a new attribute in pg_type looks too 
aggressive however. Anyway this is the next topic to talk about."

What would be the extra benefit we redesign all the out functions?

> There's a similar problem with input functions not having knowledge of
> the input length. You only have to look at textin() to see how useful
> that could be. Fixing that would probably make COPY FROM horrendously
> faster. Team that up with SIMD for the delimiter char search and COPY
> go a bit better still. Neil Conway did propose the SIMD part in [1],
> but it's just not nearly as good as it could be when having to still
> perform the strlen() calls.

OK, I think I can understand the needs to make in-function knows the
input length and good to know the SIMD part for delimiter char
search. strlen looks like a delimiter char search ('\0') as well. Not
sure if "strlen" has been implemented with SIMD part, but if not, why? 

> I had planned to work on this for PG18, but I'd be happy for some
> assistance if you're willing.

I see you did many amazing work with cache-line-frindly data struct
design, branch predition optimization and SIMD optimization. I'd like to
try one myself. I'm not sure if I can meet the target, what if we handle
the out/in function separately (can be by different people)? 

-- 
Best Regards
Andy Fan




Re: Make printtup a bit faster

From
David Rowley
Date:
On Fri, 30 Aug 2024 at 03:33, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> David Rowley <dgrowleyml@gmail.com> writes:
> > [ redesign I/O function APIs ]
> > I had planned to work on this for PG18, but I'd be happy for some
> > assistance if you're willing.
>
> I'm skeptical that such a thing will ever be practical.  To avoid
> breaking un-converted data types, all the call sites would have to
> support both old and new APIs.  To avoid breaking non-core callers,
> all the I/O functions would have to support both old and new APIs.
> That probably adds enough overhead to negate whatever benefit you'd
> get.

Scepticism is certainly good when it comes to such a large API change.
I don't want to argue with you, but I'd like to state a few things
about why I think you're wrong on this...

So, we currently return cstrings in our output functions. Let's take
jsonb_out() as an example, to build that cstring, we make a *new*
StringInfoData on *every call* inside JsonbToCStringWorker(). That
gives you 1024 bytes before you need to enlarge it. However, it's
maybe not all bad as we have some size estimations there to call
enlargeStringInfo(), only that's a bit wasteful as it does a
repalloc() which memcpys the freshly allocated 1024 bytes allocated in
initStringInfo() when it doesn't yet contain any data. After
jsonb_out() has returned and we have the cstring, only we forgot the
length of the string, so most places will immediately call strlen() or
do that indirectly via appendStringInfoString(). For larger JSON
documents, that'll likely require pulling cachelines back into L1
again. I don't know how modern CPU cacheline eviction works, but if it
was as simple as FIFO then the strlen() would flush all those
cachelines only for memcpy() to have to read them back again for
output strings larger than L1.

If we rewrote all of core's output functions to use the new API, then
the branch to test the function signature would be perfectly
predictable and amount to an extra cmp and jne/je opcode. So, I just
don't agree with the overheads negating the benefits comment. You're
probably off by 1 order of magnitude at the minimum and for
medium/large varlena types likely 2-3+ orders. Even a simple int4out
requires a palloc()/memcpy. If we were outputting lots of data, e.g.
in a COPY operation, the output buffer would seldom need to be
enlarged as it would quickly adjust to the correct size.

For the input functions, the possible gains are extensive too.
textin() is a good example, it uses cstring_to_text(), but could be
changed to use cstring_to_text_with_len(). Knowing the input string
length also opens the door to SIMD. Take int4in() as an example, if
pg_strtoint32_safe() knew its input length there are a bunch of
prechecks that could be done with either 64-bit SWAR or with SIMD.
For example, if you knew you had an 8-char string of decimal digits
then converting that to an int32 is quite cheap. It's impossible to
overflow an int32 with 8 decimal digits, so no overflow checks need to
be done until there are at least 10 decimal digits. ca6fde922 seems
like good enough example of the possible gains of SIMD vs
byte-at-a-time processing. I saw some queries go 4x faster there and
that was me trying to keep the JSON document sizes realistic.
byte-at-a-time is just not enough to saturate RAM speed. Take DDR5,
for example, Wikipedia says it has a bandwidth of 32–64 GB/s, so
unless we discover room temperature superconductors, we're not going
to see any massive jump in clock speeds any time soon, and with 5 or
6Ghz CPUs, there's just no way to get anywhere near that bandwidth by
processing byte-at-a-time. For some sort of nieve strcpy() type
function, you're going to need at least a cmp and mov, even if those
were latency=1 (which they're not, see [1]), you can only do 2.5
billion of those two per second on a 5Ghz processor. I've done tested,
but hypothetically (assuming latency=1) that amounts to processing
2.5GB/s, i.e. a long way from DDR5 RAM speed and that's not taking
into account having to increment pointers to the next byte on each
loop.

David

[1] https://www.agner.org/optimize/instruction_tables.pdf



Re: Make printtup a bit faster

From
David Rowley
Date:
On Fri, 30 Aug 2024 at 12:10, Andy Fan <zhihuifan1213@163.com> wrote:
> What would be the extra benefit we redesign all the out functions?

If I've understood your proposal correctly, it sounds like you want to
invent a new "print" output function for each type to output the Datum
onto a StringInfo, if that's the case, what would be the point of
having both versions? If there's anywhere we call output functions
where the resulting value isn't directly appended to a StringInfo,
then we could just use a temporary StringInfo to obtain the cstring
and its length.

David



Re: Make printtup a bit faster

From
Andy Fan
Date:
David Rowley <dgrowleyml@gmail.com> writes:

> On Fri, 30 Aug 2024 at 12:10, Andy Fan <zhihuifan1213@163.com> wrote:
>> What would be the extra benefit we redesign all the out functions?
>
> If I've understood your proposal correctly, it sounds like you want to
> invent a new "print" output function for each type to output the Datum
> onto a StringInfo,

Mostly yes, but not for [each type at once], just for the [common used
type], like int2/4/8, float4/8, date/time/timestamp, text/.. and so on.

> if that's the case, what would be the point of having both versions?

The biggest benefit would be compatibility.

In my opinion, print function (not need to be in pg_type at all) is as
an optimization and optional, in some performance critical path we can
replace the out-function with printfunction, like (printtup). if such
performance-critical path find a type without a print-function is
defined, just keep the old way.

Kind of like supportfunction for proc, this is for data type? Within
this way, changes would be much smaller and step-by-step.

> If there's anywhere we call output functions
> where the resulting value isn't directly appended to a StringInfo,
> then we could just use a temporary StringInfo to obtain the cstring
> and its length.

I think this is true, but it requests some caller's code change.

-- 
Best Regards
Andy Fan




Re: Make printtup a bit faster

From
David Rowley
Date:
On Fri, 30 Aug 2024 at 13:04, Andy Fan <zhihuifan1213@163.com> wrote:
>
> David Rowley <dgrowleyml@gmail.com> writes:
> > If there's anywhere we call output functions
> > where the resulting value isn't directly appended to a StringInfo,
> > then we could just use a temporary StringInfo to obtain the cstring
> > and its length.
>
> I think this is true, but it requests some caller's code change.

Yeah, calling code would need to be changed to take advantage of the
new API, however, the differences in which types support which API
could be hidden inside OutputFunctionCall(). That function could just
fake up a StringInfo for any types that only support the old cstring
API. That means we don't need to add handling for both cases
everywhere we need to call the output function. It's possible that
could make some operations slightly slower when only the old API is
available, but then maybe not as we do now have read-only StringInfos.
Maybe the StringInfoData.data field could just be set to point to the
given cstring using initReadOnlyStringInfo() rather than doing
appendBinaryStringInfo() onto yet another buffer for the old API.

David



Re: Make printtup a bit faster

From
Andy Fan
Date:
David Rowley <dgrowleyml@gmail.com> writes:

> On Fri, 30 Aug 2024 at 13:04, Andy Fan <zhihuifan1213@163.com> wrote:
>>
>> David Rowley <dgrowleyml@gmail.com> writes:
>> > If there's anywhere we call output functions
>> > where the resulting value isn't directly appended to a StringInfo,
>> > then we could just use a temporary StringInfo to obtain the cstring
>> > and its length.
>>
>> I think this is true, but it requests some caller's code change.
>
> Yeah, calling code would need to be changed to take advantage of the
> new API, however, the differences in which types support which API
> could be hidden inside OutputFunctionCall(). That function could just
> fake up a StringInfo for any types that only support the old cstring
> API. That means we don't need to add handling for both cases
> everywhere we need to call the output function.

We can do this, then the printtup case (stands for some performance
crital path) still need to change discard OutputFunctionCall() since it
uses the fake StringInfo then a memcpy is needed again IIUC.

Besides above, my major concerns about your proposal need to change [all
the type's outfunction at once] which is too aggresive for me. In the
fresh setup without any extension is created, "SELECT count(*) FROM
pg_type" returns 627 already, So other piece of my previous reply is 
more important to me.  

It is great that both of us feeling the current stategy is not good for
performance:) 

-- 
Best Regards
Andy Fan




Re: Make printtup a bit faster

From
Andreas Karlsson
Date:
On 8/29/24 1:51 PM, David Rowley wrote:
> I had planned to work on this for PG18, but I'd be happy for some
> assistance if you're willing.

I am interested in working on this, unless Andy Fan wants to do this 
work. :) I believe that optimizing the out, in and send functions would 
be worth the pain. I get Tom's objections but I do not think adding a 
small check would add much overhead compared to the gains we can get.

And given that all of in, out and send could be optimized I do not like 
the idea of duplicating all three in the catalog.

David, have you given any thought on the cleanest way to check for if 
the new API or the old is the be used for these functions? If not I can 
figure out something myself, just wondering if you already had something 
in mind.

Andreas




Re: Make printtup a bit faster

From
Andy Fan
Date:
Andy Fan <zhihuifan1213@163.com> writes:

> The attached is PoC of this idea, not matter which method are adopted
> (rewrite all the outfunction or a optional print function), I think the
> benefit will be similar. In the blew test case, it shows us 10%+
> improvements. (0.134ms vs 0.110ms)

After working on more {type}_print functions, I'm thinking it is pretty
like the 3rd IO function which shows some confused maintainence
effort. so I agree refactoring the existing out function is a better
idea. I'd like to work on _print function body first for easy review and
testing. after all, if some common issues exists in these changes,
it is better to know that before we working on the 700+ out functions.

-- 
Best Regards
Andy Fan




Re: Make printtup a bit faster

From
Andy Fan
Date:
Hello David & Andreas, 

> On 8/29/24 1:51 PM, David Rowley wrote:
>> I had planned to work on this for PG18, but I'd be happy for some
>> assistance if you're willing.
>
> I am interested in working on this, unless Andy Fan wants to do this
> work. :) I believe that optimizing the out, in and send functions would
> be worth the pain. I get Tom's objections but I do not think adding a
> small check would add much overhead compared to the gains we can get.

Just to be clearer, I'd like work on the out function only due to my
internal assignment. (Since David planned it for PG18, so it is better
say things clearer eariler). I'd put parts of out(print) function
refactor in the next 2 days. I think it deserves a double check before
working on *all* the out function.

select count(*), count(distinct typoutput) from pg_type;
 count | count 
-------+-------
   621 |    97
(1 row)

select typoutput, count(*) from pg_type group by typoutput having
    count(*) > 1 order by 2 desc;
    
    typoutput    | count 
-----------------+-------
 array_out       |   296
 record_out      |   214
 multirange_out  |     6
 range_out       |     6
 varcharout      |     3
 int4out         |     2
 timestamptz_out |     2
 nameout         |     2
 textout         |     2
(9 rows)

-- 
Best Regards
Andy Fan




Re: Make printtup a bit faster

From
Tom Lane
Date:
Andy Fan <zhihuifan1213@163.com> writes:
> Just to be clearer, I'd like work on the out function only due to my
> internal assignment. (Since David planned it for PG18, so it is better
> say things clearer eariler). I'd put parts of out(print) function
> refactor in the next 2 days. I think it deserves a double check before
> working on *all* the out function.

Well, sure.  You *cannot* write a patch that breaks existing output
functions.  Not at the start, and not at the end either.  You
should focus on writing the infrastructure and, for starters,
converting just a few output functions as a demonstration.  If
that gets accepted then you can work on converting other output
functions a few at a time.  But they'll never all be done, because
we can't realistically force extensions to convert.

There are lots of examples of similar incremental conversions in our
project's history.  I think the most recent example is the "soft error
handling" work (d9f7f5d32, ccff2d20e, and many follow-on patches).

            regards, tom lane