Thread: Make printtup a bit faster
Usually I see printtup in the perf-report with a noticeable ratio. Take "SELECT * FROM pg_class" for example, we can see: 85.65% 3.25% postgres postgres [.] printtup The high level design of printtup is: 1. Used a pre-allocated StringInfo DR_printtup.buf to store data for each tuples. 2. for each datum in the tuple, it calls the type-specific out function and get a cstring. 3. after get the cstring, we figure out the "len" and add both len and 'data' into DR_printtup.buf. 4. after all the datums are handled, socket_putmessage copies them into PqSendBuffer. 5. When the usage of PgSendBuffer is up to PqSendBufferSize, using send syscall to sent them into client (by copying the data from userspace to kernel space again). Part of the slowness is caused by "memcpy", "strlen" and palloc in outfunction. 8.35% 8.35% postgres libc.so.6 [.] __strlen_avx2 4.27% 0.00% postgres libc.so.6 [.] __memcpy_avx_unaligned_erms 3.93% 3.93% postgres postgres [.] palloc (part of them caused by out function) 5.70% 5.70% postgres postgres [.] AllocSetAlloc (part of them caused by printtup.) My high level proposal is define a type specific print function like: oidprint(Datum datum, StringInfo buf) textprint(Datum datum, StringInfo buf) This function should append both data and len into buf directly. for the oidprint case, we can avoid: 5. the dedicate palloc in oid function. 6. the memcpy from the above memory into DR_printtup.buf for the textprint case, we can avoid 7. strlen, since we can figure out the length from varlena.vl_len int2/4/8/timestamp/date/time are similar with oid. and numeric, varchar are similar with text. This almost covers all the common used type. Hard coding the relationship between common used type and {type}print function OID looks not cool, Adding a new attribute in pg_type looks too aggressive however. Anyway this is the next topic to talk about. If a type's print function is not defined, we can still using the out function (and PrinttupAttrInfo caches FmgrInfo rather than FunctionCallInfo, so there is some optimization in this step as well). This proposal covers the step 2 & 3. If we can do something more aggressively, we can let the xxxprint print to PqSendBuffer directly, but this is more complex and need some infrastructure changes. the memcpy in step 4 is: "1.27% __memcpy_avx_unaligned_erms" in my above case. What do you think? -- Best Regards Andy Fan
On Thu, 29 Aug 2024 at 21:40, Andy Fan <zhihuifan1213@163.com> wrote: > > > Usually I see printtup in the perf-report with a noticeable ratio. > Part of the slowness is caused by "memcpy", "strlen" and palloc in > outfunction. Yeah, it's a pretty inefficient API from a performance point of view. > My high level proposal is define a type specific print function like: > > oidprint(Datum datum, StringInfo buf) > textprint(Datum datum, StringInfo buf) I think what we should do instead is make the output functions take a StringInfo and just pass it the StringInfo where we'd like the bytes written. That of course would require rewriting all the output functions for all the built-in types, so not a small task. Extensions make that job harder. I don't think it would be good to force extensions to rewrite their output functions, so perhaps some wrapper function could help us align the APIs for extensions that have not been converted yet. There's a similar problem with input functions not having knowledge of the input length. You only have to look at textin() to see how useful that could be. Fixing that would probably make COPY FROM horrendously faster. Team that up with SIMD for the delimiter char search and COPY go a bit better still. Neil Conway did propose the SIMD part in [1], but it's just not nearly as good as it could be when having to still perform the strlen() calls. I had planned to work on this for PG18, but I'd be happy for some assistance if you're willing. David [1] https://postgr.es/m/CAOW5sYb1HprQKrzjCsrCP1EauQzZy+njZ-AwBbOUMoGJHJS7Sw@mail.gmail.com
David Rowley <dgrowleyml@gmail.com> writes: > [ redesign I/O function APIs ] > I had planned to work on this for PG18, but I'd be happy for some > assistance if you're willing. I'm skeptical that such a thing will ever be practical. To avoid breaking un-converted data types, all the call sites would have to support both old and new APIs. To avoid breaking non-core callers, all the I/O functions would have to support both old and new APIs. That probably adds enough overhead to negate whatever benefit you'd get. regards, tom lane
David Rowley <dgrowleyml@gmail.com> writes: Hello David, >> My high level proposal is define a type specific print function like: >> >> oidprint(Datum datum, StringInfo buf) >> textprint(Datum datum, StringInfo buf) > > I think what we should do instead is make the output functions take a > StringInfo and just pass it the StringInfo where we'd like the bytes > written. > > That of course would require rewriting all the output functions for > all the built-in types, so not a small task. Extensions make that job > harder. I don't think it would be good to force extensions to rewrite > their output functions, so perhaps some wrapper function could help us > align the APIs for extensions that have not been converted yet. I have the similar concern as Tom that this method looks too aggressive. That's why I said: "If a type's print function is not defined, we can still using the out function." AND "Hard coding the relationship between [common] used type and {type}print function OID looks not cool, Adding a new attribute in pg_type looks too aggressive however. Anyway this is the next topic to talk about." What would be the extra benefit we redesign all the out functions? > There's a similar problem with input functions not having knowledge of > the input length. You only have to look at textin() to see how useful > that could be. Fixing that would probably make COPY FROM horrendously > faster. Team that up with SIMD for the delimiter char search and COPY > go a bit better still. Neil Conway did propose the SIMD part in [1], > but it's just not nearly as good as it could be when having to still > perform the strlen() calls. OK, I think I can understand the needs to make in-function knows the input length and good to know the SIMD part for delimiter char search. strlen looks like a delimiter char search ('\0') as well. Not sure if "strlen" has been implemented with SIMD part, but if not, why? > I had planned to work on this for PG18, but I'd be happy for some > assistance if you're willing. I see you did many amazing work with cache-line-frindly data struct design, branch predition optimization and SIMD optimization. I'd like to try one myself. I'm not sure if I can meet the target, what if we handle the out/in function separately (can be by different people)? -- Best Regards Andy Fan
On Fri, 30 Aug 2024 at 03:33, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > David Rowley <dgrowleyml@gmail.com> writes: > > [ redesign I/O function APIs ] > > I had planned to work on this for PG18, but I'd be happy for some > > assistance if you're willing. > > I'm skeptical that such a thing will ever be practical. To avoid > breaking un-converted data types, all the call sites would have to > support both old and new APIs. To avoid breaking non-core callers, > all the I/O functions would have to support both old and new APIs. > That probably adds enough overhead to negate whatever benefit you'd > get. Scepticism is certainly good when it comes to such a large API change. I don't want to argue with you, but I'd like to state a few things about why I think you're wrong on this... So, we currently return cstrings in our output functions. Let's take jsonb_out() as an example, to build that cstring, we make a *new* StringInfoData on *every call* inside JsonbToCStringWorker(). That gives you 1024 bytes before you need to enlarge it. However, it's maybe not all bad as we have some size estimations there to call enlargeStringInfo(), only that's a bit wasteful as it does a repalloc() which memcpys the freshly allocated 1024 bytes allocated in initStringInfo() when it doesn't yet contain any data. After jsonb_out() has returned and we have the cstring, only we forgot the length of the string, so most places will immediately call strlen() or do that indirectly via appendStringInfoString(). For larger JSON documents, that'll likely require pulling cachelines back into L1 again. I don't know how modern CPU cacheline eviction works, but if it was as simple as FIFO then the strlen() would flush all those cachelines only for memcpy() to have to read them back again for output strings larger than L1. If we rewrote all of core's output functions to use the new API, then the branch to test the function signature would be perfectly predictable and amount to an extra cmp and jne/je opcode. So, I just don't agree with the overheads negating the benefits comment. You're probably off by 1 order of magnitude at the minimum and for medium/large varlena types likely 2-3+ orders. Even a simple int4out requires a palloc()/memcpy. If we were outputting lots of data, e.g. in a COPY operation, the output buffer would seldom need to be enlarged as it would quickly adjust to the correct size. For the input functions, the possible gains are extensive too. textin() is a good example, it uses cstring_to_text(), but could be changed to use cstring_to_text_with_len(). Knowing the input string length also opens the door to SIMD. Take int4in() as an example, if pg_strtoint32_safe() knew its input length there are a bunch of prechecks that could be done with either 64-bit SWAR or with SIMD. For example, if you knew you had an 8-char string of decimal digits then converting that to an int32 is quite cheap. It's impossible to overflow an int32 with 8 decimal digits, so no overflow checks need to be done until there are at least 10 decimal digits. ca6fde922 seems like good enough example of the possible gains of SIMD vs byte-at-a-time processing. I saw some queries go 4x faster there and that was me trying to keep the JSON document sizes realistic. byte-at-a-time is just not enough to saturate RAM speed. Take DDR5, for example, Wikipedia says it has a bandwidth of 32–64 GB/s, so unless we discover room temperature superconductors, we're not going to see any massive jump in clock speeds any time soon, and with 5 or 6Ghz CPUs, there's just no way to get anywhere near that bandwidth by processing byte-at-a-time. For some sort of nieve strcpy() type function, you're going to need at least a cmp and mov, even if those were latency=1 (which they're not, see [1]), you can only do 2.5 billion of those two per second on a 5Ghz processor. I've done tested, but hypothetically (assuming latency=1) that amounts to processing 2.5GB/s, i.e. a long way from DDR5 RAM speed and that's not taking into account having to increment pointers to the next byte on each loop. David [1] https://www.agner.org/optimize/instruction_tables.pdf
On Fri, 30 Aug 2024 at 12:10, Andy Fan <zhihuifan1213@163.com> wrote: > What would be the extra benefit we redesign all the out functions? If I've understood your proposal correctly, it sounds like you want to invent a new "print" output function for each type to output the Datum onto a StringInfo, if that's the case, what would be the point of having both versions? If there's anywhere we call output functions where the resulting value isn't directly appended to a StringInfo, then we could just use a temporary StringInfo to obtain the cstring and its length. David
David Rowley <dgrowleyml@gmail.com> writes: > On Fri, 30 Aug 2024 at 12:10, Andy Fan <zhihuifan1213@163.com> wrote: >> What would be the extra benefit we redesign all the out functions? > > If I've understood your proposal correctly, it sounds like you want to > invent a new "print" output function for each type to output the Datum > onto a StringInfo, Mostly yes, but not for [each type at once], just for the [common used type], like int2/4/8, float4/8, date/time/timestamp, text/.. and so on. > if that's the case, what would be the point of having both versions? The biggest benefit would be compatibility. In my opinion, print function (not need to be in pg_type at all) is as an optimization and optional, in some performance critical path we can replace the out-function with printfunction, like (printtup). if such performance-critical path find a type without a print-function is defined, just keep the old way. Kind of like supportfunction for proc, this is for data type? Within this way, changes would be much smaller and step-by-step. > If there's anywhere we call output functions > where the resulting value isn't directly appended to a StringInfo, > then we could just use a temporary StringInfo to obtain the cstring > and its length. I think this is true, but it requests some caller's code change. -- Best Regards Andy Fan
On Fri, 30 Aug 2024 at 13:04, Andy Fan <zhihuifan1213@163.com> wrote: > > David Rowley <dgrowleyml@gmail.com> writes: > > If there's anywhere we call output functions > > where the resulting value isn't directly appended to a StringInfo, > > then we could just use a temporary StringInfo to obtain the cstring > > and its length. > > I think this is true, but it requests some caller's code change. Yeah, calling code would need to be changed to take advantage of the new API, however, the differences in which types support which API could be hidden inside OutputFunctionCall(). That function could just fake up a StringInfo for any types that only support the old cstring API. That means we don't need to add handling for both cases everywhere we need to call the output function. It's possible that could make some operations slightly slower when only the old API is available, but then maybe not as we do now have read-only StringInfos. Maybe the StringInfoData.data field could just be set to point to the given cstring using initReadOnlyStringInfo() rather than doing appendBinaryStringInfo() onto yet another buffer for the old API. David
David Rowley <dgrowleyml@gmail.com> writes: > On Fri, 30 Aug 2024 at 13:04, Andy Fan <zhihuifan1213@163.com> wrote: >> >> David Rowley <dgrowleyml@gmail.com> writes: >> > If there's anywhere we call output functions >> > where the resulting value isn't directly appended to a StringInfo, >> > then we could just use a temporary StringInfo to obtain the cstring >> > and its length. >> >> I think this is true, but it requests some caller's code change. > > Yeah, calling code would need to be changed to take advantage of the > new API, however, the differences in which types support which API > could be hidden inside OutputFunctionCall(). That function could just > fake up a StringInfo for any types that only support the old cstring > API. That means we don't need to add handling for both cases > everywhere we need to call the output function. We can do this, then the printtup case (stands for some performance crital path) still need to change discard OutputFunctionCall() since it uses the fake StringInfo then a memcpy is needed again IIUC. Besides above, my major concerns about your proposal need to change [all the type's outfunction at once] which is too aggresive for me. In the fresh setup without any extension is created, "SELECT count(*) FROM pg_type" returns 627 already, So other piece of my previous reply is more important to me. It is great that both of us feeling the current stategy is not good for performance:) -- Best Regards Andy Fan
On 8/29/24 1:51 PM, David Rowley wrote: > I had planned to work on this for PG18, but I'd be happy for some > assistance if you're willing. I am interested in working on this, unless Andy Fan wants to do this work. :) I believe that optimizing the out, in and send functions would be worth the pain. I get Tom's objections but I do not think adding a small check would add much overhead compared to the gains we can get. And given that all of in, out and send could be optimized I do not like the idea of duplicating all three in the catalog. David, have you given any thought on the cleanest way to check for if the new API or the old is the be used for these functions? If not I can figure out something myself, just wondering if you already had something in mind. Andreas
Andy Fan <zhihuifan1213@163.com> writes: > The attached is PoC of this idea, not matter which method are adopted > (rewrite all the outfunction or a optional print function), I think the > benefit will be similar. In the blew test case, it shows us 10%+ > improvements. (0.134ms vs 0.110ms) After working on more {type}_print functions, I'm thinking it is pretty like the 3rd IO function which shows some confused maintainence effort. so I agree refactoring the existing out function is a better idea. I'd like to work on _print function body first for easy review and testing. after all, if some common issues exists in these changes, it is better to know that before we working on the 700+ out functions. -- Best Regards Andy Fan
Hello David & Andreas, > On 8/29/24 1:51 PM, David Rowley wrote: >> I had planned to work on this for PG18, but I'd be happy for some >> assistance if you're willing. > > I am interested in working on this, unless Andy Fan wants to do this > work. :) I believe that optimizing the out, in and send functions would > be worth the pain. I get Tom's objections but I do not think adding a > small check would add much overhead compared to the gains we can get. Just to be clearer, I'd like work on the out function only due to my internal assignment. (Since David planned it for PG18, so it is better say things clearer eariler). I'd put parts of out(print) function refactor in the next 2 days. I think it deserves a double check before working on *all* the out function. select count(*), count(distinct typoutput) from pg_type; count | count -------+------- 621 | 97 (1 row) select typoutput, count(*) from pg_type group by typoutput having count(*) > 1 order by 2 desc; typoutput | count -----------------+------- array_out | 296 record_out | 214 multirange_out | 6 range_out | 6 varcharout | 3 int4out | 2 timestamptz_out | 2 nameout | 2 textout | 2 (9 rows) -- Best Regards Andy Fan
Andy Fan <zhihuifan1213@163.com> writes: > Just to be clearer, I'd like work on the out function only due to my > internal assignment. (Since David planned it for PG18, so it is better > say things clearer eariler). I'd put parts of out(print) function > refactor in the next 2 days. I think it deserves a double check before > working on *all* the out function. Well, sure. You *cannot* write a patch that breaks existing output functions. Not at the start, and not at the end either. You should focus on writing the infrastructure and, for starters, converting just a few output functions as a demonstration. If that gets accepted then you can work on converting other output functions a few at a time. But they'll never all be done, because we can't realistically force extensions to convert. There are lots of examples of similar incremental conversions in our project's history. I think the most recent example is the "soft error handling" work (d9f7f5d32, ccff2d20e, and many follow-on patches). regards, tom lane