Thread: speed up text_position() for utf-8
Hi, Commit 9556aa01c69 (Use single-byte Boyer-Moore-Horspool search even with multibyte encodings), was a speed improvement for the majority of cases, but when the match was toward the end of the string, the slowdown in text_position_get_match_pos() was noticeable. It was found that there was a lot of overhead in pg_mblen(). [1] The attached exploratory PoC improves this for utf-8. It applies on top v25 of my utf-8 verification patch in [2], since one approach relies on the DFA from it. The other three approaches are: - a version of pg_utf_mblen() that uses a lookup table [3] - an inlined copy of pg_utf_mblen() - an ascii fast path with a fallback to the inlined copy of pg_utf_mblen() The test is attached and the test function is part of the patch. It's based on the test used in the commit above. The test searches for a string that's at the end of a ~1 million byte string. This is on gcc 11 with 2-3 runs to ensure repeatability, but I didn't bother with statistics because the differences are pretty big: patch | no match | ascii | mulitbyte -----------------------------------------+----------+-------+----------- PG11 | 1120 | 1100 | 900 master | 381 | 2350 | 1900 DFA | 386 | 1640 | 1640 branchless utf mblen | 387 | 4100 | 2600 inline pg_utf_mblen() | 380 | 1080 | 920 inline pg_utf_mblen() + ascii fast path | 382 | 470 | 918 Neither of the branchless approaches worked well. The DFA can't work as well here as in verification because it must do additional work. Inlining pg_utf_mblen() restores worst-case performance to PG11 levels. The ascii fast path is a nice improvement on top of that. A similar approach could work for pg_mbstrlen() as well, but I haven't looked into that yet. There are other callers of pg_mblen(), but I haven't looked into whether they are performance-sensitive. A more general application would be preferable to a targeted one. [1] https://www.postgresql.org/message-id/b65df3d8-1f59-3bd7-ebbe-68b81d5a76a4%40iki.fi [2] https://www.postgresql.org/message-id/CAFBsxsHG%3Dg6W8Mie%2B_NO8dV6O0pO2stxrnS%3Dme5ZmGqk--fd5g%40mail.gmail.com [3] https://github.com/skeeto/branchless-utf8/blob/master/utf8.h -- John Naylor EDB: http://www.enterprisedb.com
Attachment
I wrote: > The test is attached and the test function is part of the patch. It's > based on the test used in the commit above. The test searches for a > string that's at the end of a ~1 million byte string. This is on gcc > 11 with 2-3 runs to ensure repeatability, but I didn't bother with > statistics because the differences are pretty big: > > patch | no match | ascii | mulitbyte > -----------------------------------------+----------+-------+----------- > PG11 | 1120 | 1100 | 900 > master | 381 | 2350 | 1900 > DFA | 386 | 1640 | 1640 > branchless utf mblen | 387 | 4100 | 2600 > inline pg_utf_mblen() | 380 | 1080 | 920 > inline pg_utf_mblen() + ascii fast path | 382 | 470 | 918 I failed to mention that the above numbers are milliseconds, so smaller is better. -- John Naylor EDB: http://www.enterprisedb.com
Attached is a short patch series to develop some ideas of inlining pg_utf_mblen(). 0001 puts the main implementation of pg_utf_mblen() into an inline function and uses this in pg_mblen(). This is somewhat faster in the strpos tests, so that gives some measure of the speedup expected for other callers. Text search seems to call this a lot, so this might have noticeable benefit. 0002 refactors text_position_get_match_pos() to use pg_mbstrlen_with_len(). This itself is significantly faster when combined with 0001, likely because the latter can inline the call to pg_mblen(). The intention is to speed up more than just text_position. 0003 explicitly specializes for the inline version of pg_utf_mblen() into pg_mbstrlen_with_len(), but turns out to be almost as slow as master for ascii. It doesn't help if I undo the previous change in pg_mblen(), and I haven't investigated why yet. 0002 looks good now, but the experience with 0003 makes me hesitant to propose this seriously until I can figure out what's going on there. The test is as earlier, a worst-case substring search, times in milliseconds. patch | no match | ascii | multibyte --------+----------+-------+----------- PG11 | 1220 | 1220 | 1150 master | 385 | 2420 | 1980 0001 | 390 | 2180 | 1670 0002 | 389 | 1330 | 1100 0003 | 391 | 2100 | 1360 -- John Naylor EDB: http://www.enterprisedb.com
Attachment
I wrote: > 0001 puts the main implementation of pg_utf_mblen() into an inline > function and uses this in pg_mblen(). This is somewhat faster in the > strpos tests, so that gives some measure of the speedup expected for > other callers. Text search seems to call this a lot, so this might > have noticeable benefit. > > 0002 refactors text_position_get_match_pos() to use > pg_mbstrlen_with_len(). This itself is significantly faster when > combined with 0001, likely because the latter can inline the call to > pg_mblen(). The intention is to speed up more than just text_position. > > 0003 explicitly specializes for the inline version of pg_utf_mblen() > into pg_mbstrlen_with_len(), but turns out to be almost as slow as > master for ascii. It doesn't help if I undo the previous change in > pg_mblen(), and I haven't investigated why yet. > > 0002 looks good now, but the experience with 0003 makes me hesitant to > propose this seriously until I can figure out what's going on there. > > The test is as earlier, a worst-case substring search, times in milliseconds. > > patch | no match | ascii | multibyte > --------+----------+-------+----------- > PG11 | 1220 | 1220 | 1150 > master | 385 | 2420 | 1980 > 0001 | 390 | 2180 | 1670 > 0002 | 389 | 1330 | 1100 > 0003 | 391 | 2100 | 1360 I tried this test on a newer CPU, and 0003 had no regression. Both systems used gcc 11.2. Rather than try to figure out why an experiment had unexpected behavior, I plan to test 0001 and 0002 on a couple different compilers/architectures and call it a day. It's also worth noting that 0002 by itself seemed to be decently faster on the newer machine, but not as fast as 0001 and 0002 together. Looking at the assembly, pg_mblen is inlined into pg_mbstrlen_[with_len] and pg_mbcliplen, so the specialization for utf-8 in 0001 would be inlined in the other 3 as well. That's only a few bytes, so I think it's fine. -- John Naylor EDB: http://www.enterprisedb.com
I wrote: > I tried this test on a newer CPU, and 0003 had no regression. Both > systems used gcc 11.2. Rather than try to figure out why an experiment > had unexpected behavior, I plan to test 0001 and 0002 on a couple > different compilers/architectures and call it a day. It's also worth > noting that 0002 by itself seemed to be decently faster on the newer > machine, but not as fast as 0001 and 0002 together. I tested four machines with various combinations of patches, and it seems the only thing they all agree on is that 0002 is a decent improvement (full results attached). The others can be faster or slower. 0002 also simplifies things, so it has that going for it. I plan to commit that this week unless there are objections. -- John Naylor EDB: http://www.enterprisedb.com