Thread: Speeding up text_position_next with multibyte encodings
Attached is a patch to speed up text_position_setup/next(), in some common cases with multibyte encodings. text_position_next() uses the Boyer-Moore-Horspool search algorithm, with a skip table. Currently, with a multi-byte encoding, we first convert both input strings to arrays of wchars, and run the algorithm on those arrays. This patch removes the mb->wchar conversion, and instead runs the single-byte version of the algorithm directly on the inputs, even with multi-byte encodings. That avoids the mb->wchar conversion altogether, when there is no match. Even when there is a match, we don't need to convert the whole input string to wchars. It's enough to count the characters up to the match, using pg_mblen(). And when text_position_setup/next() are used as part of split_part() or replace() functions, we're not actually even interested in the character position, so we can skip that too. Avoiding the large allocation is nice, too. That was actually why I stated to look into this: a customer complained that text_position fails with strings larger than 256 MB. Using the byte-oriented search on multibyte strings might return false positives, though. To make that work, after finding a match, we verify that the match doesn't fall in the middle of a multi-byte character. However, as an important special case, that cannot happen with UTF-8, because it has the property that the byte sequence of one character never appears as part of another character. I think some other encodings might have the same property, but I wasn't sure. This is a win in most cases. One case is slower: calling position() with a large haystack input, where the match is near the end of the string. Counting the characters up to that point is slower than the mb->wchar conversion. I think we could avoid that regression too, if we had a more optimized function to count characters. Currently, the code calls pg_mblen() in a loop, whereas in pg_mb2wchar_with_len(), the similar loop through all the characters is a tight loop within the function. Thoughts? - Heikki
Attachment
Hello. At Fri, 19 Oct 2018 15:52:59 +0300, Heikki Linnakangas <hlinnaka@iki.fi> wrote in <3173d989-bc1c-fc8a-3b69-f24246f73876@iki.fi> > Attached is a patch to speed up text_position_setup/next(), in some > common cases with multibyte encodings. > > text_position_next() uses the Boyer-Moore-Horspool search algorithm, > with a skip table. Currently, with a multi-byte encoding, we first > convert both input strings to arrays of wchars, and run the algorithm > on those arrays. > > This patch removes the mb->wchar conversion, and instead runs the > single-byte version of the algorithm directly on the inputs, even with > multi-byte encodings. That avoids the mb->wchar conversion altogether, > when there is no match. Even when there is a match, we don't need to > convert the whole input string to wchars. It's enough to count the > characters up to the match, using pg_mblen(). And when > text_position_setup/next() are used as part of split_part() or > replace() functions, we're not actually even interested in the > character position, so we can skip that too. Sounds reasonable. Partial matching character is just ignored. The conversion is simply useless. > Avoiding the large allocation is nice, too. That was actually why I > stated to look into this: a customer complained that text_position > fails with strings larger than 256 MB. > > Using the byte-oriented search on multibyte strings might return false > positives, though. To make that work, after finding a match, we verify > that the match doesn't fall in the middle of a multi-byte > character. However, as an important special case, that cannot happen > with UTF-8, because it has the property that the byte sequence of one > character never appears as part of another character. I think some > other encodings might have the same property, but I wasn't sure. Yes. That happens for at least EUC_JP, where the same byte can appear in arbitrary position. Maybe we can hard code to restrict that to UTF-8 for the first step. (I'm not sure the second step happens.) > This is a win in most cases. One case is slower: calling position() > with a large haystack input, where the match is near the end of the > string. Counting the characters up to that point is slower than the > mb->wchar conversion. I think we could avoid that regression too, if > we had a more optimized function to count characters. Currently, the > code calls pg_mblen() in a loop, whereas in pg_mb2wchar_with_len(), > the similar loop through all the characters is a tight loop within the > function. > > Thoughts? Coudn't we count characters while searching a match? regards. -- Kyotaro Horiguchi NTT Open Source Software Center
> At Fri, 19 Oct 2018 15:52:59 +0300, Heikki Linnakangas <hlinnaka@iki.fi> wrote > Attached is a patch to speed up text_position_setup/next(), in some > common cases with multibyte encodings. Hi, Unfortunately, patch doesn't compile anymore due: varlena.c: In function ‘text_position_next_internal’: varlena.c:1337:13: error: ‘start_ptr’ undeclared (first use in this function) Assert(start_ptr >= haystack && start_ptr <= haystack_end); Could you please send an updated version? For now I'm moving it to the next CF.
On 11/30/18, Dmitry Dolgov <9erthalion6@gmail.com> wrote: > Unfortunately, patch doesn't compile anymore due: > > varlena.c: In function ‘text_position_next_internal’: > varlena.c:1337:13: error: ‘start_ptr’ undeclared (first use in this > function) > Assert(start_ptr >= haystack && start_ptr <= haystack_end); > > Could you please send an updated version? For now I'm moving it to the next > CF. I signed up to be a reviewer, and I will be busy next month, so I went ahead and fixed the typo in the patch that broke assert-enabled builds. While at it, I standardized on the spelling "start_ptr" in a few places to match the rest of the file. It's a bit concerning that it wouldn't compile with asserts, but the patch was written by a committer and seems to work. On 10/19/18, Heikki Linnakangas <hlinnaka@iki.fi> wrote: > This is a win in most cases. One case is slower: calling position() with > a large haystack input, where the match is near the end of the string. I wanted to test this case in detail, so I ran the attached script, which runs position() in three scenarios: short: 1 character needle at the end of a ~1000 character haystack, repeated 1 million times medium: 50 character needle at the end of a ~1 million character haystack, repeated 1 million times long: 250 character needle at the end of an ~80 million character haystack (~230MB, comfortably below the 256MB limit Heikki reported), done just one time I took the average of 5 runs using Korean characters in both UTF-8 (3-byte) and EUC-KR (2-byte) encodings. UTF-8 length master patch short 2.26s 2.51s med 2.28s 2.54s long 3.07s 3.11s EUC-KR length master patch short 2.29s 2.37s med 2.29s 2.36s long 1.75s 1.71s With UTF-8, the patch is 11-12% slower on short and medium strings, and about the same on long strings. With EUC-KR, the patch is about 3% slower on short and medium strings, and 2-3% faster on long strings. It seems the worst case is not that bad, and could be optimized, as Heikki said. -John Naylor
Attachment
On 12/14/18, John Naylor <jcnaylor@gmail.com> wrote: > I signed up to be a reviewer, and I will be busy next month, so I went > ahead and fixed the typo in the patch that broke assert-enabled > builds. While at it, I standardized on the spelling "start_ptr" in a > few places to match the rest of the file. It's a bit concerning that > it wouldn't compile with asserts, but the patch was written by a > committer and seems to work. I just noticed that the contrib/citext test fails. I've set the status to waiting on author. -John Naylor
On 14/12/2018 20:20, John Naylor wrote: > I signed up to be a reviewer, and I will be busy next month, so I went > ahead and fixed the typo in the patch that broke assert-enabled > builds. While at it, I standardized on the spelling "start_ptr" in a > few places to match the rest of the file. It's a bit concerning that > it wouldn't compile with asserts, but the patch was written by a > committer and seems to work. Thanks for fixing that! > On 10/19/18, Heikki Linnakangas <hlinnaka@iki.fi> wrote: >> This is a win in most cases. One case is slower: calling position() with >> a large haystack input, where the match is near the end of the string. > > I wanted to test this case in detail, so I ran the attached script, ... I'm afraid that script doesn't work as a performance test. The position() function is immutable, so the result gets cached in the plan cache. All you're measuring is the speed to get the constant from the plan cache :-(. I rewrote the same tests with a little C helper function, attached, to fix that, and to eliminate any PL/pgSQL overhead. And I adjusted the loop counts so that the runtimes are reasonable, now that it actually does some work. UTF-8 length master patch short 2.7s 10.9s med 2.6s 11.4s long 3.9s 9.9s EUC_KR length master patch short 2.2s 7.5s med 2.2s 3.5s long 3.4s 3.6s This doesn't look very flattering for the patch. Although, this is deliberately testing the worst case. You chose interesting characters for the UTF-8 test. The haystack is a repeating pattern of byte sequence EC 99 84, and the needle is a repeating pattern of EC 84 B1. In the 'long' test, the lengths in the skip table are '2', '1' and '250'. But the search bounces between the '2' and '1' cases, and never hits the byte that would allow it to skip 250 bytes. Interesting case, I had not realized that that can happen. But I don't think we need to put much weight on that, you could come up with other scenarios where the current code has skip table collisions, too. So overall, I think it's still a worthwhile tradeoff, given that that is a worst case scenario. If you choose less pathological UTF-8 codepoints, or there is no match or the match is closer to the beginning of the string, the patch wins. - Heikki
Attachment
On 14/12/2018 23:40, John Naylor wrote: > I just noticed that the contrib/citext test fails. I've set the status > to waiting on author. Hmm, it works for me. What failure did you see? - Heikki
On 23/12/2018 02:28, Heikki Linnakangas wrote: > On 14/12/2018 23:40, John Naylor wrote: >> I just noticed that the contrib/citext test fails. I've set the status >> to waiting on author. > > Hmm, it works for me. What failure did you see? Never mind, I'm seeing it now, with assertions enabled. Thanks, I'll investigate! - Heikki
On 23/12/2018 02:32, Heikki Linnakangas wrote: > On 23/12/2018 02:28, Heikki Linnakangas wrote: >> On 14/12/2018 23:40, John Naylor wrote: >>> I just noticed that the contrib/citext test fails. I've set the status >>> to waiting on author. >> >> Hmm, it works for me. What failure did you see? > > Never mind, I'm seeing it now, with assertions enabled. Thanks, I'll > investigate! The bug was in handling empty inputs. text_position_setup assumed and asserted that neither the needle nor haystack are empty, expecting the callers to have handled those special cases already, but not all callers did. Here is a fixed version. - Heikki
Attachment
On 12/23/18 1:26 AM, Heikki Linnakangas wrote: > On 14/12/2018 20:20, John Naylor wrote: >> I signed up to be a reviewer, and I will be busy next month, so I went >> ahead and fixed the typo in the patch that broke assert-enabled >> builds. While at it, I standardized on the spelling "start_ptr" in a >> few places to match the rest of the file. It's a bit concerning that >> it wouldn't compile with asserts, but the patch was written by a >> committer and seems to work. > > Thanks for fixing that! > >> On 10/19/18, Heikki Linnakangas <hlinnaka@iki.fi> wrote: >>> This is a win in most cases. One case is slower: calling position() with >>> a large haystack input, where the match is near the end of the string. >> >> I wanted to test this case in detail, so I ran the attached script, ... > > I'm afraid that script doesn't work as a performance test. The > position() function is immutable, so the result gets cached in the plan > cache. All you're measuring is the speed to get the constant from the > plan cache :-(. > > I rewrote the same tests with a little C helper function, attached, to > fix that, and to eliminate any PL/pgSQL overhead. And I adjusted the > loop counts so that the runtimes are reasonable, now that it actually > does some work. > > UTF-8 > > length master patch > short 2.7s 10.9s > med 2.6s 11.4s > long 3.9s 9.9s > > EUC_KR > > length master patch > short 2.2s 7.5s > med 2.2s 3.5s > long 3.4s 3.6s > > This doesn't look very flattering for the patch. Although, this is > deliberately testing the worst case. > > You chose interesting characters for the UTF-8 test. The haystack is a > repeating pattern of byte sequence EC 99 84, and the needle is a > repeating pattern of EC 84 B1. In the 'long' test, the lengths in the > skip table are '2', '1' and '250'. But the search bounces between the > '2' and '1' cases, and never hits the byte that would allow it to skip > 250 bytes. Interesting case, I had not realized that that can happen. > But I don't think we need to put much weight on that, you could come up > with other scenarios where the current code has skip table collisions, too. > > So overall, I think it's still a worthwhile tradeoff, given that that is > a worst case scenario. If you choose less pathological UTF-8 codepoints, > or there is no match or the match is closer to the beginning of the > string, the patch wins. > So, what is the expected speedup in a "good/average" case? Do we have some reasonable real-world workload mixing these cases that could be used as a realistic benchmark? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 12/22/18, Heikki Linnakangas <hlinnaka@iki.fi> wrote: > On 14/12/2018 20:20, John Naylor wrote: > I'm afraid that script doesn't work as a performance test. The > position() function is immutable, so the result gets cached in the plan > cache. All you're measuring is the speed to get the constant from the > plan cache :-(. That makes perfect sense now. I should have been more skeptical about the small and medium sizes having similar times. :/ > I rewrote the same tests with a little C helper function, attached, to > fix that, and to eliminate any PL/pgSQL overhead. Thanks for that, I'll probably have occasion to do something like this for other tests. > You chose interesting characters for the UTF-8 test. The haystack is a > repeating pattern of byte sequence EC 99 84, and the needle is a > repeating pattern of EC 84 B1. In the 'long' test, the lengths in the > skip table are '2', '1' and '250'. But the search bounces between the > '2' and '1' cases, and never hits the byte that would allow it to skip > 250 bytes. Interesting case, I had not realized that that can happen. Me neither, that was unintentional. > But I don't think we need to put much weight on that, you could come up > with other scenarios where the current code has skip table collisions, too. Okay. > So overall, I think it's still a worthwhile tradeoff, given that that is > a worst case scenario. If you choose less pathological UTF-8 codepoints, > or there is no match or the match is closer to the beginning of the > string, the patch wins. On 12/23/18, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > So, what is the expected speedup in a "good/average" case? Do we have > some reasonable real-world workload mixing these cases that could be > used as a realistic benchmark? I'll investigate some "better" cases. -John Naylor
On Sun, Dec 23, 2018 at 9:33 AM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > So, what is the expected speedup in a "good/average" case? Do we have > some reasonable real-world workload mixing these cases that could be > used as a realistic benchmark? Not sure about a realistic mix, but I investigated the tradeoffs. First, using the test function Heikki posted upthread [1], I tried a 2 character needle needle with different haystack sizes, and looked for the position where master and patch were roughly equal (average of 3, within 1%). Beyond this point, the patch regresses. To keep it simple I used UTF-8 only. haystack position <=23 (patch always faster) 30 23 100 58 1000 ~550 1000000 ~550000 For very short haystacks, the patch is always faster or regresses only when the needle is close to the end. For longer haystacks, the patch will be faster if the position searched for is less than ~55% of the way to the end, slower if after. Next, I tested finding the position of a 2 character needle in a couple positions towards the front of a 1000 character haystack, plus not present and at the end for comparison. As seen earlier [1], the worst-case regression for this haystack length was 4.4x slower for UTF-8, but that had a skip table collision, and this time I used characters with no bytes in common. Average of 3, with a loop count of 1,000,000: UTF-8 pos. master patch diff * 3880ms 144ms -96% 100 4440ms 1410ms -68% 333 5700ms 4280ms -25% 999 9370ms 12500ms 34% EUC-KR pos. master patch diff * 3900ms 112ms -97% 100 4410ms 1289ms -71% 333 5680ms 3970ms -30% 999 9370ms 11600ms 24% The patch is much faster than master when the needle is near the front of a large haystack or not there at all. The majority of cases are measurably faster, and the best case is at least 20x faster. On the whole I'd say this patch is a performance win even without further optimization. I'm marking it ready for committer. [1] https://www.postgresql.org/message-id/05d95c5c-1fb7-29ea-1c5d-e7e941a0a14d%40iki.fi -- -John Naylor
On 15/01/2019 02:52, John Naylor wrote: > The majority of cases are measurably faster, and the best case is at > least 20x faster. On the whole I'd say this patch is a performance win > even without further optimization. I'm marking it ready for committer. I read through the patch one more time, tweaked the comments a little bit, and committed. Thanks for the review! I did a little profiling of the worst case, where this is slower than the old approach. There's a lot of function call overhead coming from walking the string with pg_mblen(). That could be improved. If we inlined pg_mblen() into loop, it becomes much faster, and I think this code would be faster even in the worst case. (Except for the very worst cases, where hash table with the new code happens to have a collision at a different point than before, but that doesn't seem like a fair comparison.) I think this is good enough as it is, but if I have the time, I'm going to try optimizing the pg_mblen() loop, as well as similar loops e.g. in pg_mbstrlen(). Or if someone else wants to give that a go, feel free. - Heikki
On Fri, Jan 25, 2019 at 04:33:54PM +0200, Heikki Linnakangas wrote: > On 15/01/2019 02:52, John Naylor wrote: > >The majority of cases are measurably faster, and the best case is at > >least 20x faster. On the whole I'd say this patch is a performance win > >even without further optimization. I'm marking it ready for committer. > > I read through the patch one more time, tweaked the comments a little bit, > and committed. Thanks for the review! > > I did a little profiling of the worst case, where this is slower than the > old approach. There's a lot of function call overhead coming from walking > the string with pg_mblen(). That could be improved. If we inlined pg_mblen() > into loop, it becomes much faster, and I think this code would be faster > even in the worst case. (Except for the very worst cases, where hash table > with the new code happens to have a collision at a different point than > before, but that doesn't seem like a fair comparison.) > > I think this is good enough as it is, but if I have the time, I'm going to > try optimizing the pg_mblen() loop, as well as similar loops e.g. in > pg_mbstrlen(). Or if someone else wants to give that a go, feel free. It might be valuable to just inline the UTF8 case. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +