Thread: improve Chinese locale performance
Hi hackers, I tried to improve performance when database is Chinese. Under openSUSE, create index on table with 54996 rows locale=C, 140ms locale=zh_CN, 985ms I think the function strcoll() of Linux is too slow. So, I made a new utf8 to GB18030 map, store Chinese order in it. Do not call strcoll(). On my modified code, same operation, locale=zh_CN, 203ms. My English is too bad to describe my idea. Please find the attachment. The users in China would like to use locale=C, because it is faster. When need to order, they call function convert() to do. And I found some wrong order in the locale zh_CN of Linux. In my test under Windows, there is performance improve too. Windows XP sp3, vm, locale=Chinese_People's Republic of China.936 original code, 343ms modified code, 235ms Maybe, some Asian Languages have same problem. Regards, Quan Zongliang
Attachment
On 07/22/2013 12:17 PM, Quan Zongliang wrote: > Hi hackers, > > I tried to improve performance when database is Chinese. > > Under openSUSE, create index on table with 54996 rows > locale=C, 140ms > locale=zh_CN, 985ms > > I think the function strcoll() of Linux is too slow. > So, I made a new utf8 to GB18030 map, store Chinese order in it. > Do not call strcoll(). > On my modified code, same operation, locale=zh_CN, 203ms. It might be worth looking at gcc's strcoll() implementation. See if it performs better when you use the latest gcc, and if not try to improve gcc's strcoll() . I'd be interested in seeing a test case for this that shows that the results of your new collation are exactly the same as the original strcoll() based approach. -- Craig Ringer http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 07/22/2013 03:54 PM, Craig Ringer wrote: > On 07/22/2013 12:17 PM, Quan Zongliang wrote: >> Hi hackers, >> >> I tried to improve performance when database is Chinese. >> >> Under openSUSE, create index on table with 54996 rows >> locale=C, 140ms >> locale=zh_CN, 985ms >> >> I think the function strcoll() of Linux is too slow. >> So, I made a new utf8 to GB18030 map, store Chinese order in it. >> Do not call strcoll(). >> On my modified code, same operation, locale=zh_CN, 203ms. > > It might be worth looking at gcc's strcoll() implementation. See if it > performs better when you use the latest gcc, and if not try to improve > gcc's strcoll() . > > I'd be interested in seeing a test case for this that shows that the > results of your new collation are exactly the same as the original > strcoll() based approach. > Do not same exactly. I found some errors in gcc's strcoll() when order by Chinese character. Because there are lots of special characters in Chinese. gcc's strcoll() do not consider this or missed at part of them. Yes, the best way is to impove gcc's strcoll(). But I don't know how to do. Thanks, Quan Zongliang
On 7/22/13 3:54 AM, Craig Ringer wrote: > It might be worth looking at gcc's strcoll() implementation. See if it > performs better when you use the latest gcc, and if not try to improve > gcc's strcoll() . I think part of the problem is that we call strcoll for each comparison, instead of doing strxfrm once for each datum and then just strcmp for each comparison. That is effectively equivalent to what the proposal implements.
On Mon, Jul 22, 2013 at 12:50 PM, Peter Eisentraut <peter_e@gmx.net> wrote: > I think part of the problem is that we call strcoll for each comparison, > instead of doing strxfrm once for each datum and then just strcmp for > each comparison. That is effectively equivalent to what the proposal > implements. Fwiw I used to be a big proponent of using strxfrm. But upon further analysis I realized it was a real difficult tradeoff. strxrfm saves potentially a lot of cpu cost but at the expense of expanding the size of the sort key. If the sort spills to disk or even if it's just memory bandwidth limited it might actually be slower than doing the additional cpu work of calling strcoll. It's hard to see how to decide in advance which way will be faster. I suspect strxfrm is still the better bet, especially for complex large character set based locales like Chinese. strcoll might still win by a large margin on simple mostly-ascii character sets. -- greg
On 07/22/2013 12:49 PM, Greg Stark wrote: > On Mon, Jul 22, 2013 at 12:50 PM, Peter Eisentraut <peter_e@gmx.net> wrote: >> I think part of the problem is that we call strcoll for each comparison, >> instead of doing strxfrm once for each datum and then just strcmp for >> each comparison. That is effectively equivalent to what the proposal >> implements. > Fwiw I used to be a big proponent of using strxfrm. But upon further > analysis I realized it was a real difficult tradeoff. strxrfm saves > potentially a lot of cpu cost but at the expense of expanding the size > of the sort key. If the sort spills to disk or even if it's just > memory bandwidth limited it might actually be slower than doing the > additional cpu work of calling strcoll. > > It's hard to see how to decide in advance which way will be faster. I > suspect strxfrm is still the better bet, especially for complex large > character set based locales like Chinese. strcoll might still win by a > large margin on simple mostly-ascii character sets. > > Perhaps we need a bit of performance testing to prove the point. Maybe the behaviour should be locale-dependent. cheers andrew
On Mon, Jul 22, 2013 at 12:49 PM, Greg Stark <stark@mit.edu> wrote: > On Mon, Jul 22, 2013 at 12:50 PM, Peter Eisentraut <peter_e@gmx.net> wrote: >> I think part of the problem is that we call strcoll for each comparison, >> instead of doing strxfrm once for each datum and then just strcmp for >> each comparison. That is effectively equivalent to what the proposal >> implements. > > Fwiw I used to be a big proponent of using strxfrm. But upon further > analysis I realized it was a real difficult tradeoff. strxrfm saves > potentially a lot of cpu cost but at the expense of expanding the size > of the sort key. If the sort spills to disk or even if it's just > memory bandwidth limited it might actually be slower than doing the > additional cpu work of calling strcoll. > > It's hard to see how to decide in advance which way will be faster. I > suspect strxfrm is still the better bet, especially for complex large > character set based locales like Chinese. strcoll might still win by a > large margin on simple mostly-ascii character sets. The storage blow-up on systems I've tested is on the order of 10x. That's possibly fine if the data still fits in memory, but it pretty much sucks if it makes your sort spill to disk, which seems like a likely outcome in many cases. But I don't have much trouble believing the OP's contention that he's coded a locale-specific version that is faster than the version that ships with the OS. On glibc, for example, we copy the strings we want to compare, so that we can add a terminating zero byte. The first thing that glibc does is call strlen(). That's pretty horrible, and I'm not at all sure the horror ends there, either. It would be great to have support for user-defined collations in PostgreSQL. Let the user provide their own comparison function and whatever else is needed and use that instead of the OS-specific support. Aside from the performance advantages, one could even create collations that have the same names and orderings on all platforms we support. Our support team has gotten more than one inquiry of the form "what's the equivalent of Linux collation XYZ on Windows?" - and telling them that there is no exact equivalent is not the answer the want to hear. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
(Replying on phone, please forgive bad quoting) Isn't this pretty much what adopting ICU is supposed to give us? OS-independent collations? I'd be interested in seeing the rest data for this performance report, partly as I'd like to see how ICU collations wouldcompare when ICU is crudely hacked into place for testing.
On Tue, Jul 23, 2013 at 9:42 AM, Craig Ringer <craig@2ndquadrant.com> wrote: > (Replying on phone, please forgive bad quoting) > > Isn't this pretty much what adopting ICU is supposed to give us? OS-independent collations? Yes. > I'd be interested in seeing the rest data for this performance report, partly as I'd like to see how ICU collations wouldcompare when ICU is crudely hacked into place for testing. I pretty much lost interest in ICU upon reading that they use UTF-16 as their internal format. http://userguide.icu-project.org/strings#TOC-Strings-in-ICU What that would mean for us is that instead of copying the input strings into a temporary buffer and passing the buffer to strcoll(), we'd need to convert them to ICU's representation (which means writing twice as many bytes as the length of the input string in cases where the input string is mostly single-byte characters) and then call ICU's strcoll() equivalent. I agree that it might be worth testing, but I can't work up much optimism. It seems to me that something that operates directly on the server encoding could run a whole lot faster. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 07/23/2013 09:42 PM, Craig Ringer wrote: > (Replying on phone, please forgive bad quoting) > > Isn't this pretty much what adopting ICU is supposed to give us? OS-independent collations? > Yes, we need OS-independent collations. > I'd be interested in seeing the rest data for this performance report, partly as I'd like to see how ICU collations wouldcompare when ICU is crudely hacked into place for testing. >
On Tue, Jul 23, 2013 at 10:34:21AM -0400, Robert Haas wrote: > I pretty much lost interest in ICU upon reading that they use UTF-16 > as their internal format. > > http://userguide.icu-project.org/strings#TOC-Strings-in-ICU The UTF-8 support has been steadily improving: For example, icu::Collator::compareUTF8() compares two UTF-8 strings incrementally, without converting all of the two stringsto UTF-16 if there is an early base letter difference. http://userguide.icu-project.org/strings/utf-8 For all other encodings you should be able to use an iterator. As to performance I have no idea. The main issue with strxfrm() is its lame API. If it supported returning prefixes you'd be set, but as it is you need >10MB of memory just to transform a 10MB string, even if only the first few characers would be enough to sort... Mvg, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > He who writes carelessly confesses thereby at the very outset that he does > not attach much importance to his own thoughts. -- Arthur Schopenhauer
On Sun, Jul 28, 2013 at 5:39 AM, Martijn van Oosterhout <kleptog@svana.org> wrote: > On Tue, Jul 23, 2013 at 10:34:21AM -0400, Robert Haas wrote: >> I pretty much lost interest in ICU upon reading that they use UTF-16 >> as their internal format. >> >> http://userguide.icu-project.org/strings#TOC-Strings-in-ICU > > The UTF-8 support has been steadily improving: > > For example, icu::Collator::compareUTF8() compares two UTF-8 strings > incrementally, without converting all of the two strings to UTF-16 if > there is an early base letter difference. > > http://userguide.icu-project.org/strings/utf-8 > > For all other encodings you should be able to use an iterator. As to > performance I have no idea. > > The main issue with strxfrm() is its lame API. If it supported > returning prefixes you'd be set, but as it is you need >10MB of memory > just to transform a 10MB string, even if only the first few characers > would be enough to sort... Yep, definitely. And by ">10MB" you mean ">90MB", at least on my Mac, which is really outrageous. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 07/23/2013 09:42 PM, Craig Ringer wrote: > (Replying on phone, please forgive bad quoting) > > Isn't this pretty much what adopting ICU is supposed to give us? OS-independent collations? > > I'd be interested in seeing the rest data for this performance report, partly as I'd like to see how ICU collations wouldcompare when ICU is crudely hacked into place for testing. > I think of a new idea. Add a compare method column to pg_collation. Every collation has its own compare function or null. When function varstr_cmp is called, if specified collation has compare function, call it instead of strcoll(). How about this? Regards. Quan Zongliang
On Wed, Sep 4, 2013 at 11:02 PM, Quan Zongliang <quanzongliang@gmail.com> wrote: > I think of a new idea. > Add a compare method column to pg_collation. > Every collation has its own compare function or null. > When function varstr_cmp is called, if specified collation > has compare function, call it instead of strcoll(). I think we're going to need to have two kinds of collations: OS-derived collations (which get all of their smarts from the OS), and PG-internal collations (which use PG-aware code for everything). Which I suspect is a bit more involved than what you're imagining, but mixing and matching doesn't seem likely to end well. However, what you're proposing might serve as a useful demonstration of how much performance there is to be gained here. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 09/06/2013 01:02 AM, Robert Haas wrote: > On Wed, Sep 4, 2013 at 11:02 PM, Quan Zongliang <quanzongliang@gmail.com> wrote: >> I think of a new idea. >> Add a compare method column to pg_collation. >> Every collation has its own compare function or null. >> When function varstr_cmp is called, if specified collation >> has compare function, call it instead of strcoll(). > > I think we're going to need to have two kinds of collations: > OS-derived collations (which get all of their smarts from the OS), and > PG-internal collations (which use PG-aware code for everything). > Which I suspect is a bit more involved than what you're imagining, but > mixing and matching doesn't seem likely to end well. > > However, what you're proposing might serve as a useful demonstration > of how much performance there is to be gained here. > Understood. I just try to speed up text compare, not redesign locale. Do you have a plan to do this? Thank you. Quan Zongliang
On Mon, Sep 9, 2013 at 5:22 AM, Quan Zongliang <quanzongliang@gmail.com> wrote: > Understood. > > I just try to speed up text compare, not redesign locale. > > Do you have a plan to do this? Not any time soon, anyway. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company