Thread: improve Chinese locale performance

improve Chinese locale performance

From

Quan Zongliang

Date:

22 July 2013, 07:17:51

Hi hackers,

I tried to improve performance when database is Chinese.

Under openSUSE, create index on table with 54996 rows
locale=C, 140ms
locale=zh_CN, 985ms

I think the function strcoll() of Linux is too slow.
So, I made a new utf8 to GB18030 map, store Chinese order in it.
Do not call strcoll().
On my modified code, same operation, locale=zh_CN, 203ms.

My English is too bad to describe my idea.
Please find the attachment.

The users in China would like to use locale=C, because it is faster.
When need to order, they call function convert() to do.

And I found some wrong order in the locale zh_CN of Linux.

In my test under Windows, there is performance improve too.
Windows XP sp3, vm, locale=Chinese_People's Republic of China.936
original code, 343ms
modified code, 235ms

Maybe, some Asian Languages have same problem.

Regards,
Quan Zongliang

Attachment

zhCNorder.tar.gz

Re: improve Chinese locale performance

From

Craig Ringer

Date:

22 July 2013, 10:55:13

On 07/22/2013 12:17 PM, Quan Zongliang wrote:
> Hi hackers,
> 
> I tried to improve performance when database is Chinese.
> 
> Under openSUSE, create index on table with 54996 rows
> locale=C, 140ms
> locale=zh_CN, 985ms
> 
> I think the function strcoll() of Linux is too slow.
> So, I made a new utf8 to GB18030 map, store Chinese order in it.
> Do not call strcoll().
> On my modified code, same operation, locale=zh_CN, 203ms.

It might be worth looking at gcc's strcoll() implementation. See if it
performs better when you use the latest gcc, and if not try to improve
gcc's strcoll() .

I'd be interested in seeing a test case for this that shows that the
results of your new collation are exactly the same as the original
strcoll() based approach.

-- Craig Ringer                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services

Re: improve Chinese locale performance

From

Quan Zongliang

Date:

22 July 2013, 12:14:24

On 07/22/2013 03:54 PM, Craig Ringer wrote:
> On 07/22/2013 12:17 PM, Quan Zongliang wrote:
>> Hi hackers,
>>
>> I tried to improve performance when database is Chinese.
>>
>> Under openSUSE, create index on table with 54996 rows
>> locale=C, 140ms
>> locale=zh_CN, 985ms
>>
>> I think the function strcoll() of Linux is too slow.
>> So, I made a new utf8 to GB18030 map, store Chinese order in it.
>> Do not call strcoll().
>> On my modified code, same operation, locale=zh_CN, 203ms.
>
> It might be worth looking at gcc's strcoll() implementation. See if it
> performs better when you use the latest gcc, and if not try to improve
> gcc's strcoll() .
>
> I'd be interested in seeing a test case for this that shows that the
> results of your new collation are exactly the same as the original
> strcoll() based approach.
>
Do not same exactly.
I found some errors in gcc's strcoll() when order by Chinese character.
Because there are lots of special characters in Chinese.
gcc's strcoll() do not consider this or missed at part of them.

Yes, the best way is to impove gcc's strcoll().
But I don't know how to do.


Thanks,
Quan Zongliang

Re: improve Chinese locale performance

From

Peter Eisentraut

Date:

22 July 2013, 14:50:38

On 7/22/13 3:54 AM, Craig Ringer wrote:
> It might be worth looking at gcc's strcoll() implementation. See if it
> performs better when you use the latest gcc, and if not try to improve
> gcc's strcoll() .

I think part of the problem is that we call strcoll for each comparison,
instead of doing strxfrm once for each datum and then just strcmp for
each comparison.  That is effectively equivalent to what the proposal
implements.

Re: improve Chinese locale performance

From

Greg Stark

Date:

22 July 2013, 19:50:00

On Mon, Jul 22, 2013 at 12:50 PM, Peter Eisentraut <peter_e@gmx.net> wrote:
> I think part of the problem is that we call strcoll for each comparison,
> instead of doing strxfrm once for each datum and then just strcmp for
> each comparison.  That is effectively equivalent to what the proposal
> implements.

Fwiw I used to be a big proponent of using strxfrm. But upon further
analysis I realized it was a real difficult tradeoff. strxrfm saves
potentially a lot of cpu cost but at the expense of expanding the size
of the sort key. If the sort spills to disk or even if it's just
memory bandwidth limited it might actually be slower than doing the
additional cpu work of calling strcoll.

It's hard to see how to decide in advance which way will be faster. I
suspect strxfrm is still the better bet, especially for complex large
character set based locales like Chinese. strcoll might still win by a
large margin on simple mostly-ascii character sets.

-- 
greg

Re: improve Chinese locale performance

From

Andrew Dunstan

Date:

22 July 2013, 20:40:26

On 07/22/2013 12:49 PM, Greg Stark wrote:
> On Mon, Jul 22, 2013 at 12:50 PM, Peter Eisentraut <peter_e@gmx.net> wrote:
>> I think part of the problem is that we call strcoll for each comparison,
>> instead of doing strxfrm once for each datum and then just strcmp for
>> each comparison.  That is effectively equivalent to what the proposal
>> implements.
> Fwiw I used to be a big proponent of using strxfrm. But upon further
> analysis I realized it was a real difficult tradeoff. strxrfm saves
> potentially a lot of cpu cost but at the expense of expanding the size
> of the sort key. If the sort spills to disk or even if it's just
> memory bandwidth limited it might actually be slower than doing the
> additional cpu work of calling strcoll.
>
> It's hard to see how to decide in advance which way will be faster. I
> suspect strxfrm is still the better bet, especially for complex large
> character set based locales like Chinese. strcoll might still win by a
> large margin on simple mostly-ascii character sets.
>
>


Perhaps we need a bit of performance testing to prove the point.

Maybe the behaviour should be locale-dependent.

cheers

andrew

Re: improve Chinese locale performance

From

Robert Haas

Date:

23 July 2013, 15:32:14

On Mon, Jul 22, 2013 at 12:49 PM, Greg Stark <stark@mit.edu> wrote:
> On Mon, Jul 22, 2013 at 12:50 PM, Peter Eisentraut <peter_e@gmx.net> wrote:
>> I think part of the problem is that we call strcoll for each comparison,
>> instead of doing strxfrm once for each datum and then just strcmp for
>> each comparison.  That is effectively equivalent to what the proposal
>> implements.
>
> Fwiw I used to be a big proponent of using strxfrm. But upon further
> analysis I realized it was a real difficult tradeoff. strxrfm saves
> potentially a lot of cpu cost but at the expense of expanding the size
> of the sort key. If the sort spills to disk or even if it's just
> memory bandwidth limited it might actually be slower than doing the
> additional cpu work of calling strcoll.
>
> It's hard to see how to decide in advance which way will be faster. I
> suspect strxfrm is still the better bet, especially for complex large
> character set based locales like Chinese. strcoll might still win by a
> large margin on simple mostly-ascii character sets.

The storage blow-up on systems I've tested is on the order of 10x.
That's possibly fine if the data still fits in memory, but it pretty
much sucks if it makes your sort spill to disk, which seems like a
likely outcome in many cases.

But I don't have much trouble believing the OP's contention that he's
coded a locale-specific version that is faster than the version that
ships with the OS.  On glibc, for example, we copy the strings we want
to compare, so that we can add a terminating zero byte.  The first
thing that glibc does is call strlen().  That's pretty horrible, and
I'm not at all sure the horror ends there, either.

It would be great to have support for user-defined collations in
PostgreSQL.  Let the user provide their own comparison function and
whatever else is needed and use that instead of the OS-specific
support.  Aside from the performance advantages, one could even create
collations that have the same names and orderings on all platforms we
support.  Our support team has gotten more than one inquiry of the
form "what's the equivalent of Linux collation XYZ on Windows?" - and
telling them that there is no exact equivalent is not the answer the
want to hear.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: improve Chinese locale performance

From

Craig Ringer

Date:

23 July 2013, 16:42:34

(Replying on phone, please forgive bad quoting)

Isn't this pretty much what adopting ICU is supposed to give us? OS-independent collations?

I'd be interested in seeing the rest data for this performance report, partly as I'd like to see how ICU collations
wouldcompare when ICU is crudely hacked into place for testing.

Re: improve Chinese locale performance

From

Robert Haas

Date:

23 July 2013, 17:34:27

On Tue, Jul 23, 2013 at 9:42 AM, Craig Ringer <craig@2ndquadrant.com> wrote:
> (Replying on phone, please forgive bad quoting)
>
> Isn't this pretty much what adopting ICU is supposed to give us? OS-independent collations?

Yes.

> I'd be interested in seeing the rest data for this performance report, partly as I'd like to see how ICU collations
wouldcompare when ICU is crudely hacked into place for testing.

I pretty much lost interest in ICU upon reading that they use UTF-16
as their internal format.

http://userguide.icu-project.org/strings#TOC-Strings-in-ICU

What that would mean for us is that instead of copying the input
strings into a temporary buffer and passing the buffer to strcoll(),
we'd need to convert them to ICU's representation (which means writing
twice as many bytes as the length of the input string in cases where
the input string is mostly single-byte characters) and then call ICU's
strcoll() equivalent.  I agree that it might be worth testing, but I
can't work up much optimism.  It seems to me that something that
operates directly on the server encoding could run a whole lot faster.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: improve Chinese locale performance

From

Quan Zongliang

Date:

24 July 2013, 04:18:10

On 07/23/2013 09:42 PM, Craig Ringer wrote:
> (Replying on phone, please forgive bad quoting)
>
> Isn't this pretty much what adopting ICU is supposed to give us? OS-independent collations?
>
Yes, we need OS-independent collations.

> I'd be interested in seeing the rest data for this performance report, partly as I'd like to see how ICU collations
wouldcompare when ICU is crudely hacked into place for testing.
 
>

Re: improve Chinese locale performance

From

Martijn van Oosterhout

Date:

28 July 2013, 12:40:15

On Tue, Jul 23, 2013 at 10:34:21AM -0400, Robert Haas wrote:
> I pretty much lost interest in ICU upon reading that they use UTF-16
> as their internal format.
>
> http://userguide.icu-project.org/strings#TOC-Strings-in-ICU

The UTF-8 support has been steadily improving:
 For example, icu::Collator::compareUTF8() compares two UTF-8 strings incrementally, without converting all of the two
stringsto UTF-16 if there is an early base letter difference. 

http://userguide.icu-project.org/strings/utf-8

For all other encodings you should be able to use an iterator. As to
performance I have no idea.

The main issue with strxfrm() is its lame API. If it supported
returning prefixes you'd be set, but as it is you need >10MB of memory
just to transform a 10MB string, even if only the first few characers
would be enough to sort...

Mvg,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> He who writes carelessly confesses thereby at the very outset that he does
> not attach much importance to his own thoughts.  -- Arthur Schopenhauer

Re: improve Chinese locale performance

From

Robert Haas

Date:

01 August 2013, 19:09:56

On Sun, Jul 28, 2013 at 5:39 AM, Martijn van Oosterhout
<kleptog@svana.org> wrote:
> On Tue, Jul 23, 2013 at 10:34:21AM -0400, Robert Haas wrote:
>> I pretty much lost interest in ICU upon reading that they use UTF-16
>> as their internal format.
>>
>> http://userguide.icu-project.org/strings#TOC-Strings-in-ICU
>
> The UTF-8 support has been steadily improving:
>
>   For example, icu::Collator::compareUTF8() compares two UTF-8 strings
>   incrementally, without converting all of the two strings to UTF-16 if
>   there is an early base letter difference.
>
> http://userguide.icu-project.org/strings/utf-8
>
> For all other encodings you should be able to use an iterator. As to
> performance I have no idea.
>
> The main issue with strxfrm() is its lame API. If it supported
> returning prefixes you'd be set, but as it is you need >10MB of memory
> just to transform a 10MB string, even if only the first few characers
> would be enough to sort...

Yep, definitely.  And by ">10MB" you mean ">90MB", at least on my Mac,
which is really outrageous.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: improve Chinese locale performance

From

Quan Zongliang

Date:

05 September 2013, 06:01:37

On 07/23/2013 09:42 PM, Craig Ringer wrote:
> (Replying on phone, please forgive bad quoting)
>
> Isn't this pretty much what adopting ICU is supposed to give us? OS-independent collations?
>
> I'd be interested in seeing the rest data for this performance report, partly as I'd like to see how ICU collations
wouldcompare when ICU is crudely hacked into place for testing.
 
>
I think of a new idea.
Add a compare method column to pg_collation.
Every collation has its own compare function or null.
When function varstr_cmp is called, if specified collation
has compare function, call it instead of strcoll().

How about this?

Regards.

Quan Zongliang

Re: improve Chinese locale performance

From

Robert Haas

Date:

05 September 2013, 20:02:56

On Wed, Sep 4, 2013 at 11:02 PM, Quan Zongliang <quanzongliang@gmail.com> wrote:
> I think of a new idea.
> Add a compare method column to pg_collation.
> Every collation has its own compare function or null.
> When function varstr_cmp is called, if specified collation
> has compare function, call it instead of strcoll().

I think we're going to need to have two kinds of collations:
OS-derived collations (which get all of their smarts from the OS), and
PG-internal collations (which use PG-aware code for everything).
Which I suspect is a bit more involved than what you're imagining, but
mixing and matching doesn't seem likely to end well.

However, what you're proposing might serve as a useful demonstration
of how much performance there is to be gained here.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: improve Chinese locale performance

From

Quan Zongliang

Date:

09 September 2013, 12:21:21

On 09/06/2013 01:02 AM, Robert Haas wrote:
> On Wed, Sep 4, 2013 at 11:02 PM, Quan Zongliang <quanzongliang@gmail.com> wrote:
>> I think of a new idea.
>> Add a compare method column to pg_collation.
>> Every collation has its own compare function or null.
>> When function varstr_cmp is called, if specified collation
>> has compare function, call it instead of strcoll().
>
> I think we're going to need to have two kinds of collations:
> OS-derived collations (which get all of their smarts from the OS), and
> PG-internal collations (which use PG-aware code for everything).
> Which I suspect is a bit more involved than what you're imagining, but
> mixing and matching doesn't seem likely to end well.
>
> However, what you're proposing might serve as a useful demonstration
> of how much performance there is to be gained here.
>
Understood.

I just try to speed up text compare, not redesign locale.

Do you have a plan to do this?

Thank you.

Quan Zongliang

Re: improve Chinese locale performance

From

Robert Haas

Date:

09 September 2013, 19:09:16

On Mon, Sep 9, 2013 at 5:22 AM, Quan Zongliang <quanzongliang@gmail.com> wrote:
> Understood.
>
> I just try to speed up text compare, not redesign locale.
>
> Do you have a plan to do this?

Not any time soon, anyway.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company