Thread: the impact of encoding on performance.

the impact of encoding on performance.

From

Michael Ben-Nes

Date:

10 March 2005, 10:57:36

Hi All

Snip <<<
 The drawback of using locales other than C or POSIX in PostgreSQL is
its performance impact. It slows character handling and prevents
ordinary indexes from being used by LIKE. For this reason use locales
only if you actually need them.
snip;

What is the impact of the locale  on the server ? is it irelevant, small
or huge ?

Encoding of the DB impact performance too ? UTF8, 8859-8 ?

Thanks

--
--------------------------
Canaan Surfing Ltd.
Internet Service Providers
Ben-Nes Michael - Manager
Tel: 972-4-6991122
Cel: 972-52-8555757
Fax: 972-4-6990098
http://www.canaan.net.il
--------------------------

Re: the impact of encoding on performance.

From

Tom Lane

Date:

11 March 2005, 02:46:28

Michael Ben-Nes <miki@canaan.co.il> writes:
>>  The drawback of using locales other than C or POSIX in PostgreSQL is
>> its performance impact. It slows character handling and prevents
>> ordinary indexes from being used by LIKE. For this reason use locales
>> only if you actually need them.

> What is the impact of the locale  on the server ? is it irelevant, small
> or huge ?

> Encoding of the DB impact performance too ? UTF8, 8859-8 ?

These aren't really separable since you generally don't get to choose
the encoding independently of the locale.

I'm working on some simple benchmarking consisting of running mysql's
sql-bench against a PG 8.0.1 server on a Fedora Core 3 machine.  Mostly
I'm interested in understanding in detail why sql-bench makes us look
so bad, but as long as I'm at it it can provide one datapoint in answer
to your question.  In two runs that were identical except one used
en_US.utf8 locale and UTF8 encoding while the other used C locale and
SQL-ASCII encoding, most of the tests didn't show any meaningful
difference, but a couple of tests showed as much as a 2X advantage for C
locale.  These were tests that were heavily dependent on comparison of
strings, such as a SELECT COUNT(DISTINCT foo) across a large table.

So it would depend on your workload.  Certainly it's possible that
locale would make a big difference to you, but it might not.

Also, this all depends quite a bit on how efficiently your libc
implements strcoll() for non-C locales.  I believe there are some
platforms out there that are much slower than glibc, and would have
a correspondingly higher penalty for using a non-C locale.  You could
investigate this by timing "sort" on a large file in both locales.

            regards, tom lane