Thread: Impact of UNICODE encoding on performance

Impact of UNICODE encoding on performance

From

Harry Mantheakis

Date:

16 March 2004, 06:44:06

Hello

I am just setting out on a new project, having recently switched to
PostgreSQL.

My immediate requirements would be satisfied with ISO-8859-1 (LATIN-1)
encoding, but it is conceivable that, if things go really well, somewhere in
the future my character encoding requirements will broaden.

So I am tempted to specify UNICODE form the outset, and be done with it.

But I cannot help wondering how much of a performance penalty this entails.

If the performance hit is not significant, I shall be happy to stick with
UNICODE.

But if anyone has any strong views (or experience) on this issue I shall be
very grateful for some feedback.

Many thanks.

Harry Mantheakis
London, UK

Re: Impact of UNICODE encoding on performance

From

Aarni Ruuhimäki

Date:

16 March 2004, 19:02:18

Hi Harry,

Dunno about the performance penalty, but so far I am happy with LATIN1 dbase
system (RH and Trustix). Even with cyrillic characters. Then again, I work
with browser interfaces and it's not really up to me what encoding the client
has or has not installed. <if western, charset=iso-iso-8859-1, if fellow
russki harasoo charset=windows-1251> is, I guess, a good bet. It's a windows
world, so far.

Soviet KOI-X X, KOI8-r, KOI8-RU, Mac Cyrillic (Standard), CyrWin Cyrillic and
the rest of the soup ...

Some experience and my half a pea.

BR,

Aarni

On Tuesday 16 March 2004 12:43, you wrote:
> Hello
>
> I am just setting out on a new project, having recently switched to
> PostgreSQL.
>
> My immediate requirements would be satisfied with ISO-8859-1 (LATIN-1)
> encoding, but it is conceivable that, if things go really well, somewhere
> in the future my character encoding requirements will broaden.
>
> So I am tempted to specify UNICODE form the outset, and be done with it.
>
> But I cannot help wondering how much of a performance penalty this entails.
>
> If the performance hit is not significant, I shall be happy to stick with
> UNICODE.
>
> But if anyone has any strong views (or experience) on this issue I shall be
> very grateful for some feedback.
>
> Many thanks.
>
> Harry Mantheakis
> London, UK
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 7: don't forget to increase your free space map settings

--
-------------------------------------------------
Aarni Ruuhimäki | Megative Tmi | KYMI.com
Pääsintie 26 | 45100 Kouvola | FINLAND
-------------------------------------------------
This is a bugfree broadcast to you from a linux system.

Re: Impact of UNICODE encoding on performance

From

Harry Mantheakis

Date:

17 March 2004, 12:25:49

Thanks for the feedback Aarni!

It's good to know that you are happy with LATIN-1. That is my fallback at
the moment.

I just wonder how much of performance price there is to pay for going with
UNICODE.

I guess I might just have to suck it and see!

Kind regards

Harry


> Hi Harry,
>
> Dunno about the performance penalty, but so far I am happy with LATIN1 dbase
> system (RH and Trustix). Even with cyrillic characters. Then again, I work
> with browser interfaces and it's not really up to me what encoding the client
> has or has not installed. <if western, charset=iso-iso-8859-1, if fellow
> russki harasoo charset=windows-1251> is, I guess, a good bet. It's a windows
> world, so far.
>
> Soviet KOI-X X, KOI8-r, KOI8-RU, Mac Cyrillic (Standard), CyrWin Cyrillic and
> the rest of the soup ...
>
> Some experience and my half a pea.
>
> BR,
>
> Aarni
>
>
> On Tuesday 16 March 2004 12:43, you wrote:
>> Hello
>>
>> I am just setting out on a new project, having recently switched to
>> PostgreSQL.
>>
>> My immediate requirements would be satisfied with ISO-8859-1 (LATIN-1)
>> encoding, but it is conceivable that, if things go really well, somewhere
>> in the future my character encoding requirements will broaden.
>>
>> So I am tempted to specify UNICODE form the outset, and be done with it.
>>
>> But I cannot help wondering how much of a performance penalty this entails.
>>
>> If the performance hit is not significant, I shall be happy to stick with
>> UNICODE.
>>
>> But if anyone has any strong views (or experience) on this issue I shall be
>> very grateful for some feedback.
>>
>> Many thanks.
>>
>> Harry Mantheakis
>> London, UK
>>
>>
>> ---------------------------(end of broadcast)---------------------------
>> TIP 7: don't forget to increase your free space map settings

Re: Impact of UNICODE encoding on performance

From

Reshat Sabiq

Date:

17 March 2004, 23:27:26

I'm not very knowledgeable on this, but i think you should try UTF-8 from the start, given your expectations. I am able to save UTF-8 strings into LATIN-1 db, and retrieve them, using JDBC, but viewing them in pgAdmin III is not a pretty site (understandably). But i haven't used it extensively, and i think that queries (comparisons) might be affected with this setup (i.e., a string with 2 characters corresponding to 1 UTF-8 character would be equal to the its UTF-8 counterpart, which is clearly not intended). On the other hand, it is also conceivable that queries won't be affected, if no meaningless overlaps like that can occur.
In general, i read that Unicode is somewhat slower (understandably), but i don't think it's significant. One just needs to have a senseful character comparison method that does a bitmap first, so i don't think the overhead is big. There are probably studies on the web.

-- 
Sincerely,
Reshat.

---
If you see my certificate with this message, you should be able to send me encrypted e-mail. 
Please consult your e-mail client for details if you would like to do that.

Aarni Ruuhimäki wrote:

Hi Harry,

Dunno about the performance penalty, but so far I am happy with LATIN1 dbase 
system (RH and Trustix). Even with cyrillic characters. Then again, I work 
with browser interfaces and it's not really up to me what encoding the client 
has or has not installed. <if western, charset=iso-iso-8859-1, if fellow 
russki harasoo charset=windows-1251> is, I guess, a good bet. It's a windows 
world, so far.

Soviet KOI-X X, KOI8-r, KOI8-RU, Mac Cyrillic (Standard), CyrWin Cyrillic and 
the rest of the soup ...

Some experience and my half a pea.

BR,

Aarni


On Tuesday 16 March 2004 12:43, you wrote:

Hello

I am just setting out on a new project, having recently switched to
PostgreSQL.

My immediate requirements would be satisfied with ISO-8859-1 (LATIN-1)
encoding, but it is conceivable that, if things go really well, somewhere
in the future my character encoding requirements will broaden.

So I am tempted to specify UNICODE form the outset, and be done with it.

But I cannot help wondering how much of a performance penalty this entails.

If the performance hit is not significant, I shall be happy to stick with
UNICODE.

But if anyone has any strong views (or experience) on this issue I shall be
very grateful for some feedback.

Many thanks.

Harry Mantheakis
London, UK


---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings

Attachment

smime.p7s

Re: Impact of UNICODE encoding on performance

From

Harry Mantheakis

Date:

18 March 2004, 04:34:26

Reshat, thanks for the input.

I probably will go with Unicode. Things can only get faster, anyway :-)

Regards

Harry

> I'm not very knowledgeable on this, but i think you should try UTF-8
> from the start, given your expectations. I am able to save UTF-8 strings
> into LATIN-1 db, and retrieve them, using JDBC, but viewing them in
> pgAdmin III is not a pretty site (understandably). But i haven't used it
> extensively, and i think that queries (comparisons) might be affected
> with this setup (i.e., a string with 2 characters corresponding to 1
> UTF-8 character would be equal to the its UTF-8 counterpart, which is
> clearly not intended). On the other hand, it is also conceivable that
> queries won't be affected, if no meaningless overlaps like that can occur.
> In general, i read that Unicode is somewhat slower (understandably), but
> i don't think it's significant. One just needs to have a senseful
> character comparison method that does a bitmap first, so i don't think
> the overhead is big. There are probably studies on the web.

Re: Impact of UNICODE encoding on performance

From

"M. Bastin"

Date:

18 March 2004, 10:27:31

With UNICODE UTF-8 the basic (a-z, A-Z, 0-9, ...) 128 characters
(there are actually less than 128) are single byte characters
identical to the original ASCII specification.  All other characters
might have multiple bytes.

This means that as long you are transferring roman alphabet based
text, the impact will be very low since the text will mostly consist
of those 128 characters.

for other languages more characters consisting of multiple bytes
would be transferred.

I don't know about PostgreSQL's internal treatement of multi-byte
characters and whether this woud require more CPU time.

After weighing pro and cons, I'd definitely go with UNICODE.

Marc

Re: Impact of UNICODE encoding on performance

From

Harry Mantheakis

Date:

18 March 2004, 12:42:29

Marc, thanks for your input.

The more I think about it, the more it seems like Unicode is the right
answer.

In the meantime I have not seen anyone screaming out about performance
issues, so I am not going to worry about it any more.

Unicode also supports the Euro currency symbol, which helps a lot :-)

Kind regards

Harry


> With UNICODE UTF-8 the basic (a-z, A-Z, 0-9, ...) 128 characters
> (there are actually less than 128) are single byte characters
> identical to the original ASCII specification.  All other characters
> might have multiple bytes.
>
> This means that as long you are transferring roman alphabet based
> text, the impact will be very low since the text will mostly consist
> of those 128 characters.
>
> for other languages more characters consisting of multiple bytes
> would be transferred.
>
> I don't know about PostgreSQL's internal treatement of multi-byte
> characters and whether this woud require more CPU time.
>
> After weighing pro and cons, I'd definitely go with UNICODE.
>
> Marc