Home > mailing lists

Re: Unicode combining characters - Mailing list pgsql-hackers

From	Tatsuo Ishii
Subject	Re: Unicode combining characters
Date	October 4, 2001 01:18:31
Msg-id	20011004111642R.t-ishii@sra.co.jp Whole thread Raw
In response to	Re: Unicode combining characters (Tatsuo Ishii <t-ishii@sra.co.jp>)
Responses	Re: Unicode combining characters Re: Unicode combining characters
List	pgsql-hackers

Tree view

> Ok. I ran the modified test (now the iteration is reduced to 100000 in
> liketest()). As you can see, there's huge difference. MB seems up to
> ~8 times slower:-< There seems some problems existing in the
> implementation. Considering REGEX is not so slow, maybe we should
> employ the same design as REGEX. i.e. using wide charcters, not
> multibyte streams...
> 
> MB+LIKE
> Total runtime: 1321.58 msec
> Total runtime: 1718.03 msec
> Total runtime: 2519.97 msec
> Total runtime: 4187.05 msec
> Total runtime: 7629.24 msec
> Total runtime: 14456.45 msec
> Total runtime: 17320.14 msec
> Total runtime: 17323.65 msec
> Total runtime: 17321.51 msec
> 
> noMB+LIKE
> Total runtime: 964.90 msec
> Total runtime: 993.09 msec
> Total runtime: 1057.40 msec
> Total runtime: 1192.68 msec
> Total runtime: 1494.59 msec
> Total runtime: 2078.75 msec
> Total runtime: 2328.77 msec
> Total runtime: 2326.38 msec
> Total runtime: 2330.53 msec

I did some trials with wide characters implementation and saw
virtually no improvement. My guess is the logic employed in LIKE is
too simple to hide the overhead of the multibyte and wide character
conversion. The reason why REGEX with MB is not so slow would be the
complexity of its logic, I think. As you can see in my previous
postings, $1 ~ $2 operation (this is logically same as a LIKE '%a%')
is, for example, almost 80 times slower than LIKE (remember that
likest() loops over 10 times more than regextest()).

So I decided to use a completely different approach. Now like has two
matching engines, one for single byte encodings (MatchText()), the
other is for multibyte ones (MBMatchText()). MatchText() is identical
to the non MB version of it, and virtually no performance penalty for
single byte encodings. MBMatchText() is for multibyte encodings and is
identical the one used in 7.1.

Here is the MB case result with SQL_ASCII encoding.

Total runtime: 901.69 msec
Total runtime: 939.08 msec
Total runtime: 993.60 msec
Total runtime: 1148.18 msec
Total runtime: 1434.92 msec
Total runtime: 2024.59 msec
Total runtime: 2288.50 msec
Total runtime: 2290.53 msec
Total runtime: 2316.00 msec

To accomplish this, I moved MatchText etc. to a separate file and now
like.c includes it *twice* (similar technique used in regexec()). This
makes like.o a little bit larger, but I believe this is worth for the
optimization.
--
Tatsuo Ishii

pgsql-hackers by date:

From: Laurette Cisneros
Date: 03 October 2001, 23:04:56
Subject: Timestamp, fractional seconds problem

From: Tom Lane
Date: 04 October 2001, 01:47:48
Subject: Re: BUG: text(varchar) truncates at 31 bytes

Re: Unicode combining characters - Mailing list pgsql-hackers

Previous

Next