Re: Unicode combining characters - Mailing list pgsql-hackers

From Tatsuo Ishii
Subject Re: Unicode combining characters
Date
Msg-id 20011004111642R.t-ishii@sra.co.jp
Whole thread Raw
In response to Re: Unicode combining characters  (Tatsuo Ishii <t-ishii@sra.co.jp>)
Responses Re: Unicode combining characters
Re: Unicode combining characters
List pgsql-hackers
> Ok. I ran the modified test (now the iteration is reduced to 100000 in
> liketest()). As you can see, there's huge difference. MB seems up to
> ~8 times slower:-< There seems some problems existing in the
> implementation. Considering REGEX is not so slow, maybe we should
> employ the same design as REGEX. i.e. using wide charcters, not
> multibyte streams...
> 
> MB+LIKE
> Total runtime: 1321.58 msec
> Total runtime: 1718.03 msec
> Total runtime: 2519.97 msec
> Total runtime: 4187.05 msec
> Total runtime: 7629.24 msec
> Total runtime: 14456.45 msec
> Total runtime: 17320.14 msec
> Total runtime: 17323.65 msec
> Total runtime: 17321.51 msec
> 
> noMB+LIKE
> Total runtime: 964.90 msec
> Total runtime: 993.09 msec
> Total runtime: 1057.40 msec
> Total runtime: 1192.68 msec
> Total runtime: 1494.59 msec
> Total runtime: 2078.75 msec
> Total runtime: 2328.77 msec
> Total runtime: 2326.38 msec
> Total runtime: 2330.53 msec

I did some trials with wide characters implementation and saw
virtually no improvement. My guess is the logic employed in LIKE is
too simple to hide the overhead of the multibyte and wide character
conversion. The reason why REGEX with MB is not so slow would be the
complexity of its logic, I think. As you can see in my previous
postings, $1 ~ $2 operation (this is logically same as a LIKE '%a%')
is, for example, almost 80 times slower than LIKE (remember that
likest() loops over 10 times more than regextest()).

So I decided to use a completely different approach. Now like has two
matching engines, one for single byte encodings (MatchText()), the
other is for multibyte ones (MBMatchText()). MatchText() is identical
to the non MB version of it, and virtually no performance penalty for
single byte encodings. MBMatchText() is for multibyte encodings and is
identical the one used in 7.1.

Here is the MB case result with SQL_ASCII encoding.

Total runtime: 901.69 msec
Total runtime: 939.08 msec
Total runtime: 993.60 msec
Total runtime: 1148.18 msec
Total runtime: 1434.92 msec
Total runtime: 2024.59 msec
Total runtime: 2288.50 msec
Total runtime: 2290.53 msec
Total runtime: 2316.00 msec

To accomplish this, I moved MatchText etc. to a separate file and now
like.c includes it *twice* (similar technique used in regexec()). This
makes like.o a little bit larger, but I believe this is worth for the
optimization.
--
Tatsuo Ishii


pgsql-hackers by date:

Previous
From: Laurette Cisneros
Date:
Subject: Timestamp, fractional seconds problem
Next
From: Tom Lane
Date:
Subject: Re: BUG: text(varchar) truncates at 31 bytes