Re: Patch: add conversion from pg_wchar to multibyte - Mailing list pgsql-hackers

From Alexander Korotkov
Subject Re: Patch: add conversion from pg_wchar to multibyte
Date
Msg-id CAPpHfduGy+ubDZNumLe9b_z_AFmrNYUbfNR-_A1gkqYPwASKWw@mail.gmail.com
Whole thread Raw
In response to Re: Patch: add conversion from pg_wchar to multibyte  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: Patch: add conversion from pg_wchar to multibyte
List pgsql-hackers
On Wed, May 2, 2012 at 4:50 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Tue, May 1, 2012 at 6:02 PM, Alexander Korotkov <aekorotkov@gmail.com> wrote:
> Right. When number of trigrams is big, it is slow to scan posting list of
> all of them. The solution is this case is to exclude most frequent trigrams
> from index scan.  But, it require some kind of statistics of trigrams
> frequencies which we don't have. We could estimate frequencies using some
> hard-coded assumptions about natural languages. Or we could exclude
> arbitrary trigrams. But I don't like both these ideas. This problem is also
> relevant for LIKE/ILIKE search using trigram indexes.

I was thinking you could perhaps do it just based on the *number* of
trigrams, not necessarily their frequency.

Imagine we've two queries:
1) SELECT * FROM tbl WHERE col LIKE '%abcd%';
2) SELECT * FROM tbl WHERE col LIKE '%abcdefghijk%';

The first query require reading posting lists of trigrams "abc" and "bcd".
The second query require reading posting lists of trigrams "abc", "bcd", "cde", "def", "efg", "fgh", "ghi", "hij" and "ijk".
We could decide to use index scan for first query and sequential scan for second query because number of posting list to read is high. But it is unreasonable because actually second query is narrower than the first one. We can use same index scan for it, recheck will remove all false positives. When number of trigrams is high we can just exclude some of them from index scan. It would be better than just decide to do sequential scan. But the question is what trigrams to exclude? Ideally we would leave most rare trigrams to make index scan cheaper.
 
> Probably you have some comments on idea of conversion from pg_wchar to
> multibyte? Is it acceptable at all?

Well, I'm not an expert on encodings, but it seems like a logical
extension of what we're doing right now, so I don't really see why
not.  I'm confused by the diff hunks in pg_mule2wchar_with_len,
though.  Presumably either the old code is right (in which case, don't
change it) or the new code is right (in which case, there's a bug fix
needed here that ought to be discussed and committed separately from
the rest of the patch).  Maybe I am missing something.

Unfortunately I didn't understand original logic of pg_mule2wchar_with_len. I just did proposal about how it could be. I hope somebody more familiar with this code would clarify this situation.

------
With best regards,
Alexander Korotkov.

pgsql-hackers by date:

Previous
From: Peter Geoghegan
Date:
Subject: Re: Have we out-grown Flex?
Next
From: Robert Haas
Date:
Subject: Re: Patch: add conversion from pg_wchar to multibyte