Re: Patch: add conversion from pg_wchar to multibyte - Mailing list pgsql-hackers

From Alexander Korotkov
Subject Re: Patch: add conversion from pg_wchar to multibyte
Date
Msg-id CAPpHfdtFUmViWynMqO5Et4OXmxX-HREhOGLqDoedezXqh6EJMA@mail.gmail.com
Whole thread Raw
In response to Re: Patch: add conversion from pg_wchar to multibyte  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
On Wed, May 2, 2012 at 5:48 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Wed, May 2, 2012 at 9:35 AM, Alexander Korotkov <aekorotkov@gmail.com> wrote:
 > Imagine we've two queries:
> 1) SELECT * FROM tbl WHERE col LIKE '%abcd%';
> 2) SELECT * FROM tbl WHERE col LIKE '%abcdefghijk%';
>
> The first query require reading posting lists of trigrams "abc" and "bcd".
> The second query require reading posting lists of trigrams "abc", "bcd",
> "cde", "def", "efg", "fgh", "ghi", "hij" and "ijk".
> We could decide to use index scan for first query and sequential scan for
> second query because number of posting list to read is high. But it is
> unreasonable because actually second query is narrower than the first one.
> We can use same index scan for it, recheck will remove all false positives.
> When number of trigrams is high we can just exclude some of them from index
> scan. It would be better than just decide to do sequential scan. But the
> question is what trigrams to exclude? Ideally we would leave most rare
> trigrams to make index scan cheaper.

True.  I guess I was thinking more of the case where you've got
abc|def|ghi|jkl|mno|pqr|stu|vwx|yza|....  There's probably some point
at which it becomes silly to think about using the index.

Yes, such situations are also possible.

>> Well, I'm not an expert on encodings, but it seems like a logical
>> extension of what we're doing right now, so I don't really see why
>> not.  I'm confused by the diff hunks in pg_mule2wchar_with_len,
>> though.  Presumably either the old code is right (in which case, don't
>> change it) or the new code is right (in which case, there's a bug fix
>> needed here that ought to be discussed and committed separately from
>> the rest of the patch).  Maybe I am missing something.
>
> Unfortunately I didn't understand original logic of pg_mule2wchar_with_len.
> I just did proposal about how it could be. I hope somebody more familiar
> with this code would clarify this situation.

Well, do you think the current code is buggy, or not?

Probably, but I'm not sure. The conversion seems lossy to me unless I'm missing something about mule encoding.

------
With best regards,
Alexander Korotkov.

pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: Patch: add conversion from pg_wchar to multibyte
Next
From: "Kevin Grittner"
Date:
Subject: Re: proposal: additional error fields