Thread: strpos behavior change around empty substring in PG12
Greetings hackers,
Before PG12, select strpos('test', '') returns 1 (empty substring found at first position of the string), whereas starting with PG12 it returns 0 (empty substring not found).
Is this behavior change intentional? If so, it doesn't seem to be documented in the release notes...
First raised by Austin Drenski in https://github.com/npgsql/efcore.pg/pull/1068#issuecomment-546795826
Thanks,
Shay
On Mon, Oct 28, 2019 at 11:02 AM Shay Rojansky <roji@roji.org> wrote: > Before PG12, select strpos('test', '') returns 1 (empty substring found at first position of the string), whereas startingwith PG12 it returns 0 (empty substring not found). > > Is this behavior change intentional? If so, it doesn't seem to be documented in the release notes... > > First raised by Austin Drenski in https://github.com/npgsql/efcore.pg/pull/1068#issuecomment-546795826 It looks to me like this got broken here: commit 9556aa01c69a26ca726d8dda8e395acc7c1e30fc Author: Heikki Linnakangas <heikki.linnakangas@iki.fi> Date: Fri Jan 25 16:25:05 2019 +0200 Use single-byte Boyer-Moore-Horspool search even with multibyte encodings. Not sure what happened exactly. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Mon, Oct 28, 2019 at 11:02 AM Shay Rojansky <roji@roji.org> wrote: >> Before PG12, select strpos('test', '') returns 1 (empty substring found at first position of the string), whereas startingwith PG12 it returns 0 (empty substring not found). > It looks to me like this got broken here: > commit 9556aa01c69a26ca726d8dda8e395acc7c1e30fc > Author: Heikki Linnakangas <heikki.linnakangas@iki.fi> > Date: Fri Jan 25 16:25:05 2019 +0200 > Use single-byte Boyer-Moore-Horspool search even with multibyte encodings. > Not sure what happened exactly. I think the problem is lack of clarity about the edge cases. The patch added this short-circuit right at the top of text_position(): + if (VARSIZE_ANY_EXHDR(t1) < 1 || VARSIZE_ANY_EXHDR(t2) < 1) + return 0; and as this example shows, that's the Wrong Thing. Fortunately, it also seems easily fixed. regards, tom lane
On 28/10/2019 17:57, Tom Lane wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Mon, Oct 28, 2019 at 11:02 AM Shay Rojansky <roji@roji.org> wrote: >>> Before PG12, select strpos('test', '') returns 1 (empty substring found at first position of the string), whereas startingwith PG12 it returns 0 (empty substring not found). > >> It looks to me like this got broken here: > >> commit 9556aa01c69a26ca726d8dda8e395acc7c1e30fc >> Author: Heikki Linnakangas <heikki.linnakangas@iki.fi> >> Date: Fri Jan 25 16:25:05 2019 +0200 >> Use single-byte Boyer-Moore-Horspool search even with multibyte encodings. > >> Not sure what happened exactly. > > I think the problem is lack of clarity about the edge cases. > The patch added this short-circuit right at the top of text_position(): > > + if (VARSIZE_ANY_EXHDR(t1) < 1 || VARSIZE_ANY_EXHDR(t2) < 1) > + return 0; > > and as this example shows, that's the Wrong Thing. Fortunately, > it also seems easily fixed. Tom fixed this in commit bd1ef5799b; thanks! To be sure, I also checked the SQL standard for what POSITION('' IN 'test') is supposed to return. It agrees that 1 is correct: > If CHAR_LENGTH(CVE1) is 0 (zero), then the result is 1 (one). - Heikki
Thanks for the quick turnaround!
Tom Lane <tgl@sss.pgh.pa.us> schrieb am Mo., 28. Okt. 2019, 16:57:
Robert Haas <robertmhaas@gmail.com> writes:
> On Mon, Oct 28, 2019 at 11:02 AM Shay Rojansky <roji@roji.org> wrote:
>> Before PG12, select strpos('test', '') returns 1 (empty substring found at first position of the string), whereas starting with PG12 it returns 0 (empty substring not found).
> It looks to me like this got broken here:
> commit 9556aa01c69a26ca726d8dda8e395acc7c1e30fc
> Author: Heikki Linnakangas <heikki.linnakangas@iki.fi>
> Date: Fri Jan 25 16:25:05 2019 +0200
> Use single-byte Boyer-Moore-Horspool search even with multibyte encodings.
> Not sure what happened exactly.
I think the problem is lack of clarity about the edge cases.
The patch added this short-circuit right at the top of text_position():
+ if (VARSIZE_ANY_EXHDR(t1) < 1 || VARSIZE_ANY_EXHDR(t2) < 1)
+ return 0;
and as this example shows, that's the Wrong Thing. Fortunately,
it also seems easily fixed.
regards, tom lane