Thread: BUG #8821: pg_trgm segfault with Turkish locale database
The following bug has been logged on the website: Bug reference: 8821 Logged by: Ufuk Kayserilioglu Email address: ufuk@paralaus.com PostgreSQL version: 9.3.2 Operating system: Linux - Ubuntu x86_64 Description: PLATFORM: -------- PostgreSQL 9.3.2 installed from precise-pgdg apt repository on Ubuntu 12.04.3 LTS and 13.10 x86_64 SETUP: ----- Given a database with encoding UTF-8, locale tr_TR.UTF-8, 'pg_trgm' extension enabled, and the following setup: CREATE TABLE car_makers (maker TEXT); INSERT INTO car_makers VALUES ('AUDI'), ('MINI'); Run any one of the following commands: - SELECT maker <-> 'MAZDA' FROM car_makers; - SELECT similarity(maker, 'MAZDA') FROM car_makers; - SELECT show_trgm('III'); ACTUAL OUTCOME: -------------- The database will crash and the log will output something along the lines of: LOG: server process (PID 24941) was terminated by signal 11: Segmentation fault DETAIL: Failed process was running: SELECT maker <-> 'MAZDA' FROM car_makers; LOG: terminating any other active server processes WARNING: terminating connection because of crash of another server process DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. HINT: In a moment you should be able to reconnect to the database and repeat your command. FATAL: the database system is in recovery mode LOG: all server processes terminated; reinitializing EXPECTED OUTCOME: ---------------- The commands return a result and the database process stays up. ANALYSIS: -------- I initially ran into this issue on a table with more records and have narrowed the case down to rows that have 4 letter words that contain the uppercase I character (that's why only 'AUDI' and 'MINI' in the example). I am not sure why comparison with 'MAZDA' triggers the crash (for example, a comparison with 'MAZDA ' does not), but it seems other 5 char strings also trigger it (eg. 'ABCDE' or '12345'). Also, there are other, longer comparison strings like 'RENAULT-OYAK ' that also trigger it. As I was trying to understand what exactly causes the problem, I tried to understand the trigrams generated by the values in the table and various test strings. I happened on the other crash during that experimentation. As a result, it seems there is a problem with trigrams generation and/or comparison when the Turkish locale is being used. The issue ultimately seems to be linked to the fact that the lowercase equivalent of "I" is "ı" (i without the dot), and the trigram codes does the proper conversion to lowercase which turns the string into a multibyte string. The same problem does NOT exist on Mac OSX 10.9.1 with PostgreSQL installed via Postgres.app. However, OSX has broken tr_TR locale support to begin with (for example, sorting does not work properly.) WORKAROUND: ---------- I was not able to find a reliable workaround since the code that does the lowercase conversion seems to only take into consideration the locale of the working database (which in my case needs to be tr_TR). Specifying an 'en_US' collation on the 'maker' column or an explicit collation specifier in the query do not work.
ufuk@paralaus.com writes: > Given a database with encoding UTF-8, locale tr_TR.UTF-8, 'pg_trgm' > extension enabled, and the following setup: > CREATE TABLE car_makers (maker TEXT); > INSERT INTO car_makers VALUES ('AUDI'), ('MINI'); > Run any one of the following commands: > - SELECT maker <-> 'MAZDA' FROM car_makers; > - SELECT similarity(maker, 'MAZDA') FROM car_makers; > - SELECT show_trgm('III'); > As a result, it seems there is a problem with trigrams generation and/or > comparison when the Turkish locale is being used. It looks like generate_trgm() is not considering the possibility that case-folding will make the string physically longer, so you get a buffer overrun when any of these I-containing strings are converted to trigrams. Will fix, thanks for the report! regards, tom lane
Hi Tom, Thanks for the quick response. Just a quick followup note though; running= : SELECT show=5Ftrgm(maker) =46ROM car=5Fmakers; behaves properly, but trying to make a similarity comparison triggers the= segfault (at least in my case). That's why I suspect there may be 2 rela= ted bugs, one related trigram generation and, maybe, one related to trigr= am comparison. Hope this info will help you as well. Regards, -- Ufuk Kayserilioglu =46rom:=C2=A0Tom Lane Tom Lane Reply:=C2=A0Tom Lane tgl=40sss.pgh.pa.us Date:=C2=A013 January 2014 at 18:09:46 To:=C2=A0ufuk=40paralaus.com ufuk=40paralaus.com Subject:=C2=A0 Re: =5BBUGS=5D BUG =238821: pg=5Ftrgm segfault with Turkis= h locale database =20 ufuk=40paralaus.com writes: =20 > Given a database with encoding UT=46-8, locale tr=5FTR.UT=46-8, 'pg=5Ft= rgm' =20 > extension enabled, and the following setup: =20 > CREATE TABLE car=5Fmakers (maker TEXT); =20 > INSERT INTO car=5Fmakers VALUES ('AUDI'), ('MINI'); =20 > Run any one of the following commands: =20 > - SELECT maker <-> 'MAZDA' =46ROM car=5Fmakers; =20 > - SELECT similarity(maker, 'MAZDA') =46ROM car=5Fmakers; =20 > - SELECT show=5Ftrgm('III'); =20 > As a result, it seems there is a problem with trigrams generation and/o= r =20 > comparison when the Turkish locale is being used. =20 It looks like generate=5Ftrgm() is not considering the possibility that =20 case-folding will make the string physically longer, so you get a buffer = =20 overrun when any of these I-containing strings are converted to trigrams.= =20 Will fix, thanks for the report=21 =20 regards, tom lane =20
Ufuk Kayserilioglu <ufuk@paralaus.com> writes: > Thanks for the quick response. Just a quick followup note though; running: > SELECT show_trgm(maker) FROM car_makers; > behaves properly, but trying to make a similarity comparison triggers the segfault (at least in my case). That's why Isuspect there may be 2 related bugs, one related trigram generation and, maybe, one related to trigram comparison. Hopethis info will help you as well. The given cases all work for me with the committed patch to generate_trgm(), http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=c3ccc9ee584b9b015dd9c1931e261e21f3961e5f I'm not going to sit here and claim that there are now zero bugs in pg_trgm, but I'm not seeing evidence of a second issue. Since the bug is a buffer overrun of just a few bytes, whether it results in a crash in any particular trigram operation is hard to predict; it might just stomp on noncritical memory. regards, tom lane
Hi Tom, Thank you so much for the quick turnaround. Indeed, I just built the late= st version from Git and tested it on one of the original test machines an= d all cases are working perfectly. Regards, -- Ufuk Kayserilioglu =46rom:=C2=A0Tom Lane Tom Lane Reply:=C2=A0Tom Lane tgl=40sss.pgh.pa.us Date:=C2=A013 January 2014 at 21:24:31 To:=C2=A0Ufuk Kayserilioglu ufuk=40paralaus.com Subject:=C2=A0 Re: =5BBUGS=5D BUG =238821: pg=5Ftrgm segfault with Turkis= h locale database =20 Ufuk Kayserilioglu <ufuk=40paralaus.com> writes: =20 > Thanks for the quick response. Just a quick followup note though; runni= ng: =20 > SELECT show=5Ftrgm(maker) =46ROM car=5Fmakers; =20 > behaves properly, but trying to make a similarity comparison triggers t= he segfault (at least in my case). That's why I suspect there may be 2 re= lated bugs, one related trigram generation and, maybe, one related to tri= gram comparison. Hope this info will help you as well. =20 The given cases all work for me with the committed patch to generate=5Ftr= gm(), =20 http://git.postgresql.org/gitweb/=3Fp=3Dpostgresql.git;a=3Dcommitdiff;h=3D= c3ccc9ee584b9b015dd9c1931e261e21f3961e5f =20 I'm not going to sit here and claim that there are now zero bugs in =20 pg=5Ftrgm, but I'm not seeing evidence of a second issue. Since the bug i= s =20 a buffer overrun of just a few bytes, whether it results in a crash in an= y =20 particular trigram operation is hard to predict; it might just stomp on =20 noncritical memory. =20 regards, tom lane =20