Thread: Distinct oddity
Hi there, does this look right? FAKDB=# select count(distinct(f.land)) from firmen f where f.typlist='Redaktion';count ------- 1975 (1 row) FAKDB=# select count(distinct(f.land||'1')) from firmen f where f.typlist='Redaktion';count ------- 4944 (1 row) FAKDB=# select version(); version ---------------------------------------------------------------------------- ----------------------------------------------------------PostgreSQL 8.3.3 on powerpc-apple-darwin, compiled by GCC i686-apple-darwin8-gcc-4.0.1 (GCC) 4.0.1 (Apple Computer, Inc. build 5370) (1 row) Why would the 2 queries give different results? No inserts occurred in the meantime. tia, Maximilian Tyrtania
Hello, I wonder if someone has an idea for this problem: I have a string that contains a serie of chars, separated by single spaces. e.g 'a b x n r a b c b' Having such a string, I d'like to get a list of all predecessors of a given character. In the example, the predecessors of b are a,a,c. If I now have the string 'a a a', the predecessors of 'a' are a,a I tried to use regexp_matches for this: select regexp_matches('a a a', '([a-z]) a','g'); => {"a "} only As the second parameter of the function matches the first 2 'a', only the trailing ' a' will be used to seek for further matching... Cheers, Marc Mamin
On Thu, May 7, 2009 at 2:21 AM, Maximilian Tyrtania <maximilian.tyrtania@onlinehome.de> wrote: > Hi there, > > does this look right? > > FAKDB=# select count(distinct(f.land)) from firmen f where > f.typlist='Redaktion'; > count > ------- > 1975 > (1 row) > > FAKDB=# select count(distinct(f.land||'1')) from firmen f where > f.typlist='Redaktion'; > count > ------- > 4944 > (1 row) Yeah, that does seem odd. Could it be something like nulls in your data set? just guessing really. If you could make a small test case that shows it happening and allows others to reproduce it you're likely to get more bites.
am 07.05.2009 20:54 Uhr schrieb Scott Marlowe unter scott.marlowe@gmail.com: > On Thu, May 7, 2009 at 2:21 AM, Maximilian Tyrtania > <maximilian.tyrtania@onlinehome.de> wrote: >> Hi there, >> >> does this look right? >> >> FAKDB=# select count(distinct(f.land)) from firmen f where >> f.typlist='Redaktion'; >> count >> ------- >> 1975 >> (1 row) >> >> FAKDB=# select count(distinct(f.land||'1')) from firmen f where >> f.typlist='Redaktion'; >> count >> ------- >> 4944 >> (1 row) > > Yeah, that does seem odd. Could it be something like nulls in your > data set? just guessing really. If you could make a small test case > that shows it happening and allows others to reproduce it you're > likely to get more bites. It doesn't seem to be related to null values (which wouldn't explain it anyway) nor to this particular field... FAKDB=# select count(*) from firmen where bezeichnung is null;count ------- 0 (1 row) FAKDB=# select count(distinct(f.bezeichnung)) from firmen f;count -------72698 (1 row) FAKDB=# select count(distinct(f.bezeichnung||'e')) from firmen f;count -------72892 (1 row) My attempts at reproducing this with a freshly created table failed, of course. FAKDB=# create table concattest(mytext text); CREATE TABLE FAKDB=# insert into concattest (mytext) select(generate_series(1,10000)::text); INSERT 0 10000 FAKDB=# insert into concattest (mytext) select(generate_series(1,10000)::text); INSERT 0 10000 FAKDB=# select count(distinct(mytext)) from concattest;count -------10000 (1 row) FAKDB=# select count(distinct(mytext||'2')) from concattest;count -------10000 (1 row) best, Maximilian Tyrtania
On Fri, May 8, 2009 at 3:28 AM, Maximilian Tyrtania <maximilian.tyrtania@onlinehome.de> wrote: > am 07.05.2009 20:54 Uhr schrieb Scott Marlowe unter scott.marlowe@gmail.com: > >> On Thu, May 7, 2009 at 2:21 AM, Maximilian Tyrtania >> <maximilian.tyrtania@onlinehome.de> wrote: >>> Hi there, >>> >>> does this look right? >>> >>> FAKDB=# select count(distinct(f.land)) from firmen f where >>> f.typlist='Redaktion'; >>> count >>> ------- >>> 1975 >>> (1 row) >>> >>> FAKDB=# select count(distinct(f.land||'1')) from firmen f where >>> f.typlist='Redaktion'; >>> count >>> ------- >>> 4944 >>> (1 row) >> >> Yeah, that does seem odd. Could it be something like nulls in your >> data set? just guessing really. If you could make a small test case >> that shows it happening and allows others to reproduce it you're >> likely to get more bites. > > It doesn't seem to be related to null values (which wouldn't explain it > anyway) nor to this particular field... > > FAKDB=# select count(*) from firmen where bezeichnung is null; > count > ------- > 0 > (1 row) That's not the same field as in the original query. > My attempts at reproducing this with a freshly created table failed, of > course. Instead of trying to create a test case from scratch, isolate some rows that cause this, put them in another table, and then pg_dump that one table, cleaned as needed for privacy, here.
Is firmen a table or a view?
From: Scott Marlowe <scott.marlowe@gmail.com>
To: Maximilian Tyrtania <maximilian.tyrtania@onlinehome.de>
Cc: pgsql-sql@postgresql.org
Sent: Friday, May 8, 2009 5:35:21 AM
Subject: Re: [SQL] Distinct oddity
On Fri, May 8, 2009 at 3:28 AM, Maximilian Tyrtania
<maximilian.tyrtania@onlinehome.de> wrote:
> am 07.05.2009 20:54 Uhr schrieb Scott Marlowe unter scott.marlowe@gmail.com:
>
>> On Thu, May 7, 2009 at 2:21 AM, Maximilian Tyrtania
>> <maximilian.tyrtania@onlinehome.de> wrote:
>>> Hi there,
>>>
>>> does this look right?
>>>
>>> FAKDB=# select count(distinct(f.land)) from firmen f where
>>> f.typlist='Redaktion';
>>> count
>>> -------
>>> 1975
>>> (1 row)
>>>
>>> FAKDB=# select count(distinct(f.land||'1')) from firmen f where
>>> f.typlist='Redaktion';
>>> count
>>> -------
>>> 4944
>>> (1 row)
>>
>> Yeah, that does seem odd. Could it be something like nulls in your
>> data set? just guessing really. If you could make a small test case
>> that shows it happening and allows others to reproduce it you're
>> likely to get more bites.
>
> It doesn't seem to be related to null values (which wouldn't explain it
> anyway) nor to this particular field...
>
> FAKDB=# select count(*) from firmen where bezeichnung is null;
> count
> -------
> 0
> (1 row)
That's not the same field as in the original query.
> My attempts at reproducing this with a freshly created table failed, of
> course.
Instead of trying to create a test case from scratch, isolate some
rows that cause this, put them in another table, and then pg_dump that
one table, cleaned as needed for privacy, here.
--
Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-sql
Maximilian Tyrtania <maximilian.tyrtania@onlinehome.de> writes: > am 07.05.2009 20:54 Uhr schrieb Scott Marlowe unter scott.marlowe@gmail.com: >> Yeah, that does seem odd. Could it be something like nulls in your >> data set? just guessing really. If you could make a small test case >> that shows it happening and allows others to reproduce it you're >> likely to get more bites. > It doesn't seem to be related to null values (which wouldn't explain it > anyway) nor to this particular field... Can you reproduce it in other contexts than specifically count(distinct)? In particular I'd try select count(*) from (select distinct f.bezeichnung from firmen f) ss; select count(*) from (select distinct f.bezeichnung||'e' from firmen f) ss; If those give the same numbers as you're showing here, then the next step would be to dump out the actual results of the SELECT DISTINCT queries and compare them --- looking at the actual data values should give some insight as to what's happening. BTW, what is the datatype of f.bezeichnung, and what locale are you running in? regards, tom lane
am 08.05.2009 16:55 Uhr schrieb Rob Sargent unter robsargent@rocketmail.com: >Is firmen a table or a view? It's a table. am 08.05.2009 21:52 Uhr schrieb Tom Lane unter tgl@sss.pgh.pa.us: >> It doesn't seem to be related to null values (which wouldn't explain it >> anyway) nor to this particular field... > > Can you reproduce it in other contexts than specifically count(distinct)? > In particular I'd try > > select count(*) from > (select distinct f.bezeichnung from firmen f) ss; FAKDB=# select count(*) from FAKDB-# (select distinct f.bezeichnung from firmen f) ss;count -------73437 (1 row) > > select count(*) from > (select distinct f.bezeichnung||'e' from firmen f) ss; FAKDB=# select count(*) from (select distinct f.bezeichnung||'e' from firmen f) ss;count -------72535 (1 row) > If those give the same numbers as you're showing here, then the > next step would be to dump out the actual results of the SELECT DISTINCT > queries and compare them --- looking at the actual data values should > give some insight as to what's happening. FAKDB=# select distinct f.bezeichnung from firmen f order by 1 limit 5 FAKDB-# ; bezeichnung -----------------------------------------sterreichisches Verkehrsbro AG\x01Assistenz\x10Frohstoff Design & Textilveredelung"1.Mittelschule ""Am Kupferberg""" (5 rows) FAKDB=# select distinct f.bezeichnung||'e' from firmen f order by 1 limit 5 ; ?column? ------------------------------------------Österreich/Welt (Ltg.)esterreichisches Verkehrsbro AGe\x01Assistenze\x10FrohstoffDesign & Textilveredelunge"1. Mittelschule ""Am Kupferberg"""e (5 rows) Aha, the "Österreich/Welt (Ltg.)"-entry is missing in the 1st query. So that does smell like a locale problem. > BTW, what is the datatype of f.bezeichnung, It's character varying(255). Just for the record: FAKDB=# explain analyze select distinct f.bezeichnung||'e' from firmen f order by 1 limit 5; QUERY PLAN ---------------------------------------------------------------------------- ---------------------------------------------------------Limit (cost=16173.07..16174.08 rows=5 width=18) (actual time=1667.841..1667.855 rows=5 loops=1) -> Unique (cost=16173.07..16933.66 rows=3765 width=18) (actual time=1667.839..1667.851 rows=5 loops=1) -> Sort (cost=16173.07..16553.36 rows=152117 width=18) (actual time=1667.837..1667.844 rows=5 loops=1) Sort Key: (((bezeichnung)::text || 'e'::text)) Sort Method: external merge Disk: 4640kB -> Seq Scan on firmen f (cost=0.00..13646.46 rows=152117 width=18) (actual time=0.069..353.777 rows=152118 loops=1)Total runtime: 1669.998 ms (7 rows) > and what locale are you > running in? lc_collate | de_DE | Shows the collation order locale.lc_ctype | de_DE | Shows the character classification and case conversion locale. The encoding is UTF-8. Best, Maximilian Tyrtania
Maximilian Tyrtania <maximilian.tyrtania@onlinehome.de> writes: > FAKDB=# select distinct f.bezeichnung from firmen f order by 1 limit 5 > FAKDB-# ; > bezeichnung > ----------------------------------------- > �sterreichisches Verkehrsb�ro AG > \x01Assistenz > \x10Frohstoff Design & Textilveredelung > "1. Mittelschule ""Am Kupferberg""" > (5 rows) > FAKDB=# select distinct f.bezeichnung||'e' from firmen f order by 1 limit 5 > ; > ?column? > ------------------------------------------ > �sterreich/Welt (Ltg.)e > �sterreichisches Verkehrsb�ro AGe > \x01Assistenze > \x10Frohstoff Design & Textilveredelunge > "1. Mittelschule ""Am Kupferberg"""e > (5 rows) > Aha, the "�sterreich/Welt (Ltg.)"-entry is missing in the 1st query. So that > does smell like a locale problem. That only proves that adding the 'e' changes the sort order, which is completely unsurprising for any non-C locale. What you need to do is dump out the *entire* results of the DISTINCT queries and look for the unmatched lines. I'd try dumping to two files, stripping the 'e' with sed, and then sort/diff. regards, tom lane
On Sat, May 9, 2009 at 10:33 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > That only proves that adding the 'e' changes the sort order, which is > completely unsurprising for any non-C locale. What you need to do is > dump out the *entire* results of the DISTINCT queries and look for the > unmatched lines. I'd try dumping to two files, stripping the 'e' with > sed, and then sort/diff. How could adding an "e" change the sorting of "Österreich/Welt (Ltg.)" compared to "Šsterreichisches Verkehrsb ro AG" in de_DE or en_US (or any locale)? It's also odd that the "1. Mittelschule ..." line is getting sorted after those. -- Glenn Maynard
Maximilian Tyrtania wrote: > am 08.05.2009 16:55 Uhr schrieb Rob Sargent unter robsargent@rocketmail.com: > > and what locale are you running in? > > lc_collate | de_DE > | Shows the collation order locale. > lc_ctype | de_DE > | Shows the character classification and case conversion locale. > > The encoding is UTF-8. Note that the de_DE locale uses Latin9 encoding, which is incompatible with UTF8. I'd try checking if the problem is reproducible in de_DE.utf8 (you need to create a new database for testing, obviously). If it's not, then the incompatible locale definition is causing the problem. -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
am 10.05.2009 4:58 Uhr schrieb Alvaro Herrera unter alvherre@commandprompt.com: >>> and what locale are you running in? >> >> lc_collate | de_DE >> | Shows the collation order locale. >> lc_ctype | de_DE >> | Shows the character classification and case conversion locale. >> >> The encoding is UTF-8. > > Note that the de_DE locale uses Latin9 encoding, which is incompatible > with UTF8. Ah, good catch. > I'd try checking if the problem is reproducible in > de_DE.utf8 (you need to create a new database for testing, obviously). Wait a minute. I need to re- initdb with de_DE.UTF-8, don't I? > If it's not, then the incompatible locale definition is causing the > problem. I'll try that, thanks. M
Maximilian Tyrtania wrote: > am 10.05.2009 4:58 Uhr schrieb Alvaro Herrera unter > alvherre@commandprompt.com: > > I'd try checking if the problem is reproducible in > > de_DE.utf8 (you need to create a new database for testing, obviously). > > Wait a minute. I need to re- initdb with de_DE.UTF-8, don't I? Well, either that, or create a new database with Latin9 encoding. -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
Maximilian Tyrtania wrote: > am 11.05.2009 16:38 Uhr schrieb Alvaro Herrera unter > alvherre@commandprompt.com: > > >>>Note that the de_DE locale uses Latin9 encoding, which is incompatible > >>>with UTF8. > >>> I'd try checking if the problem is reproducible in > >>> de_DE.utf8 (you need to create a new database for testing, obviously). > >> > >> Wait a minute. I need to re- initdb with de_DE.UTF-8, don't I? > > > > Well, either that, or create a new database with Latin9 encoding. > > FAKDB=# CREATE DATABASE "TestLatin9" > FAKDB-# WITH ENCODING='LATIN9' > FAKDB-# OWNER=postgres; > ERROR: encoding LATIN9 does not match server's locale de_DE > DETAIL: The server's LC_CTYPE setting requires encoding UTF8. > FAKDB=# > > Now i'm deeply confused...So do i have to re-initdb? Hmm, I didn't expect this. I guess I assumed de_DE was an alias for the Latin1- or Latin9- encoded locale, but it seems your system uses it as an alias for the UTF-8 encoded one. So my initial comment seems to be wrong as well. Please paste the output of the "locale" command. What platform are you using anyway? -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
am 11.05.2009 16:38 Uhr schrieb Alvaro Herrera unter alvherre@commandprompt.com: >>>Note that the de_DE locale uses Latin9 encoding, which is incompatible >>>with UTF8. >>> I'd try checking if the problem is reproducible in >>> de_DE.utf8 (you need to create a new database for testing, obviously). >> >> Wait a minute. I need to re- initdb with de_DE.UTF-8, don't I? > > Well, either that, or create a new database with Latin9 encoding. FAKDB=# CREATE DATABASE "TestLatin9" FAKDB-# WITH ENCODING='LATIN9' FAKDB-# OWNER=postgres; ERROR: encoding LATIN9 does not match server's locale de_DE DETAIL: The server's LC_CTYPE setting requires encoding UTF8. FAKDB=# Now i'm deeply confused...So do i have to re-initdb? Best, Maximilian Tyrtania
am 12.05.2009 19:23 Uhr schrieb Alvaro Herrera unter alvherre@commandprompt.com: >> FAKDB=# CREATE DATABASE "TestLatin9" >> FAKDB-# WITH ENCODING='LATIN9' >> FAKDB-# OWNER=postgres; >> ERROR: encoding LATIN9 does not match server's locale de_DE >> DETAIL: The server's LC_CTYPE setting requires encoding UTF8. >> FAKDB=# >> >> Now i'm deeply confused...So do i have to re-initdb? > > Hmm, I didn't expect this. I guess I assumed de_DE was an alias for > the Latin1- or Latin9- encoded locale, but it seems your system uses it > as an alias for the UTF-8 encoded one. So my initial comment seems to > be wrong as well. Please paste the output of the "locale" command. Sputnik-Server:~ administrator$ locale LANG= LC_COLLATE="C" LC_CTYPE="C" LC_MESSAGES="C" LC_MONETARY="C" LC_NUMERIC="C" LC_TIME="C" LC_ALL="C" > What platform are you using anyway? Mac OS 10.4.11 Best, Maximilian Tyrtania
am 09.05.2009 16:33 Uhr schrieb Tom Lane unter tgl@sss.pgh.pa.us: > What you need to do is > dump out the *entire* results of the DISTINCT queries and look for the > unmatched lines. I'd try dumping to two files, stripping the 'e' with > sed, and then sort/diff. Okay, that's what I did, and the results are, well, surprising (for me, mind you). FAKDB=# \o /withapp FAKDB=# select distinct f.bezeichnung||'$' from firmen f order by 1; FAKDB=# \o /withoutapp FAKDB=# select distinct f.bezeichnung from firmen f order by 1; Opened those files (with textwrangler, as I 've never used sed), stripped off the '$', sorted and looked at the differences (using textwranglers "compare documents"-feature). The file "withoutapp" has those lines: "Abendschau""Abendschau" They don't look distinct to me. The "Abendschau"-Entry appears only once in "withapp". But then again the "withapp" file has these entries: Adformatie Adformatie which in turn only appear once in "withoutapp". I turned "show invisibles" on in textwrangler, so I don't think there is some gremlin business going on. Actually I'm interested in but in no way dependant on what's going on here, I just thought as a good pg-citizen I should report whatever might be wrong. Best, Maximilian Tyrtania
Maximilian Tyrtania <maximilian.tyrtania@onlinehome.de> writes: > am 12.05.2009 19:23 Uhr schrieb Alvaro Herrera unter > alvherre@commandprompt.com: >> What platform are you using anyway? > Mac OS 10.4.11 I have some vague recollection that UTF8-using locales don't actually work well on OSX ... check the archives ... regards, tom lane
I wrote: > Maximilian Tyrtania <maximilian.tyrtania@onlinehome.de> writes: >> am 12.05.2009 19:23 Uhr schrieb Alvaro Herrera unter >> alvherre@commandprompt.com: >>> What platform are you using anyway? >> Mac OS 10.4.11 > I have some vague recollection that UTF8-using locales don't actually > work well on OSX ... check the archives ... OK, the thread (or one of the threads) I was remembering is here: http://archives.postgresql.org//pgsql-general/2005-11/msg00047.php I am too lazy to boot up 10.4 right now, but looking on a 10.5.6 machine indicates that Apple is still being pretty lame about this: $ ls -l /usr/share/locale/de_DE total 40 lrwxr-xr-x 1 root wheel 28 Feb 27 2008 LC_COLLATE -> ../la_LN.US-ASCII/LC_COLLATE lrwxr-xr-x 1 root wheel 17 Feb 27 2008 LC_CTYPE -> ../UTF-8/LC_CTYPE drwxr-xr-x 3 root wheel 102 Feb 27 2008 LC_MESSAGES lrwxr-xr-x 1 root wheel 30 Feb 27 2008 LC_MONETARY -> ../de_DE.ISO8859-1/LC_MONETARY lrwxr-xr-x 1 root wheel 29 Feb 27 2008 LC_NUMERIC -> ../de_DE.ISO8859-1/LC_NUMERIC -r--r--r-- 1 root wheel 370 Jan 2 2008 LC_TIME So it looks like they understand UTF-8 to the extent of supporting character classification fairly well, but sort order is "just ASCII". I'm not sure exactly how that might result in the observed odd behavior of DISTINCT, but I bet it's causing it somehow. You'd probably have better luck in the de_DE.ISO8859-1 or de_DE.ISO8859-15 locales. regards, tom lane
For purposes of DISTINCT, I'd expect any sort order should do; all it needs is for equal values to be grouped together. If strcoll() ever fails to do that, I'd call it a critical bug--even throwing total garbage at it should result in a consistent ordering, even if the ordering itself is totally meaningless. Many sort functions depend on this. On Wed, May 13, 2009 at 8:37 AM, Maximilian Tyrtania <maximilian.tyrtania@onlinehome.de> wrote: > Opened those files (with textwrangler, as I 've never used sed), stripped > off the '$', sorted and looked at the differences (using textwranglers > "compare documents"-feature). Can you narrow down what triggers this? Try copying off the table, and running this: ****** CREATE FUNCTION is_inconsistent() RETURNS BOOLEAN LANGUAGE SQL AS $$select (select count(distinct(f.bezeichnung)) fromfirmen_copy f) <> (select count(distinct(f.bezeichnung||'1')) from firmen_copy f) $$; -- If deleting [first,last] leaves the results inconsistent, return true; otherwise -- roll back the deletion and return false. CREATE FUNCTION test_delete_range(first BIGINT, last BIGINT) RETURNS BOOLEAN LANGUAGE plpgsql AS $$ BEGIN DELETE FROM firmen_copy WHERE id BETWEEN first AND last; IF is_inconsistent() THEN RETURN true; END IF; SELECT 1/0; EXCEPTION WHEN division_by_zero THEN RETURN false; END; $$; CREATE FUNCTION test_func() RETURNS INTEGER LANGUAGE plpgsql AS $$ DECLARE total bigint; BEGIN IF NOT is_inconsistent() THEN RETURN -1; END IF; LOOP total := (SELECT MAX(id) FROM firmen_copy); IF test_delete_range(0, total/2) THEN CONTINUE; END IF; IF test_delete_range(total*1/4, total*3/4)THEN CONTINUE; END IF; IF test_delete_range(total/2, total) THEN CONTINUE; END IF; RETURN 0; END LOOP; END; $$; SELECT test_func(); ****** This assumes you have a primary key named "id", and that your IDs start around 0 and are vaguely monotonic (renumber them in the copy if necessary). I can't easily test this code, of course, but it's a simple binary search. Depending on what's triggering this, it may or may not be able to narrow in on a test case. Tangentally, is there a better way of rolling back a function than a dumb hack like "SELECT 1/0"? -- Glenn Maynard
Glenn Maynard <glennfmaynard@gmail.com> writes: > For purposes of DISTINCT, I'd expect any sort order should do; all it > needs is for equal values to be grouped together. If strcoll() ever > fails to do that, I'd call it a critical bug--even throwing total > garbage at it should result in a consistent ordering, even if the > ordering itself is totally meaningless. Feel free to complain to the libc authors for your platform --- there are quite a number of platforms where we know that strcoll does in fact return garbage (as in inconsistent results) when given garbage input (which in this case typically means data that's invalid according to what strcoll thinks the encoding is). That's one reason why we've been tightening up the encoding-validity enforcement in recent PG releases. It's not entirely clear to me how things are going wrong in Maximilian's particular example, but I suspect that the root of it is a problem of this type. > Tangentally, is there a better way of rolling back a function than a > dumb hack like "SELECT 1/0"? RAISE ERROR? regards, tom lane