Thread: Distinct oddity

Distinct oddity

From

Maximilian Tyrtania

Date:

07 May 2009, 05:26:48

Hi there,

does this look right?

FAKDB=# select count(distinct(f.land)) from firmen f where
f.typlist='Redaktion';count 
------- 1975
(1 row)

FAKDB=# select count(distinct(f.land||'1')) from firmen f where
f.typlist='Redaktion';count 
------- 4944
(1 row)

FAKDB=# select version();                                                              version
----------------------------------------------------------------------------
----------------------------------------------------------PostgreSQL 8.3.3 on powerpc-apple-darwin, compiled by GCC
i686-apple-darwin8-gcc-4.0.1 (GCC) 4.0.1 (Apple Computer, Inc. build 5370)
(1 row)

Why would the 2 queries give different results? No inserts occurred in the
meantime.

tia,

Maximilian Tyrtania

select regexp_matches('a a a', '([a-z]) a','g');

From

"Marc Mamin"

Date:

07 May 2009, 12:19:59

Hello,

I wonder if someone has an idea for this problem:

I have a string that contains a serie of chars, separated by single
spaces.

e.g 'a b x n r a b c b'

Having such a string, I d'like to get a list of all predecessors of a
given character.
In the example, the predecessors of b  are a,a,c.

If I now have the string 'a a a', the predecessors of 'a' are a,a

I tried to use regexp_matches for this:

select regexp_matches('a a a', '([a-z]) a','g');
=> {"a "} only

As the second parameter of the function matches the first 2 'a',
only the trailing ' a' will be used to seek for further matching...

Cheers,

Marc Mamin

Re: Distinct oddity

From

Scott Marlowe

Date:

07 May 2009, 15:54:14

On Thu, May 7, 2009 at 2:21 AM, Maximilian Tyrtania
<maximilian.tyrtania@onlinehome.de> wrote:
> Hi there,
>
> does this look right?
>
> FAKDB=# select count(distinct(f.land)) from firmen f where
> f.typlist='Redaktion';
>  count
> -------
>  1975
> (1 row)
>
> FAKDB=# select count(distinct(f.land||'1')) from firmen f where
> f.typlist='Redaktion';
>  count
> -------
>  4944
> (1 row)

Yeah, that does seem odd.  Could it be something like nulls in your
data set?  just guessing really.  If you could make a small test case
that shows it happening and allows others to reproduce it you're
likely to get more bites.

Re: Distinct oddity

From

Maximilian Tyrtania

Date:

08 May 2009, 06:28:57

am 07.05.2009 20:54 Uhr schrieb Scott Marlowe unter scott.marlowe@gmail.com:

> On Thu, May 7, 2009 at 2:21 AM, Maximilian Tyrtania
> <maximilian.tyrtania@onlinehome.de> wrote:
>> Hi there,
>>
>> does this look right?
>>
>> FAKDB=# select count(distinct(f.land)) from firmen f where
>> f.typlist='Redaktion';
>>  count
>> -------
>>  1975
>> (1 row)
>>
>> FAKDB=# select count(distinct(f.land||'1')) from firmen f where
>> f.typlist='Redaktion';
>>  count
>> -------
>>  4944
>> (1 row)
>
> Yeah, that does seem odd.  Could it be something like nulls in your
> data set?  just guessing really.  If you could make a small test case
> that shows it happening and allows others to reproduce it you're
> likely to get more bites.

It doesn't seem to be related to null values (which wouldn't explain it
anyway) nor to this particular field...

FAKDB=# select count(*) from firmen where bezeichnung is null;count
-------    0
(1 row)

FAKDB=# select count(distinct(f.bezeichnung)) from firmen f;count
-------72698
(1 row)

FAKDB=# select count(distinct(f.bezeichnung||'e')) from firmen f;count
-------72892
(1 row)

My attempts at reproducing this with a freshly created table failed, of
course.

FAKDB=# create table concattest(mytext text);
CREATE TABLE
FAKDB=# insert into concattest (mytext)
select(generate_series(1,10000)::text);
INSERT 0 10000
FAKDB=# insert into concattest (mytext)
select(generate_series(1,10000)::text);
INSERT 0 10000
FAKDB=# select count(distinct(mytext)) from concattest;count
-------10000
(1 row)

FAKDB=# select count(distinct(mytext||'2')) from concattest;count
-------10000
(1 row)

best,

Maximilian Tyrtania

Re: Distinct oddity

From

Scott Marlowe

Date:

08 May 2009, 08:35:29

On Fri, May 8, 2009 at 3:28 AM, Maximilian Tyrtania
<maximilian.tyrtania@onlinehome.de> wrote:
> am 07.05.2009 20:54 Uhr schrieb Scott Marlowe unter scott.marlowe@gmail.com:
>
>> On Thu, May 7, 2009 at 2:21 AM, Maximilian Tyrtania
>> <maximilian.tyrtania@onlinehome.de> wrote:
>>> Hi there,
>>>
>>> does this look right?
>>>
>>> FAKDB=# select count(distinct(f.land)) from firmen f where
>>> f.typlist='Redaktion';
>>>  count
>>> -------
>>>  1975
>>> (1 row)
>>>
>>> FAKDB=# select count(distinct(f.land||'1')) from firmen f where
>>> f.typlist='Redaktion';
>>>  count
>>> -------
>>>  4944
>>> (1 row)
>>
>> Yeah, that does seem odd.  Could it be something like nulls in your
>> data set?  just guessing really.  If you could make a small test case
>> that shows it happening and allows others to reproduce it you're
>> likely to get more bites.
>
> It doesn't seem to be related to null values (which wouldn't explain it
> anyway) nor to this particular field...
>
> FAKDB=# select count(*) from firmen where bezeichnung is null;
>  count
> -------
>     0
> (1 row)

That's not the same field as in the original query.


> My attempts at reproducing this with a freshly created table failed, of
> course.

Instead of trying to create a test case from scratch, isolate some
rows that cause this, put them in another table, and then pg_dump that
one table, cleaned as needed for privacy, here.

Re: Distinct oddity

From

Rob Sargent

Date:

08 May 2009, 12:03:05

Is firmen a table or a view?

From: Scott Marlowe <scott.marlowe@gmail.com>
To: Maximilian Tyrtania <maximilian.tyrtania@onlinehome.de>
Cc: pgsql-sql@postgresql.org
Sent: Friday, May 8, 2009 5:35:21 AM
Subject: Re: [SQL] Distinct oddity

On Fri, May 8, 2009 at 3:28 AM, Maximilian Tyrtania
<maximilian.tyrtania@onlinehome.de> wrote:
> am 07.05.2009 20:54 Uhr schrieb Scott Marlowe unter scott.marlowe@gmail.com:
>
>> On Thu, May 7, 2009 at 2:21 AM, Maximilian Tyrtania
>> <maximilian.tyrtania@onlinehome.de> wrote:
>>> Hi there,
>>>
>>> does this look right?
>>>
>>> FAKDB=# select count(distinct(f.land)) from firmen f where
>>> f.typlist='Redaktion';
>>> count
>>> -------
>>> 1975
>>> (1 row)
>>>
>>> FAKDB=# select count(distinct(f.land||'1')) from firmen f where
>>> f.typlist='Redaktion';
>>> count
>>> -------
>>> 4944
>>> (1 row)
>>
>> Yeah, that does seem odd. Could it be something like nulls in your
>> data set? just guessing really. If you could make a small test case
>> that shows it happening and allows others to reproduce it you're
>> likely to get more bites.
>
> It doesn't seem to be related to null values (which wouldn't explain it
> anyway) nor to this particular field...
>
> FAKDB=# select count(*) from firmen where bezeichnung is null;
> count
> -------
> 0
> (1 row)

That's not the same field as in the original query.

> My attempts at reproducing this with a freshly created table failed, of
> course.

Instead of trying to create a test case from scratch, isolate some
rows that cause this, put them in another table, and then pg_dump that
one table, cleaned as needed for privacy, here.

--
Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-sql

Re: Distinct oddity

From

Tom Lane

Date:

08 May 2009, 16:52:50

Maximilian Tyrtania <maximilian.tyrtania@onlinehome.de> writes:
> am 07.05.2009 20:54 Uhr schrieb Scott Marlowe unter scott.marlowe@gmail.com:
>> Yeah, that does seem odd.  Could it be something like nulls in your
>> data set?  just guessing really.  If you could make a small test case
>> that shows it happening and allows others to reproduce it you're
>> likely to get more bites.

> It doesn't seem to be related to null values (which wouldn't explain it
> anyway) nor to this particular field...

Can you reproduce it in other contexts than specifically count(distinct)?
In particular I'd try
select count(*) from  (select distinct f.bezeichnung from firmen f) ss;
select count(*) from  (select distinct f.bezeichnung||'e' from firmen f) ss;

If those give the same numbers as you're showing here, then the
next step would be to dump out the actual results of the SELECT DISTINCT
queries and compare them --- looking at the actual data values should
give some insight as to what's happening.

BTW, what is the datatype of f.bezeichnung, and what locale are you
running in?
        regards, tom lane

Re: Distinct oddity

From

Maximilian Tyrtania

Date:

09 May 2009, 04:22:44

am 08.05.2009 16:55 Uhr schrieb Rob Sargent unter robsargent@rocketmail.com:

>Is firmen a table or a view?

It's a table.

am 08.05.2009 21:52 Uhr schrieb Tom Lane unter tgl@sss.pgh.pa.us:

>> It doesn't seem to be related to null values (which wouldn't explain it
>> anyway) nor to this particular field...
>
> Can you reproduce it in other contexts than specifically count(distinct)?
> In particular I'd try
>
> select count(*) from
>  (select distinct f.bezeichnung from firmen f) ss;

FAKDB=# select count(*) from
FAKDB-#   (select distinct f.bezeichnung from firmen f) ss;count
-------73437
(1 row)

>
> select count(*) from
>  (select distinct f.bezeichnung||'e' from firmen f) ss;

FAKDB=# select count(*) from
(select distinct f.bezeichnung||'e' from firmen f) ss;count
-------72535
(1 row)

> If those give the same numbers as you're showing here, then the
> next step would be to dump out the actual results of the SELECT DISTINCT
> queries and compare them --- looking at the actual data values should
> give some insight as to what's happening.

FAKDB=# select distinct f.bezeichnung from firmen f order by 1 limit 5
FAKDB-# ;              bezeichnung
-----------------------------------------sterreichisches Verkehrsbro AG\x01Assistenz\x10Frohstoff Design &
Textilveredelung"1.Mittelschule ""Am Kupferberg""" 
(5 rows)


FAKDB=# select distinct f.bezeichnung||'e' from firmen f order by 1 limit 5
;                ?column?
------------------------------------------Österreich/Welt (Ltg.)esterreichisches Verkehrsbro
AGe\x01Assistenze\x10FrohstoffDesign & Textilveredelunge"1. Mittelschule ""Am Kupferberg"""e 
(5 rows)


Aha, the "Österreich/Welt (Ltg.)"-entry is missing in the 1st query. So that
does smell like a locale problem.

> BTW, what is the datatype of f.bezeichnung,

It's character varying(255). Just for the record:

FAKDB=# explain analyze select distinct f.bezeichnung||'e' from firmen f
order by 1 limit 5;                                                            QUERY PLAN
----------------------------------------------------------------------------
---------------------------------------------------------Limit  (cost=16173.07..16174.08 rows=5 width=18) (actual
time=1667.841..1667.855 rows=5 loops=1)  ->  Unique  (cost=16173.07..16933.66 rows=3765 width=18) (actual
time=1667.839..1667.851 rows=5 loops=1)        ->  Sort  (cost=16173.07..16553.36 rows=152117 width=18) (actual
time=1667.837..1667.844 rows=5 loops=1)              Sort Key: (((bezeichnung)::text || 'e'::text))              Sort
Method: external merge  Disk: 4640kB              ->  Seq Scan on firmen f  (cost=0.00..13646.46 rows=152117 
width=18) (actual time=0.069..353.777 rows=152118 loops=1)Total runtime: 1669.998 ms
(7 rows)

> and what locale are you
> running in?
lc_collate                      | de_DE
| Shows the collation order locale.lc_ctype                        | de_DE
| Shows the character classification and case conversion locale.

The encoding is UTF-8.

Best,

Maximilian Tyrtania

Re: Distinct oddity

From

Tom Lane

Date:

09 May 2009, 11:33:58

Maximilian Tyrtania <maximilian.tyrtania@onlinehome.de> writes:
> FAKDB=# select distinct f.bezeichnung from firmen f order by 1 limit 5
> FAKDB-# ;
>                bezeichnung
> -----------------------------------------
>  �sterreichisches Verkehrsb�ro AG
>  \x01Assistenz
>  \x10Frohstoff Design & Textilveredelung
>  "1. Mittelschule ""Am Kupferberg"""
> (5 rows)


> FAKDB=# select distinct f.bezeichnung||'e' from firmen f order by 1 limit 5
> ;
>                  ?column?
> ------------------------------------------
>  �sterreich/Welt (Ltg.)e
>  �sterreichisches Verkehrsb�ro AGe
>  \x01Assistenze
>  \x10Frohstoff Design & Textilveredelunge
>  "1. Mittelschule ""Am Kupferberg"""e
> (5 rows)


> Aha, the "�sterreich/Welt (Ltg.)"-entry is missing in the 1st query. So that
> does smell like a locale problem.

That only proves that adding the 'e' changes the sort order, which is
completely unsurprising for any non-C locale.  What you need to do is
dump out the *entire* results of the DISTINCT queries and look for the
unmatched lines.  I'd try dumping to two files, stripping the 'e' with
sed, and then sort/diff.
        regards, tom lane

Re: Distinct oddity

From

Glenn Maynard

Date:

09 May 2009, 16:21:21

On Sat, May 9, 2009 at 10:33 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> That only proves that adding the 'e' changes the sort order, which is
> completely unsurprising for any non-C locale.  What you need to do is
> dump out the *entire* results of the DISTINCT queries and look for the
> unmatched lines.  I'd try dumping to two files, stripping the 'e' with
> sed, and then sort/diff.

How could adding an "e" change the sorting of "Österreich/Welt (Ltg.)"
compared to "Šsterreichisches Verkehrsb ro AG" in de_DE or en_US (or
any locale)?

It's also odd that the "1. Mittelschule ..." line is getting sorted after those.

--
Glenn Maynard

Re: Distinct oddity

From

Alvaro Herrera

Date:

09 May 2009, 23:58:51

Maximilian Tyrtania wrote:
> am 08.05.2009 16:55 Uhr schrieb Rob Sargent unter robsargent@rocketmail.com:

> > and what locale are you running in?
> 
>  lc_collate                      | de_DE
> | Shows the collation order locale.
>  lc_ctype                        | de_DE
> | Shows the character classification and case conversion locale.
> 
> The encoding is UTF-8.

Note that the de_DE locale uses Latin9 encoding, which is incompatible
with UTF8.  I'd try checking if the problem is reproducible in
de_DE.utf8 (you need to create a new database for testing, obviously).
If it's not, then the incompatible locale definition is causing the
problem.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: Distinct oddity

From

Maximilian Tyrtania

Date:

11 May 2009, 03:28:59

am 10.05.2009 4:58 Uhr schrieb Alvaro Herrera unter
alvherre@commandprompt.com:

>>> and what locale are you running in?
>> 
>>  lc_collate                      | de_DE
>> | Shows the collation order locale.
>>  lc_ctype                        | de_DE
>> | Shows the character classification and case conversion locale.
>> 
>> The encoding is UTF-8.
> 
> Note that the de_DE locale uses Latin9 encoding, which is incompatible
> with UTF8. 

Ah, good catch.

> I'd try checking if the problem is reproducible in
> de_DE.utf8 (you need to create a new database for testing, obviously).

Wait a minute. I need to re- initdb with de_DE.UTF-8, don't I?

> If it's not, then the incompatible locale definition is causing the
> problem.

I'll try that, thanks.

M

Re: Distinct oddity

From

Alvaro Herrera

Date:

11 May 2009, 11:43:39

Maximilian Tyrtania wrote:
> am 10.05.2009 4:58 Uhr schrieb Alvaro Herrera unter
> alvherre@commandprompt.com:

> > I'd try checking if the problem is reproducible in
> > de_DE.utf8 (you need to create a new database for testing, obviously).
> 
> Wait a minute. I need to re- initdb with de_DE.UTF-8, don't I?

Well, either that, or create a new database with Latin9 encoding.


-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

Re: Distinct oddity

From

Alvaro Herrera

Date:

12 May 2009, 14:32:25

Maximilian Tyrtania wrote:
> am 11.05.2009 16:38 Uhr schrieb Alvaro Herrera unter
> alvherre@commandprompt.com:
> 
> >>>Note that the de_DE locale uses Latin9 encoding, which is incompatible
> >>>with UTF8. 
> >>> I'd try checking if the problem is reproducible in
> >>> de_DE.utf8 (you need to create a new database for testing, obviously).
> >> 
> >> Wait a minute. I need to re- initdb with de_DE.UTF-8, don't I?
> > 
> > Well, either that, or create a new database with Latin9 encoding.
> 
> FAKDB=# CREATE DATABASE "TestLatin9"
> FAKDB-#   WITH ENCODING='LATIN9'
> FAKDB-#        OWNER=postgres;
> ERROR:  encoding LATIN9 does not match server's locale de_DE
> DETAIL:  The server's LC_CTYPE setting requires encoding UTF8.
> FAKDB=# 
> 
> Now i'm deeply confused...So do i have to re-initdb?

Hmm, I didn't expect this.  I guess I assumed de_DE was an alias for
the Latin1- or Latin9- encoded locale, but it seems your system uses it
as an alias for the UTF-8 encoded one.  So my initial comment seems to
be wrong as well.  Please paste the output of the "locale" command.

What platform are you using anyway?

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

Re: Distinct oddity

From

Maximilian Tyrtania

Date:

12 May 2009, 23:16:32

am 11.05.2009 16:38 Uhr schrieb Alvaro Herrera unter
alvherre@commandprompt.com:

>>>Note that the de_DE locale uses Latin9 encoding, which is incompatible
>>>with UTF8. 
>>> I'd try checking if the problem is reproducible in
>>> de_DE.utf8 (you need to create a new database for testing, obviously).
>> 
>> Wait a minute. I need to re- initdb with de_DE.UTF-8, don't I?
> 
> Well, either that, or create a new database with Latin9 encoding.

FAKDB=# CREATE DATABASE "TestLatin9"
FAKDB-#   WITH ENCODING='LATIN9'
FAKDB-#        OWNER=postgres;
ERROR:  encoding LATIN9 does not match server's locale de_DE
DETAIL:  The server's LC_CTYPE setting requires encoding UTF8.
FAKDB=# 

Now i'm deeply confused...So do i have to re-initdb?

Best,

Maximilian Tyrtania

Re: Distinct oddity

From

Maximilian Tyrtania

Date:

13 May 2009, 01:12:02

am 12.05.2009 19:23 Uhr schrieb Alvaro Herrera unter
alvherre@commandprompt.com:

>> FAKDB=# CREATE DATABASE "TestLatin9"
>> FAKDB-#   WITH ENCODING='LATIN9'
>> FAKDB-#        OWNER=postgres;
>> ERROR:  encoding LATIN9 does not match server's locale de_DE
>> DETAIL:  The server's LC_CTYPE setting requires encoding UTF8.
>> FAKDB=# 
>> 
>> Now i'm deeply confused...So do i have to re-initdb?
> 
> Hmm, I didn't expect this.  I guess I assumed de_DE was an alias for
> the Latin1- or Latin9- encoded locale, but it seems your system uses it
> as an alias for the UTF-8 encoded one.  So my initial comment seems to
> be wrong as well.  Please paste the output of the "locale" command.

Sputnik-Server:~ administrator$ locale
LANG=
LC_COLLATE="C"
LC_CTYPE="C"
LC_MESSAGES="C"
LC_MONETARY="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_ALL="C"

> What platform are you using anyway?

Mac OS 10.4.11

Best,

Maximilian Tyrtania

Re: Distinct oddity

From

Maximilian Tyrtania

Date:

13 May 2009, 09:37:52

am 09.05.2009 16:33 Uhr schrieb Tom Lane unter tgl@sss.pgh.pa.us:

> What you need to do is
> dump out the *entire* results of the DISTINCT queries and look for the
> unmatched lines.  I'd try dumping to two files, stripping the 'e' with
> sed, and then sort/diff.

Okay, that's what I did, and the results are, well, surprising (for me, mind
you).

FAKDB=# \o /withapp
FAKDB=# select distinct f.bezeichnung||'$' from firmen f order by 1;

FAKDB=# \o /withoutapp
FAKDB=# select distinct f.bezeichnung from firmen f order by 1;

Opened those files (with textwrangler, as I 've never used sed), stripped
off the '$', sorted and looked at the differences (using textwranglers
"compare documents"-feature).

The file "withoutapp" has those lines:
"Abendschau""Abendschau"

They don't look distinct to me. The "Abendschau"-Entry appears only once in
"withapp". But then again the "withapp" file has these entries:

Adformatie
Adformatie 

which in turn only appear once in "withoutapp". I turned "show invisibles"
on in textwrangler, so I don't think there is some gremlin business going
on.

Actually I'm interested in but in no way dependant on what's going on here,
I just thought as a good pg-citizen I should report whatever might be wrong.

Best,

Maximilian Tyrtania

Re: Distinct oddity

From

Tom Lane

Date:

13 May 2009, 11:48:06

Maximilian Tyrtania <maximilian.tyrtania@onlinehome.de> writes:
> am 12.05.2009 19:23 Uhr schrieb Alvaro Herrera unter
> alvherre@commandprompt.com:
>> What platform are you using anyway?

> Mac OS 10.4.11

I have some vague recollection that UTF8-using locales don't actually
work well on OSX ... check the archives ...
        regards, tom lane

Re: Distinct oddity

From

Tom Lane

Date:

13 May 2009, 12:48:02

I wrote:
> Maximilian Tyrtania <maximilian.tyrtania@onlinehome.de> writes:
>> am 12.05.2009 19:23 Uhr schrieb Alvaro Herrera unter
>> alvherre@commandprompt.com:
>>> What platform are you using anyway?

>> Mac OS 10.4.11

> I have some vague recollection that UTF8-using locales don't actually
> work well on OSX ... check the archives ...

OK, the thread (or one of the threads) I was remembering is here:
http://archives.postgresql.org//pgsql-general/2005-11/msg00047.php

I am too lazy to boot up 10.4 right now, but looking on a 10.5.6 machine
indicates that Apple is still being pretty lame about this:

$ ls -l /usr/share/locale/de_DE 
total 40
lrwxr-xr-x  1 root  wheel   28 Feb 27  2008 LC_COLLATE -> ../la_LN.US-ASCII/LC_COLLATE
lrwxr-xr-x  1 root  wheel   17 Feb 27  2008 LC_CTYPE -> ../UTF-8/LC_CTYPE
drwxr-xr-x  3 root  wheel  102 Feb 27  2008 LC_MESSAGES
lrwxr-xr-x  1 root  wheel   30 Feb 27  2008 LC_MONETARY -> ../de_DE.ISO8859-1/LC_MONETARY
lrwxr-xr-x  1 root  wheel   29 Feb 27  2008 LC_NUMERIC -> ../de_DE.ISO8859-1/LC_NUMERIC
-r--r--r--  1 root  wheel  370 Jan  2  2008 LC_TIME

So it looks like they understand UTF-8 to the extent of supporting
character classification fairly well, but sort order is "just ASCII".
I'm not sure exactly how that might result in the observed odd behavior
of DISTINCT, but I bet it's causing it somehow.  You'd probably have
better luck in the de_DE.ISO8859-1 or de_DE.ISO8859-15 locales.
        regards, tom lane

Re: Distinct oddity

From

Glenn Maynard

Date:

13 May 2009, 15:50:02

For purposes of DISTINCT, I'd expect any sort order should do; all it
needs is for equal values to be grouped together.  If strcoll() ever
fails to do that, I'd call it a critical bug--even throwing total
garbage at it should result in a consistent ordering, even if the
ordering itself is totally meaningless.  Many sort functions depend on
this.

On Wed, May 13, 2009 at 8:37 AM, Maximilian Tyrtania
<maximilian.tyrtania@onlinehome.de> wrote:
> Opened those files (with textwrangler, as I 've never used sed), stripped
> off the '$', sorted and looked at the differences (using textwranglers
> "compare documents"-feature).

Can you narrow down what triggers this?  Try copying off the table,
and running this:

******
CREATE FUNCTION is_inconsistent() RETURNS BOOLEAN LANGUAGE SQL AS $$select (select count(distinct(f.bezeichnung))
fromfirmen_copy f) <>              (select count(distinct(f.bezeichnung||'1')) from firmen_copy f)

$$;

-- If deleting [first,last] leaves the results inconsistent, return
true; otherwise
-- roll back the deletion and return false.
CREATE FUNCTION test_delete_range(first BIGINT, last BIGINT)
RETURNS BOOLEAN LANGUAGE plpgsql AS $$
BEGIN       DELETE FROM firmen_copy WHERE id BETWEEN first AND last;       IF is_inconsistent() THEN RETURN true; END
IF;      SELECT 1/0;

EXCEPTION       WHEN division_by_zero THEN               RETURN false;
END;
$$;

CREATE FUNCTION test_func()
RETURNS INTEGER LANGUAGE plpgsql AS $$
DECLARE       total bigint;
BEGIN       IF NOT is_inconsistent() THEN RETURN -1; END IF;
       LOOP               total := (SELECT MAX(id) FROM firmen_copy);
               IF test_delete_range(0, total/2) THEN CONTINUE; END IF;               IF test_delete_range(total*1/4,
total*3/4)THEN

CONTINUE; END IF;               IF test_delete_range(total/2, total) THEN CONTINUE; END IF;
               RETURN 0;       END LOOP;
END;
$$;

SELECT test_func();
******

This assumes you have a primary key named "id", and that your IDs
start around 0 and are vaguely monotonic (renumber them in the copy if
necessary).  I can't easily test this code, of course, but it's a
simple binary search.  Depending on what's triggering this, it may or
may not be able to narrow in on a test case.

Tangentally, is there a better way of rolling back a function than a
dumb hack like "SELECT 1/0"?

-- 
Glenn Maynard

Re: Distinct oddity

From

Tom Lane

Date:

13 May 2009, 16:03:37

Glenn Maynard <glennfmaynard@gmail.com> writes:
> For purposes of DISTINCT, I'd expect any sort order should do; all it
> needs is for equal values to be grouped together.  If strcoll() ever
> fails to do that, I'd call it a critical bug--even throwing total
> garbage at it should result in a consistent ordering, even if the
> ordering itself is totally meaningless.

Feel free to complain to the libc authors for your platform --- there
are quite a number of platforms where we know that strcoll does in fact
return garbage (as in inconsistent results) when given garbage input
(which in this case typically means data that's invalid according to
what strcoll thinks the encoding is).  That's one reason why we've been
tightening up the encoding-validity enforcement in recent PG releases.

It's not entirely clear to me how things are going wrong in Maximilian's
particular example, but I suspect that the root of it is a problem of
this type.

> Tangentally, is there a better way of rolling back a function than a
> dumb hack like "SELECT 1/0"?

RAISE ERROR?
        regards, tom lane