Thread: DISTINCT is not quite distinct

DISTINCT is not quite distinct

From

Florian Weimer

Date:

06 November 2006, 10:14:37

I run this innocent query

CREATE TABLE foo AS SELECT DISTINCT bar FROM baz ORDER BY bar;

and the resulting table contains duplicate rows. 8-(

According to EXPLAIN, an index scan on the bar column is used (using
the underlying B-tree index).  This is with PostgreSQL 8.1.4 (Debian
package 8.1.4-6).  Is this a known problem?

If I drop the DISTINCT, the output is not correctly ordered, either.
Perhaps this is an index corruption issue?  The hardware itself seems
fine.

--
Florian Weimer                <fweimer@bfk.de>
BFK edv-consulting GmbH       http://www.bfk.de/
Kriegsstraße 100              tel: +49-721-96201-1
D-76133 Karlsruhe             fax: +49-721-96201-99

Re: DISTINCT is not quite distinct

From

Richard Huxton

Date:

06 November 2006, 10:39:17

Florian Weimer wrote:
> I run this innocent query
>
> CREATE TABLE foo AS SELECT DISTINCT bar FROM baz ORDER BY bar;
>
> and the resulting table contains duplicate rows. 8-(
>
> According to EXPLAIN, an index scan on the bar column is used (using
> the underlying B-tree index).  This is with PostgreSQL 8.1.4 (Debian
> package 8.1.4-6).  Is this a known problem?
>
> If I drop the DISTINCT, the output is not correctly ordered, either.
> Perhaps this is an index corruption issue?  The hardware itself seems
> fine.

Could be index corruption perhaps.

I take it SELECT DISTINCT bar... shows the same problem?

If so, can you do:
  SELECT OID,xmin,cmin,xmax,cmax,bar FROM baz
   WHERE bar = <something with duplicates>

--
   Richard Huxton
   Archonet Ltd

Re: DISTINCT is not quite distinct

From

Tom Lane

Date:

06 November 2006, 10:42:08

Florian Weimer <fweimer@bfk.de> writes:
> I run this innocent query
> CREATE TABLE foo AS SELECT DISTINCT bar FROM baz ORDER BY bar;
> and the resulting table contains duplicate rows. 8-(

> According to EXPLAIN, an index scan on the bar column is used (using
> the underlying B-tree index).

Do you mean an indexscan followed immediately by a Unique node?  If
so, yeah, that would depend entirely on correct ordering of the
indexscan output to produce distinct results.

> If I drop the DISTINCT, the output is not correctly ordered, either.
> Perhaps this is an index corruption issue?  The hardware itself seems
> fine.

Perhaps.  Do you want to save off a physical copy of the index and then
try REINDEXing?  If that fixes it, I'd be interested to compare the two
versions of the index.

            regards, tom lane

Re: DISTINCT is not quite distinct

From

Florian Weimer

Date:

06 November 2006, 13:10:38

* Tom Lane:

>> According to EXPLAIN, an index scan on the bar column is used (using
>> the underlying B-tree index).
>
> Do you mean an indexscan followed immediately by a Unique node?  If
> so, yeah, that would depend entirely on correct ordering of the
> indexscan output to produce distinct results.

Yes.

>> If I drop the DISTINCT, the output is not correctly ordered, either.
>> Perhaps this is an index corruption issue?  The hardware itself seems
>> fine.
>
> Perhaps.  Do you want to save off a physical copy of the index and then
> try REINDEXing?

The duplicate row is gone.

> If that fixes it, I'd be interested to compare the two versions of
> the index.

The index files are about 155 MB and 98 MB, compressed.  How shall we
transfer them?  (Their contents is not super-secret, but I don't want
to distribute them widely, either.)

--
Florian Weimer                <fweimer@bfk.de>
BFK edv-consulting GmbH       http://www.bfk.de/
Kriegsstraße 100              tel: +49-721-96201-1
D-76133 Karlsruhe             fax: +49-721-96201-99

Re: DISTINCT is not quite distinct

From

Florian Weimer

Date:

06 November 2006, 13:20:33

* Richard Huxton:

> I take it SELECT DISTINCT bar... shows the same problem?

SELECT bar FROM baz does *not* show the duplicate row.

> If so, can you do:
>  SELECT OID,xmin,cmin,xmax,cmax,bar FROM baz
>   WHERE bar = <something with duplicates>

Even if I force a complete index scan, I get xmin = 1007617 for both
rows, the others are zero.  The table hasn't got OIDs.

--
Florian Weimer                <fweimer@bfk.de>
BFK edv-consulting GmbH       http://www.bfk.de/
Kriegsstraße 100              tel: +49-721-96201-1
D-76133 Karlsruhe             fax: +49-721-96201-99