Thread: DISTINCT is not quite distinct
I run this innocent query CREATE TABLE foo AS SELECT DISTINCT bar FROM baz ORDER BY bar; and the resulting table contains duplicate rows. 8-( According to EXPLAIN, an index scan on the bar column is used (using the underlying B-tree index). This is with PostgreSQL 8.1.4 (Debian package 8.1.4-6). Is this a known problem? If I drop the DISTINCT, the output is not correctly ordered, either. Perhaps this is an index corruption issue? The hardware itself seems fine. -- Florian Weimer <fweimer@bfk.de> BFK edv-consulting GmbH http://www.bfk.de/ Kriegsstraße 100 tel: +49-721-96201-1 D-76133 Karlsruhe fax: +49-721-96201-99
Florian Weimer wrote: > I run this innocent query > > CREATE TABLE foo AS SELECT DISTINCT bar FROM baz ORDER BY bar; > > and the resulting table contains duplicate rows. 8-( > > According to EXPLAIN, an index scan on the bar column is used (using > the underlying B-tree index). This is with PostgreSQL 8.1.4 (Debian > package 8.1.4-6). Is this a known problem? > > If I drop the DISTINCT, the output is not correctly ordered, either. > Perhaps this is an index corruption issue? The hardware itself seems > fine. Could be index corruption perhaps. I take it SELECT DISTINCT bar... shows the same problem? If so, can you do: SELECT OID,xmin,cmin,xmax,cmax,bar FROM baz WHERE bar = <something with duplicates> -- Richard Huxton Archonet Ltd
Florian Weimer <fweimer@bfk.de> writes: > I run this innocent query > CREATE TABLE foo AS SELECT DISTINCT bar FROM baz ORDER BY bar; > and the resulting table contains duplicate rows. 8-( > According to EXPLAIN, an index scan on the bar column is used (using > the underlying B-tree index). Do you mean an indexscan followed immediately by a Unique node? If so, yeah, that would depend entirely on correct ordering of the indexscan output to produce distinct results. > If I drop the DISTINCT, the output is not correctly ordered, either. > Perhaps this is an index corruption issue? The hardware itself seems > fine. Perhaps. Do you want to save off a physical copy of the index and then try REINDEXing? If that fixes it, I'd be interested to compare the two versions of the index. regards, tom lane
* Tom Lane: >> According to EXPLAIN, an index scan on the bar column is used (using >> the underlying B-tree index). > > Do you mean an indexscan followed immediately by a Unique node? If > so, yeah, that would depend entirely on correct ordering of the > indexscan output to produce distinct results. Yes. >> If I drop the DISTINCT, the output is not correctly ordered, either. >> Perhaps this is an index corruption issue? The hardware itself seems >> fine. > > Perhaps. Do you want to save off a physical copy of the index and then > try REINDEXing? The duplicate row is gone. > If that fixes it, I'd be interested to compare the two versions of > the index. The index files are about 155 MB and 98 MB, compressed. How shall we transfer them? (Their contents is not super-secret, but I don't want to distribute them widely, either.) -- Florian Weimer <fweimer@bfk.de> BFK edv-consulting GmbH http://www.bfk.de/ Kriegsstraße 100 tel: +49-721-96201-1 D-76133 Karlsruhe fax: +49-721-96201-99
* Richard Huxton: > I take it SELECT DISTINCT bar... shows the same problem? SELECT bar FROM baz does *not* show the duplicate row. > If so, can you do: > SELECT OID,xmin,cmin,xmax,cmax,bar FROM baz > WHERE bar = <something with duplicates> Even if I force a complete index scan, I get xmin = 1007617 for both rows, the others are zero. The table hasn't got OIDs. -- Florian Weimer <fweimer@bfk.de> BFK edv-consulting GmbH http://www.bfk.de/ Kriegsstraße 100 tel: +49-721-96201-1 D-76133 Karlsruhe fax: +49-721-96201-99