Home > mailing lists

Re: SQL - Indexing for performance on uniquness check... - Mailing list pgsql-novice

From	Josh Berkus
Subject	Re: SQL - Indexing for performance on uniquness check...
Date	July 18, 2004 20:10:09
Msg-id	200407181309.23862.josh@agliodbs.com Whole thread Raw
In response to	SQL - Indexing for performance on uniquness check... ("Loftis, Charles E" <charles.loftis@eds.com>)
Responses	Re: SQL - Indexing for performance on uniquness check...
List	pgsql-novice

Tree view

Charles,

> Assume the table has more attributes than those attributes being GROUPed
> on. Also, assume all attributes are of type varchar.
>
> Sample query to return non-uniqueness
>    SELECT A1, A2, A3, ..., An
>    FROM Table
>    GROUP BY A1, A2, A3, ..., An
>    HAVING Count(*)>1

In order for it to be even possible to use an index (a hashaggregate
operation, actually) on this table, you'd have to include *all* of the GROUP
BY columns in a single, multi-column index.

However, it would be unlikely for PG to use any kind of an index in the
operation above, because of the number of columns, the unlikelyness of
grouping (i.e. there will only be a minority of rows with count(*) > 1) and
the fact that you're running this against the whole table.  So any kind of an
index is liable to be useless.

If the table is so large that you *have* to use an index or it takes
absolutely forever to run, then you may wish to try different operations to
detect duplicate rows.   For example, if it could be assumed that there was a
column Ax which was not included as a unique identifier, and could be counted
on to be both (a) not null and (b) different for duplicate rows (a timestamp
for example), then you could do:

SELECT A1, A2, A3, ... An
FROM table
WHERE EXISTS (SELECT 1
    FROM table t2
    WHERE t2.A1 = table.A1
        AND t2.A2 = table.A2
        ...
        AND t2.An = table.An
        AND t2.Ax <> table.Ax
    );

This can be a much better structure for indexed searches because an index on
some-but-not-all of the columns A1 ... An can be used for the EXISTS join,
such as an index on A1, A2, A3.

Of course, if this is an import table, where some of the rows are *exact*
duplicates, the above doesn't help.   On the other hand, if the rows *are*
exact duplicates, why do you care?   Just do SELECT DISTINCT on transfer and
eliminate them.

--
Josh Berkus
Aglio Database Solutions
San Francisco

pgsql-novice by date:

From: "Loftis, Charles E"
Date: 18 July 2004, 19:21:07
Subject: SQL - Indexing for performance on uniquness check...

From: Frank Voellmann
Date: 18 July 2004, 20:32:22
Subject: Trouble with pg_dump in 7.3.4

Re: SQL - Indexing for performance on uniquness check... - Mailing list pgsql-novice

Previous

Next