Thread: SQL - Indexing for performance on uniquness check...

SQL - Indexing for performance on uniquness check...

From
"Loftis, Charles E"
Date:
When trying to find duplicates on an table how I need to know how index the
table to optimize performance.
Should there be an index for each attribute (A1, A2, ..., An) in the GROUP
BY or should there be one multi-attribute index on all the grouping
attributes.

Assume the table has more attributes than those attributes being GROUPed on.
Also, assume all attributes are of type varchar.

Sample query to return non-uniqueness
   SELECT A1, A2, A3, ..., An
   FROM Table
   GROUP BY A1, A2, A3, ..., An
   HAVING Count(*)>1

Re: SQL - Indexing for performance on uniquness check...

From
Josh Berkus
Date:
Charles,

> Assume the table has more attributes than those attributes being GROUPed
> on. Also, assume all attributes are of type varchar.
>
> Sample query to return non-uniqueness
>    SELECT A1, A2, A3, ..., An
>    FROM Table
>    GROUP BY A1, A2, A3, ..., An
>    HAVING Count(*)>1

In order for it to be even possible to use an index (a hashaggregate
operation, actually) on this table, you'd have to include *all* of the GROUP
BY columns in a single, multi-column index.

However, it would be unlikely for PG to use any kind of an index in the
operation above, because of the number of columns, the unlikelyness of
grouping (i.e. there will only be a minority of rows with count(*) > 1) and
the fact that you're running this against the whole table.  So any kind of an
index is liable to be useless.

If the table is so large that you *have* to use an index or it takes
absolutely forever to run, then you may wish to try different operations to
detect duplicate rows.   For example, if it could be assumed that there was a
column Ax which was not included as a unique identifier, and could be counted
on to be both (a) not null and (b) different for duplicate rows (a timestamp
for example), then you could do:

SELECT A1, A2, A3, ... An
FROM table
WHERE EXISTS (SELECT 1
    FROM table t2
    WHERE t2.A1 = table.A1
        AND t2.A2 = table.A2
        ...
        AND t2.An = table.An
        AND t2.Ax <> table.Ax
    );

This can be a much better structure for indexed searches because an index on
some-but-not-all of the columns A1 ... An can be used for the EXISTS join,
such as an index on A1, A2, A3.

Of course, if this is an import table, where some of the rows are *exact*
duplicates, the above doesn't help.   On the other hand, if the rows *are*
exact duplicates, why do you care?   Just do SELECT DISTINCT on transfer and
eliminate them.

--
Josh Berkus
Aglio Database Solutions
San Francisco

Re: SQL - Indexing for performance on uniquness check...

From
Tom Lane
Date:
Josh Berkus <josh@agliodbs.com> writes:
> Charles,
>> Sample query to return non-uniqueness
>> SELECT A1, A2, A3, ..., An
>> FROM Table
>> GROUP BY A1, A2, A3, ..., An
>> HAVING Count(*)>1

> In order for it to be even possible to use an index (a hashaggregate
> operation, actually) on this table, you'd have to include *all* of the GROUP
> BY columns in a single, multi-column index.

> However, it would be unlikely for PG to use any kind of an index in the
> operation above, because of the number of columns, the unlikelyness of
> grouping (i.e. there will only be a minority of rows with count(*) > 1) and
> the fact that you're running this against the whole table.  So any
> kind of an index is liable to be useless.

Yeah.  If you are not expecting a huge number of groups, I think that it
would be more interesting to try a HashAggregate plan than a sort/group
plan.  For this you need 7.4 or later and a sort_mem setting large
enough to cover whatever the planner estimates the hashtable size to be.

            regards, tom lane