Re: arrays and indexes - Mailing list pgsql-performance

From Merlin Moncure
Subject Re: arrays and indexes
Date
Msg-id 6EE64EF3AB31D5448D0007DD34EEB34101AEF1@Herge.rcsinc.local
Whole thread Raw
In response to arrays and indexes  ("Ross J. Reedstrom" <reedstrm@rice.edu>)
Responses Re: arrays and indexes
List pgsql-performance
Ross wrote:
> Hi all -
> I've got a schema I'm working on modifying, nad I need some help
getting
> the best performance out. The orginal schema has a many to many
linkage
> between a couple tables, using a two column linkage table. This is
used
> to represent groups of people and their relationship to an object
> (authors, copyrightholders, maintainers) This worked fine, and, with
the
> right indixes, is quite zippy. Approximate schems:
>
> table content (
> contentid serial,
> name text,
> <...>
> authorgroupid int,
> cpholdergroupid int,
> maintgroupid int)
>
> table groups (
> personid text,
> groupid int)
>
> Note that neither grouid nor personid are unique.
>
> Now the users want not just groups, but ordered lists. Well, that's
just
> fine: we could do it with another column in the groups linkage table,
> and some additional logic in the middleware for detecting identical
> groups, but it occured to me that PG's array types are just the ticket
> for ordered lists like this.
>
> So, by dropping arrays of personids (authors, copyrightholders,
> maintainers, ...) into the content table, I can do everything I need.
>
> Only one problem. Retreiving all the content for a particular
> person/role is fairly common. Queries of the form:
>
> SELECT * from content c join groups g on c.authorgroupid = g.personid
> where personid = 'ross';
>
> work fine and use the index on groups.personid.
>
> In the new schema, the same thing is:
>
> SELECT * from content where 42 = ANY (authors);
>
> Works fine, but for the life of me I can't find nor figure out how to
> build an index that will be used to speed this along. Any ideas?
>
> I'm using 7.4.3, BTW.

Arrays are usually a bad choice to put in your tables with a couple of
exceptions.  Keep in mind that you can generate the array in the query
stage using custom aggregates if you prefer to deal with them on the
client side.  The basic problem is they introduce flexibility issues and
are usually better handled by moving the data to a dependant table.

Here are cases you might want to consider using arrays in your tables:
1.  Your array bounds are small and known at design time (think: pay by
quarter example in the docs).
2.  Your array will not contain more than one or two dependant elements.
3.  You are dealing with an extreme performance situation and you have
tried doing things the proper way first.

There are other exceptions...arrays can be a powerful tool albeit a
dangerous one...just know what you are getting into.  A firm
understanding of relational principles are a tremendous help.

If your array bounds are known, it possible to get around the index
problem in limited cases by using a custom function (but only when the
array bounds are known:

create function any_quarter_over_10k (numeric[]) returns boolean as
'
    select
    case
      when $1[1] = > 10000 then true
        when $1[2] = > 10000 then true
        when $1[3] = > 10000 then true
        when $1[4] = > 10000 then true
        else false
    end;

' language 'sql' IMMUTABLE;

create index t_q_10k_idx on t(any_quarter_over_10k(salary_qtr));

select * from t where any_quarter_over_10k(t.salary_qtr) = true;


Good luck!
Merlin

pgsql-performance by date:

Previous
From: "Harmon S. Nine"
Date:
Subject: Timestamp-based indexing
Next
From: Kevin Barnard
Date:
Subject: Re: Timestamp-based indexing