Thread: arrays and indexes

arrays and indexes

From
"Ross J. Reedstrom"
Date:
Hi all -
I've got a schema I'm working on modifying, nad I need some help getting
the best performance out. The orginal schema has a many to many linkage
between a couple tables, using a two column linkage table. This is used
to represent groups of people and their relationship to an object
(authors, copyrightholders, maintainers) This worked fine, and, with the
right indixes, is quite zippy. Approximate schems:

table content (
contentid serial,
name text,
<...>
authorgroupid int,
cpholdergroupid int,
maintgroupid int)

table groups (
personid text,
groupid int)

Note that neither grouid nor personid are unique.

Now the users want not just groups, but ordered lists. Well, that's just
fine: we could do it with another column in the groups linkage table,
and some additional logic in the middleware for detecting identical
groups, but it occured to me that PG's array types are just the ticket
for ordered lists like this.

So, by dropping arrays of personids (authors, copyrightholders,
maintainers, ...) into the content table, I can do everything I need.

Only one problem. Retreiving all the content for a particular
person/role is fairly common. Queries of the form:

SELECT * from content c join groups g on c.authorgroupid = g.personid
where personid = 'ross';

work fine and use the index on groups.personid.

In the new schema, the same thing is:

SELECT * from content where 42 = ANY (authors);

Works fine, but for the life of me I can't find nor figure out how to
build an index that will be used to speed this along. Any ideas?

I'm using 7.4.3, BTW.

Ross
--
Ross Reedstrom, Ph.D.                                 reedstrm@rice.edu
Research Scientist                                  phone: 713-348-6166
The Connexions Project      http://cnx.rice.edu       fax: 713-348-3665
Rice University MS-375, Houston, TX 77005
GPG Key fingerprint = F023 82C8 9B0E 2CC6 0D8E  F888 D3AE 810E 88F0 BEDE



Re: arrays and indexes

From
Greg Stark
Date:
"Ross J. Reedstrom" <reedstrm@rice.edu> writes:

> In the new schema, the same thing is:
>
> SELECT * from content where 42 = ANY (authors);
>
> Works fine, but for the life of me I can't find nor figure out how to
> build an index that will be used to speed this along. Any ideas?

Well that's basically the problem with denormalized data like this.

Have you resolved what you're going to do if two sessions try to add a user to
the same group at the same time? Or how you'll go about removing a user from
all his groups in one shot?

Basically, if you denormalize in this fashion it becomes hard to use the
groups as anything but single monolithic objects. Whereas normalized data can
be queried and updated from other points of view like in the case you name
above.

Postgres does have a way to do what you ask, though. It involves GiST indexes
and the operators from the contrib/intarray directory from the Postgres
source.

However I warn you in advance that this is fairly esoteric stuff and will take
some time to get used to. And at least in my case I found the indexes didn't
actually help much for my data sets, probably because they just weren't big
enough to benefit.

--
greg

Re: arrays and indexes

From
"Merlin Moncure"
Date:
Ross wrote:
> Hi all -
> I've got a schema I'm working on modifying, nad I need some help
getting
> the best performance out. The orginal schema has a many to many
linkage
> between a couple tables, using a two column linkage table. This is
used
> to represent groups of people and their relationship to an object
> (authors, copyrightholders, maintainers) This worked fine, and, with
the
> right indixes, is quite zippy. Approximate schems:
>
> table content (
> contentid serial,
> name text,
> <...>
> authorgroupid int,
> cpholdergroupid int,
> maintgroupid int)
>
> table groups (
> personid text,
> groupid int)
>
> Note that neither grouid nor personid are unique.
>
> Now the users want not just groups, but ordered lists. Well, that's
just
> fine: we could do it with another column in the groups linkage table,
> and some additional logic in the middleware for detecting identical
> groups, but it occured to me that PG's array types are just the ticket
> for ordered lists like this.
>
> So, by dropping arrays of personids (authors, copyrightholders,
> maintainers, ...) into the content table, I can do everything I need.
>
> Only one problem. Retreiving all the content for a particular
> person/role is fairly common. Queries of the form:
>
> SELECT * from content c join groups g on c.authorgroupid = g.personid
> where personid = 'ross';
>
> work fine and use the index on groups.personid.
>
> In the new schema, the same thing is:
>
> SELECT * from content where 42 = ANY (authors);
>
> Works fine, but for the life of me I can't find nor figure out how to
> build an index that will be used to speed this along. Any ideas?
>
> I'm using 7.4.3, BTW.

Arrays are usually a bad choice to put in your tables with a couple of
exceptions.  Keep in mind that you can generate the array in the query
stage using custom aggregates if you prefer to deal with them on the
client side.  The basic problem is they introduce flexibility issues and
are usually better handled by moving the data to a dependant table.

Here are cases you might want to consider using arrays in your tables:
1.  Your array bounds are small and known at design time (think: pay by
quarter example in the docs).
2.  Your array will not contain more than one or two dependant elements.
3.  You are dealing with an extreme performance situation and you have
tried doing things the proper way first.

There are other exceptions...arrays can be a powerful tool albeit a
dangerous one...just know what you are getting into.  A firm
understanding of relational principles are a tremendous help.

If your array bounds are known, it possible to get around the index
problem in limited cases by using a custom function (but only when the
array bounds are known:

create function any_quarter_over_10k (numeric[]) returns boolean as
'
    select
    case
      when $1[1] = > 10000 then true
        when $1[2] = > 10000 then true
        when $1[3] = > 10000 then true
        when $1[4] = > 10000 then true
        else false
    end;

' language 'sql' IMMUTABLE;

create index t_q_10k_idx on t(any_quarter_over_10k(salary_qtr));

select * from t where any_quarter_over_10k(t.salary_qtr) = true;


Good luck!
Merlin

Re: arrays and indexes

From
"Ross J. Reedstrom"
Date:
On Mon, Jul 26, 2004 at 02:27:20AM -0400, Greg Stark wrote:
>
> "Ross J. Reedstrom" <reedstrm@rice.edu> writes:
>
> > In the new schema, the same thing is:
> >
> > SELECT * from content where 42 = ANY (authors);
> >
> > Works fine, but for the life of me I can't find nor figure out how to
> > build an index that will be used to speed this along. Any ideas?
>
> Well that's basically the problem with denormalized data like this.
>
> Have you resolved what you're going to do if two sessions try to add a user to
> the same group at the same time? Or how you'll go about removing a user from
> all his groups in one shot?

We've got plenty of interlocks in the middleware to handle the first
(mainly because this is an authoring system where everyone has to agree
to participate, and acknowledge the open license on the materials)

Second, they _can't_ be removed: we're effectively a write only archive.
Even if we weren't it would be a rare event and could go slowly (loop
over groups in the middleware, probably)

>
> Basically, if you denormalize in this fashion it becomes hard to use the
> groups as anything but single monolithic objects. Whereas normalized data can
> be queried and updated from other points of view like in the case you name
> above.

These groups _really are_ ideal for Joe Conway's work on arrays: we need
ordered vectors, so we'd be sorting all the time, otherwise. They're
static, and they're read only. The one thing they're not is fixed, known
size (Sorry Merlin). They work fine for the query as shown: the only
issue is performance.

> Postgres does have a way to do what you ask, though. It involves GiST
> indexes and the operators from the contrib/intarray directory from the
> Postgres source.

Well, yes, that's how it used to be done. I figured the new array
support should be able to handle it without the addon, however.

> However I warn you in advance that this is fairly esoteric stuff and
> will take some time to get used to. And at least in my case I found
> the indexes didn't actually help much for my data sets, probably
> because they just weren't big enough to benefit.

I know that they should help in  this case: we've got lots of content.
Any particular author or maintainter will be in a small fraction of
those. i.e.: it's ideal for an index. And the current joined case uses
an index, when it's available. I'll take a look at the GiST/contrib work,
anyway.

Thanks -

Ross
--
Ross Reedstrom, Ph.D.                                 reedstrm@rice.edu
Research Scientist                                  phone: 713-348-6166
The Connexions Project      http://cnx.rice.edu       fax: 713-348-3665
Rice University MS-375, Houston, TX 77005
GPG Key fingerprint = F023 82C8 9B0E 2CC6 0D8E  F888 D3AE 810E 88F0 BEDE

Re: arrays and indexes

From
Pierre-Frédéric Caillaud
Date:
>> > SELECT * from content where 42 = ANY (authors);

>> Postgres does have a way to do what you ask, though. It involves GiST
>> indexes and the operators from the contrib/intarray directory from the
>> Postgres source.

    I have tried to use these indexes, and the performance was very good. It
can be faster (in fact much faster) than a join with an additional table,
because you don't have a join. The SQL array syntax is a pain, though.

Re: arrays and indexes

From
Greg Stark
Date:
"Ross J. Reedstrom" <reedstrm@rice.edu> writes:

> These groups _really are_ ideal for Joe Conway's work on arrays: we need
> ordered vectors, so we'd be sorting all the time, otherwise. They're
> static, and they're read only. The one thing they're not is fixed, known
> size (Sorry Merlin). They work fine for the query as shown: the only
> issue is performance.

Well just as long as you understand the trade-offs. Denormalizing can be
useful but you need to know what flexibility you're losing too.

> > Postgres does have a way to do what you ask, though. It involves GiST
> > indexes and the operators from the contrib/intarray directory from the
> > Postgres source.
>
> Well, yes, that's how it used to be done. I figured the new array
> support should be able to handle it without the addon, however.

I think you can btree index arrays now, which is new, but it's not useful for
the kind of lookup you're doing. It would only be useful for joining on array
types or looking for groups with given content, or things like that.

> > However I warn you in advance that this is fairly esoteric stuff and
> > will take some time to get used to. And at least in my case I found
> > the indexes didn't actually help much for my data sets, probably
> > because they just weren't big enough to benefit.
>
> I know that they should help in  this case: we've got lots of content.
> Any particular author or maintainter will be in a small fraction of
> those. i.e.: it's ideal for an index. And the current joined case uses
> an index, when it's available. I'll take a look at the GiST/contrib work,
> anyway.

I would be curious to know how it goes. My own project uses denormalized sets
stored as arrays as well, though in my case they're precalculated from the
fully normalized data. I tried to use GiST indexes but ran into problems
combining the btree-GiST code with array GiST code in a multicolumn index. I
still don't really know why it failed, but after two days building the index I
gave up.

--
greg

Re: arrays and indexes

From
Tom Lane
Date:
Greg Stark <gsstark@mit.edu> writes:
> I would be curious to know how it goes. My own project uses
> denormalized sets stored as arrays as well, though in my case they're
> precalculated from the fully normalized data. I tried to use GiST
> indexes but ran into problems combining the btree-GiST code with array
> GiST code in a multicolumn index. I still don't really know why it
> failed, but after two days building the index I gave up.

Sounds like a bug to me.  Could you put together a test case?

            regards, tom lane

Re: arrays and indexes

From
Greg Stark
Date:
Tom Lane <tgl@sss.pgh.pa.us> writes:

> > I still don't really know why it failed, but after two days building the
> > index I gave up.
>
> Sounds like a bug to me.  Could you put together a test case?

At the time I contacted one of the GiST authors and we went over things for a
while. They diagnosed the problem as being caused by having a poor selectivity
GiST btree as the leading column in the index.

He seemed to think this was fairly fundamental and wasn't something they were
going to be able to address. And I was fairly certain I didn't want to turn
the index upside down to have the more selective columns first (as is usually
normal) for various reasons.

So I gave it up as a lost cause. In any case in my application it was unlikely
to really help. I expect that leading btree index to narrow the search to only
a few hundred or few thousand records in the normal case. So the access times
are already within reason even having to dig through all the records. And
since other queries are likely to need other records from that set I'll need
them all in cache eventually. There are a lot of array columns to search
through, so the added i/o to read all those indexes would probably be a net
loss when they push other things out of cache.

I could try setting up a test case, but I think all it took was having a
btree-gist index that was insufficiently selective. In my case I had about 900
integer values each on the order of 100-1000 records.

--
greg