Thread: arrays and indexes
Hi all - I've got a schema I'm working on modifying, nad I need some help getting the best performance out. The orginal schema has a many to many linkage between a couple tables, using a two column linkage table. This is used to represent groups of people and their relationship to an object (authors, copyrightholders, maintainers) This worked fine, and, with the right indixes, is quite zippy. Approximate schems: table content ( contentid serial, name text, <...> authorgroupid int, cpholdergroupid int, maintgroupid int) table groups ( personid text, groupid int) Note that neither grouid nor personid are unique. Now the users want not just groups, but ordered lists. Well, that's just fine: we could do it with another column in the groups linkage table, and some additional logic in the middleware for detecting identical groups, but it occured to me that PG's array types are just the ticket for ordered lists like this. So, by dropping arrays of personids (authors, copyrightholders, maintainers, ...) into the content table, I can do everything I need. Only one problem. Retreiving all the content for a particular person/role is fairly common. Queries of the form: SELECT * from content c join groups g on c.authorgroupid = g.personid where personid = 'ross'; work fine and use the index on groups.personid. In the new schema, the same thing is: SELECT * from content where 42 = ANY (authors); Works fine, but for the life of me I can't find nor figure out how to build an index that will be used to speed this along. Any ideas? I'm using 7.4.3, BTW. Ross -- Ross Reedstrom, Ph.D. reedstrm@rice.edu Research Scientist phone: 713-348-6166 The Connexions Project http://cnx.rice.edu fax: 713-348-3665 Rice University MS-375, Houston, TX 77005 GPG Key fingerprint = F023 82C8 9B0E 2CC6 0D8E F888 D3AE 810E 88F0 BEDE
"Ross J. Reedstrom" <reedstrm@rice.edu> writes: > In the new schema, the same thing is: > > SELECT * from content where 42 = ANY (authors); > > Works fine, but for the life of me I can't find nor figure out how to > build an index that will be used to speed this along. Any ideas? Well that's basically the problem with denormalized data like this. Have you resolved what you're going to do if two sessions try to add a user to the same group at the same time? Or how you'll go about removing a user from all his groups in one shot? Basically, if you denormalize in this fashion it becomes hard to use the groups as anything but single monolithic objects. Whereas normalized data can be queried and updated from other points of view like in the case you name above. Postgres does have a way to do what you ask, though. It involves GiST indexes and the operators from the contrib/intarray directory from the Postgres source. However I warn you in advance that this is fairly esoteric stuff and will take some time to get used to. And at least in my case I found the indexes didn't actually help much for my data sets, probably because they just weren't big enough to benefit. -- greg
Ross wrote: > Hi all - > I've got a schema I'm working on modifying, nad I need some help getting > the best performance out. The orginal schema has a many to many linkage > between a couple tables, using a two column linkage table. This is used > to represent groups of people and their relationship to an object > (authors, copyrightholders, maintainers) This worked fine, and, with the > right indixes, is quite zippy. Approximate schems: > > table content ( > contentid serial, > name text, > <...> > authorgroupid int, > cpholdergroupid int, > maintgroupid int) > > table groups ( > personid text, > groupid int) > > Note that neither grouid nor personid are unique. > > Now the users want not just groups, but ordered lists. Well, that's just > fine: we could do it with another column in the groups linkage table, > and some additional logic in the middleware for detecting identical > groups, but it occured to me that PG's array types are just the ticket > for ordered lists like this. > > So, by dropping arrays of personids (authors, copyrightholders, > maintainers, ...) into the content table, I can do everything I need. > > Only one problem. Retreiving all the content for a particular > person/role is fairly common. Queries of the form: > > SELECT * from content c join groups g on c.authorgroupid = g.personid > where personid = 'ross'; > > work fine and use the index on groups.personid. > > In the new schema, the same thing is: > > SELECT * from content where 42 = ANY (authors); > > Works fine, but for the life of me I can't find nor figure out how to > build an index that will be used to speed this along. Any ideas? > > I'm using 7.4.3, BTW. Arrays are usually a bad choice to put in your tables with a couple of exceptions. Keep in mind that you can generate the array in the query stage using custom aggregates if you prefer to deal with them on the client side. The basic problem is they introduce flexibility issues and are usually better handled by moving the data to a dependant table. Here are cases you might want to consider using arrays in your tables: 1. Your array bounds are small and known at design time (think: pay by quarter example in the docs). 2. Your array will not contain more than one or two dependant elements. 3. You are dealing with an extreme performance situation and you have tried doing things the proper way first. There are other exceptions...arrays can be a powerful tool albeit a dangerous one...just know what you are getting into. A firm understanding of relational principles are a tremendous help. If your array bounds are known, it possible to get around the index problem in limited cases by using a custom function (but only when the array bounds are known: create function any_quarter_over_10k (numeric[]) returns boolean as ' select case when $1[1] = > 10000 then true when $1[2] = > 10000 then true when $1[3] = > 10000 then true when $1[4] = > 10000 then true else false end; ' language 'sql' IMMUTABLE; create index t_q_10k_idx on t(any_quarter_over_10k(salary_qtr)); select * from t where any_quarter_over_10k(t.salary_qtr) = true; Good luck! Merlin
On Mon, Jul 26, 2004 at 02:27:20AM -0400, Greg Stark wrote: > > "Ross J. Reedstrom" <reedstrm@rice.edu> writes: > > > In the new schema, the same thing is: > > > > SELECT * from content where 42 = ANY (authors); > > > > Works fine, but for the life of me I can't find nor figure out how to > > build an index that will be used to speed this along. Any ideas? > > Well that's basically the problem with denormalized data like this. > > Have you resolved what you're going to do if two sessions try to add a user to > the same group at the same time? Or how you'll go about removing a user from > all his groups in one shot? We've got plenty of interlocks in the middleware to handle the first (mainly because this is an authoring system where everyone has to agree to participate, and acknowledge the open license on the materials) Second, they _can't_ be removed: we're effectively a write only archive. Even if we weren't it would be a rare event and could go slowly (loop over groups in the middleware, probably) > > Basically, if you denormalize in this fashion it becomes hard to use the > groups as anything but single monolithic objects. Whereas normalized data can > be queried and updated from other points of view like in the case you name > above. These groups _really are_ ideal for Joe Conway's work on arrays: we need ordered vectors, so we'd be sorting all the time, otherwise. They're static, and they're read only. The one thing they're not is fixed, known size (Sorry Merlin). They work fine for the query as shown: the only issue is performance. > Postgres does have a way to do what you ask, though. It involves GiST > indexes and the operators from the contrib/intarray directory from the > Postgres source. Well, yes, that's how it used to be done. I figured the new array support should be able to handle it without the addon, however. > However I warn you in advance that this is fairly esoteric stuff and > will take some time to get used to. And at least in my case I found > the indexes didn't actually help much for my data sets, probably > because they just weren't big enough to benefit. I know that they should help in this case: we've got lots of content. Any particular author or maintainter will be in a small fraction of those. i.e.: it's ideal for an index. And the current joined case uses an index, when it's available. I'll take a look at the GiST/contrib work, anyway. Thanks - Ross -- Ross Reedstrom, Ph.D. reedstrm@rice.edu Research Scientist phone: 713-348-6166 The Connexions Project http://cnx.rice.edu fax: 713-348-3665 Rice University MS-375, Houston, TX 77005 GPG Key fingerprint = F023 82C8 9B0E 2CC6 0D8E F888 D3AE 810E 88F0 BEDE
>> > SELECT * from content where 42 = ANY (authors); >> Postgres does have a way to do what you ask, though. It involves GiST >> indexes and the operators from the contrib/intarray directory from the >> Postgres source. I have tried to use these indexes, and the performance was very good. It can be faster (in fact much faster) than a join with an additional table, because you don't have a join. The SQL array syntax is a pain, though.
"Ross J. Reedstrom" <reedstrm@rice.edu> writes: > These groups _really are_ ideal for Joe Conway's work on arrays: we need > ordered vectors, so we'd be sorting all the time, otherwise. They're > static, and they're read only. The one thing they're not is fixed, known > size (Sorry Merlin). They work fine for the query as shown: the only > issue is performance. Well just as long as you understand the trade-offs. Denormalizing can be useful but you need to know what flexibility you're losing too. > > Postgres does have a way to do what you ask, though. It involves GiST > > indexes and the operators from the contrib/intarray directory from the > > Postgres source. > > Well, yes, that's how it used to be done. I figured the new array > support should be able to handle it without the addon, however. I think you can btree index arrays now, which is new, but it's not useful for the kind of lookup you're doing. It would only be useful for joining on array types or looking for groups with given content, or things like that. > > However I warn you in advance that this is fairly esoteric stuff and > > will take some time to get used to. And at least in my case I found > > the indexes didn't actually help much for my data sets, probably > > because they just weren't big enough to benefit. > > I know that they should help in this case: we've got lots of content. > Any particular author or maintainter will be in a small fraction of > those. i.e.: it's ideal for an index. And the current joined case uses > an index, when it's available. I'll take a look at the GiST/contrib work, > anyway. I would be curious to know how it goes. My own project uses denormalized sets stored as arrays as well, though in my case they're precalculated from the fully normalized data. I tried to use GiST indexes but ran into problems combining the btree-GiST code with array GiST code in a multicolumn index. I still don't really know why it failed, but after two days building the index I gave up. -- greg
Greg Stark <gsstark@mit.edu> writes: > I would be curious to know how it goes. My own project uses > denormalized sets stored as arrays as well, though in my case they're > precalculated from the fully normalized data. I tried to use GiST > indexes but ran into problems combining the btree-GiST code with array > GiST code in a multicolumn index. I still don't really know why it > failed, but after two days building the index I gave up. Sounds like a bug to me. Could you put together a test case? regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> writes: > > I still don't really know why it failed, but after two days building the > > index I gave up. > > Sounds like a bug to me. Could you put together a test case? At the time I contacted one of the GiST authors and we went over things for a while. They diagnosed the problem as being caused by having a poor selectivity GiST btree as the leading column in the index. He seemed to think this was fairly fundamental and wasn't something they were going to be able to address. And I was fairly certain I didn't want to turn the index upside down to have the more selective columns first (as is usually normal) for various reasons. So I gave it up as a lost cause. In any case in my application it was unlikely to really help. I expect that leading btree index to narrow the search to only a few hundred or few thousand records in the normal case. So the access times are already within reason even having to dig through all the records. And since other queries are likely to need other records from that set I'll need them all in cache eventually. There are a lot of array columns to search through, so the added i/o to read all those indexes would probably be a net loss when they push other things out of cache. I could try setting up a test case, but I think all it took was having a btree-gist index that was insufficiently selective. In my case I had about 900 integer values each on the order of 100-1000 records. -- greg