Thread: What is the best way to do attribute/values?
Hi list, I have a database with 1M "people" in it. Each person has about 20 attributes, such as height, weight, eye color, etc. I need to be able to search for people based on these attributes. A search can be conducted on one attribute, all attributes, or any number in between. How would _you_ do this? I have already attempted to answer this. My attempts are detailed here: http://sh.nu/email.txt This is the email I was originally going to send to this list. Since it's so large, I decided to link to it instead. If you feel that it belongs in a post to the list, let me know, and I'll post again. I've discussed these attempts with people in #postgresql on irc.freenode.net. Agliodbs (I presume you know who this is) was very helpful, but in end was at a loss. I find myself in the same postition at this time. He suggested I contact this list. My ultimate goal is performance. This _must_ be fast. And by fast, I mean, < 1 second, for every permutation of the number of attributes searched for. Flexibility would be a bonus, but at this point I'll settle for something that's harder to maintain if I can get the speed gain I need. Thanks, Daniel Ceregatti
Folks, > I've discussed these attempts with people in #postgresql on > irc.freenode.net. Agliodbs (I presume you know who this is) was very > helpful, but in end was at a loss. I find myself in the same postition > at this time. He suggested I contact this list. There's a couple of issues here to attack: 1) PostgreSQL is not using the most optimal plan. First, it's ignoring the fact that all referenced columns are indexed and only using the first column, then filtering based on the other criteria. Second, testing has shown that a hash join would actually be faster. We've tried upping the statistics, but it doesn't seem to have an effect on the planner's erroneous estimates. 2) Even were it using the most optimal plan, it's still to slow. As you can see from the plan, each merge join takes about 1.5 to 2 seconds. (hash joins are only about 0.5 seconds slower). Mysteriously, a big chunk of this time is spent *in bewtween* planner steps, as if there was some hold-up in retrieving the index or table pages. There may be, but Daniel and I have not been able to diagnose the cause. It's particularly mysterious since a filter-and-sort on a *single* criteria set, without join, takes < 400ms. Things we've already tried to avoid going over old ground: 1) increasing statistics; 2) increasing sort_mem (to 256MB, which is overkill) 3) testing on 8.0 beta, which does not affect the issue. At this point I'm looking for ideas. Suggestions, anyone? -- Josh Berkus Aglio Database Solutions San Francisco
Daniel Ceregatti wrote: > Hi list, > > I have a database with 1M "people" in it. Each person has about 20 > attributes, such as height, weight, eye color, etc. I need to be able to > search for people based on these attributes. A search can be conducted > on one attribute, all attributes, or any number in between. How would > _you_ do this? > > I have already attempted to answer this. My attempts are detailed here: > > http://sh.nu/email.txt Hmm... interesting. Shot in the dark - try a tsearch2 full-text index. Your problem could be translated into searching strings of the form "hair=black eyes=blue age=117" Not pretty, but might give you the speed you want. -- Richard Huxton Archonet Ltd
Josh Berkus wrote: > Things we've already tried to avoid going over old ground: > >1) increasing statistics; >2) increasing sort_mem (to 256MB, which is overkill) >3) testing on 8.0 beta, which does not affect the issue. > >At this point I'm looking for ideas. Suggestions, anyone? > > > with respect to query design: consider instead of: select pav1.person_id from person_attributes_vertical pav1, person_attributes_vertical pav2 where pav1.attribute_id = 1 and pav1.value_id in (2,3) and pav2.attribute_id = 2 and pav2.value_id in (2,3) and pav1.person_id = pav2.person_id try: select pav1.person_id from person_attributes_vertical pav1 where ( pav1.attribute_id = 1 and pav1.value_id in (2,3)) or ( pav1.attribute_id = 2 and pav1.value_id in (2,3)) I am gambling that the 'or's' might be less expensive than the multiple self joins (particularly in the more general cases!). To make access work well you might want to have *several* concatenated indexes of 2 -> 4 attributes - to work around Pg inabilityto use more than 1 in a given query. For this query indexing (attribute_id, value_id) is probably good. Consider playing with 'random_page_cost' and maybe 'effective_cache_size' to encourage the planner to use 'em. regards Mark
On Aug 25, 2004, at 4:22 AM, Mark Kirkwood wrote: > select > pav1.person_id > from > person_attributes_vertical pav1 > where > ( pav1.attribute_id = 1 > and pav1.value_id in (2,3)) > or ( pav1.attribute_id = 2 > and pav1.value_id in (2,3)) > You know.. It may help if you toss in a group by ie select pav1.person_id, count(*) from person_attributes_vertical pav1 where (pav1.attribute_id = 1 and pav1.value_id in (2,3)) or ( ... ) or (...) group by pav1.person_id order by count(*) desc that should give you the person_id's that matched the most criteria........ I've used similar things before now that I've thought about it. If you want an exact match you could put "having count(*) = $myNumAttributes" in there too.. By definition an exact match would match that definition.. it has an added side effect of producing "closest matches" when an exact match cannot be found... granted you may not want that for a dating site : ) "You asked for a blond female, blue eyes.. but I couldn't find any... but I *DID* find a brown haired male with brown eyes! Is that good enough?" -- Jeff Trout <jeff@jefftrout.com> http://www.jefftrout.com/ http://www.stuarthamm.net/
Hi, On Aug 25, 2004, at 4:22 AM, Mark Kirkwood wrote: > select > pav1.person_id > from > person_attributes_vertical pav1 > where > ( pav1.attribute_id = 1 > and pav1.value_id in (2,3)) > or ( pav1.attribute_id = 2 > and pav1.value_id in (2,3)) > [...] Why not combine attribute_id and value_id? Then you have nothing but an OR (or IN). It should, AFAICS, give you much better selectivity on your indexes: There will be a lot of attributes with the same ID; there will also be a lot of attributes with the same value. However,there should be much less attributes with a specific combination of (ID/Value). Right now I think it will be very hard to determine which field has a better selectivity: attribute_id or value_id. The combined attribute/value field could be an int8 or so, where the upper 4 bytes are for attribute_id and the lower 4 bytesfor value_id. Depending on the number of attributes and possible values a smaller datatype and / or a different split can be made. A smallerdatatype will result in faster access. What difference does that make? regards, --Tim
Mark, Tim, > select > pav1.person_id > from > person_attributes_vertical pav1 > where > ( pav1.attribute_id = 1 > and pav1.value_id in (2,3)) > or ( pav1.attribute_id = 2 > and pav1.value_id in (2,3)) Not the same query, sorry. Daniel's query yields all the person_id's which have criteria A AND criteria B. Yours gives all the person_id's which have criteria A OR criteria B. > There will be a lot of attributes with the same ID; there will also be a > lot of attributes with the same value. However, there should be much less > attributes with a specific combination of (ID/Value). Right now I think it > will be very hard to determine which field has a better selectivity: > attribute_id or value_id. Given that there is already an index on ( attribute_id, value_id ) I don't quite see what difference this makes. Unless you're suggesting this as a workaround for the PG Planner's poor use of the index? -- Josh Berkus Aglio Database Solutions San Francisco
Two more unusual suggestions: 1. Drop all the indexes and just do sequential scans (full table scans), aiming as hard as possible to get the whole people table to fit in memory (size says that should be easy - various ways) - and make sure you're using 8.0 so you have the best cache manager. This will at least give you consistent performance on whatever attribute values searched on in user queries. Dropping all the indexes will allow the query to optimize faster, since it has only one path choice. Work out how many attributes it takes to reduce the list of candidates to a manageable number, and include only those factors into the table, effectively vertically partitioning the table, thereby reducing the volume and increasing scan speed. Then redesign the user interface so that they see a two-stage process, first stage is top N common attributes, second stage is to further reduce that down using rarer attributes, as well as using the output from the first table to index into the second. Users then won't mind additional wait time. (Experiment with: Horizontally partition the table into N pieces. Issue N simultaneous queries to simulate a parallel query. Try N == 2 on your box) 2. You can try forcing the use of a Star Join optimization here: Concatenate the attribute values into a single column, then index it. This will be nearly unique. Cluster the table. Permute the values of the attributes, to give you a list of concatenated keys that would match, then join that list to the main table, using a join via the index. You can do this permutation by using a reference table per attribute, then doing an unconstrained product join between all of the attribute tables (avoid any indexes on them) and assembling the candidate keys into a single temporary table. Then join the temp table to the main people table. This will only work effectively if people's attributes are selected with some discrimination, otherwise this optimisation will fail. You'd need to constrain the user interface to "pick 20 out of the following 100 attribute values" or some other heuristic to ensure a relatively low count, or use a LIMIT on the query into the temp table. This sounds long winded, but is essentially the route the Oracle optimizer takes in performing a star join....you clearly know you're Oracle, so look that up to confirm what I'm saying. (May not work as well if you use a sub-select on PostgreSQL....) Also, I'd categorise the Age, Height, Weight and Salary attributes and everything else based upon most common ranges, so it will be just an equality search on an integer assigned to that category, rather than a > search. Order by the distance, don't search on it, it'll be quicker since you'll only need to calculate it for the records that match...even if you do get a few too many, it would be a shame to avoid somebody because they lived 1 mile outside of the stated radius. The database sounds < 1 Gb in total logical volume, so 4Gb of RAM should be easily sufficient. Best Regards, Simon Riggs > -----Original Message----- > From: pgsql-performance-owner@postgresql.org > [mailto:pgsql-performance-owner@postgresql.org]On Behalf Of Daniel > Ceregatti > Sent: 19 August 2004 19:03 > To: pgsql-performance@postgresql.org > Subject: [PERFORM] What is the best way to do attribute/values? > > > Hi list, > > I have a database with 1M "people" in it. Each person has about 20 > attributes, such as height, weight, eye color, etc. I need to be able to > search for people based on these attributes. A search can be conducted > on one attribute, all attributes, or any number in between. How would > _you_ do this? > > I have already attempted to answer this. My attempts are detailed here: > > http://sh.nu/email.txt > > This is the email I was originally going to send to this list. Since > it's so large, I decided to link to it instead. If you feel that it > belongs in a post to the list, let me know, and I'll post again. > > I've discussed these attempts with people in #postgresql on > irc.freenode.net. Agliodbs (I presume you know who this is) was very > helpful, but in end was at a loss. I find myself in the same postition > at this time. He suggested I contact this list. > > My ultimate goal is performance. This _must_ be fast. And by fast, I > mean, < 1 second, for every permutation of the number of attributes > searched for. Flexibility would be a bonus, but at this point I'll > settle for something that's harder to maintain if I can get the speed > gain I need. > > Thanks, > > Daniel Ceregatti > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Don't 'kill -9' the postmaster
Josh Berkus wrote: >Mark, Tim, > > > >>select >> pav1.person_id >>from >> person_attributes_vertical pav1 >>where >> ( pav1.attribute_id = 1 >> and pav1.value_id in (2,3)) >> or ( pav1.attribute_id = 2 >> and pav1.value_id in (2,3)) >> >> > >Not the same query, sorry. Daniel's query yields all the person_id's which >have criteria A AND criteria B. Yours gives all the person_id's which have >criteria A OR criteria B. > > > Apologies, not thinking clearly enough there... Maybe try out intersection : select pav1.person_id from person_attributes_vertical pav1 where ( pav1.attribute_id = 1 and pav1.value_id in (2,3)) intersect select pav1.person_id from person_attributes_vertical pav1 where ( pav1.attribute_id = 2 and pav1.value_id in (2,3)) In the advent that is unhelpful, I wonder about simplifying the situation and investigating how select pav1.person_id from person_attributes_vertical pav1 where pav1.attribute_id = 1 performs, compared to select pav1.person_id from person_attributes_vertical pav1 where ( pav1.attribute_id = 1 and pav1.value_id in (2,3)) If the first performs ok and the second does not, It may be possible to get better times by doing some horrible re-writes :e.g: select pav1.person_id from person_attributes_vertical pav1 where ( pav1.attribute_id = 1 and pav1.value_id||null in (2,3)) etc. regards Mark