Thread: What is the best way to do attribute/values?

What is the best way to do attribute/values?

From
Daniel Ceregatti
Date:
Hi list,

I have a database with 1M "people" in it. Each person has about 20
attributes, such as height, weight, eye color, etc. I need to be able to
search for people based on these attributes. A search can be conducted
on one attribute, all attributes, or any number in between. How would
_you_ do this?

I have already attempted to answer this. My attempts are detailed here:

http://sh.nu/email.txt

This is the email I was originally going to send to this list. Since
it's so large, I decided to link to it instead. If you feel that it
belongs in a post to the list, let me know, and I'll post again.

I've discussed these attempts with people in #postgresql on
irc.freenode.net. Agliodbs (I presume you know who this is) was very
helpful, but in end was at a loss. I find myself in the same postition
at this time. He suggested I contact this list.

My ultimate goal is performance. This _must_ be fast. And by fast, I
mean, < 1 second, for every permutation of the number of attributes
searched for. Flexibility would be a bonus, but at this point I'll
settle for something that's harder to maintain if I can get the speed
gain I need.

Thanks,

Daniel Ceregatti

Re: What is the best way to do attribute/values?

From
Josh Berkus
Date:
Folks,

> I've discussed these attempts with people in #postgresql on
> irc.freenode.net. Agliodbs (I presume you know who this is) was very
> helpful, but in end was at a loss. I find myself in the same postition
> at this time. He suggested I contact this list.

There's a couple of issues here to attack:

1) PostgreSQL is not using the most optimal plan.    First, it's ignoring the
fact that all referenced columns are indexed and only using the first column,
then filtering based on the other criteria.   Second, testing has shown that
a hash join would actually be faster.   We've tried upping the statistics,
but it doesn't seem to have an effect on the planner's erroneous estimates.

2) Even were it using the most optimal plan, it's still to slow.   As you can
see from the plan, each merge join takes about 1.5 to 2 seconds.    (hash
joins are only about 0.5 seconds slower).  Mysteriously, a big chunk of this
time is spent *in bewtween* planner steps, as if there was some hold-up in
retrieving the index or table pages.   There may be, but Daniel and I have
not been able to diagnose the cause.   It's particularly mysterious since a
filter-and-sort on a *single* criteria set, without join, takes < 400ms.

Things we've already tried to avoid going over old ground:
1) increasing statistics;
2) increasing sort_mem (to 256MB, which is overkill)
3) testing on 8.0 beta, which does not affect the issue.

At this point I'm looking for ideas.   Suggestions, anyone?

--
Josh Berkus
Aglio Database Solutions
San Francisco

Re: What is the best way to do attribute/values?

From
Richard Huxton
Date:
Daniel Ceregatti wrote:
> Hi list,
>
> I have a database with 1M "people" in it. Each person has about 20
> attributes, such as height, weight, eye color, etc. I need to be able to
> search for people based on these attributes. A search can be conducted
> on one attribute, all attributes, or any number in between. How would
> _you_ do this?
>
> I have already attempted to answer this. My attempts are detailed here:
>
> http://sh.nu/email.txt

Hmm... interesting.

Shot in the dark - try a tsearch2 full-text index. Your problem could be
translated into searching strings of the form
   "hair=black eyes=blue age=117"

Not pretty, but might give you the speed you want.
--
   Richard Huxton
   Archonet Ltd

Re: What is the best way to do attribute/values?

From
Mark Kirkwood
Date:

Josh Berkus wrote:

> Things we've already tried to avoid going over old ground:
>
>1) increasing statistics;
>2) increasing sort_mem (to 256MB, which is overkill)
>3) testing on 8.0 beta, which does not affect the issue.
>
>At this point I'm looking for ideas.   Suggestions, anyone?
>
>
>
with respect to query design:

consider instead of:

select
    pav1.person_id
from
    person_attributes_vertical pav1,
    person_attributes_vertical pav2
where
    pav1.attribute_id = 1
    and pav1.value_id in (2,3)
    and pav2.attribute_id = 2
    and pav2.value_id in (2,3)
    and pav1.person_id = pav2.person_id

try:

select
    pav1.person_id
from
    person_attributes_vertical pav1
where
       (    pav1.attribute_id = 1
        and pav1.value_id in (2,3))
    or (    pav1.attribute_id = 2
        and pav1.value_id in (2,3))

I am gambling that the 'or's' might be less expensive than the multiple self joins (particularly in the more general
cases!).

To make access work well you might want to have *several* concatenated indexes of 2 -> 4 attributes - to work around Pg
inabilityto use more than 1 in a given query. 
For this query indexing (attribute_id, value_id) is probably good.

Consider playing with 'random_page_cost' and maybe 'effective_cache_size' to encourage the planner to use 'em.

regards

Mark





Re: What is the best way to do attribute/values?

From
Jeff
Date:
On Aug 25, 2004, at 4:22 AM, Mark Kirkwood wrote:

> select
>     pav1.person_id
> from
>     person_attributes_vertical pav1
> where
>        (    pav1.attribute_id = 1
>         and pav1.value_id in (2,3))
>     or (    pav1.attribute_id = 2
>         and pav1.value_id in (2,3))
>

You know..
It may help if you toss in a group by
ie

select pav1.person_id, count(*) from person_attributes_vertical pav1
where (pav1.attribute_id = 1 and pav1.value_id in (2,3)) or ( ... ) or
(...)
group by pav1.person_id
order by count(*) desc

that should give you the person_id's that matched the most
criteria........
I've used similar things before now that I've thought about it.

If you want an exact match you could put
"having count(*) = $myNumAttributes" in there too.. By definition an
exact match would match that definition..

it has an added side effect of producing "closest matches" when an
exact match cannot be found... granted you may not want that for a
dating site : )

"You asked for a blond female, blue eyes.. but I couldn't find any...
but I *DID* find a brown haired male with brown eyes! Is that good
enough?"

--
Jeff Trout <jeff@jefftrout.com>
http://www.jefftrout.com/
http://www.stuarthamm.net/


Re: What is the best way to do attribute/values?

From
"Leeuw van der, Tim"
Date:
Hi,

On Aug 25, 2004, at 4:22 AM, Mark Kirkwood wrote:

> select
>     pav1.person_id
> from
>     person_attributes_vertical pav1
> where
>        (    pav1.attribute_id = 1
>         and pav1.value_id in (2,3))
>     or (    pav1.attribute_id = 2
>         and pav1.value_id in (2,3))
>

[...]

Why not combine attribute_id and value_id? Then you have nothing but an OR (or IN).

It should, AFAICS, give you much better selectivity on your indexes:

There will be a lot of attributes with the same ID; there will also be a lot of attributes with the same value.
However,there should be much less attributes with a specific combination of (ID/Value). 
Right now I think it will be very hard to determine which field has a better selectivity: attribute_id or value_id.


The combined attribute/value field could be an int8 or so, where the upper 4 bytes are for attribute_id and the lower 4
bytesfor value_id. 
Depending on the number of attributes and possible values a smaller datatype and / or a different split can be made. A
smallerdatatype will result in faster access. 

What difference does that make?

regards,

--Tim

Re: What is the best way to do attribute/values?

From
Josh Berkus
Date:
Mark, Tim,

> select
>     pav1.person_id
> from
>     person_attributes_vertical pav1
> where
>        (    pav1.attribute_id = 1
>         and pav1.value_id in (2,3))
>     or (    pav1.attribute_id = 2
>         and pav1.value_id in (2,3))

Not the same query, sorry.   Daniel's query yields all the person_id's which
have criteria A AND criteria B.   Yours gives all the person_id's which have
criteria A OR criteria B.

> There will be a lot of attributes with the same ID; there will also be a
> lot of attributes with the same value. However, there should be much less
> attributes with a specific combination of (ID/Value). Right now I think it
> will be very hard to determine which field has a better selectivity:
> attribute_id or value_id.

Given that there is already an index on ( attribute_id, value_id ) I don't
quite see what difference this makes.   Unless you're suggesting this as a
workaround for the PG Planner's poor use of the index?

--
Josh Berkus
Aglio Database Solutions
San Francisco

Re: What is the best way to do attribute/values?

From
"Simon Riggs"
Date:
Two more unusual suggestions:

1. Drop all the indexes and just do sequential scans (full table scans),
aiming as hard as possible to get the whole people table to fit in memory
(size says that should be easy - various ways) - and make sure you're using
8.0 so you have the best cache manager. This will at least give you
consistent performance on whatever attribute values searched on in user
queries. Dropping all the indexes will allow the query to optimize faster,
since it has only one path choice. Work out how many attributes it takes to
reduce the list of candidates to a manageable number, and include only those
factors into the table, effectively vertically partitioning the table,
thereby reducing the volume and increasing scan speed. Then redesign the
user interface so that they see a two-stage process, first stage is top N
common attributes, second stage is to further reduce that down using rarer
attributes, as well as using the output from the first table to index into
the second. Users then won't mind additional wait time.
(Experiment with: Horizontally partition the table into N pieces. Issue N
simultaneous queries to simulate a parallel query. Try N == 2 on your box)

2. You can try forcing the use of a Star Join optimization here:
Concatenate the attribute values into a single column, then index it. This
will be nearly unique. Cluster the table.
Permute the values of the attributes, to give you a list of concatenated
keys that would match, then join that list to the main table, using a join
via the index.
You can do this permutation by using a reference table per attribute, then
doing an unconstrained product join between all of the attribute tables
(avoid any indexes on them) and assembling the candidate keys into a single
temporary table. Then join the temp table to the main people table. This
will only work effectively if people's attributes are selected with some
discrimination, otherwise this optimisation will fail. You'd need to
constrain the user interface to "pick 20 out of the following 100 attribute
values" or some other heuristic to ensure a relatively low count, or use a
LIMIT on the query into the temp table.
This sounds long winded, but is essentially the route the Oracle optimizer
takes in performing a star join....you clearly know you're Oracle, so look
that up to confirm what I'm saying. (May not work as well if you use a
sub-select on PostgreSQL....)

Also, I'd categorise the Age, Height, Weight and Salary attributes and
everything else based upon most common ranges, so it will be just an
equality search on an integer assigned to that category, rather than a >
search. Order by the distance, don't search on it, it'll be quicker since
you'll only need to calculate it for the records that match...even if you do
get a few too many, it would be a shame to avoid somebody because they lived
1 mile outside of the stated radius.

The database sounds < 1 Gb in total logical volume, so 4Gb of RAM should be
easily sufficient.

Best Regards, Simon Riggs

> -----Original Message-----
> From: pgsql-performance-owner@postgresql.org
> [mailto:pgsql-performance-owner@postgresql.org]On Behalf Of Daniel
> Ceregatti
> Sent: 19 August 2004 19:03
> To: pgsql-performance@postgresql.org
> Subject: [PERFORM] What is the best way to do attribute/values?
>
>
> Hi list,
>
> I have a database with 1M "people" in it. Each person has about 20
> attributes, such as height, weight, eye color, etc. I need to be able to
> search for people based on these attributes. A search can be conducted
> on one attribute, all attributes, or any number in between. How would
> _you_ do this?
>
> I have already attempted to answer this. My attempts are detailed here:
>
> http://sh.nu/email.txt
>
> This is the email I was originally going to send to this list. Since
> it's so large, I decided to link to it instead. If you feel that it
> belongs in a post to the list, let me know, and I'll post again.
>
> I've discussed these attempts with people in #postgresql on
> irc.freenode.net. Agliodbs (I presume you know who this is) was very
> helpful, but in end was at a loss. I find myself in the same postition
> at this time. He suggested I contact this list.
>
> My ultimate goal is performance. This _must_ be fast. And by fast, I
> mean, < 1 second, for every permutation of the number of attributes
> searched for. Flexibility would be a bonus, but at this point I'll
> settle for something that's harder to maintain if I can get the speed
> gain I need.
>
> Thanks,
>
> Daniel Ceregatti
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Don't 'kill -9' the postmaster


Re: What is the best way to do attribute/values?

From
Mark Kirkwood
Date:
Josh Berkus wrote:

>Mark, Tim,
>
>
>
>>select
>>    pav1.person_id
>>from
>>    person_attributes_vertical pav1
>>where
>>       (    pav1.attribute_id = 1
>>        and pav1.value_id in (2,3))
>>    or (    pav1.attribute_id = 2
>>        and pav1.value_id in (2,3))
>>
>>
>
>Not the same query, sorry.   Daniel's query yields all the person_id's which
>have criteria A AND criteria B.   Yours gives all the person_id's which have
>criteria A OR criteria B.
>
>
>
Apologies, not thinking clearly enough there...


Maybe try out intersection :


select
    pav1.person_id
from
    person_attributes_vertical pav1
where
       (    pav1.attribute_id = 1
        and pav1.value_id in (2,3))
intersect
select
    pav1.person_id
from
    person_attributes_vertical pav1
where (    pav1.attribute_id = 2
        and pav1.value_id in (2,3))


In the advent that is unhelpful, I wonder about simplifying the
situation and investigating how


select
    pav1.person_id
from
    person_attributes_vertical pav1
where
       pav1.attribute_id = 1


performs, compared to


select
    pav1.person_id
from
    person_attributes_vertical pav1
where
       (    pav1.attribute_id = 1
        and pav1.value_id in (2,3))


If the first performs ok and the second does not, It may be possible to
get better times by doing some horrible re-writes :e.g:


select
    pav1.person_id
from
    person_attributes_vertical pav1
where
       (    pav1.attribute_id = 1
        and pav1.value_id||null in (2,3))


etc.


regards

Mark