Thread: What's the best way in Postgres9 to store a bunch of arbitrary boolean values for a row?

I have a database full of recipes, one recipe per row.  I need to
store a bunch of arbitrary "flags" for each recipe to mark various
properties such as Gluton-Free, No meat, No Red Meat, No Pork, No
Animals, Quick, Easy, Low Fat, Low Sugar, Low Calorie, Low Sodium and
Low Carb.  Users need to be able to search for recipes that contain
one or more of those flags by checking checkboxes in the UI.

I'm searching for the best way to store these properties in the
Recipes table.  My ideas so far:

1. Have a separate column for each property and create an index on
each of those columns.  I may have upwards of about 20 of these
properties, so I'm wondering if there's any drawbacks with creating a
whole bunch of BOOL columns on a single table.
2. Use a bitmask for all properties and store the whole thing in one
numeric column that contains the appropriate number of bits.  Create a
separate index on each bit so searches will be fast.
3. Create an ENUM with a value for each tag, then create a column that
has an ARRAY of that ENUM type.  I believe an ANY clause on an array
column can use an INDEX, but have never done this.
4. Create a separate table that has a one-to-many mapping of recipes
to tags.  Each tag would be a row in this table.  The table would
contain a link to the recipe, and an ENUM value for which tag is "on"
for that recipe.  When querying, I'd have to do a nested SELECT to
filter out recipes that didn't contain at least one of these tags.  I
think this is the more "normal" way of doing this, but it does make
certain queries more complicated - If I want to query for 100 recipes
and also display all their tags, I'd have to use an INNER JOIN and
consolidate the rows, or use a nested SELECT and aggregate on the fly.

Write performance is not too big of an issue here since recipes are
added by a backend process, and search speed is critical (there might
be a few hundred thousand recipes eventually).  I doubt I will add new
tags all that often, but I want it to be at least possible to do
without major headaches.

Thanks!

On 12/05/2011 12:10 PM, Mike Christensen wrote:
> I have a database full of recipes, one recipe per row.  I need to
> store a bunch of arbitrary "flags" for each recipe to mark various
> properties such as Gluton-Free, No meat, No Red Meat, No Pork, No
> Animals, Quick, Easy, Low Fat, Low Sugar, Low Calorie, Low Sodium and
> Low Carb.  Users need to be able to search for recipes that contain
> one or more of those flags by checking checkboxes in the UI.
>
> I'm searching for the best way to store these properties in the
> Recipes table.
I'd use hstore to store them as tags. You can then use hstore's GiST
index support to get quick lookups.
> 1. Have a separate column for each property and create an index on
> each of those columns.  I may have upwards of about 20 of these
> properties, so I'm wondering if there's any drawbacks with creating a
> whole bunch of BOOL columns on a single table.
It'll get frustrating as you start adding new categories, and will drive
you insane as soon as you want to let the user define their own
categories - which you will land up wanting to do in your problem space.
I'd avoid it.
> 2. Use a bitmask for all properties and store the whole thing in one
> numeric column that contains the appropriate number of bits.  Create a
> separate index on each bit so searches will be fast.
Same as above, it'll get annoying to manage when you want user tagging.
> 3. Create an ENUM with a value for each tag, then create a column that
> has an ARRAY of that ENUM type.  I believe an ANY clause on an array
> column can use an INDEX, but have never done this.
Same again.
> 4. Create a separate table that has a one-to-many mapping of recipes
> to tags.  Each tag would be a row in this table.  The table would
> contain a link to the recipe, and an ENUM value for which tag is "on"
> for that recipe.  When querying, I'd have to do a nested SELECT to
> filter out recipes that didn't contain at least one of these tags.  I
> think this is the more "normal" way of doing this, but it does make
> certain queries more complicated - If I want to query for 100 recipes
> and also display all their tags, I'd have to use an INNER JOIN and
> consolidate the rows, or use a nested SELECT and aggregate on the fly.
That'll get slow. It'll work and is IMO better than all the other
options you suggested, but I'd probably favour hstore over it.

--
Craig Ringer

>> I have a database full of recipes, one recipe per row.  I need to
>> store a bunch of arbitrary "flags" for each recipe to mark various
>> properties such as Gluton-Free, No meat, No Red Meat, No Pork, No
>> Animals, Quick, Easy, Low Fat, Low Sugar, Low Calorie, Low Sodium and
>> Low Carb.  Users need to be able to search for recipes that contain
>> one or more of those flags by checking checkboxes in the UI.
>>
>> I'm searching for the best way to store these properties in the
>> Recipes table.
>
> I'd use hstore to store them as tags. You can then use hstore's GiST index
> support to get quick lookups.
>>
>> 1. Have a separate column for each property and create an index on
>> each of those columns.  I may have upwards of about 20 of these
>> properties, so I'm wondering if there's any drawbacks with creating a
>> whole bunch of BOOL columns on a single table.
>
> It'll get frustrating as you start adding new categories, and will drive you
> insane as soon as you want to let the user define their own categories -
> which you will land up wanting to do in your problem space. I'd avoid it.
>>
>> 2. Use a bitmask for all properties and store the whole thing in one
>> numeric column that contains the appropriate number of bits.  Create a
>> separate index on each bit so searches will be fast.
>
> Same as above, it'll get annoying to manage when you want user tagging.
>>
>> 3. Create an ENUM with a value for each tag, then create a column that
>> has an ARRAY of that ENUM type.  I believe an ANY clause on an array
>> column can use an INDEX, but have never done this.
>
> Same again.
>>
>> 4. Create a separate table that has a one-to-many mapping of recipes
>> to tags.  Each tag would be a row in this table.  The table would
>> contain a link to the recipe, and an ENUM value for which tag is "on"
>> for that recipe.  When querying, I'd have to do a nested SELECT to
>> filter out recipes that didn't contain at least one of these tags.  I
>> think this is the more "normal" way of doing this, but it does make
>> certain queries more complicated - If I want to query for 100 recipes
>> and also display all their tags, I'd have to use an INNER JOIN and
>> consolidate the rows, or use a nested SELECT and aggregate on the fly.
>
> That'll get slow. It'll work and is IMO better than all the other options
> you suggested, but I'd probably favour hstore over it.

The hstore module sounds fantastic!

I'm curious as to how these columns are serialized back through the
driver, such as Npgsql.  Do I get the values as strings, such as a
comma delimited key/value pair list?  Or would I need to do some
custom logic to deserialize them?

Right now, I'm using Npgsql as a driver, and NHibernate/Castle
ActiveRecord as an ORM.

Mike

Mike Christensen wrote:
> I have a database full of recipes, one recipe per row.  I need to
> store a bunch of arbitrary "flags" for each recipe to mark various
> properties such as Gluton-Free, No meat, No Red Meat, No Pork, No
> Animals, Quick, Easy, Low Fat, Low Sugar, Low Calorie, Low Sodium and
> Low Carb.  Users need to be able to search for recipes that contain
> one or more of those flags by checking checkboxes in the UI.
>
> I'm searching for the best way to store these properties in the
> Recipes table.  My ideas so far:
>
> 1. Have a separate column for each property and create an index on
> each of those columns.  I may have upwards of about 20 of these
> properties, so I'm wondering if there's any drawbacks with creating a
> whole bunch of BOOL columns on a single table.
> 2. Use a bitmask for all properties and store the whole thing in one
> numeric column that contains the appropriate number of bits.  Create a
> separate index on each bit so searches will be fast.
> 3. Create an ENUM with a value for each tag, then create a column that
> has an ARRAY of that ENUM type.  I believe an ANY clause on an array
> column can use an INDEX, but have never done this.
> 4. Create a separate table that has a one-to-many mapping of recipes
> to tags.  Each tag would be a row in this table.  The table would
> contain a link to the recipe, and an ENUM value for which tag is "on"
> for that recipe.  When querying, I'd have to do a nested SELECT to
> filter out recipes that didn't contain at least one of these tags.  I
> think this is the more "normal" way of doing this, but it does make
> certain queries more complicated - If I want to query for 100 recipes
> and also display all their tags, I'd have to use an INNER JOIN and
> consolidate the rows, or use a nested SELECT and aggregate on the fly.
>
> Write performance is not too big of an issue here since recipes are
> added by a backend process, and search speed is critical (there might
> be a few hundred thousand recipes eventually).  I doubt I will add new
> tags all that often, but I want it to be at least possible to do
> without major headaches.

I would use a boolean column per property and a partial index on the
ones
where the property is selective, i.e. only a small percentage of all
recipes
match the property.

Yours,
Laurenz Albe

Re: What's the best way in Postgres9 to store a bunch of arbitrary boolean values for a row?

From
Filip Rembiałkowski
Date:
Do not fall too easilly into hstore :-)

while it sometimes good and extremely easy to setup, it has some drawbacks
- querying and searching has some limitations (keys/values not easily indexable, horrible syntax)
- storage not optimised (one hstore field = several dozens of boolean columns)
- only text values, no data type validation

I'd recommend option (4) - normalize!

FR


2011/12/5 Mike Christensen <mike@kitchenpc.com>
>> I have a database full of recipes, one recipe per row.  I need to
>> store a bunch of arbitrary "flags" for each recipe to mark various
>> properties such as Gluton-Free, No meat, No Red Meat, No Pork, No
>> Animals, Quick, Easy, Low Fat, Low Sugar, Low Calorie, Low Sodium and
>> Low Carb.  Users need to be able to search for recipes that contain
>> one or more of those flags by checking checkboxes in the UI.
>>
>> I'm searching for the best way to store these properties in the
>> Recipes table.
>
> I'd use hstore to store them as tags. You can then use hstore's GiST index
> support to get quick lookups.
>>
>> 1. Have a separate column for each property and create an index on
>> each of those columns.  I may have upwards of about 20 of these
>> properties, so I'm wondering if there's any drawbacks with creating a
>> whole bunch of BOOL columns on a single table.
>
> It'll get frustrating as you start adding new categories, and will drive you
> insane as soon as you want to let the user define their own categories -
> which you will land up wanting to do in your problem space. I'd avoid it.
>>
>> 2. Use a bitmask for all properties and store the whole thing in one
>> numeric column that contains the appropriate number of bits.  Create a
>> separate index on each bit so searches will be fast.
>
> Same as above, it'll get annoying to manage when you want user tagging.
>>
>> 3. Create an ENUM with a value for each tag, then create a column that
>> has an ARRAY of that ENUM type.  I believe an ANY clause on an array
>> column can use an INDEX, but have never done this.
>
> Same again.
>>
>> 4. Create a separate table that has a one-to-many mapping of recipes
>> to tags.  Each tag would be a row in this table.  The table would
>> contain a link to the recipe, and an ENUM value for which tag is "on"
>> for that recipe.  When querying, I'd have to do a nested SELECT to
>> filter out recipes that didn't contain at least one of these tags.  I
>> think this is the more "normal" way of doing this, but it does make
>> certain queries more complicated - If I want to query for 100 recipes
>> and also display all their tags, I'd have to use an INNER JOIN and
>> consolidate the rows, or use a nested SELECT and aggregate on the fly.
>
> That'll get slow. It'll work and is IMO better than all the other options
> you suggested, but I'd probably favour hstore over it.

The hstore module sounds fantastic!

I'm curious as to how these columns are serialized back through the
driver, such as Npgsql.  Do I get the values as strings, such as a
comma delimited key/value pair list?  Or would I need to do some
custom logic to deserialize them?

Right now, I'm using Npgsql as a driver, and NHibernate/Castle
ActiveRecord as an ORM.

Mike

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Le 5 décembre 2011 10:04, Albe Laurenz <laurenz.albe@wien.gv.at> a écrit :
> Mike Christensen wrote:
>> I have a database full of recipes, one recipe per row.  I need to
>> store a bunch of arbitrary "flags" for each recipe to mark various
>> properties such as Gluton-Free, No meat, No Red Meat, No Pork, No
>> Animals, Quick, Easy, Low Fat, Low Sugar, Low Calorie, Low Sodium and
>> Low Carb.  Users need to be able to search for recipes that contain
>> one or more of those flags by checking checkboxes in the UI.
>>
>> I'm searching for the best way to store these properties in the
>> Recipes table.  My ideas so far:
>>
>> 1. Have a separate column for each property and create an index on
>> each of those columns.  I may have upwards of about 20 of these
>> properties, so I'm wondering if there's any drawbacks with creating a
>> whole bunch of BOOL columns on a single table.
>> 2. Use a bitmask for all properties and store the whole thing in one
>> numeric column that contains the appropriate number of bits.  Create a
>> separate index on each bit so searches will be fast.
>> 3. Create an ENUM with a value for each tag, then create a column that
>> has an ARRAY of that ENUM type.  I believe an ANY clause on an array
>> column can use an INDEX, but have never done this.
>> 4. Create a separate table that has a one-to-many mapping of recipes
>> to tags.  Each tag would be a row in this table.  The table would
>> contain a link to the recipe, and an ENUM value for which tag is "on"
>> for that recipe.  When querying, I'd have to do a nested SELECT to
>> filter out recipes that didn't contain at least one of these tags.  I
>> think this is the more "normal" way of doing this, but it does make
>> certain queries more complicated - If I want to query for 100 recipes
>> and also display all their tags, I'd have to use an INNER JOIN and
>> consolidate the rows, or use a nested SELECT and aggregate on the fly.
>>
>> Write performance is not too big of an issue here since recipes are
>> added by a backend process, and search speed is critical (there might
>> be a few hundred thousand recipes eventually).  I doubt I will add new
>> tags all that often, but I want it to be at least possible to do
>> without major headaches.
>
> I would use a boolean column per property and a partial index on the
> ones
> where the property is selective, i.e. only a small percentage of all
> recipes
> match the property.

I would like to recommend to have a look at Bloom Filtering:
http://www.sai.msu.su/~megera/wiki/bloom

--
Cédric Villemain +33 (0)6 20 30 22 52
http://2ndQuadrant.fr/
PostgreSQL: Support 24x7 - Développement, Expertise et Formation


On 12/05/2011 03:31 PM, Mike Christensen wrote:
That'll get slow. It'll work and is IMO better than all the other options
you suggested, but I'd probably favour hstore over it.
The hstore module sounds fantastic!

I'm curious as to how these columns are serialized back through the
driver, such as Npgsql.  Do I get the values as strings, such as a
comma delimited key/value pair list?  Or would I need to do some
custom logic to deserialize them?

It depends on what Npgsql supports, really. The server sends hstore values as text; what the client does with them depends on the client. I don't really do C# and .NET so I'm not the one to turn to for advice on that side. Ideally a hstore would be parsed and converted to a hash map by the database driver. At present I don't know of any that do this natively, though I may well be out of date on this. For PgJDBC there's code around (not AFAIK yet integrated into PgJDBC proper) to do it.

In many (most?) cases you'll want to interact with hstore fields using the hstore-provided types and operators, eg.

SELECT somefield SET hstorecol = hstorecol - "somekey";

See: http://www.postgresql.org/docs/current/static/hstore.html

If you're working via some ORM layer (as it sounds like) you may have to use native queries or explain to it about the hstore types and operators. That's the usual problem when trying to use database-specific not-quite-relational features like hstore through a layer that tries to be db-independent and purely relational. I don't have any experience with Castle ActiveRecord. When I've used hstore with hibernate I've always done it by direct native queries.

--
Craig Ringer

Re: What's the best way in Postgres9 to store a bunch of arbitrary boolean values for a row?

From
"Francisco Figueiredo Jr."
Date:
Currently, Npgsql doesn't support hstore datatype.

It will be sent and received as text from Npgsql.

I hope it helps.



On Mon, Dec 5, 2011 at 09:25, Craig Ringer <ringerc@ringerc.id.au> wrote:
>
> On 12/05/2011 03:31 PM, Mike Christensen wrote:
>
> That'll get slow. It'll work and is IMO better than all the other options
> you suggested, but I'd probably favour hstore over it.
>
> The hstore module sounds fantastic!
>
> I'm curious as to how these columns are serialized back through the
> driver, such as Npgsql.  Do I get the values as strings, such as a
> comma delimited key/value pair list?  Or would I need to do some
> custom logic to deserialize them?
>
>
> It depends on what Npgsql supports, really. The server sends hstore values
> as text; what the client does with them depends on the client. I don't
> really do C# and .NET so I'm not the one to turn to for advice on that side.
> Ideally a hstore would be parsed and converted to a hash map by the database
> driver. At present I don't know of any that do this natively, though I may
> well be out of date on this. For PgJDBC there's code around (not AFAIK yet
> integrated into PgJDBC proper) to do it.
>
> In many (most?) cases you'll want to interact with hstore fields using the
> hstore-provided types and operators, eg.
>
> SELECT somefield SET hstorecol = hstorecol - "somekey";
>
> See: http://www.postgresql.org/docs/current/static/hstore.html
>
> If you're working via some ORM layer (as it sounds like) you may have to use
> native queries or explain to it about the hstore types and operators. That's
> the usual problem when trying to use database-specific not-quite-relational
> features like hstore through a layer that tries to be db-independent and
> purely relational. I don't have any experience with Castle ActiveRecord.
> When I've used hstore with hibernate I've always done it by direct native
> queries.
>
> --
> Craig Ringer



--
Regards,

Francisco Figueiredo Jr.
Npgsql Lead Developer
http://www.npgsql.org
http://gplus.to/franciscojunior
http://fxjr.blogspot.com
http://twitter.com/franciscojunior