Re: Bit datatype performance? - Mailing list pgsql-general

From Eduardo Piombino
Subject Re: Bit datatype performance?
Date
Msg-id CAGHqW7-ZS773vxYcwy5D0KqdmjX7UJ_1QycM7saReOvQRV-kGw@mail.gmail.com
Whole thread Raw
In response to Re: Bit datatype performance?  (Antonio Vieiro <antonio@antonioshome.net>)
List pgsql-general
Hi, if you are thinking to access data in that manner, what's the point of bits (or tags)?

The idea behind having a value and then using a bitmask is to be able to test the value against different bitmasks, each bitmask corresponding to a different tag or tag combination.

The where statement you are suggesting differs in nothing from a regular id or in this case a category id (instead of a combination of tags)You will be fetching all records that only have a specific mask.

I think you are a little bit confused:
Let's say you have several tags each with an identifier:
TAG_NATURE = 1
TAG_ANIMALS = 2
TAG_CARS = 4
TAG_SPORTS = 8

then you have a record ... the idea to use bits is to be able to assign that record a single value, formed by the combination of the different tags.

For example if an element corresponds to TAG_NATURE and TAG_ANIMALS, you would want to have that element with a value of TAG_NATURE + TAG_ANIMALS resulting in a tag value of 3.

Then if you want to extract all ANIMALS you just do:

... where value & TAG_ANIMALS = TAG_ANIMALS;

because if you just do:

... where value = TAG_ANIMALS

you will only get the elements that exclusively have the tag TAG_ANIMALS. You will miss for instance those that have the NATURE and ANIMALS (or any other tag).

So, your simple index on value willl not be of any help, since you won't be doing

... where value = ANY_SPECIFIC_TAG

because of the latter.

Now, if you are going to have a different "TAG" for every "TAG COMBINATION" well, you can do that, but that would be no different than any regular id, in this case, it would be more of a "CATEGORY", and elements will only be able to have one single category for them.

One alternative would be to try to make some helping indexes on expressions, maybe with the help of a function like:

create or replace function hasTag(data integer, tag integer) returns boolean as $$
declare
begin
    return (data & tag) = tag;
end;
$$ language plpgsql immutable;

-- this function would return
select hasTag(1, 1); -- true
select hasTag(3, 1); -- true
select hasTag(4, 1); -- false

This way, you could reformulate your query in the following fashion:

... where hasTag(value, TAG_NATURE);

and you could now build an index on yourtable based on that expression like:

create index idx_yourtable_hasTag_1 on yourtable (hasTag(value, 1 /* TAG_NATURE */));

If you would like to fetch a combination of tags, you could do:

... where hasTag(value, TAG_NATURE) and hasTag(value, TAG_ANIMALS)

requiring an extra index on (hasTag(value, TAG_ANIMALS)).

In this way, you will end up requiring 256 indexes :) (which can be from acceptable to ridiculous, depending on how much often the indexes should be updated, volume, etc), it's up to you. I'm not actually suggesting you use this approach, it's just a raw idea, and it's just the conclusion of one line of thought, that may or may have not crossed your mind. Maybe with some refinement, you can get to something more practical.

But nonetheless (if I'm not missing something huge), the where statement you provided is just the wrong approach to tags.

hope it helps,
regards,
eduardo


On Wed, Sep 14, 2011 at 12:58 PM, Antonio Vieiro <antonio@antonioshome.net> wrote:
Hi again,

Thanks for the tip. In fact I was thinking of creating an index on the bitmask, so I could use:

... where t.bits = :mymask

directly, avoiding a full table scan. I assume this is possible (indexing bit and comparing bits), isn't it?

Thanks,
Antonio

El 14/09/11 15:58, Radosław Smogura escribió:

On Wed, 14 Sep 2011 12:00:35 +0200, Antonio Vieiro wrote:
Hi all,

One of my entities 'E' may be 'tagged' with an arbitrary set of 256
tags 'T'.

A first approach could be to add a M:N relationship between 'E' and 'T'.

A second way to do this could be to add a BIT(256) datatype to 'E',
setting bits to '1' if the entity is tagged with each one of the 256
tags (i.e. using a 'bitmask' on the set of tags).

Since querying entities 'E' with a certain set of tags 'T' must be
very fast I was wondering if the second approach would be faster. What
do you think?

Thanks for any hints,
Antonio

I assume each entity may have one or more different tags.

Actually performing test like
... where (t.bits & :mymask) = :mymask
should be quite fast and faster then creating additional relations, but
only if it's highly probable that your query will almost always scan
whole table.

The advantage of indexes is that the index is used 1st and tail (slower)
parts of query will always get "subset" of table. In bitset, You will
probably scan whole table.

So I think, you should do some performance test for large number of
data, and compare both ways. I think bitset will be fast for really
small data, but M:N relations may be faster for really large data sets.

You need to measure size of your database too, in M:N case with 256 tags
it may be quite large.


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

pgsql-general by date:

Previous
From: Radosław Smogura
Date:
Subject: Re: Bit datatype performance?
Next
From: "Bob Pawley"
Date:
Subject: Arrays