Thread: scoring differences between bitmasks

scoring differences between bitmasks

From
Ben
Date:
I'm looking for a quick way to find the number of bits that are
different between two bitmasks of the same size. I'll be doing this a
lot, so speed is desirable.... some kind of indexing would be even
better. It seems like this problem must have been solved before....

Re: scoring differences between bitmasks

From
Martijn van Oosterhout
Date:
On Sat, Oct 01, 2005 at 07:12:13PM -0700, Ben wrote:
> I'm looking for a quick way to find the number of bits that are
> different between two bitmasks of the same size. I'll be doing this a
> lot, so speed is desirable.... some kind of indexing would be even
> better. It seems like this problem must have been solved before....

Step 1: Use XOR to get the bits that are different.
Step 2: Count the bits.

Something like  x & ((~x) +1)  will give you the value of the last
bit that is set, mask it out and repeat. If you need to do it a lot,
build a table of 256 values and then process 8 bits at a time. Should
be fairly straight forward...

Hope this helps,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

Attachment

Re: scoring differences between bitmasks

From
Ben
Date:
Yes, that's the straightforward way to do it. But given that my
vectors are 256 bits in length, and that I'm going to eventually have
about 4 million of them to search through, I was hoping greater minds
than mine had figured out how to do it faster, or how compute some
kind of indexing....... somehow.

On Oct 2, 2005, at 5:39 AM, Martijn van Oosterhout wrote:
>
> Step 1: Use XOR to get the bits that are different.
> Step 2: Count the bits.
>
> Something like  x & ((~x) +1)  will give you the value of the last
> bit that is set, mask it out and repeat. If you need to do it a lot,
> build a table of 256 values and then process 8 bits at a time. Should
> be fairly straight forward...
>
> Hope this helps,
> --
> Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/
> kleptog/
>
>> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent
>> is a
>> tool for doing 5% of the work and then sitting around waiting for
>> someone
>> else to do the other 95% so you can sue them.
>


Re: scoring differences between bitmasks

From
Martijn van Oosterhout
Date:
On Sun, Oct 02, 2005 at 08:28:53AM -0700, Ben wrote:
> Yes, that's the straightforward way to do it. But given that my
> vectors are 256 bits in length, and that I'm going to eventually have
> about 4 million of them to search through, I was hoping greater minds
> than mine had figured out how to do it faster, or how compute some
> kind of indexing....... somehow.

Ah right, that's another kettle of fish. Well, your problem space is a
form of eulidean space in the sense that the triangle inequality holds,
eg dist(A,C) <= dist(A,B)+dist(B,C). Maybe one of the geometric index
classes could help.

Here comes something I thought up myself, but you may be able to come
up with something better.

1. For each string, store its distance from the strings:
0000000000 etc
0101010101 etc
0011001100 etc
maybe more or less, you'll need to investigate

2. Index each of these columns.
3. For your target string calculate the distances from each of these
4. Do a query like:

SELECT key, * from table
where dist_zero_string > $dist_zero_search-2 and dist_zero_string < $dist_zero_search+2
where dist_0101_string > $dist_0101_search-2 and dist_0101_string < $dist_0101_search+2
where dist_0011_string > $dist_0101_search-2 and dist_0011_string < $dist_0011_search+2
order by dist(key,searchvalue);

PostgreSQL 8.1 has bitmap index scans which could make the above quite
fast. Some theory may help determine the best key strings. For example
all-zeros and all-ones are redundant since you can calculate the
distance for one from the other. You also might need to vary your
distances to match with (the 2 above) depending on the average distance
to the strings and where you search.

You could use functional indexes to avoid storing the actual values in
the tables. Or you could take two or three of these values and make
points and use an rtree index on those to find nearby points (which
will represent nearby strings).

Anyway, this is just some random thoughts, hope it helps,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

Attachment

Re: scoring differences between bitmasks

From
Ben
Date:
Hrm, I'll think about it. Thanks!

On Oct 2, 2005, at 12:50 PM, Martijn van Oosterhout wrote:

> On Sun, Oct 02, 2005 at 10:17:10AM -0700, Ben wrote:
>
>> Yeah, I thought about this, but damn my 3-dimensional mind, I just
>> can't convince myself this will work for 256 dimensions. :) Or
>> rather, I can see how it could, *given* good placements of the grid
>> points to snap each vector to. Unfortunately, I don't know enough
>> theory to know what good placements would be.
>>
>
> My intuition tells me that just like real life, you want them to be as
> far apart as possible. That's why I picked those strings. If the
> string
> is length N, all those key strings are distance n/2 apart from
> eachother.
>
> As for proof, I think it will work due to the triangle inequality
> holding true. Try not to worry about the dimensions, if you don't
> assume the number of dimensions, it'll work for any.
>
> My Google search hasn't come up with anything, although searching for
> "binary vector space search" comes with some almost but not quite
> results.
>
> Good luck...
> --
> Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/
> kleptog/
>
>> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent
>> is a
>> tool for doing 5% of the work and then sitting around waiting for
>> someone
>> else to do the other 95% so you can sue them.
>


Re: scoring differences between bitmasks

From
Martijn van Oosterhout
Date:
On Sun, Oct 02, 2005 at 10:17:10AM -0700, Ben wrote:
> Yeah, I thought about this, but damn my 3-dimensional mind, I just
> can't convince myself this will work for 256 dimensions. :) Or
> rather, I can see how it could, *given* good placements of the grid
> points to snap each vector to. Unfortunately, I don't know enough
> theory to know what good placements would be.

My intuition tells me that just like real life, you want them to be as
far apart as possible. That's why I picked those strings. If the string
is length N, all those key strings are distance n/2 apart from
eachother.

As for proof, I think it will work due to the triangle inequality
holding true. Try not to worry about the dimensions, if you don't
assume the number of dimensions, it'll work for any.

My Google search hasn't come up with anything, although searching for
"binary vector space search" comes with some almost but not quite
results.

Good luck...
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

Attachment

Re: scoring differences between bitmasks

From
Ben
Date:
Sure, but unless I can figure out some way to choose a small number
of vectors, I'm left with computing the full N^2 set. Given that I'm
shooting for N to be 4 million or larger, that's a lot of data to
store.....

On Oct 2, 2005, at 12:14 PM, Todd A. Cook wrote:

> Ben wrote:
>
>> Just the number of bits, not which ones. Basically, the hamming
>> distance.
>>
>
> I see.  Could you pre-compute the bit counts for the vectors in the
> table?
> You could count the bits in the search vector as Martijn suggested,
> and then
> do a lookup based on the count.
>
> -- todd
>
>
>
>


Re: scoring differences between bitmasks

From
Ben
Date:
Yeah, I thought about this, but damn my 3-dimensional mind, I just
can't convince myself this will work for 256 dimensions. :) Or
rather, I can see how it could, *given* good placements of the grid
points to snap each vector to. Unfortunately, I don't know enough
theory to know what good placements would be.

On Oct 2, 2005, at 9:47 AM, Martijn van Oosterhout wrote:

> On Sun, Oct 02, 2005 at 08:28:53AM -0700, Ben wrote:
>
>> Yes, that's the straightforward way to do it. But given that my
>> vectors are 256 bits in length, and that I'm going to eventually have
>> about 4 million of them to search through, I was hoping greater minds
>> than mine had figured out how to do it faster, or how compute some
>> kind of indexing....... somehow.
>>
>
> Ah right, that's another kettle of fish. Well, your problem space is a
> form of eulidean space in the sense that the triangle inequality
> holds,
> eg dist(A,C) <= dist(A,B)+dist(B,C). Maybe one of the geometric index
> classes could help.
>
> Here comes something I thought up myself, but you may be able to come
> up with something better.
>
> 1. For each string, store its distance from the strings:
> 0000000000 etc
> 0101010101 etc
> 0011001100 etc
> maybe more or less, you'll need to investigate
>
> 2. Index each of these columns.
> 3. For your target string calculate the distances from each of these
> 4. Do a query like:
>
> SELECT key, * from table
> where dist_zero_string > $dist_zero_search-2 and dist_zero_string <
> $dist_zero_search+2
> where dist_0101_string > $dist_0101_search-2 and dist_0101_string <
> $dist_0101_search+2
> where dist_0011_string > $dist_0101_search-2 and dist_0011_string <
> $dist_0011_search+2
> order by dist(key,searchvalue);
>
> PostgreSQL 8.1 has bitmap index scans which could make the above quite
> fast. Some theory may help determine the best key strings. For example
> all-zeros and all-ones are redundant since you can calculate the
> distance for one from the other. You also might need to vary your
> distances to match with (the 2 above) depending on the average
> distance
> to the strings and where you search.
>
> You could use functional indexes to avoid storing the actual values in
> the tables. Or you could take two or three of these values and make
> points and use an rtree index on those to find nearby points (which
> will represent nearby strings).
>
> Anyway, this is just some random thoughts, hope it helps,
> --
> Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/
> kleptog/
>
>> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent
>> is a
>> tool for doing 5% of the work and then sitting around waiting for
>> someone
>> else to do the other 95% so you can sue them.
>


Re: scoring differences between bitmasks

From
"Todd A. Cook"
Date:
Ben wrote:
> Just the number of bits, not which ones. Basically, the hamming  distance.

I see.  Could you pre-compute the bit counts for the vectors in the table?
You could count the bits in the search vector as Martijn suggested, and then
do a lookup based on the count.

-- todd




Re: scoring differences between bitmasks

From
Ben
Date:
Just the number of bits, not which ones. Basically, the hamming
distance.

On Oct 2, 2005, at 11:44 AM, Todd A. Cook wrote:

> Hi,
>
> It may be that I don't understand your problem. :)
>
> Are you searching the table for the closest vector?  If so, is
> "closeness" defined only as the number of bits that are different?
> Or, do you need to know which bits as well?
>
> -- todd
>
>
> Ben wrote:
>
>> Hrm, I don't understand. Can you give me an example with some
>> reasonably sized vectors?
>> On Oct 2, 2005, at 10:59 AM, Todd A. Cook wrote:
>>
>>> Hi,
>>>
>>> Try breaking the vector into 4 bigint columns and building a
>>> multi- column
>>> index, with index columns going from the most evenly distributed
>>> to  the
>>> least.  Depending on the distribution of your data, you may only
>>> need 2
>>> or 3 columns in the index.  If you can cluster the table in that
>>> order,
>>> it should be really fast.  (This structure is a tabular form of
>>> a  linked
>>> trie.)
>>>
>>> -- todd
>>>
>>>
>>> Ben wrote:
>>>
>>>
>>>> Yes, that's the straightforward way to do it. But given that
>>>> my   vectors are 256 bits in length, and that I'm going to
>>>> eventually  have  about 4 million of them to search through, I
>>>> was hoping  greater minds  than mine had figured out how to do
>>>> it faster, or  how compute some  kind of indexing....... somehow.
>>>>
>
>


Re: scoring differences between bitmasks

From
"Todd A. Cook"
Date:
Hi,

It may be that I don't understand your problem. :)

Are you searching the table for the closest vector?  If so, is
"closeness" defined only as the number of bits that are different?
Or, do you need to know which bits as well?

-- todd


Ben wrote:
> Hrm, I don't understand. Can you give me an example with some
> reasonably sized vectors?
>
> On Oct 2, 2005, at 10:59 AM, Todd A. Cook wrote:
>
>> Hi,
>>
>> Try breaking the vector into 4 bigint columns and building a multi-
>> column
>> index, with index columns going from the most evenly distributed to  the
>> least.  Depending on the distribution of your data, you may only  need 2
>> or 3 columns in the index.  If you can cluster the table in that  order,
>> it should be really fast.  (This structure is a tabular form of a  linked
>> trie.)
>>
>> -- todd
>>
>>
>> Ben wrote:
>>
>>> Yes, that's the straightforward way to do it. But given that my
>>> vectors are 256 bits in length, and that I'm going to eventually
>>> have  about 4 million of them to search through, I was hoping
>>> greater minds  than mine had figured out how to do it faster, or  how
>>> compute some  kind of indexing....... somehow.


Re: scoring differences between bitmasks

From
Ben
Date:
Hrm, I don't understand. Can you give me an example with some
reasonably sized vectors?

On Oct 2, 2005, at 10:59 AM, Todd A. Cook wrote:

> Hi,
>
> Try breaking the vector into 4 bigint columns and building a multi-
> column
> index, with index columns going from the most evenly distributed to
> the
> least.  Depending on the distribution of your data, you may only
> need 2
> or 3 columns in the index.  If you can cluster the table in that
> order,
> it should be really fast.  (This structure is a tabular form of a
> linked
> trie.)
>
> -- todd
>
>
> Ben wrote:
>
>> Yes, that's the straightforward way to do it. But given that my
>> vectors are 256 bits in length, and that I'm going to eventually
>> have  about 4 million of them to search through, I was hoping
>> greater minds  than mine had figured out how to do it faster, or
>> how compute some  kind of indexing....... somehow.
>


Re: scoring differences between bitmasks

From
"Todd A. Cook"
Date:
Hi,

Try breaking the vector into 4 bigint columns and building a multi-column
index, with index columns going from the most evenly distributed to the
least.  Depending on the distribution of your data, you may only need 2
or 3 columns in the index.  If you can cluster the table in that order,
it should be really fast.  (This structure is a tabular form of a linked
trie.)

-- todd


Ben wrote:
> Yes, that's the straightforward way to do it. But given that my  vectors
> are 256 bits in length, and that I'm going to eventually have  about 4
> million of them to search through, I was hoping greater minds  than mine
> had figured out how to do it faster, or how compute some  kind of
> indexing....... somehow.

Re: scoring differences between bitmasks

From
"Todd A. Cook"
Date:
Hi,

Sorry for being so abtuse; my 9 year old has been helping me with my
work today. :)

The hamming distance between two bit vectors can also be found as the
sum of distances between sub-vectors.  Let's assume you're looking
for all vectors less than d bits away from a search vector, and we'll
divide the vectors into k-bit subvectors.  The value of k will depend
on d, how sparse you data is, and how much memory you have for lookup
tables.  The goal is to eliminate the vectors that are more than d
bits away.  (This assumes that most of the search vectors are more
than d bits away; for the opposite case, just reverse the sense of
the comparisons below.)

Compute the set of k-bit vectors that <d away from the first k-bits
of the search vector, using a lookup table if feasible.  Find the
set of full vectors whose initial k-vector is in this set.  For
those whose distance is d-1, the remaining bits have to be identical,
which is easy to check.  Repeat for the second set of k-vectors
where their distance has to be less than d-1, then d-2, etc.  You
can do these searches using the linked tries I mentioned earlier.
(I've described the searches as happening sequentially, but that
needn't be the case.  You can search the subvectors simultaneously
and then filter the results at the client.)

You don't have to use constant length subvectors.  You can start
with small subvectors to keep the size of the search sets down,
and then increase the vector length as the set of remaining
possibilities shrinks.  Also, you don't have to work from one
end of the full vector to the other; start with a subvector where
you expect a lot of variability.

I think this method is similar in spirit, if not detail, to what
Martijn suggested.

I've actually implemented this approach before, though using my
own file-based lookups rather than a real database.  I had a set
of 128-bit vectors used to filter an input stream; most of the
comparisons never looked past 64 bits.

-- todd


Ben wrote:
> Sure, but unless I can figure out some way to choose a small number  of
> vectors, I'm left with computing the full N^2 set. Given that I'm
> shooting for N to be 4 million or larger, that's a lot of data to
> store.....
>
> On Oct 2, 2005, at 12:14 PM, Todd A. Cook wrote:
>
>> Ben wrote:
>>
>>> Just the number of bits, not which ones. Basically, the hamming
>>> distance.
>>>
>>
>> I see.  Could you pre-compute the bit counts for the vectors in the
>> table?
>> You could count the bits in the search vector as Martijn suggested,
>> and then
>> do a lookup based on the count.