Thread: scoring differences between bitmasks
I'm looking for a quick way to find the number of bits that are different between two bitmasks of the same size. I'll be doing this a lot, so speed is desirable.... some kind of indexing would be even better. It seems like this problem must have been solved before....
On Sat, Oct 01, 2005 at 07:12:13PM -0700, Ben wrote: > I'm looking for a quick way to find the number of bits that are > different between two bitmasks of the same size. I'll be doing this a > lot, so speed is desirable.... some kind of indexing would be even > better. It seems like this problem must have been solved before.... Step 1: Use XOR to get the bits that are different. Step 2: Count the bits. Something like x & ((~x) +1) will give you the value of the last bit that is set, mask it out and repeat. If you need to do it a lot, build a table of 256 values and then process 8 bits at a time. Should be fairly straight forward... Hope this helps, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them.
Attachment
Yes, that's the straightforward way to do it. But given that my vectors are 256 bits in length, and that I'm going to eventually have about 4 million of them to search through, I was hoping greater minds than mine had figured out how to do it faster, or how compute some kind of indexing....... somehow. On Oct 2, 2005, at 5:39 AM, Martijn van Oosterhout wrote: > > Step 1: Use XOR to get the bits that are different. > Step 2: Count the bits. > > Something like x & ((~x) +1) will give you the value of the last > bit that is set, mask it out and repeat. If you need to do it a lot, > build a table of 256 values and then process 8 bits at a time. Should > be fairly straight forward... > > Hope this helps, > -- > Martijn van Oosterhout <kleptog@svana.org> http://svana.org/ > kleptog/ > >> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent >> is a >> tool for doing 5% of the work and then sitting around waiting for >> someone >> else to do the other 95% so you can sue them. >
On Sun, Oct 02, 2005 at 08:28:53AM -0700, Ben wrote: > Yes, that's the straightforward way to do it. But given that my > vectors are 256 bits in length, and that I'm going to eventually have > about 4 million of them to search through, I was hoping greater minds > than mine had figured out how to do it faster, or how compute some > kind of indexing....... somehow. Ah right, that's another kettle of fish. Well, your problem space is a form of eulidean space in the sense that the triangle inequality holds, eg dist(A,C) <= dist(A,B)+dist(B,C). Maybe one of the geometric index classes could help. Here comes something I thought up myself, but you may be able to come up with something better. 1. For each string, store its distance from the strings: 0000000000 etc 0101010101 etc 0011001100 etc maybe more or less, you'll need to investigate 2. Index each of these columns. 3. For your target string calculate the distances from each of these 4. Do a query like: SELECT key, * from table where dist_zero_string > $dist_zero_search-2 and dist_zero_string < $dist_zero_search+2 where dist_0101_string > $dist_0101_search-2 and dist_0101_string < $dist_0101_search+2 where dist_0011_string > $dist_0101_search-2 and dist_0011_string < $dist_0011_search+2 order by dist(key,searchvalue); PostgreSQL 8.1 has bitmap index scans which could make the above quite fast. Some theory may help determine the best key strings. For example all-zeros and all-ones are redundant since you can calculate the distance for one from the other. You also might need to vary your distances to match with (the 2 above) depending on the average distance to the strings and where you search. You could use functional indexes to avoid storing the actual values in the tables. Or you could take two or three of these values and make points and use an rtree index on those to find nearby points (which will represent nearby strings). Anyway, this is just some random thoughts, hope it helps, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them.
Attachment
Hrm, I'll think about it. Thanks! On Oct 2, 2005, at 12:50 PM, Martijn van Oosterhout wrote: > On Sun, Oct 02, 2005 at 10:17:10AM -0700, Ben wrote: > >> Yeah, I thought about this, but damn my 3-dimensional mind, I just >> can't convince myself this will work for 256 dimensions. :) Or >> rather, I can see how it could, *given* good placements of the grid >> points to snap each vector to. Unfortunately, I don't know enough >> theory to know what good placements would be. >> > > My intuition tells me that just like real life, you want them to be as > far apart as possible. That's why I picked those strings. If the > string > is length N, all those key strings are distance n/2 apart from > eachother. > > As for proof, I think it will work due to the triangle inequality > holding true. Try not to worry about the dimensions, if you don't > assume the number of dimensions, it'll work for any. > > My Google search hasn't come up with anything, although searching for > "binary vector space search" comes with some almost but not quite > results. > > Good luck... > -- > Martijn van Oosterhout <kleptog@svana.org> http://svana.org/ > kleptog/ > >> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent >> is a >> tool for doing 5% of the work and then sitting around waiting for >> someone >> else to do the other 95% so you can sue them. >
On Sun, Oct 02, 2005 at 10:17:10AM -0700, Ben wrote: > Yeah, I thought about this, but damn my 3-dimensional mind, I just > can't convince myself this will work for 256 dimensions. :) Or > rather, I can see how it could, *given* good placements of the grid > points to snap each vector to. Unfortunately, I don't know enough > theory to know what good placements would be. My intuition tells me that just like real life, you want them to be as far apart as possible. That's why I picked those strings. If the string is length N, all those key strings are distance n/2 apart from eachother. As for proof, I think it will work due to the triangle inequality holding true. Try not to worry about the dimensions, if you don't assume the number of dimensions, it'll work for any. My Google search hasn't come up with anything, although searching for "binary vector space search" comes with some almost but not quite results. Good luck... -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them.
Attachment
Sure, but unless I can figure out some way to choose a small number of vectors, I'm left with computing the full N^2 set. Given that I'm shooting for N to be 4 million or larger, that's a lot of data to store..... On Oct 2, 2005, at 12:14 PM, Todd A. Cook wrote: > Ben wrote: > >> Just the number of bits, not which ones. Basically, the hamming >> distance. >> > > I see. Could you pre-compute the bit counts for the vectors in the > table? > You could count the bits in the search vector as Martijn suggested, > and then > do a lookup based on the count. > > -- todd > > > >
Yeah, I thought about this, but damn my 3-dimensional mind, I just can't convince myself this will work for 256 dimensions. :) Or rather, I can see how it could, *given* good placements of the grid points to snap each vector to. Unfortunately, I don't know enough theory to know what good placements would be. On Oct 2, 2005, at 9:47 AM, Martijn van Oosterhout wrote: > On Sun, Oct 02, 2005 at 08:28:53AM -0700, Ben wrote: > >> Yes, that's the straightforward way to do it. But given that my >> vectors are 256 bits in length, and that I'm going to eventually have >> about 4 million of them to search through, I was hoping greater minds >> than mine had figured out how to do it faster, or how compute some >> kind of indexing....... somehow. >> > > Ah right, that's another kettle of fish. Well, your problem space is a > form of eulidean space in the sense that the triangle inequality > holds, > eg dist(A,C) <= dist(A,B)+dist(B,C). Maybe one of the geometric index > classes could help. > > Here comes something I thought up myself, but you may be able to come > up with something better. > > 1. For each string, store its distance from the strings: > 0000000000 etc > 0101010101 etc > 0011001100 etc > maybe more or less, you'll need to investigate > > 2. Index each of these columns. > 3. For your target string calculate the distances from each of these > 4. Do a query like: > > SELECT key, * from table > where dist_zero_string > $dist_zero_search-2 and dist_zero_string < > $dist_zero_search+2 > where dist_0101_string > $dist_0101_search-2 and dist_0101_string < > $dist_0101_search+2 > where dist_0011_string > $dist_0101_search-2 and dist_0011_string < > $dist_0011_search+2 > order by dist(key,searchvalue); > > PostgreSQL 8.1 has bitmap index scans which could make the above quite > fast. Some theory may help determine the best key strings. For example > all-zeros and all-ones are redundant since you can calculate the > distance for one from the other. You also might need to vary your > distances to match with (the 2 above) depending on the average > distance > to the strings and where you search. > > You could use functional indexes to avoid storing the actual values in > the tables. Or you could take two or three of these values and make > points and use an rtree index on those to find nearby points (which > will represent nearby strings). > > Anyway, this is just some random thoughts, hope it helps, > -- > Martijn van Oosterhout <kleptog@svana.org> http://svana.org/ > kleptog/ > >> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent >> is a >> tool for doing 5% of the work and then sitting around waiting for >> someone >> else to do the other 95% so you can sue them. >
Ben wrote: > Just the number of bits, not which ones. Basically, the hamming distance. I see. Could you pre-compute the bit counts for the vectors in the table? You could count the bits in the search vector as Martijn suggested, and then do a lookup based on the count. -- todd
Just the number of bits, not which ones. Basically, the hamming distance. On Oct 2, 2005, at 11:44 AM, Todd A. Cook wrote: > Hi, > > It may be that I don't understand your problem. :) > > Are you searching the table for the closest vector? If so, is > "closeness" defined only as the number of bits that are different? > Or, do you need to know which bits as well? > > -- todd > > > Ben wrote: > >> Hrm, I don't understand. Can you give me an example with some >> reasonably sized vectors? >> On Oct 2, 2005, at 10:59 AM, Todd A. Cook wrote: >> >>> Hi, >>> >>> Try breaking the vector into 4 bigint columns and building a >>> multi- column >>> index, with index columns going from the most evenly distributed >>> to the >>> least. Depending on the distribution of your data, you may only >>> need 2 >>> or 3 columns in the index. If you can cluster the table in that >>> order, >>> it should be really fast. (This structure is a tabular form of >>> a linked >>> trie.) >>> >>> -- todd >>> >>> >>> Ben wrote: >>> >>> >>>> Yes, that's the straightforward way to do it. But given that >>>> my vectors are 256 bits in length, and that I'm going to >>>> eventually have about 4 million of them to search through, I >>>> was hoping greater minds than mine had figured out how to do >>>> it faster, or how compute some kind of indexing....... somehow. >>>> > >
Hi, It may be that I don't understand your problem. :) Are you searching the table for the closest vector? If so, is "closeness" defined only as the number of bits that are different? Or, do you need to know which bits as well? -- todd Ben wrote: > Hrm, I don't understand. Can you give me an example with some > reasonably sized vectors? > > On Oct 2, 2005, at 10:59 AM, Todd A. Cook wrote: > >> Hi, >> >> Try breaking the vector into 4 bigint columns and building a multi- >> column >> index, with index columns going from the most evenly distributed to the >> least. Depending on the distribution of your data, you may only need 2 >> or 3 columns in the index. If you can cluster the table in that order, >> it should be really fast. (This structure is a tabular form of a linked >> trie.) >> >> -- todd >> >> >> Ben wrote: >> >>> Yes, that's the straightforward way to do it. But given that my >>> vectors are 256 bits in length, and that I'm going to eventually >>> have about 4 million of them to search through, I was hoping >>> greater minds than mine had figured out how to do it faster, or how >>> compute some kind of indexing....... somehow.
Hrm, I don't understand. Can you give me an example with some reasonably sized vectors? On Oct 2, 2005, at 10:59 AM, Todd A. Cook wrote: > Hi, > > Try breaking the vector into 4 bigint columns and building a multi- > column > index, with index columns going from the most evenly distributed to > the > least. Depending on the distribution of your data, you may only > need 2 > or 3 columns in the index. If you can cluster the table in that > order, > it should be really fast. (This structure is a tabular form of a > linked > trie.) > > -- todd > > > Ben wrote: > >> Yes, that's the straightforward way to do it. But given that my >> vectors are 256 bits in length, and that I'm going to eventually >> have about 4 million of them to search through, I was hoping >> greater minds than mine had figured out how to do it faster, or >> how compute some kind of indexing....... somehow. >
Hi, Try breaking the vector into 4 bigint columns and building a multi-column index, with index columns going from the most evenly distributed to the least. Depending on the distribution of your data, you may only need 2 or 3 columns in the index. If you can cluster the table in that order, it should be really fast. (This structure is a tabular form of a linked trie.) -- todd Ben wrote: > Yes, that's the straightforward way to do it. But given that my vectors > are 256 bits in length, and that I'm going to eventually have about 4 > million of them to search through, I was hoping greater minds than mine > had figured out how to do it faster, or how compute some kind of > indexing....... somehow.
Hi, Sorry for being so abtuse; my 9 year old has been helping me with my work today. :) The hamming distance between two bit vectors can also be found as the sum of distances between sub-vectors. Let's assume you're looking for all vectors less than d bits away from a search vector, and we'll divide the vectors into k-bit subvectors. The value of k will depend on d, how sparse you data is, and how much memory you have for lookup tables. The goal is to eliminate the vectors that are more than d bits away. (This assumes that most of the search vectors are more than d bits away; for the opposite case, just reverse the sense of the comparisons below.) Compute the set of k-bit vectors that <d away from the first k-bits of the search vector, using a lookup table if feasible. Find the set of full vectors whose initial k-vector is in this set. For those whose distance is d-1, the remaining bits have to be identical, which is easy to check. Repeat for the second set of k-vectors where their distance has to be less than d-1, then d-2, etc. You can do these searches using the linked tries I mentioned earlier. (I've described the searches as happening sequentially, but that needn't be the case. You can search the subvectors simultaneously and then filter the results at the client.) You don't have to use constant length subvectors. You can start with small subvectors to keep the size of the search sets down, and then increase the vector length as the set of remaining possibilities shrinks. Also, you don't have to work from one end of the full vector to the other; start with a subvector where you expect a lot of variability. I think this method is similar in spirit, if not detail, to what Martijn suggested. I've actually implemented this approach before, though using my own file-based lookups rather than a real database. I had a set of 128-bit vectors used to filter an input stream; most of the comparisons never looked past 64 bits. -- todd Ben wrote: > Sure, but unless I can figure out some way to choose a small number of > vectors, I'm left with computing the full N^2 set. Given that I'm > shooting for N to be 4 million or larger, that's a lot of data to > store..... > > On Oct 2, 2005, at 12:14 PM, Todd A. Cook wrote: > >> Ben wrote: >> >>> Just the number of bits, not which ones. Basically, the hamming >>> distance. >>> >> >> I see. Could you pre-compute the bit counts for the vectors in the >> table? >> You could count the bits in the search vector as Martijn suggested, >> and then >> do a lookup based on the count.