Thread: Todo item: Support amgettuple() in GIN

Todo item: Support amgettuple() in GIN

From
Andreas Karlsson
Date:
Hi,

I decided to look into how much work implementing the todo item about 
supporting amgettuple in GIN would be, since exclusion constraints on 
GIN would be neat. Robert Haas suggested a solution[1], but to fix it we 
also need to look into why the commit message mentions that it did not 
work anyway with the partial matches.

So I looked into that first, and here is my explanation for why it is 
broken for partial matches. I am sending this to the mailing list to 
check if I am correct and if so update the todo list with this new 
information.

= Explanation

When doing normal matching the code simply traverses the matching 
posting tree in TID order.

When doing partial matching the code need to be able to return the union 
of all TIDs in all the matching posting trees in TID order (to be able 
to do AND and OR operations with multiple search keys later). It does 
this by traversing them posting tree after posting tree and collecting 
them all in a TIDBitmap which is later iterated over.

This TIDBitmap becomes lossy if it too many TIDs are added to it, and 
this case is what broke amgettuple for partial matches.

To fix this it seems to me that either lossy pages would need to be 
rescanned by gingettuple or a version of collectMatchBitmap needs to be 
written which merges the matched posting trees in another way, probably 
by iterating over all of them at the same time.

Does this diagnosis sound correct?

= Footnotes

1. 
http://www.postgresql.org/message-id/CA+TgmobZhfRJNyz-fyw5kDtRurK0HjWP0vtP5fGZLE6eVSWCQw@mail.gmail.com

-- 
Andreas Karlsson



Re: Todo item: Support amgettuple() in GIN

From
Antonin Houska
Date:
On 11/29/2013 01:13 AM, Andreas Karlsson wrote:

> When doing partial matching the code need to be able to return the union 
> of all TIDs in all the matching posting trees in TID order (to be able 
> to do AND and OR operations with multiple search keys later). It does 
> this by traversing them posting tree after posting tree and collecting 
> them all in a TIDBitmap which is later iterated over.

I think it's not a plain union. My understanding is that - to evaluate a
single key (typically array) - you first need to get all the TID streams
for that key (i.e. one posting list/tree per element of the key array)
and then iterate all these streams in parallel and 'merge' them     using
consistent() function. That's how I understand ginget.c:keyGetItem().

So the problem of partial match is (IMO) that there can be too many TID
streams to merge - much more than the number of elements of the key array.

// Antonin Houska (Tony)




Re: Todo item: Support amgettuple() in GIN

From
Andreas Karlsson
Date:
On 11/29/2013 09:54 AM, Antonin Houska wrote:
> On 11/29/2013 01:13 AM, Andreas Karlsson wrote:
>
>> When doing partial matching the code need to be able to return the union
>> of all TIDs in all the matching posting trees in TID order (to be able
>> to do AND and OR operations with multiple search keys later). It does
>> this by traversing them posting tree after posting tree and collecting
>> them all in a TIDBitmap which is later iterated over.
>
> I think it's not a plain union. My understanding is that - to evaluate a
> single key (typically array) - you first need to get all the TID streams
> for that key (i.e. one posting list/tree per element of the key array)
> and then iterate all these streams in parallel and 'merge' them     using
> consistent() function. That's how I understand ginget.c:keyGetItem().

For partial matches the merging is done in two steps: first a simple 
union of all the streams per key and then second merging those union 
streams using the consistent() function.

It is the first step that can be lossy.

> So the problem of partial match is (IMO) that there can be too many TID
> streams to merge - much more than the number of elements of the key array.

Agreed.

-- 
Andreas Karlsson



Re: Todo item: Support amgettuple() in GIN

From
Antonin Houska
Date:
On 11/29/2013 01:57 PM, Andreas Karlsson wrote:
> On 11/29/2013 09:54 AM, Antonin Houska wrote:
>> On 11/29/2013 01:13 AM, Andreas Karlsson wrote:
>>
>>> When doing partial matching the code need to be able to return the union
>>> of all TIDs in all the matching posting trees in TID order (to be able
>>> to do AND and OR operations with multiple search keys later). It does
>>> this by traversing them posting tree after posting tree and collecting
>>> them all in a TIDBitmap which is later iterated over.
>>
>> I think it's not a plain union. My understanding is that - to evaluate a
>> single key (typically array) - you first need to get all the TID streams
>> for that key (i.e. one posting list/tree per element of the key array)
>> and then iterate all these streams in parallel and 'merge' them     using
>> consistent() function. That's how I understand ginget.c:keyGetItem().
> 
> For partial matches the merging is done in two steps: first a simple 
> union of all the streams per key and then second merging those union 
> streams using the consistent() function.

Yes, short after I sent my previous mail I realized that your "union"
probably referred to the things that collectMatchBitmap() does.

// Tony




Re: Todo item: Support amgettuple() in GIN

From
Tom Lane
Date:
Andreas Karlsson <andreas@proxel.se> writes:
> I decided to look into how much work implementing the todo item about 
> supporting amgettuple in GIN would be, since exclusion constraints on 
> GIN would be neat. Robert Haas suggested a solution[1], but to fix it we 
> also need to look into why the commit message mentions that it did not 
> work anyway with the partial matches.
> ...
> This TIDBitmap becomes lossy if it too many TIDs are added to it, and 
> this case is what broke amgettuple for partial matches.

Right, see
http://www.postgresql.org/message-id/49AC300F.1050903@enterprisedb.com

Note that fixing the potential lossiness in scanning is not the only
roadblock to re-enabling amgettuple.  Fast updates also pose problems:
http://www.postgresql.org/message-id/4974B002.3040202@sigaev.ru

Half of that is basically the same lossiness problem, but the other
half is that we're relying on the bitmap to suppress duplicate reports
of the same TID.  It's fairly hard to see how you'd avoid that without
creating other problems.

Note that Robert's proposed solution is no solution, because it just
puts you right back in the bind of needing guaranteed non-lossy
storage of a TID set that might be too big to fit in memory.
        regards, tom lane



Re: Todo item: Support amgettuple() in GIN

From
Heikki Linnakangas
Date:
On 11/29/2013 07:13 PM, Tom Lane wrote:
> Andreas Karlsson <andreas@proxel.se> writes:
>> I decided to look into how much work implementing the todo item about
>> supporting amgettuple in GIN would be, since exclusion constraints on
>> GIN would be neat. Robert Haas suggested a solution[1], but to fix it we
>> also need to look into why the commit message mentions that it did not
>> work anyway with the partial matches.
>> ...
>> This TIDBitmap becomes lossy if it too many TIDs are added to it, and
>> this case is what broke amgettuple for partial matches.
>
> Right, see
>
https://urldefense.proofpoint.com/v1/url?u=http://www.postgresql.org/message-id/49AC300F.1050903%40enterprisedb.com&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=xGch4oNJbpD%2BKPJECmgw4SLBhytSZLBX7UnkZhtNcpw%3D%0A&m=OqhHlGFG81LH1EqJLzTW8HuXdXslGEL%2FPu1f27HxV%2Bs%3D%0A&s=9f3fad064e2845bd2b99c85f684d237fbe96e542081e4b2dc49b1fe51f91f144
>
> Note that fixing the potential lossiness in scanning is not the only
> roadblock to re-enabling amgettuple.  Fast updates also pose problems:
>
https://urldefense.proofpoint.com/v1/url?u=http://www.postgresql.org/message-id/4974B002.3040202%40sigaev.ru&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=xGch4oNJbpD%2BKPJECmgw4SLBhytSZLBX7UnkZhtNcpw%3D%0A&m=OqhHlGFG81LH1EqJLzTW8HuXdXslGEL%2FPu1f27HxV%2Bs%3D%0A&s=0e08a781fcc17a3d68ce247344a3499a23a9f545b937f254439dadfaf7b9b8ab
>
> Half of that is basically the same lossiness problem, but the other
> half is that we're relying on the bitmap to suppress duplicate reports
> of the same TID.  It's fairly hard to see how you'd avoid that without
> creating other problems.
>
> Note that Robert's proposed solution is no solution, because it just
> puts you right back in the bind of needing guaranteed non-lossy
> storage of a TID set that might be too big to fit in memory.

You can always call amgetbitmap, and return the tuples from the bitmap 
one by one. For a lossy result, re-check all tuples on the page. IOW, do 
a bitmap index + heap scan. You could do that within indexam.c, and 
present the familiar index_getnext() interface for callers. Or you could 
modify the exclusion constraint code to do that if amgettuple is not 
available

- Heikki



Re: Todo item: Support amgettuple() in GIN

From
Andreas Karlsson
Date:
On 11/29/2013 06:13 PM, Tom Lane wrote:
> Note that Robert's proposed solution is no solution, because it just
> puts you right back in the bind of needing guaranteed non-lossy
> storage of a TID set that might be too big to fit in memory.

The solution should work if we could guarantee that a TIDBitmap based on 
the fast update pending list always will fit in the memory. That does 
not sound like a good assumption to me.

-- 
Andreas Karlsson