Re: Sequence Access Method WIP - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Re: Sequence Access Method WIP
Date
Msg-id 545A1BEB.9030406@vmware.com
Whole thread Raw
In response to Re: Sequence Access Method WIP  (Petr Jelinek <petr@2ndquadrant.com>)
Responses Re: Sequence Access Method WIP  (Petr Jelinek <petr@2ndquadrant.com>)
List pgsql-hackers
On 11/04/2014 11:01 PM, Petr Jelinek wrote:
> On 04/11/14 13:11, Heikki Linnakangas wrote:
>> On 10/13/2014 01:01 PM, Petr Jelinek wrote:
>>> Only the alloc and reloptions methods are required (and implemented by
>>> the local AM).
>>>
>>> The caching, xlog writing, updating the page, etc is handled by backend,
>>> the AM does not see the tuple at all. I decided to not pass even the
>>> struct around and just pass the relevant options because I think if we
>>> want to abstract the storage properly then the AM should not care about
>>> how the pg_sequence looks like at all, even if it means that the
>>> sequence_alloc parameter list is bit long.
>>
>> Hmm. The division of labour between the seqam and commands/sequence.c
>> still feels a bit funny. sequence.c keeps track of how many values have
>> been WAL-logged, and thus usable immediately, but we still call
>> sequence_alloc even when using up those already WAL-logged values. If
>> you think of using this for something like a centralized sequence server
>> in a replication cluster, you certainly don't want to make a call to the
>> remote server for every value - you'll want to cache them.
>>
>> With the "local" seqam, there are two levels of caching. Each backend
>> caches some values (per the CACHE <value> option in CREATE SEQUENCE). In
>> addition to that, the server WAL-logs 32 values at a time. If you have a
>> remote seqam, it would most likely add a third cache, but it would
>> interact in strange ways with the second cache.
>>
>> Considering a non-local seqam, the locking is also a bit strange. The
>> server keeps the sequence page locked throughout nextval(). But if the
>> actual state of the sequence is maintained elsewhere, there's no need to
>> serialize the calls to the remote allocator, i.e. the sequence_alloc()
>> calls.
>>
>> I'm not exactly sure what to do about that. One option is to completely
>> move the maintenance of the "current" value, i.e. sequence.last_value,
>> to the seqam. That makes sense from an abstraction point of view. For
>> example with a remote server managing the sequence, storing the
>> "current" value in the local catalog table makes no sense as it's always
>> going to be out-of-date. The local seqam would store it as part of the
>> am-private data. However, you would need to move the responsibility of
>> locking and WAL-logging to the seqam. Maybe that's OK, but we'll need to
>> provide an API that the seqam can call to do that. Perhaps just let the
>> seqam call heap_inplace_update on the sequence relation.
>
> My idea of how this works is - sequence_next handles the allocation and
> the level2 caching (WAL logged cache) via amdata if it supports it, or
> returns single value if it doesn't - then the WAL will always just write
> the 1 value and there will basically be no level2 cache, since it is the
> sequence_next who controls how much will be WAL-logged, what backend
> asks for is just "suggestion".

Hmm, so the AM might return an "nallocated" value less than the "fetch" 
value that sequence.c requested? As the patch stands, wouldn't that make 
sequence.c write a WAL record more often?

In fact, if the seqam manages the current value outside the database 
(e.g. a "remote" seqam that gets the value from another server), 
nextval() never needs to write a WAL record.

> The third level caching as you say should be solved by the
> sequence_request_update and sequence_update callback - that will enable
> the sequence AM to handle this kind of caching asynchronously without
> blocking the sequence_next unnecessarily.
>
> That way it's possible to implement many different strategies, from no
> cache at all, to three levels of caching, with and without blocking of
> calls to sequence_next.

It's nice that asynchronous operation is possible, but that seems like a 
more advanced feature. Surely an approach where you fetch a bunch of 
values when needed is going to be more common. Let's focus on the simple 
synchronous case first, and make sure the API works well for that.

>>> For the amdata handling (which is the AM's private data variable) the
>>> API assumes that (Datum) 0 is NULL, this seems to work well for
>>> reloptions so should work here also and it simplifies things a little
>>> compared to passing pointers to pointers around and making sure
>>> everything is allocated, etc.
>>>
>>> Sadly the fact that amdata is not fixed size and can be NULL made the
>>> page updates of the sequence relation quite more complex that it used to
>>> be.
>>
>> It would be nice if the seqam could define exactly the columns it needs,
>> with any datatypes. There would be a set of common attributes:
>> sequence_name, start_value, cache_value, increment_by, max_value,
>> min_value, is_cycled. The local seqam would add "last_value", "log_cnt"
>> and "is_called" to that. A remote seqam that calls out to some other
>> server might store the remote server's hostname etc.
>>
>> There could be a seqam function that returns a TupleDesc with the
>> required columns, for example.
>
> Wouldn't that somewhat bloat catalog if we had new catalog table for
> each sequence AM?

No, that's not what I meant. The number of catalog tables would be the 
same as today. Sequences look much like any other relation, with entries 
in pg_attribute catalog table for all the attributes for each sequence. 
Currently, all sequences have the same set of attributes, sequence_name, 
last_value and so forth. What I'm proposing is that there would a set of 
attributes that are common to all sequences, but in addition to that 
there could be any number of AM-specific attributes.

> It also does not really solve the amdata being dynamic
> size "issue".

Yes it would. There would not be a single amdata attribute, but the AM 
could specify any number of custom attributes, which could be fixed size 
or varlen. It would be solely the AM's responsibility to set the values 
of those attributes.

> I think just dynamic field where AM stores whatever it
> wants is enough (ie the current solution), it's just that the hackery
> that sequences storage implementation does is more visible once there
> are non-fixed fields.

Using a single amdata field is like creating a table with a single text 
column, and putting all your data in the single column separated by 
tabs. I'm sure you can make it work, but normalized data is much nicer 
to work with.

- Heikki




pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: tracking commit timestamps
Next
From: Alvaro Herrera
Date:
Subject: Re: WAL replay bugs