Re: Sequence Access Method WIP - Mailing list pgsql-hackers

From Petr Jelinek
Subject Re: Sequence Access Method WIP
Date
Msg-id 545A3D2C.1080404@2ndquadrant.com
Whole thread Raw
In response to Re: Sequence Access Method WIP  (Heikki Linnakangas <hlinnakangas@vmware.com>)
Responses Re: Sequence Access Method WIP  (Heikki Linnakangas <hlinnakangas@vmware.com>)
List pgsql-hackers
On 05/11/14 13:45, Heikki Linnakangas wrote:
> On 11/04/2014 11:01 PM, Petr Jelinek wrote:
>> On 04/11/14 13:11, Heikki Linnakangas wrote:
>>> On 10/13/2014 01:01 PM, Petr Jelinek wrote:
>>>> Only the alloc and reloptions methods are required (and implemented by
>>>> the local AM).
>>>>
>>>> The caching, xlog writing, updating the page, etc is handled by
>>>> backend,
>>>> the AM does not see the tuple at all. I decided to not pass even the
>>>> struct around and just pass the relevant options because I think if we
>>>> want to abstract the storage properly then the AM should not care about
>>>> how the pg_sequence looks like at all, even if it means that the
>>>> sequence_alloc parameter list is bit long.
>>>
>>> Hmm. The division of labour between the seqam and commands/sequence.c
>>> still feels a bit funny. sequence.c keeps track of how many values have
>>> been WAL-logged, and thus usable immediately, but we still call
>>> sequence_alloc even when using up those already WAL-logged values. If
>>> you think of using this for something like a centralized sequence server
>>> in a replication cluster, you certainly don't want to make a call to the
>>> remote server for every value - you'll want to cache them.
>>>
>>> With the "local" seqam, there are two levels of caching. Each backend
>>> caches some values (per the CACHE <value> option in CREATE SEQUENCE). In
>>> addition to that, the server WAL-logs 32 values at a time. If you have a
>>> remote seqam, it would most likely add a third cache, but it would
>>> interact in strange ways with the second cache.
>>>
>>> Considering a non-local seqam, the locking is also a bit strange. The
>>> server keeps the sequence page locked throughout nextval(). But if the
>>> actual state of the sequence is maintained elsewhere, there's no need to
>>> serialize the calls to the remote allocator, i.e. the sequence_alloc()
>>> calls.
>>>
>>> I'm not exactly sure what to do about that. One option is to completely
>>> move the maintenance of the "current" value, i.e. sequence.last_value,
>>> to the seqam. That makes sense from an abstraction point of view. For
>>> example with a remote server managing the sequence, storing the
>>> "current" value in the local catalog table makes no sense as it's always
>>> going to be out-of-date. The local seqam would store it as part of the
>>> am-private data. However, you would need to move the responsibility of
>>> locking and WAL-logging to the seqam. Maybe that's OK, but we'll need to
>>> provide an API that the seqam can call to do that. Perhaps just let the
>>> seqam call heap_inplace_update on the sequence relation.
>>
>> My idea of how this works is - sequence_next handles the allocation and
>> the level2 caching (WAL logged cache) via amdata if it supports it, or
>> returns single value if it doesn't - then the WAL will always just write
>> the 1 value and there will basically be no level2 cache, since it is the
>> sequence_next who controls how much will be WAL-logged, what backend
>> asks for is just "suggestion".
>
> Hmm, so the AM might return an "nallocated" value less than the "fetch"
> value that sequence.c requested? As the patch stands, wouldn't that make
> sequence.c write a WAL record more often?
>

That's correct, that's also why you usually want to have some form of 
local caching if possible.

> In fact, if the seqam manages the current value outside the database
> (e.g. a "remote" seqam that gets the value from another server),
> nextval() never needs to write a WAL record.

Sure it does, you need to keep the current state in Postgres also, at 
least the current value so that you can pass correct input to 
sequence_alloc(). And you need to do this in crash-safe way so WAL is 
necessary.

I think sequences will cache in amdata if possible for that type of 
sequence and in case where it's not and it's caching on some external 
server then you'll most likely get bigger overhead from network than WAL 
anyway...

>
>>>> For the amdata handling (which is the AM's private data variable) the
>>>> API assumes that (Datum) 0 is NULL, this seems to work well for
>>>> reloptions so should work here also and it simplifies things a little
>>>> compared to passing pointers to pointers around and making sure
>>>> everything is allocated, etc.
>>>>
>>>> Sadly the fact that amdata is not fixed size and can be NULL made the
>>>> page updates of the sequence relation quite more complex that it
>>>> used to
>>>> be.
>>>
>>> It would be nice if the seqam could define exactly the columns it needs,
>>> with any datatypes. There would be a set of common attributes:
>>> sequence_name, start_value, cache_value, increment_by, max_value,
>>> min_value, is_cycled. The local seqam would add "last_value", "log_cnt"
>>> and "is_called" to that. A remote seqam that calls out to some other
>>> server might store the remote server's hostname etc.
>>>
>>> There could be a seqam function that returns a TupleDesc with the
>>> required columns, for example.
>>
>> Wouldn't that somewhat bloat catalog if we had new catalog table for
>> each sequence AM?
>
> No, that's not what I meant. The number of catalog tables would be the
> same as today. Sequences look much like any other relation, with entries
> in pg_attribute catalog table for all the attributes for each sequence.
> Currently, all sequences have the same set of attributes, sequence_name,
> last_value and so forth. What I'm proposing is that there would a set of
> attributes that are common to all sequences, but in addition to that
> there could be any number of AM-specific attributes.

Oh, that's interesting idea, so the AM interfaces would basically return 
updated tuple and there would be some description function that returns 
tupledesc. I am bit worried that this would kill any possibility of 
ALTER SEQUENCE USING access_method. Plus I don't think it actually 
solves any real problem - serializing the internal C structs into bytea 
is not any harder than serializing them into tuple IMHO.

>
>> It also does not really solve the amdata being dynamic
>> size "issue".
>
> Yes it would. There would not be a single amdata attribute, but the AM
> could specify any number of custom attributes, which could be fixed size
> or varlen. It would be solely the AM's responsibility to set the values
> of those attributes.
>

That's not the issue I was referring to, I was talking about the page 
replacement code which is not as simple now that we have potentially 
dynamic size tuple and if tuples were different for different AMs the 
code would still have to be able to handle that case. Setting the values 
in tuple itself is not too complicated.

--  Petr Jelinek                  http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training &
Services



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Time to remove dummy autocommit GUC?
Next
From: Tom Lane
Date:
Subject: Re: get_cast_func syscache utility function