Re: Sequence Access Method WIP - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Re: Sequence Access Method WIP
Date
Msg-id 5458C258.802@vmware.com
Whole thread Raw
In response to Re: Sequence Access Method WIP  (Petr Jelinek <petr@2ndquadrant.com>)
Responses Re: Sequence Access Method WIP  (Petr Jelinek <petr@2ndquadrant.com>)
List pgsql-hackers
On 10/13/2014 01:01 PM, Petr Jelinek wrote:
> Hi,
>
> I rewrote the patch with different API along the lines of what was
> discussed.

Thanks, that's better.

It would be good to see an alternative seqam to implement this API, to 
see how it really works. The "local" one is too dummy to expose any 
possible issues.

> The API now consists of following functions:
> sequence_alloc - allocating range of new values
> The function receives the sequence relation, current value, number of
> requested values amdata and relevant sequence options like min/max and
> returns new amdata, new current value, number of values allocated and
> also if it needs wal write (that should be returned if amdata has
> changed plus other reasons the AM might have to force the wal update).
>
> sequence_setval - notification that setval is happening
> This function gets sequence relation, previous value and new value plus
> the amdata and returns amdata (I can imagine some complex sequence AMs
> will want to throw error that setval can't be done on them).
>
> sequence_request_update/sequence_update - used for background processing
> Basically AM can call the sequence_request_update and backend will then
> call the sequence_update method of an AM with current amdata and will
> write the updated amdata to disk
>
> sequence_seqparams - function to process/validate the standard sequence
> options like start position, min/max, increment by etc by the AM, it's
> called in addition to the standard processing
>
> sequence_reloptions - this is the only thing that remained unchanged
> from previous patch, it's meant to pass custom options to the AM
>
> Only the alloc and reloptions methods are required (and implemented by
> the local AM).
>
> The caching, xlog writing, updating the page, etc is handled by backend,
> the AM does not see the tuple at all. I decided to not pass even the
> struct around and just pass the relevant options because I think if we
> want to abstract the storage properly then the AM should not care about
> how the pg_sequence looks like at all, even if it means that the
> sequence_alloc parameter list is bit long.

Hmm. The division of labour between the seqam and commands/sequence.c 
still feels a bit funny. sequence.c keeps track of how many values have 
been WAL-logged, and thus usable immediately, but we still call 
sequence_alloc even when using up those already WAL-logged values. If 
you think of using this for something like a centralized sequence server 
in a replication cluster, you certainly don't want to make a call to the 
remote server for every value - you'll want to cache them.

With the "local" seqam, there are two levels of caching. Each backend 
caches some values (per the CACHE <value> option in CREATE SEQUENCE). In 
addition to that, the server WAL-logs 32 values at a time. If you have a 
remote seqam, it would most likely add a third cache, but it would 
interact in strange ways with the second cache.

Considering a non-local seqam, the locking is also a bit strange. The 
server keeps the sequence page locked throughout nextval(). But if the 
actual state of the sequence is maintained elsewhere, there's no need to 
serialize the calls to the remote allocator, i.e. the sequence_alloc() 
calls.

I'm not exactly sure what to do about that. One option is to completely 
move the maintenance of the "current" value, i.e. sequence.last_value, 
to the seqam. That makes sense from an abstraction point of view. For 
example with a remote server managing the sequence, storing the 
"current" value in the local catalog table makes no sense as it's always 
going to be out-of-date. The local seqam would store it as part of the 
am-private data. However, you would need to move the responsibility of 
locking and WAL-logging to the seqam. Maybe that's OK, but we'll need to 
provide an API that the seqam can call to do that. Perhaps just let the 
seqam call heap_inplace_update on the sequence relation.

> For the amdata handling (which is the AM's private data variable) the
> API assumes that (Datum) 0 is NULL, this seems to work well for
> reloptions so should work here also and it simplifies things a little
> compared to passing pointers to pointers around and making sure
> everything is allocated, etc.
>
> Sadly the fact that amdata is not fixed size and can be NULL made the
> page updates of the sequence relation quite more complex that it used to
> be.

It would be nice if the seqam could define exactly the columns it needs, 
with any datatypes. There would be a set of common attributes: 
sequence_name, start_value, cache_value, increment_by, max_value, 
min_value, is_cycled. The local seqam would add "last_value", "log_cnt" 
and "is_called" to that. A remote seqam that calls out to some other 
server might store the remote server's hostname etc.

There could be a seqam function that returns a TupleDesc with the 
required columns, for example.

- Heikki



pgsql-hackers by date:

Previous
From: Alexey Vasiliev
Date:
Subject: Re[2]: [HACKERS] Patch: add recovery_timeout option to control timeout of restore_command nonzero status code
Next
From: Alvaro Herrera
Date:
Subject: Re: tracking commit timestamps