Thread: "global" & shared sequences

"global" & shared sequences

From
Jonathan Vanasco
Date:
Hoping to glean some advice from the more experienced....

The major component of our application currently tracks a few dozen object types, and the total number of objects is in
the100s Millions range.  Postgres will potentially be tracking billions of objects. 

Right now the primary key for our "core" objects is based on a per-table sequence, but each object has a secondary id
basedon a global/shared sequence.  we expose everything via a connected object graph, and basically needed a global
sequence. We are currently scaled vertically (1x writer, 2x reader) 

I'd like to avoid assuming any more technical debt, and am not thrilled with the current setup.  Our internal relations
areall by the table's primary key, but the external (API, WEB) queries use the global id.  Every table has 2 indexes,
andwe need to convert a 'global' id to a 'table id' before doing a query.  If we're able to replace the per-table
primarykey with the global id, we'd be freeing up some disk space from the indexes and tables -- and not have to keep
ourperformance cache that maps table-to-global ids. 

The concerns that I have before moving ahead are:

1. general performance at different stages of DB size.   with 18 sequences, our keys/indexes are simply smaller than
they'dbe with 1 key.  i wonder how this will impact lookups and joins. 
2. managing this sequence when next scaling the db (which would probably have to be sharding, unless others have a
suggestion)

if anyone has insights, they would be greatly appreciated.

Re: "global" & shared sequences

From
Jim Nasby
Date:
On 10/1/15 6:48 PM, Jonathan Vanasco wrote:
> 1. general performance at different stages of DB size.   with 18 sequences, our keys/indexes are simply smaller than
they'dbe with 1 key.  i wonder how this will impact lookups and joins. 

I'm not really following here... the size of an index is determined by
the number of tuples in it and the average width of each tuple. So as
long as you're using the same size of data type, 18 vs 1 sequence won't
change the size of your indexes.

> 2. managing this sequence when next scaling the db (which would probably have to be sharding, unless others have a
suggestion)

Sequences are designed to be extremely fast to assign. If you ever did
find a single sequence being a bottleneck, you could always start
caching values in each backend. I think it'd be hard (if not impossible)
to turn a single global sequence into a real bottleneck.

If you start sharding you'll need to either create a composite ID where
part of the ID is a shard identifier (say, the top 8 bits), or assign
IDs in ranges that are assigned to each shard. There's work being done
right now to make #2 a bit easier. Probably better would be if you could
shard based on something like object or customer; that way you only have
to look up which shard the customer lives in.
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com


Re: "global" & shared sequences

From
Jonathan Vanasco
Date:
Thanks for the reply.


On Oct 2, 2015, at 3:26 PM, Jim Nasby wrote:

> I'm not really following here... the size of an index is determined by the number of tuples in it and the average
widthof each tuple. So as long as you're using the same size of data type, 18 vs 1 sequence won't change the size of
yourindexes. 

I'm pretty much concerned with exactly that -- the general distribution of numbers, which affects the average
size/lengthof each key. 

Using an even distribution as an example, the average width of the keys can increase by 2 places:

Since we have ~18 object types, the primary keys in each might range from 1 to      9,999,999
Using a shared sequence, the keys for the same dataset would range from  1 to 189,999,999

Each table is highly related, and may fkey onto 2-4 other tables... So i'm a bit wary of this change.  But if it works
forothers... I'm fine with that! 


> Sequences are designed to be extremely fast to assign. If you ever did find a single sequence being a bottleneck, you
couldalways start caching values in each backend. I think it'd be hard (if not impossible) to turn a single global
sequenceinto a real bottleneck. 

I don't think so either, but everything I've read has been theoretical -- so I was hoping that someone here can give
the"yeah, no issue!" from experience.    The closest production stuff I found was  via  the BDR plugin (only relevant
thingthat came up during search) and there seemed to be anecdotal accounts of issues with sequences becoming
bottlenecks-- but that was from their code that pre-generated allowable sequence ids on each node. 

Re: "global" & shared sequences

From
Jim Nasby
Date:
On 10/2/15 4:08 PM, Jonathan Vanasco wrote:
> Using an even distribution as an example, the average width of the keys can increase by 2 places:

Assuming you're using int4 or int8, then that doesn't matter. The only
other possible issue I can think of would be it somehow throwing the
planner stats off, but I think the odds of that are very small.

>> >Sequences are designed to be extremely fast to assign. If you ever did find a single sequence being a bottleneck,
youcould always start caching values in each backend. I think it'd be hard (if not impossible) to turn a single global
sequenceinto a real bottleneck. 
> I don't think so either, but everything I've read has been theoretical -- so I was hoping that someone here can give
the"yeah, no issue!" from experience.    The closest production stuff I found was  via  the BDR plugin (only relevant
thingthat came up during search) and there seemed to be anecdotal accounts of issues with sequences becoming
bottlenecks-- but that was from their code that pre-generated allowable sequence ids on each node. 

You could always run a custom pg_bench that runs a PREPAREd SELECT
nextval() and compare that to a prepared SELECT currval(). You might
notice a difference at higher client counts with no caching, but I doubt
you'd see that much difference with caching turned on.
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com