Thread: "global" & shared sequences
Hoping to glean some advice from the more experienced.... The major component of our application currently tracks a few dozen object types, and the total number of objects is in the100s Millions range. Postgres will potentially be tracking billions of objects. Right now the primary key for our "core" objects is based on a per-table sequence, but each object has a secondary id basedon a global/shared sequence. we expose everything via a connected object graph, and basically needed a global sequence. We are currently scaled vertically (1x writer, 2x reader) I'd like to avoid assuming any more technical debt, and am not thrilled with the current setup. Our internal relations areall by the table's primary key, but the external (API, WEB) queries use the global id. Every table has 2 indexes, andwe need to convert a 'global' id to a 'table id' before doing a query. If we're able to replace the per-table primarykey with the global id, we'd be freeing up some disk space from the indexes and tables -- and not have to keep ourperformance cache that maps table-to-global ids. The concerns that I have before moving ahead are: 1. general performance at different stages of DB size. with 18 sequences, our keys/indexes are simply smaller than they'dbe with 1 key. i wonder how this will impact lookups and joins. 2. managing this sequence when next scaling the db (which would probably have to be sharding, unless others have a suggestion) if anyone has insights, they would be greatly appreciated.
On 10/1/15 6:48 PM, Jonathan Vanasco wrote: > 1. general performance at different stages of DB size. with 18 sequences, our keys/indexes are simply smaller than they'dbe with 1 key. i wonder how this will impact lookups and joins. I'm not really following here... the size of an index is determined by the number of tuples in it and the average width of each tuple. So as long as you're using the same size of data type, 18 vs 1 sequence won't change the size of your indexes. > 2. managing this sequence when next scaling the db (which would probably have to be sharding, unless others have a suggestion) Sequences are designed to be extremely fast to assign. If you ever did find a single sequence being a bottleneck, you could always start caching values in each backend. I think it'd be hard (if not impossible) to turn a single global sequence into a real bottleneck. If you start sharding you'll need to either create a composite ID where part of the ID is a shard identifier (say, the top 8 bits), or assign IDs in ranges that are assigned to each shard. There's work being done right now to make #2 a bit easier. Probably better would be if you could shard based on something like object or customer; that way you only have to look up which shard the customer lives in. -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com
Thanks for the reply. On Oct 2, 2015, at 3:26 PM, Jim Nasby wrote: > I'm not really following here... the size of an index is determined by the number of tuples in it and the average widthof each tuple. So as long as you're using the same size of data type, 18 vs 1 sequence won't change the size of yourindexes. I'm pretty much concerned with exactly that -- the general distribution of numbers, which affects the average size/lengthof each key. Using an even distribution as an example, the average width of the keys can increase by 2 places: Since we have ~18 object types, the primary keys in each might range from 1 to 9,999,999 Using a shared sequence, the keys for the same dataset would range from 1 to 189,999,999 Each table is highly related, and may fkey onto 2-4 other tables... So i'm a bit wary of this change. But if it works forothers... I'm fine with that! > Sequences are designed to be extremely fast to assign. If you ever did find a single sequence being a bottleneck, you couldalways start caching values in each backend. I think it'd be hard (if not impossible) to turn a single global sequenceinto a real bottleneck. I don't think so either, but everything I've read has been theoretical -- so I was hoping that someone here can give the"yeah, no issue!" from experience. The closest production stuff I found was via the BDR plugin (only relevant thingthat came up during search) and there seemed to be anecdotal accounts of issues with sequences becoming bottlenecks-- but that was from their code that pre-generated allowable sequence ids on each node.
On 10/2/15 4:08 PM, Jonathan Vanasco wrote: > Using an even distribution as an example, the average width of the keys can increase by 2 places: Assuming you're using int4 or int8, then that doesn't matter. The only other possible issue I can think of would be it somehow throwing the planner stats off, but I think the odds of that are very small. >> >Sequences are designed to be extremely fast to assign. If you ever did find a single sequence being a bottleneck, youcould always start caching values in each backend. I think it'd be hard (if not impossible) to turn a single global sequenceinto a real bottleneck. > I don't think so either, but everything I've read has been theoretical -- so I was hoping that someone here can give the"yeah, no issue!" from experience. The closest production stuff I found was via the BDR plugin (only relevant thingthat came up during search) and there seemed to be anecdotal accounts of issues with sequences becoming bottlenecks-- but that was from their code that pre-generated allowable sequence ids on each node. You could always run a custom pg_bench that runs a PREPAREd SELECT nextval() and compare that to a prepared SELECT currval(). You might notice a difference at higher client counts with no caching, but I doubt you'd see that much difference with caching turned on. -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com