Re: Partial index-based load balancing - Mailing list pgsql-general
From | Andy Colson |
---|---|
Subject | Re: Partial index-based load balancing |
Date | |
Msg-id | 551B1A8E.7030200@squeakycode.net Whole thread Raw |
In response to | Re: Partial index-based load balancing (Fabio Ugo Venchiarutti <fabio@vuole.me>) |
List | pgsql-general |
On 3/31/2015 3:28 PM, Fabio Ugo Venchiarutti wrote: > > > On 01/04/15 06:12, Andy Colson wrote: >> On 3/31/2015 1:58 AM, Fabio Ugo Venchiarutti wrote: >>> Greetings >>> >>> >>> I'm working for a startup and our core DB is growing rather fast. >>> >>> Our target scale is large enough that we expect some of our core tables' >>> indexes to grow bigger than the memory on any single node over the next >>> couple years (our current intended design involves conventional >>> stream-replication-based, write-on-one-read-from-many load balancing). >>> >>> We don't fancy the idea of using inheritance through partitioning due to >>> the maintenance overhead and our reliance on validation >>> constraints/triggers. >>> >>> My proposal will be to instead create a number of partial indexes >>> covering predefined ranges of client IDs, then use a connection-level >>> routing mechanism that relies on what range the relevant client's data >>> belongs to in order to address the right node and match the right >>> partial index. >>> >>> The idea is to have any given read-only node hold just one of the >>> partial indexes in its cache and never fetch index pages off its >>> secondary storage. >>> Scaling would just be a matter of increasing the partitioning density. >>> >>> >>> I'm going to assume that I'm not the first one to come up with this >>> strategy (and that there already is a name for it. If so, what is it?). >>> >>> >>> Is it a valid setup or am I missing some key aspect of how index >>> partitioning is meant to work? >>> >>> >>> TIA >>> >>> >>> Best regards >>> >>> >>> Fabio Ugo Venchiarutti >>> >>> >> >> Have you timed it? It sounds like it should work ok, but just to play >> devils advocate it sounds a little over complex. >> >> Heres my thinking: Have one index, have lots of read-only nodes, do the >> connection-level routing. I think the PG caching and OS caching will be >> smart enough to cache the parts of the index that's hot per node. You >> are also less likely to confuse the planner. >> >> For example, I have one db, with web hits logging, with a timestamp >> index. Its upward of 25 gig, but I only every query the last 24 hours >> to get ip/session/hit counts. That part of the index/db is cached very >> well and responds very quick. If I ever hit data that's many days old >> it slow's down hitting the HD. >> >> I don't know for sure which method would be faster though. The only >> real difference between the two methods is the number of indexes. I'd >> bet there is no speed difference between them. PG will figure out what >> data is hot and cache it. >> >> -Andy >> >> > > > I always assumed that an index, or at least the root level of a BTREE > one, had to be completely cached in order to guarantee instantaneous > lookups, but your thinking is seems sound as after the first level > resolution only certain branches will be walked on any given slave. > > Will try both approaches and update you with about my findings if you're > interested (won't happen before a few weeks at least due to management > bottlenecks). > > Many thanks > > > F > > > Yes, interested. Everyone loves a good benchmark. And if your solution (with multiple partial index) (which I've never seen discussed on this list before) is faster, then I propose we name it after you. :-) -Andy
pgsql-general by date: