Re: Partial index-based load balancing - Mailing list pgsql-general

From Andy Colson
Subject Re: Partial index-based load balancing
Date
Msg-id 551B1A8E.7030200@squeakycode.net
Whole thread Raw
In response to Re: Partial index-based load balancing  (Fabio Ugo Venchiarutti <fabio@vuole.me>)
List pgsql-general
On 3/31/2015 3:28 PM, Fabio Ugo Venchiarutti wrote:
>
>
> On 01/04/15 06:12, Andy Colson wrote:
>> On 3/31/2015 1:58 AM, Fabio Ugo Venchiarutti wrote:
>>> Greetings
>>>
>>>
>>> I'm working for a startup and our core DB is growing rather fast.
>>>
>>> Our target scale is large enough that we expect some of our core tables'
>>> indexes to grow bigger than the memory on any single node over the next
>>> couple years (our current intended design involves conventional
>>> stream-replication-based, write-on-one-read-from-many load balancing).
>>>
>>> We don't fancy the idea of using inheritance through partitioning due to
>>> the maintenance overhead and our reliance on validation
>>> constraints/triggers.
>>>
>>> My proposal will be to instead create a number of partial indexes
>>> covering predefined ranges of client IDs, then use a connection-level
>>> routing mechanism that relies on what range the relevant client's data
>>> belongs to in order to address the right node and match the right
>>> partial index.
>>>
>>> The idea is to have any given read-only node hold just one of the
>>> partial indexes in its cache and never fetch index pages off its
>>> secondary storage.
>>> Scaling would just be a matter of increasing the partitioning density.
>>>
>>>
>>> I'm going to assume that I'm not the first one to come up with this
>>> strategy (and that there already is a name for it. If so, what is it?).
>>>
>>>
>>> Is it a valid setup or am I missing some key aspect of how index
>>> partitioning is meant to work?
>>>
>>>
>>> TIA
>>>
>>>
>>> Best regards
>>>
>>>
>>> Fabio Ugo Venchiarutti
>>>
>>>
>>
>> Have you timed it?  It sounds like it should work ok, but just to play
>> devils advocate it sounds a little over complex.
>>
>> Heres my thinking:  Have one index, have lots of read-only nodes, do the
>> connection-level routing.  I think the PG caching and OS caching will be
>> smart enough to cache the parts of the index that's hot per node.  You
>> are also less likely to confuse the planner.
>>
>> For example, I have one db, with web hits logging, with a timestamp
>> index.  Its upward of 25 gig, but I only every query the last 24 hours
>> to get ip/session/hit counts.  That part of the index/db is cached very
>> well and responds very quick.  If I ever hit data that's many days old
>> it slow's down hitting the HD.
>>
>> I don't know for sure which method would be faster though.  The only
>> real difference between the two methods is the number of indexes.  I'd
>> bet there is no speed difference between them.  PG will figure out what
>> data is hot and cache it.
>>
>> -Andy
>>
>>
>
>
> I always assumed that an index, or at least the root level of a BTREE
> one, had to be completely cached in order to guarantee instantaneous
> lookups, but your thinking is seems sound as after the first level
> resolution only certain branches will be walked on any given slave.
>
> Will try both approaches and update you with about my findings if you're
> interested (won't happen before a few weeks at least due to management
> bottlenecks).
>
> Many thanks
>
>
> F
>
>
>

Yes, interested.  Everyone loves a good benchmark.  And if your solution
(with multiple partial index) (which I've never seen discussed on this
list before) is faster, then I propose we name it after you.  :-)

-Andy


pgsql-general by date:

Previous
From: Fabio Ugo Venchiarutti
Date:
Subject: Re: Partial index-based load balancing
Next
From: Joshua Ma
Date:
Subject: Why does CREATE INDEX CONCURRENTLY need two scans?