Home > mailing lists

Re: On partitioning - Mailing list pgsql-hackers

From	Claudio Freire
Subject	Re: On partitioning
Date	December 15, 2014 09:43:03
Msg-id	CAGTBQpZBbfOQOmncJDUP0rsYEBYcRkp_ymACRs5B1Mucih1Ccg@mail.gmail.com Whole thread Raw
In response to	Re: On partitioning ("Amit Langote" <Langote_Amit_f8@lab.ntt.co.jp>)
Responses	Re: On partitioning Re: On partitioning
List	pgsql-hackers

Tree view

On Sun, Dec 14, 2014 at 11:12 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> On egress you need some direct way to compare the scan quals with the
>> partitioning values.  I would imagine this to be similar to how scan
>> quals are compared to the values stored in a BRIN index: each scan qual
>> has a corresponding operator strategy and a scan key, and you can say
>> "aye" or "nay" based on a small set of operations that can be run
>> cheaply, again without any proof or running arbitrary expressions.
>>
>
> My knowledge of this is far from being perfect, though to clear any confusions -
>
> As far as planning is concerned, I could not imagine how index access method way of pruning partitions could be made
towork. Of course, I may be missing something.

Let me be overly verbose, don't take it as patronizing, just answering
in lots of detail why this could be a good idea to try.

Normal indexes store a pointer for each key value of sorts. So B-Tree
gets you a set of tids for each key, and so does GIN and hash.

But BRIN is different in that it does the mapping differently. BRIN
stores a compact, approximate representation of the set of keys within
a page range. It can tell with some degree of (in)accuracy whether a
key or key range could be part of that page range or not. The way it
does this is abstracted out, but at its core, it stores a "compressed"
representation of the key set that takes a constant amount of bits to
store, and no more, no matter how many elements. What changes as the
element it represents grows, is its accuracy.

Currently, BRIN only supports min-max representations. It will store,
for each page range, the minimum and maximum of some columns, and when
you query it, you can compare range vs range, and discard whole page
ranges.

Now, what are partitions, if not page ranges?

A BRIN tuple is a min-max pair. But BRIN in more general, it could use
other data structures to hold that "compressed representation", if
someone implemented them. Like bloom filters [0].

A BRIN index is a complex data structure because it has to account for
physically growing tables, but all the complexities vanish when you
fix a "block range" (the BR in BRIN) to a partition. Then, a mere
array of BRIN tuples would suffice.

BRIN already contains the machinery to turn quals into something that
filters out entire partitions, if you provide the BRIN tuples.

And you could even effectively matain a BRIN index for the partitions
(just a BRIN tuple per partition, dynamically updated with every
insertion).

If you do that, you start with empty partitions, and each insert
updates the BRIN tuple. Avoiding concurrency loss in this case would
be tricky, but in theory this could allow very general partition
exclusion. In fact it could even work with constraint exclusion right
now: you'd have a single-tuple BRIN index for each partition and
benefit from it.

But you don't need to pay the price of updating BRIN indexes, as
min-max tuples for each partition can be produced while creating the
partitions if the syntax already provides the information. Then, it's
just a matter of querying this meta-data which just happens to have
the form of a BRIN tuple for each partition.

[0] http://en.wikipedia.org/wiki/Bloom_filter

pgsql-hackers by date:

From: Michael Paquier
Date: 15 December 2014, 09:41:15
Subject: Re: [REVIEW] Re: Compression of full-page-writes

From: "Amit Langote"
Date: 15 December 2014, 10:12:48
Subject: Comment typo in typedef struct BrinTuple

Re: On partitioning - Mailing list pgsql-hackers

Previous

Next