Re: On partitioning - Mailing list pgsql-hackers
From | Claudio Freire |
---|---|
Subject | Re: On partitioning |
Date | |
Msg-id | CAGTBQpZBbfOQOmncJDUP0rsYEBYcRkp_ymACRs5B1Mucih1Ccg@mail.gmail.com Whole thread Raw |
In response to | Re: On partitioning ("Amit Langote" <Langote_Amit_f8@lab.ntt.co.jp>) |
Responses |
Re: On partitioning
Re: On partitioning |
List | pgsql-hackers |
On Sun, Dec 14, 2014 at 11:12 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> On egress you need some direct way to compare the scan quals with the >> partitioning values. I would imagine this to be similar to how scan >> quals are compared to the values stored in a BRIN index: each scan qual >> has a corresponding operator strategy and a scan key, and you can say >> "aye" or "nay" based on a small set of operations that can be run >> cheaply, again without any proof or running arbitrary expressions. >> > > My knowledge of this is far from being perfect, though to clear any confusions - > > As far as planning is concerned, I could not imagine how index access method way of pruning partitions could be made towork. Of course, I may be missing something. Let me be overly verbose, don't take it as patronizing, just answering in lots of detail why this could be a good idea to try. Normal indexes store a pointer for each key value of sorts. So B-Tree gets you a set of tids for each key, and so does GIN and hash. But BRIN is different in that it does the mapping differently. BRIN stores a compact, approximate representation of the set of keys within a page range. It can tell with some degree of (in)accuracy whether a key or key range could be part of that page range or not. The way it does this is abstracted out, but at its core, it stores a "compressed" representation of the key set that takes a constant amount of bits to store, and no more, no matter how many elements. What changes as the element it represents grows, is its accuracy. Currently, BRIN only supports min-max representations. It will store, for each page range, the minimum and maximum of some columns, and when you query it, you can compare range vs range, and discard whole page ranges. Now, what are partitions, if not page ranges? A BRIN tuple is a min-max pair. But BRIN in more general, it could use other data structures to hold that "compressed representation", if someone implemented them. Like bloom filters [0]. A BRIN index is a complex data structure because it has to account for physically growing tables, but all the complexities vanish when you fix a "block range" (the BR in BRIN) to a partition. Then, a mere array of BRIN tuples would suffice. BRIN already contains the machinery to turn quals into something that filters out entire partitions, if you provide the BRIN tuples. And you could even effectively matain a BRIN index for the partitions (just a BRIN tuple per partition, dynamically updated with every insertion). If you do that, you start with empty partitions, and each insert updates the BRIN tuple. Avoiding concurrency loss in this case would be tricky, but in theory this could allow very general partition exclusion. In fact it could even work with constraint exclusion right now: you'd have a single-tuple BRIN index for each partition and benefit from it. But you don't need to pay the price of updating BRIN indexes, as min-max tuples for each partition can be produced while creating the partitions if the syntax already provides the information. Then, it's just a matter of querying this meta-data which just happens to have the form of a BRIN tuple for each partition. [0] http://en.wikipedia.org/wiki/Bloom_filter
pgsql-hackers by date: