Thread: Do BRIN indexes support MIN/MAX?
Hello,
We've started experimenting with using BRIN indexes for some of our large tables, as a replacement for B-Tree on "natural" timestamp columns that seem to be a good case for BRIN. Is it correct that BRIN indexes don't support MIN/MAX operations ?
We see that a query like `SELECT max(timestamp_column) FROM table` does a sequential scan on the table instead of using the index.
We're on Postgres 10.23 at the moment if that's important.
--
Andrey Klochkov
> Is it correct that BRIN indexes don't support MIN/MAX operations ?
In theory, it should be possible to implement min/max scan support for BRIN, however it is not implemented yet.
Just in case, min/max query would require to read all BRIN pages, and then it would require to read the corresponding pages in table.
For instance, imagine the table has N pages. Then BRIN would have N/128 pages with the default pages_per_range=128, so your max(..) query would take N/128 + 128 pages to read. In theory it would be sequential, however, under concurrent load it might not be that sequential for the disk.
For instance, 80GiB table would be like 10’000’000 pages, so the default BRIN would take about 78’000 pages (625MiB), so the min/max scan would read 626 MiB
If pages per range is increased to ~3162, then index size would be ~3162 pages (25MiB), and each index entry would cover 25MiB range. Then the query would have to read ~50MiB to fetch min/max. It is not clear if that is really practical though.
What are you data volumes and expectations by the way?
Vladimir
Vladimir
On Wed, 29 Mar 2023 at 22:07, Vladimir Sitnikov <sitnikov.vladimir@gmail.com> wrote: > > Is it correct that BRIN indexes don't support MIN/MAX operations ? > In theory, it should be possible to implement min/max scan support for BRIN, however it is not implemented yet. > > Just in case, min/max query would require to read all BRIN pages, and then it would require to read the corresponding pagesin table. > For instance, imagine the table has N pages. Then BRIN would have N/128 pages with the default pages_per_range=128, soyour max(..) query would take N/128 + 128 pages to read. In theory it would be sequential, however, under concurrent loadit might not be that sequential for the disk. I think BRIN would require N/128 RANGES, not pages, and if I am not mistaken it fits several ranges in an index page. It talks of summary tuples, and I suspect a summary tuple for say, an integer, is not gonna be longer, than 128 bytes, in which case you could fit 64 of them in a 4k page. Also, if you account for possible concurrent load disturbing your index+partial scan, you also have to account for the ( more likely ) disruption on the full scan. I.e., I have this table apc | apc_cdrs_p2022_12 | table | postgres | permanent | heap | 860 MB | N/128 pages implies N/128 bytes, so index would be 6.7Mb in your numbers, but apc | apc_cdrs_p2022_12_cuando_idx | index | postgres | apc_cdrs_p2022_12 | permanent | brin | 64 kB | apc | apc_cdrs_p2022_12_raw_id_idx | index | postgres | apc_cdrs_p2022_12 | permanent | brin | 64 kB | 1st one is on a timestamp column, second on an integer. And several empty partitions hace 48kB indexes, so it seems data is just 16k for the 860 ranges. That could be about 20 bytes/range which more or less fits to a couple of values. In my experience, BRIN are ridiculously small. I use them on that particular table because both cuando and raw_id correlate with insertion order and I normally only read several megabytes ranges indexed on them, so they work very well in limiting the scan range to nearly what I need. > For instance, 80GiB table would be like 10’000’000 pages, so the default BRIN would take about 78’000 pages (625MiB), sothe min/max scan would read 626 MiB > If pages per range is increased to ~3162, then index size would be ~3162 pages (25MiB), and each index entry would cover25MiB range. Then the query would have to read ~50MiB to fetch min/max. It is not clear if that is really practicalthough. If you assume your index fits 64 tuples per page your index read drops to about 10Mb, plus the 1Mb range. Also, I suspect you will have to read all unsummarized ranges ( probably before the summarized ones, as unsummarized can discard summarizeds, but not the other way ). Francisco Olarte.
BRIN indexes seem to work perfectly well for our purposes, and they are so tiny compared to B-Tree. Selecting min/max values is very expensive though.
In my case the table is ~2.5TB (530M records), while the whole BRIN index is 16MB. I think it'd be totally fine to scan all BRIN pages, it'd be way better than doing table scan.
On Wed, Mar 29, 2023 at 1:47 PM Francisco Olarte <folarte@peoplecall.com> wrote:
On Wed, 29 Mar 2023 at 22:07, Vladimir Sitnikov
<sitnikov.vladimir@gmail.com> wrote:
> > Is it correct that BRIN indexes don't support MIN/MAX operations ?
> In theory, it should be possible to implement min/max scan support for BRIN, however it is not implemented yet.
>
> Just in case, min/max query would require to read all BRIN pages, and then it would require to read the corresponding pages in table.
> For instance, imagine the table has N pages. Then BRIN would have N/128 pages with the default pages_per_range=128, so your max(..) query would take N/128 + 128 pages to read. In theory it would be sequential, however, under concurrent load it might not be that sequential for the disk.
I think BRIN would require N/128 RANGES, not pages, and if I am not
mistaken it fits several ranges in an index page. It talks of summary
tuples, and I suspect a summary tuple for say, an integer, is not
gonna be longer, than 128 bytes, in which case you could fit 64 of
them in a 4k page.
Also, if you account for possible concurrent load disturbing your
index+partial scan, you also have to account for the ( more likely )
disruption on the full scan.
I.e., I have this table
apc | apc_cdrs_p2022_12 | table |
postgres | permanent | heap | 860 MB |
N/128 pages implies N/128 bytes, so index would be 6.7Mb in your numbers, but
apc | apc_cdrs_p2022_12_cuando_idx | index
| postgres | apc_cdrs_p2022_12 | permanent | brin
| 64 kB |
apc | apc_cdrs_p2022_12_raw_id_idx | index
| postgres | apc_cdrs_p2022_12 | permanent | brin
| 64 kB |
1st one is on a timestamp column, second on an integer. And several
empty partitions hace 48kB indexes, so it seems data is just 16k for
the 860 ranges. That could be about 20 bytes/range which more or less
fits to a couple of values.
In my experience, BRIN are ridiculously small. I use them on that
particular table because both cuando and raw_id correlate with
insertion order and I normally only read several megabytes ranges
indexed on them, so they work very well in limiting the scan range to
nearly what I need.
> For instance, 80GiB table would be like 10’000’000 pages, so the default BRIN would take about 78’000 pages (625MiB), so the min/max scan would read 626 MiB
> If pages per range is increased to ~3162, then index size would be ~3162 pages (25MiB), and each index entry would cover 25MiB range. Then the query would have to read ~50MiB to fetch min/max. It is not clear if that is really practical though.
If you assume your index fits 64 tuples per page your index read drops
to about 10Mb, plus the 1Mb range.
Also, I suspect you will have to read all unsummarized ranges (
probably before the summarized ones, as unsummarized can discard
summarizeds, but not the other way ).
Francisco Olarte.
Andrey Klochkov
Right you are I should have divided it by ranges per page.
Vladimir
Vladimir
On Thu, 30 Mar 2023, 05:03 Andrey Klochkov, <diggerk@gmail.com> wrote:
BRIN indexes seem to work perfectly well for our purposes, and they are so tiny compared to B-Tree. Selecting min/max values is very expensive though.In my case the table is ~2.5TB (530M records), while the whole BRIN index is 16MB. I think it'd be totally fine to scan all BRIN pages, it'd be way better than doing table scan.
brin indexes don't work the way you would hope for. the stored min/max values per range guarantees that all values in the underlying relation pages are contained in that range, but it doesn't mean that those min/max values are still present in the table, so you can't deduce in which range the current min or max value is from there.
Julien Rouhaud <rjuju123@gmail.com> writes: > brin indexes don't work the way you would hope for. the stored min/max > values per range guarantees that all values in the underlying relation > pages are contained in that range, but it doesn't mean that those min/max > values are still present in the table, so you can't deduce in which range > the current min or max value is from there. Yeah. You could for example (when looking for a MAX) skip scanning block ranges whose indexed MAX is less than the indexed MIN of some other block range. But unless the column you are interested in is pretty well correlated with physical storage order, that seems unlikely to buy much. [ ... thinks ... ] Actually, I'm not sure that even that optimization works without hackery. It requires that you know that the "other block range" is not empty according to your snapshot. I guess you could work by scanning block ranges in descending indexed-MIN order, and skipping ranges only once you have a candidate MAX in hand. But that sounds complicated and very prone to nonsequential table access, so I'm not sure how much of a win it is. If you need fast min/max access, you probably need to bite the bullet and maintain a btree index. regards, tom lane
> so you can't deduce in which range the current min or max value is from there.
That is why you select several candidate ranges and scan the table for those ranges.
For instance, if you have ranges
1) 1..4
2) 5..8
3) 6..9
Then you do something like
select x
from (
select max(col) x from tab t where rowid in 5..8 or rowid in 6..9
union all
selext max(col) x from tab t where rowid in 1..4
)
limit 1
If the first two (2 and 3) ranges happen to be empty, then scanning of 1 would be needed.
Of course, it would degrade to scanning all the pages in table if all the ranges intersect or if all the rows are deleted from the table. However, it might work well in timestamp-like cases.
Vladimir
Vladimir
On Thu, 30 Mar 2023 at 17:18, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Julien Rouhaud <rjuju123@gmail.com> writes: > > brin indexes don't work the way you would hope for. the stored min/max > > values per range guarantees that all values in the underlying relation > > pages are contained in that range, but it doesn't mean that those min/max > > values are still present in the table, so you can't deduce in which range > > the current min or max value is from there. > > Yeah. You could for example (when looking for a MAX) skip scanning > block ranges whose indexed MAX is less than the indexed MIN of some > other block range. But unless the column you are interested in is > pretty well correlated with physical storage order, that seems > unlikely to buy much. I might be missing something obvious here, but as I understand it, there's not really any API to ask an index AM what the maximum or minimum indexed value is. There's only amcanorder == true indexes that could give you values starting with the highest or lowest depending on if it's a forward or backward index scan. Tomas is doing some work in https://commitfest.postgresql.org/42/3949/ which I imagined would have allowed index scans of BRIN indexes. Which I imagined would lead to allowing the MIN/MAX aggregates to effectively be rewritten to effectively be executed as SELECT <col> FROM <table> ORDER BY col LIMIT 1, as is currently done in build_minmax_path(). I'd assume the way to make this work with BRIN would be to allow ordered scans by first splitting all ranges into non-overlapping sets and then sorting tuples from each of those range sets in batches. Of course, that would only be efficient when ranges were reasonably not overlapping each other. (Glancing at Tomas's patch, I was surprised to see it didn't set amcanorder to true, so I'm a little unsure how that patch is adding more usable optimisations which the planner can make use of.) David
> You could for example (when looking for a MAX) skip scanning
block ranges whose indexed MAX is less than the indexed MIN of someother block range.
When looking for a max, you could scan the range with the maximal indexed MAX, and then you could skip all the ranges that have MAX less than the already found max value. You can use any found value as a cutoff point.
You do not need to have much hackery as you can scan the most promising range first, and then scan the others if the most promising did not yield value which is known to exceed all the other ranges.
So find_min and find_max can be pretty efficient with BRIN.
I’m not sure regarding the practical use of that though.
Top N can be implemented in the same way by passing all the values to a binary heap and skipping all the ranges that are known to be less than what we already have in a heap.
Andrey, could you clarify the use cases for looking up of min/max record?
I am afraid, PostgreSQL has no way to represent “this access method supports min/max retrieval”, and what currently exists is “this access method supports ordered scans sorted by the indexed column’s value”
Apparently, implementing ordered BRIN scan would be harder (from implementation, costing, and testing point of views), so it would be nice to hear on the use cases for having min/max am scans or for having BRIN ordered scans.
Vladimir
Vladimir
Vladimir,
Here's the use case description. We have some large tables that are effectively append-only logs with some stats on our systems. The table I was experimenting with recently contains 6 months of data. It has a timestamp column that tells when the stats in the row were collected. The system that consumes the data from this table normally does point queries that use B-Tree indexes which involve more than the timestamp column. Sometimes that system needs to reprocess a few hours to a few days of the most recent data, and that's when the BRIN index is used. In addition, we sometimes check the recency of the data in the table by finding "MAX(timestamp_column)", and that's when the BRIN index is not helpful. For now we modified the query that does this to limit the search to a few days of data as we don't expect the data in the table to ever become more stale than that.
On Wed, Mar 29, 2023 at 9:58 PM Vladimir Sitnikov <sitnikov.vladimir@gmail.com> wrote:
> You could for example (when looking for a MAX) skip scanningblock ranges whose indexed MAX is less than the indexed MIN of some
other block range.When looking for a max, you could scan the range with the maximal indexed MAX, and then you could skip all the ranges that have MAX less than the already found max value. You can use any found value as a cutoff point.You do not need to have much hackery as you can scan the most promising range first, and then scan the others if the most promising did not yield value which is known to exceed all the other ranges.So find_min and find_max can be pretty efficient with BRIN.I’m not sure regarding the practical use of that though.Top N can be implemented in the same way by passing all the values to a binary heap and skipping all the ranges that are known to be less than what we already have in a heap.Andrey, could you clarify the use cases for looking up of min/max record?I am afraid, PostgreSQL has no way to represent “this access method supports min/max retrieval”, and what currently exists is “this access method supports ordered scans sorted by the indexed column’s value”Apparently, implementing ordered BRIN scan would be harder (from implementation, costing, and testing point of views), so it would be nice to hear on the use cases for having min/max am scans or for having BRIN ordered scans.Vladimir--Vladimir
Andrey Klochkov