Re: New access method for b-tree. - Mailing list pgsql-hackers
| From | Tomas Vondra |
|---|---|
| Subject | Re: New access method for b-tree. |
| Date | |
| Msg-id | b0c06e87-1c0a-43d6-8432-5704fbf131ab@vondra.me Whole thread Raw |
| In response to | Re: New access method for b-tree. (Matthias van de Meent <boekewurm+postgres@gmail.com>) |
| List | pgsql-hackers |
On 2/3/26 17:01, Matthias van de Meent wrote: > On Mon, 2 Feb 2026 at 00:54, Tomas Vondra <tomas@vondra.me> wrote: >> >> Hello Felipe, >> >> On 2/1/26 11:02, Alexandre Felipe wrote: >>> Hello Hackers, >>> >>> Please check this out, >>> >>> It is an access method to scan a table sorting by the second column of >>> an index, filtered on the first. >>> Queries like >>> SELECT x, y FROM grid >>> WHERE x in (array of Nx elements) >>> ORDER BY y, x >>> LIMIT M >>> >>> Can execute streaming the rows directly from disk instead of loading >>> everything. > > +1 for the idea, it does sound interesting. I haven't looked in depth > at the patch, so no comments on the execution yet. > >> So how does this compare to skip scan in practice? It's hard to compare, >> as the patch does not implement an actual access path, but I tried this: > [...] >> 1) master >> >> SELECT * FROM merge_test_int >> WHERE prefix_col IN (1,3,4,5,6,7,8,9,10,11,12,13,14,15) > [...] >> 2) merge scan >> >> SELECT * FROM test_btree_merge_scan_int( > [...] >> ARRAY[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15], > [...] >> And with explain analyze we get this: >> >> 1) Buffers: shared hit=26 read=25 >> 2) Buffers: shared hit=143 read=17 > > (FYI; your first query was missing "2" from it's IN list while it was > present in the merge scan input; this makes the difference worse by a > few pages) > >> So it seems to access many more buffers, even if the number of reads is >> lower. Presumably the merge scan is not always better than skip scan, >> probably depending on number of prefixes in the query etc. What is the >> cost model to decide between those two? > > Skip scan always returns data in index order, while this merge scan > would return tuples a suffix order. The cost model would thus weigh > the cost of sorting the result of an index skipscan against the cost > of doing a merge join on n_in_list_items distinct (partial) index > scans. > Makes sense. > As for when you would benefit in buffers accessed: The merge scan > would mainly benefit in number of buffers accessed when the selected > prefix values are non-sequential, and the prefixes cover multiple > pages at a time, and when there is a LIMIT clause on the scan. Normal > btree index skip scan infrastructure efficiently prevents new index > descents into the index when the selected SAOP key ranges are directly > adjecent, while merge scan would generally do at least one index > descent for each of its N scan heads (*) - which in the proposed > prototype patch guarantees O(index depth * num scan heads) buffer > accesses. > Do we have sufficient information to reliably make the right decision? Can we actually cost the two cases well enough? > (*) It is theoretically possible to reuse an earlier index descent if > the SAOP entry's key range of the last descent starts and ends on the > leaf page that the next SAOP entry's key range also starts on > (applying the ideas of 5bf748b86b to this new multi-headed index scan > mode), but that infrastructure doesn't seem to be in place in the > current patch. That commit is also why your buffer access count for > master is so low compared to the merge scan's; if your chosen list of > numbers was multiples of 5 (so that matching tuples are not all > sequential) you'd probably see much more comparable buffer access > counts. > >> If you had to construct the best case and worst cases (vs. skip scan), >> what would that look like? > > Presumably the best case would be: > > -- mytable.a has very few distinct values (e.g. bool or enum); > mytable.b many distinct values (e.g. uuid) > SELECT * FROM mytable WHERE a IN (1, 2) ORDER BY b; > > which the index's merge scan would turn into an index scan that > behaves similar to the following, possibly with the merge join pushed > down into the index: > > SELECT * FROM ( > SELECT ... FROM mytable WHERE a = 1 > UNION > SELECT ... FROM mytable WHERE a = 2 > ) ORDER BY b. > > > The worst case would be the opposite: > > -- mytable.a has many distinct values (random uuid); mytable.b few > (e.g. boolean; enum) > SELECT * FROM mytable WHERE a IN (... huge in list) ORDER BY b > > As the merge scan maintains one internal indexscan head per SAOP array > element, it'd have significant in-memory and scan startup overhead, > while few values are produced for each of those scan heads. > OK. It'll be interesting to see how this performs in practice for the whole gamut between the best and worst case. >> I'm also wondering how common is the targeted query pattern? How common >> it is to have an IN condition on the leading column in an index, and >> ORDER BY on the second one? > > I'm not sure, but it seems like it might be common/useful in > queue-like access patterns: > > With an index on (state, updated_timestamp) you're probably interested > in all messages in just a subset of states, ordered by recent state > transitions. An index on (updated_timestamp, state) might be > considered more optimal, but won't be able to efficiently serve > queries that only want data on uncommon states: The leaf pages would > mainly contain data on common states, reducing the value of those leaf > pages. > > Right now, you can rewrite the "prefix IN (...) ORDER BY SUFFIX" query > using UNION, or add an index for each percievable IN list, but it'd be > great if the user didn't have to rewrite their query or create > n_combinations indexes with their respective space usage to get this > more efficient query execution. > I think the examples presented by Ants (with timeline view) are quite plausible in practice. regards -- Tomas Vondra
pgsql-hackers by date: