$SUBJECT is still a very loosely formed idea, so forgive lack of detail or things I've likely missed, but I wanted to get it out there to see if it sounded at all intriguing to people.
Background: One of the big problems with non-local storage such as AWS EBS volumes or a SAN is that in a large database (really, working set, where working set includes reads) exceeds the size of buffer cache (and page cache) the cost of random page reads hitting the underlying disk system dominates. This is because networked disks have an order of magnitude higher latency than a bunch of RAIDed SSDs (even more so with NVMe storage). In some of our experiments on Aurora I've seen a 10x change versus pretty good physical hardware, and I'd assume RDS (since it's EBS-backed) is similar.
A specific area where this is particularly painful is btree index reads. Walking the tree to leaf pages isn't naturally prefetchable, and so for each level you pay the random page cost. Of course higher levels in the tree will almost certainly exhibit emergent behavior such that they (just by fact of the LRU caching) will be in the buffer cache, but for a large index lower levels likely won't be.
If we squint a bit, insertions look a whole lot like reads as well since we have to walk the tree to find the leaf insertion page for a new tuple. This is particularly true for indexes where inserts are roughly randomly distributed data, like a uuid.
The read-for-lookups problem is harder to solve, but the cost as it relates to table inserts is possibly more tractable. Tables typically have more than one index to update, so the obvious approach is "let's just parallelize the index insertions". Of course we know that's difficult given the multi-process approach Postgres uses for parallelism.
Another approach that at first glance seems like it fits better into Postgres (I'm not claiming it's easy or a small patch) would be to process a batch of indexes at once. For example, if the index access methods were extended to allow being given a list of indexes that need to be walked, then the btree code could process each layer in the walk as a group -- issuing IO fetches for all of the first level blocks in the tree, and then computing all of the next level blocks needed and issuing those IO requests at a time, and so on.
In some workloads we've been testing I believe such an approach could plausibly improve table insert (and update) performance by multiple hundreds of percent.
I don't have any code at the moment to show here, but I wanted to get the idea out there to see if there were any immediate reactions or other thoughts on the topic.
Thoughts?
James