Re: Add the ability to limit the amount of memory that can be allocated to backends. - Mailing list pgsql-hackers
| From | Tomas Vondra |
|---|---|
| Subject | Re: Add the ability to limit the amount of memory that can be allocated to backends. |
| Date | |
| Msg-id | 9bae2a6c-8bc0-45ce-940e-b4ae9d5b119f@vondra.me Whole thread Raw |
| In response to | Re: Add the ability to limit the amount of memory that can be allocated to backends. (James Hunter <james.hunter.pg@gmail.com>) |
| List | pgsql-hackers |
Hi, Let me bump this dormant thread after about a year. I don't have any patch to share or anything like that, but I happened to read two quite interesting papers relevant to the topic of setting a per-query memory limit, and distributing the memory between operators in a query plan: 1) Exploiting pipeline interruptions for efficient memory allocation Authors: Josep Aguilar-Saborit, Mohammad Jalali, Dave Sharpe, Victor Muntés MuleroAuthors Info & Claims Published: 2008 PDF: https://dl.acm.org/doi/epdf/10.1145/1458082.1458169 2) Saving Private Hash Join Authors: Laurens Kuiper, Paul Groß, Peter Boncz, Hannes Mühleisen Published: 2024-2025 PDF: https://www.vldb.org/pvldb/vol18/p2748-kuiper.pdf I read the (2) paper first, expecting to learn some new stuff about hash joins, but to my surprise it's focusing on how to distribute memory between multiple hash joins in a single query. The hash joins serve mostly as an example of an actual/common operator in a query. The (1) paper establishes the theoretical framework and algorithms, and (2) presents a more precise/complete model for hash joins. I think both papers are worth reading, and the framework seems quite helpful. Even if there some parts may need tweaks, as Postgres does certain things differently for whatever reason. I'm not going to explain all the details (the papers are not that long anyway), but the proposed approach combines a couple basic parts: 1) leverage "pipeline interruption" operations Some operators materialize the intermediate results. This splits the plan into parts that don't need memory at the same time. It's enough to enforce the limit for these parts, not for the whole plan. Which means the optimization problems are smaller, and the operators can get more memory than when assuming the whole query runs at the same time. Of course, the reality is more complicated. Some operators are only partial pipeline interruptions (e.g. hashjoin breaks for the inner subplan, not the outer). And Postgres currently delays the Hash build until the first outer tuple, unlike DuckDB used in the (2) paper. 2) cost model for memory distribution The memory is distributed between operators based on a simple cost model, instead of using a simple scheme where operators get 1/N of memory. The (2) paper presents a hash join cost model combining "materialization cost" and "throughput". But it's a pretty generic approach, and should not be too difficult to do for other operators. Most operators don't even need to allocate that much memory, so the cost model can ignore those, I think. Only nodes that "break the pipeline" would matter. The important part is that this is a second optimization phase. The optimizer picks a query plan (assuming some default memory sizes), and then the cost model determines how to distribute memory within that particular plan. The paper does that repeatedly during execution, but I suspect these adjustments are easier in DuckDB. In Postgres we'd probably do that only once right after planning? Or maybe not, not sure. The (1) paper does things a bit differently, so that it works on DB2 (so less dynamic), but it also focuses on plans with hash joins etc. The general framework seems to be mostly the same. Anyway, that's all I have. I don't expect to work on this anytime soon, but I found those papers interesting. regards -- Tomas Vondra
pgsql-hackers by date: