Thread: Let's make PostgreSQL multi-threaded
I spoke with some folks at PGCon about making PostgreSQL multi-threaded, so that the whole server runs in a single process, with multiple threads. It has been discussed many times in the past, last thread on pgsql-hackers was back in 2017 when Konstantin made some experiments [0]. I feel that there is now pretty strong consensus that it would be a good thing, more so than before. Lots of work to get there, and lots of details to be hashed out, but no objections to the idea at a high level. The purpose of this email is to make that silent consensus explicit. If you have objections to switching from the current multi-process architecture to a single-process, multi-threaded architecture, please speak up. If there are no major objections, I'm going to update the developer FAQ, removing the excuses there for why we don't use threads [1]. And we can start to talk about the path to get there. Below is a list of some hurdles and proposed high-level solutions. This isn't an exhaustive list, just some of the most obvious problems: # Transition period The transition surely cannot be done fully in one release. Even if we could pull it off in core, extensions will need more time to adapt. There will be a transition period of at least one release, probably more, where you can choose multi-process or multi-thread model using a GUC. Depending on how it goes, we can document it as experimental at first. # Thread per connection To get started, it's most straightforward to have one thread per connection, just replacing backend process with a backend thread. In the future, we might want to have a thread pool with some kind of a scheduler to assign active queries to worker threads. Or multiple threads per connection, or spawn additional helper threads for specific tasks. But that's future work. # Global variables We have a lot of global and static variables: $ objdump -t bin/postgres | grep -e "\.data" -e "\.bss" | grep -v "data.rel.ro" | wc -l 1666 Some of them are pointers to shared memory structures and can stay as they are. But many of them are per-connection state. The most straightforward conversion for those is to turn them into thread-local variables, like Konstantin did in [0]. It might be good to have some kind of a Session context struct that we pass everywhere, or maybe have a single thread-local variable to hold it. Many of the global variables would become fields in the Session. But that's future work. # Extensions A lot of extensions also contain global variables or other things that break in a multi-threaded environment. We need a way to label extensions that support multi-threading. And in the future, also extensions that *require* a multi-threaded server. Let's add flags to the control file to mark if the extension is thread-safe and/or process-safe. If you try to load an extension that's not compatible with the server's mode, throw an error. We might need new functions in addition _PG_init, called at connection startup and shutdown. And background worker API probably needs some changes. # Exposed PIDs We expose backend process PIDs to users in a few places. pg_stat_activity.pid and pg_terminate_backend(), for example. They need to be replaced, or we can assign a fake PID to each connection when running in multi-threaded mode. # Signals We use signals for communication between backends. SIGURG in latches, and SIGUSR1 in procsignal, for example. Those primitives need to be rewritten with some other signalling mechanism in multi-threaded mode. In principle, it's possible to set per-thread signal handlers, and send a signal to a particular thread (pthread_kill), but I think it's better to just rewrite them. We also document that you can send SIGINT, SIGTERM or SIGHUP to an individual backend process. I think we need to deprecate that, and maybe come up with some convenient replacement. E.g. send a message with backend ID to a unix domain socket, and a new pg_kill executable to send those messages. # Restart on crash If a backend process crashes, postmaster terminates all other backends and restarts the system. That's hard (impossible?) to do safely if everything runs in one process. We can continue have a separate postmaster process that just monitors the main process and restarts it on crash. # Thread-safe libraries Need to switch to thread-safe versions of library functions, e.g. uselocale() instead of setlocale(). The Python interpreter has a Global Interpreter Lock. It's not possible to create two completely independent Python interpreters in the same process, there will be some lock contention on the GIL. Fortunately, the python community just accepted https://peps.python.org/pep-0684/. That's exactly what we need: it makes it possible for separate interpreters to have their own GILs. It's not clear to me if that's in Python 3.12 already, or under development for some future version, but by the time we make the switch in Postgres, there probably will be a solution in cpython. At a quick glance, I think perl and TCL are fine, you can have multiple interpreters in one process. Need to check any other libraries we use. [0] https://www.postgresql.org/message-id/flat/9defcb14-a918-13fe-4b80-a0b02ff85527%40postgrespro.ru [1] https://wiki.postgresql.org/wiki/Developer_FAQ#Why_don.27t_you_use_raw_devices.2C_async-I.2FO.2C_.3Cinsert_your_favorite_wizz-bang_feature_here.3E.3F -- Heikki Linnakangas Neon (https://neon.tech)
Heikki Linnakangas <hlinnaka@iki.fi> writes: > I spoke with some folks at PGCon about making PostgreSQL multi-threaded, > so that the whole server runs in a single process, with multiple > threads. It has been discussed many times in the past, last thread on > pgsql-hackers was back in 2017 when Konstantin made some experiments [0]. > I feel that there is now pretty strong consensus that it would be a good > thing, more so than before. Lots of work to get there, and lots of > details to be hashed out, but no objections to the idea at a high level. > The purpose of this email is to make that silent consensus explicit. If > you have objections to switching from the current multi-process > architecture to a single-process, multi-threaded architecture, please > speak up. For the record, I think this will be a disaster. There is far too much code that will get broken, largely silently, and much of it is not under our control. regards, tom lane
On Mon Jun 5, 2023 at 9:51 AM CDT, Heikki Linnakangas wrote: > # Global variables > > We have a lot of global and static variables: > > $ objdump -t bin/postgres | grep -e "\.data" -e "\.bss" | grep -v > "data.rel.ro" | wc -l > 1666 > > Some of them are pointers to shared memory structures and can stay as > they are. But many of them are per-connection state. The most > straightforward conversion for those is to turn them into thread-local > variables, like Konstantin did in [0]. > > It might be good to have some kind of a Session context struct that we > pass everywhere, or maybe have a single thread-local variable to hold > it. Many of the global variables would become fields in the Session. But > that's future work. +1 to the session context idea after the more simple thread_local storage idea. > # Extensions > > A lot of extensions also contain global variables or other things that > break in a multi-threaded environment. We need a way to label extensions > that support multi-threading. And in the future, also extensions that > *require* a multi-threaded server. > > Let's add flags to the control file to mark if the extension is > thread-safe and/or process-safe. If you try to load an extension that's > not compatible with the server's mode, throw an error. > > We might need new functions in addition _PG_init, called at connection > startup and shutdown. And background worker API probably needs some changes. It would be a good idea to start exposing a variable through pkg-config to tell whether the backend is multi-threaded or multi-process. > # Exposed PIDs > > We expose backend process PIDs to users in a few places. > pg_stat_activity.pid and pg_terminate_backend(), for example. They need > to be replaced, or we can assign a fake PID to each connection when > running in multi-threaded mode. Would it be possible to just transparently slot in the thread ID instead? > # Thread-safe libraries > > Need to switch to thread-safe versions of library functions, e.g. > uselocale() instead of setlocale(). Seems like a good starting point. > The Python interpreter has a Global Interpreter Lock. It's not possible > to create two completely independent Python interpreters in the same > process, there will be some lock contention on the GIL. Fortunately, the > python community just accepted https://peps.python.org/pep-0684/. That's > exactly what we need: it makes it possible for separate interpreters to > have their own GILs. It's not clear to me if that's in Python 3.12 > already, or under development for some future version, but by the time > we make the switch in Postgres, there probably will be a solution in > cpython. 3.12 is the currently in-development version of Python. 3.12 is planned for release in October of this year. A workaround that some projects seem to do is to use multiple Python interpreters[0], though it seems uncommon. It might be important to note depending on the minimum version of Python Postgres aims to support (not sure on this policy). The C-API of Python also provides mechanisms for releasing the GIL. I am not familiar with how Postgres uses Python, but I have seen huge improvements to performance with well-placed GIL releases in multi-threaded contexts. Surely this API would just become a no-op after the PEP is implemented. [0]: https://peps.python.org/pep-0684/#existing-use-of-multiple-interpreters -- Tristan Partin Neon (https://neon.tech)
On 05/06/2023 11:18, Tom Lane wrote: > Heikki Linnakangas <hlinnaka@iki.fi> writes: >> I spoke with some folks at PGCon about making PostgreSQL multi-threaded, >> so that the whole server runs in a single process, with multiple >> threads. It has been discussed many times in the past, last thread on >> pgsql-hackers was back in 2017 when Konstantin made some experiments [0]. > >> I feel that there is now pretty strong consensus that it would be a good >> thing, more so than before. Lots of work to get there, and lots of >> details to be hashed out, but no objections to the idea at a high level. > >> The purpose of this email is to make that silent consensus explicit. If >> you have objections to switching from the current multi-process >> architecture to a single-process, multi-threaded architecture, please >> speak up. > > For the record, I think this will be a disaster. There is far too much > code that will get broken, largely silently, and much of it is not > under our control. Noted. Other large projects have gone through this transition. It's not easy, but it's a lot easier now than it was 10 years ago. The platform and compiler support is there now, all libraries have thread-safe interfaces, etc. I don't expect you or others to buy into any particular code change at this point, or to contribute time into it. Just to accept that it's a worthwhile goal. If the implementation turns out to be a disaster, then it won't be accepted, of course. But I'm optimistic. -- Heikki Linnakangas Neon (https://neon.tech)
On 05/06/2023 11:28, Tristan Partin wrote: > On Mon Jun 5, 2023 at 9:51 AM CDT, Heikki Linnakangas wrote: >> # Extensions >> >> A lot of extensions also contain global variables or other things that >> break in a multi-threaded environment. We need a way to label extensions >> that support multi-threading. And in the future, also extensions that >> *require* a multi-threaded server. >> >> Let's add flags to the control file to mark if the extension is >> thread-safe and/or process-safe. If you try to load an extension that's >> not compatible with the server's mode, throw an error. >> >> We might need new functions in addition _PG_init, called at connection >> startup and shutdown. And background worker API probably needs some changes. > > It would be a good idea to start exposing a variable through pkg-config > to tell whether the backend is multi-threaded or multi-process. I think we need to support both modes without having to recompile the server or the extensions. So it needs to be a runtime check. >> # Exposed PIDs >> >> We expose backend process PIDs to users in a few places. >> pg_stat_activity.pid and pg_terminate_backend(), for example. They need >> to be replaced, or we can assign a fake PID to each connection when >> running in multi-threaded mode. > > Would it be possible to just transparently slot in the thread ID > instead? Perhaps. It might break applications that use the PID directly with e.g. 'kill <PID>', though. >> The Python interpreter has a Global Interpreter Lock. It's not possible >> to create two completely independent Python interpreters in the same >> process, there will be some lock contention on the GIL. Fortunately, the >> python community just accepted https://peps.python.org/pep-0684/. That's >> exactly what we need: it makes it possible for separate interpreters to >> have their own GILs. It's not clear to me if that's in Python 3.12 >> already, or under development for some future version, but by the time >> we make the switch in Postgres, there probably will be a solution in >> cpython. > > 3.12 is the currently in-development version of Python. 3.12 is planned > for release in October of this year. > > A workaround that some projects seem to do is to use multiple Python > interpreters[0], though it seems uncommon. It might be important to note > depending on the minimum version of Python Postgres aims to support (not > sure on this policy). > > The C-API of Python also provides mechanisms for releasing the GIL. I am > not familiar with how Postgres uses Python, but I have seen huge > improvements to performance with well-placed GIL releases in > multi-threaded contexts. Surely this API would just become a no-op after > the PEP is implemented. > > [0]: https://peps.python.org/pep-0684/#existing-use-of-multiple-interpreters Oh, cool. I'm inclined to jump straight to PEP-684 and require python 3.12 in multi-threaded mode, though, or just accept that it's slow. But let's see what the state of the world is when we get there. -- Heikki Linnakangas Neon (https://neon.tech)
On 05/06/2023 11:18, Tom Lane wrote:
> Heikki Linnakangas <hlinnaka(at)iki(dot)fi> writes:
>> I spoke with some folks at PGCon about making PostgreSQL multi-threaded,
>> so that the whole server runs in a single process, with multiple
>> threads. It has been discussed many times in the past, last thread on
>> pgsql-hackers was back in 2017 when Konstantin made some experiments [0].
>
>> I feel that there is now pretty strong consensus that it would be a good
>> thing, more so than before. Lots of work to get there, and lots of
>> details to be hashed out, but no objections to the idea at a high level.
>
>> The purpose of this email is to make that silent consensus explicit. If
>> you have objections to switching from the current multi-process
>> architecture to a single-process, multi-threaded architecture, please
>> speak up.
>
> For the record, I think this will be a disaster. There is far too much
> code that will get broken, largely silently, and much of it is not
> under our control.
I fully agreed with Tom.
First, it is not clear what are the benefits of architecture change?
Performance?
Development becomes much more complicated and error-prone.
There are still many low-hanging fruit to be had that can improve performance.
And the code can gradually and safely remove multithreading barriers.
1. gradual reduction of global variables
2. introduction of local context structures
3. shrink current structures (to fit in 32, 64 boundaries)
4. scope reduction
My 2c.
regards,
Ranier Vilela
On Mon, Jun 5, 2023 at 01:26:00PM -0300, Ranier Vilela wrote: > On 05/06/2023 11:18, Tom Lane wrote: > > For the record, I think this will be a disaster. There is far too much > > code that will get broken, largely silently, and much of it is not > > under our control. > > I fully agreed with Tom. > > First, it is not clear what are the benefits of architecture change? > > Performance? > > Development becomes much more complicated and error-prone. I agree the costs of going threaded have been reduced with compiler and library improvements, but I don't know if they are reduced enough for the change to be a net benefit, except on Windows where the process creation overhead is high. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com Only you can decide what is important to you.
On Mon, Jun 5, 2023 at 01:26:00PM -0300, Ranier Vilela wrote:
> On 05/06/2023 11:18, Tom Lane wrote:
> > For the record, I think this will be a disaster. There is far too much
> > code that will get broken, largely silently, and much of it is not
> > under our control.
>
> I fully agreed with Tom.
>
> First, it is not clear what are the benefits of architecture change?
>
> Performance?
>
> Development becomes much more complicated and error-prone.
I agree the costs of going threaded have been reduced with compiler and
library improvements, but I don't know if they are reduced enough for
the change to be a net benefit, except on Windows where the process
creation overhead is high.
nOn Mon, Jun 5, 2023 at 05:51:57PM +0300, Heikki Linnakangas wrote: > # Restart on crash > > If a backend process crashes, postmaster terminates all other backends and > restarts the system. That's hard (impossible?) to do safely if everything > runs in one process. We can continue have a separate postmaster process that > just monitors the main process and restarts it on crash. It would be good to know what new class of errors would cause server restarts, e.g., memory allocation failures? -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com Only you can decide what is important to you.
On 05/06/2023 12:26, Ranier Vilela wrote: > First, it is not clear what are the benefits of architecture change? > > Performance? I doubt it makes much performance difference, at least not initially. It might help a little with backend startup time, and maybe some other things. And might reduce the overhead of context switches and TLB cache misses. In the long run, a single-process architecture makes it easier to have shared catalog caches, plan cache, etc. which can improve performance. And it can make it easier to launch helper threads for things where worker processes would be too heavy-weight. But those benefits will require more work, they won't happen just by replacing processes with threads. The ease of developing things like that is my motivation. > Development becomes much more complicated and error-prone. I don't agree with that. We currently bend over backwards to make all allocations fixed-sized in shared memory. You learn to live with that, but a lot of things would be simpler if you could allocate and free in shared memory more freely. It's no panacea, you still need to be careful with locking and concurrency. But a lot simpler. We have built dynamic shared memory etc. over the years to work around the limitations of shared memory. But it's still a lot more complicated. Code that doesn't need to communicate with other processes/threads is simple to write in either model. > There are still many low-hanging fruit to be had that can improve > performance. > And the code can gradually and safely remove multithreading barriers. > > 1. gradual reduction of global variables > 2. introduction of local context structures > 3. shrink current structures (to fit in 32, 64 boundaries) > > 4. scope reduction Right, the reason I started this thread is to explicitly note that it is a worthy goal. If it's not, the above steps would be pointless. But if we agree that it is a worthy goal, we can start to incrementally work towards it. -- Heikki Linnakangas Neon (https://neon.tech)
On 05/06/2023 13:10, Bruce Momjian wrote: > nOn Mon, Jun 5, 2023 at 05:51:57PM +0300, Heikki Linnakangas wrote: >> # Restart on crash >> >> If a backend process crashes, postmaster terminates all other backends and >> restarts the system. That's hard (impossible?) to do safely if everything >> runs in one process. We can continue have a separate postmaster process that >> just monitors the main process and restarts it on crash. > > It would be good to know what new class of errors would cause server > restarts, e.g., memory allocation failures? You mean "out of memory"? No, that would be horrible. I don't think there would be any new class of errors that would cause server restarts. In theory, having a separate address space for each backend gives you some protection. In practice, there are a lot of shared memory structures anyway that you can stomp over, and a segfault or unexpected exit of any backend process causes postmaster to restart the whole system anyway. -- Heikki Linnakangas Neon (https://neon.tech)
We currently bend over backwards to make all allocations fixed-sized in
shared memory. You learn to live with that, but a lot of things would be
simpler if you could allocate and free in shared memory more freely.
It's no panacea, you still need to be careful with locking and
concurrency. But a lot simpler.
On 6/5/23 11:33 AM, Heikki Linnakangas wrote: > On 05/06/2023 11:18, Tom Lane wrote: >> Heikki Linnakangas <hlinnaka@iki.fi> writes: >>> I spoke with some folks at PGCon about making PostgreSQL multi-threaded, >>> so that the whole server runs in a single process, with multiple >>> threads. It has been discussed many times in the past, last thread on >>> pgsql-hackers was back in 2017 when Konstantin made some experiments >>> [0]. >> >>> I feel that there is now pretty strong consensus that it would be a good >>> thing, more so than before. Lots of work to get there, and lots of >>> details to be hashed out, but no objections to the idea at a high level. >> >>> The purpose of this email is to make that silent consensus explicit. If >>> you have objections to switching from the current multi-process >>> architecture to a single-process, multi-threaded architecture, please >>> speak up. >> >> For the record, I think this will be a disaster. There is far too much >> code that will get broken, largely silently, and much of it is not >> under our control. > > Noted. Other large projects have gone through this transition. It's not > easy, but it's a lot easier now than it was 10 years ago. The platform > and compiler support is there now, all libraries have thread-safe > interfaces, etc. > > I don't expect you or others to buy into any particular code change at > this point, or to contribute time into it. Just to accept that it's a > worthwhile goal. If the implementation turns out to be a disaster, then > it won't be accepted, of course. But I'm optimistic. I don't have enough expertise in this area to comment on if it'd be a "disaster" or not. My zoomed out observations are two-fold: 1. It seems like there's a lack of consensus on which of processes vs. threads yield the best performance benefit, and from talking to folks with greater expertise than me, this can vary between workloads. I believe one DB even gives uses a choice if they want to run in processes vs. threads. 2. While I wouldn't want to necessarily discourage a moonshot effort, I would ask if developer time could be better spent on tackling some of the other problems around vertical scalability? Per some PGCon discussions, there's still room for improvement in how PostgreSQL can best utilize resources available very large "commodity" machines (a 448-core / 24TB RAM instance comes to mind). I'm purposely giving a nonanswer on whether it's a worthwhile goal, but rather I'd be curious where it could stack up against some other efforts to continue to help PostgreSQL improve performance and handle very large workloads. Thanks, Jonathan
Attachment
On 05/06/2023 13:32, Merlin Moncure wrote: > Would this help with oom killer in linux? Hmm, I guess the OOM killer would better understand what Postgres is doing, it's not very smart about accounting shared memory. You still wouldn't want the OOM killer to kill Postgres, though, so I think you'd still want to disable it in production systems. > Isn't it true that pgbouncer provides a lot of the same benefits? I guess there is some overlap, although I don't really think of it that way. Firstly, pgbouncer has its own set of problems. Secondly, switching to threads would not make connection poolers obsolete. Maybe in the distant future, Postgres could handle thousands of connections with ease, and threads would make that easier to achieve that, but that would need a lot of more work. -- Heikki Linnakangas Neon (https://neon.tech)
On Mon, Jun 5, 2023 at 08:29:16PM +0300, Heikki Linnakangas wrote: > On 05/06/2023 13:10, Bruce Momjian wrote: > > nOn Mon, Jun 5, 2023 at 05:51:57PM +0300, Heikki Linnakangas wrote: > > > # Restart on crash > > > > > > If a backend process crashes, postmaster terminates all other backends and > > > restarts the system. That's hard (impossible?) to do safely if everything > > > runs in one process. We can continue have a separate postmaster process that > > > just monitors the main process and restarts it on crash. > > > > It would be good to know what new class of errors would cause server > > restarts, e.g., memory allocation failures? > > You mean "out of memory"? No, that would be horrible. > > I don't think there would be any new class of errors that would cause server > restarts. In theory, having a separate address space for each backend gives > you some protection. In practice, there are a lot of shared memory > structures anyway that you can stomp over, and a segfault or unexpected exit > of any backend process causes postmaster to restart the whole system anyway. Uh, yes, but don't we detect failures while modifying shared memory and force a restart? Wouldn't the scope of failures be much larger? -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com Only you can decide what is important to you.
On 05/06/2023 14:04, Bruce Momjian wrote: > On Mon, Jun 5, 2023 at 08:29:16PM +0300, Heikki Linnakangas wrote: >> I don't think there would be any new class of errors that would cause server >> restarts. In theory, having a separate address space for each backend gives >> you some protection. In practice, there are a lot of shared memory >> structures anyway that you can stomp over, and a segfault or unexpected exit >> of any backend process causes postmaster to restart the whole system anyway. > > Uh, yes, but don't we detect failures while modifying shared memory and > force a restart? Wouldn't the scope of failures be much larger? If one process writes over shared memory that it shouldn't, it can cause a crash in that process or some other process that reads it. Same with multiple threads, no difference there. With a single process, one thread can modify another thread's "backend private" memory, and cause the other thread to crash. Perhaps that's what you meant? In practice, I don't think it's so bad. Even in a multi-threaded environment, common bugs like buffer overflows and use-after-free are still much more likely to access memory owned by the same thread, thanks to how memory allocators work. And a completely random memory access is still more likely to cause a segfault than corrupting another thread's memory. And tools like CLOBBER_FREED_MEMORY/MEMORY_CONTEXT_CHECKING and valgrind are pretty good at catching memory access bugs at development time, whether it's multiple processes or threads. -- Heikki Linnakangas Neon (https://neon.tech)
Heikki Linnakangas <hlinnaka@iki.fi> writes:I spoke with some folks at PGCon about making PostgreSQL multi-threaded, so that the whole server runs in a single process, with multiple threads. It has been discussed many times in the past, last thread on pgsql-hackers was back in 2017 when Konstantin made some experiments [0].I feel that there is now pretty strong consensus that it would be a good thing, more so than before. Lots of work to get there, and lots of details to be hashed out, but no objections to the idea at a high level.The purpose of this email is to make that silent consensus explicit. If you have objections to switching from the current multi-process architecture to a single-process, multi-threaded architecture, please speak up.For the record, I think this will be a disaster. There is far too much code that will get broken, largely silently, and much of it is not under our control.
If we were starting out today we would probably choose a threaded implementation. But moving to threaded now seems to me like a multi-year-multi-person project with the prospect of years to come chasing bugs and the prospect of fairly modest advantages. The risk to reward doesn't look great.
That's my initial reaction. I could be convinced otherwise.
cheers
andrew
-- Andrew Dunstan EDB: https://www.enterprisedb.com
In the long run, a single-process architecture makes it easier to have
shared catalog caches, plan cache, etc. which can improve performance.
And it can make it easier to launch helper threads for things where
worker processes would be too heavy-weight. But those benefits will
require more work, they won't happen just by replacing processes with
threads.
On 6/5/23 14:51, Andrew Dunstan wrote: > > On 2023-06-05 Mo 11:18, Tom Lane wrote: >> Heikki Linnakangas<hlinnaka@iki.fi> writes: >>> I spoke with some folks at PGCon about making PostgreSQL multi-threaded, >>> so that the whole server runs in a single process, with multiple >>> threads. It has been discussed many times in the past, last thread on >>> pgsql-hackers was back in 2017 when Konstantin made some experiments [0]. >>> I feel that there is now pretty strong consensus that it would be a good >>> thing, more so than before. Lots of work to get there, and lots of >>> details to be hashed out, but no objections to the idea at a high level. >>> The purpose of this email is to make that silent consensus explicit. If >>> you have objections to switching from the current multi-process >>> architecture to a single-process, multi-threaded architecture, please >>> speak up. >> For the record, I think this will be a disaster. There is far too much >> code that will get broken, largely silently, and much of it is not >> under our control. > > If we were starting out today we would probably choose a threaded > implementation. But moving to threaded now seems to me like a > multi-year-multi-person project with the prospect of years to come > chasing bugs and the prospect of fairly modest advantages. The risk to > reward doesn't look great. > > That's my initial reaction. I could be convinced otherwise. I read through the thread thus far, and Andrew's response is the one that best aligns with my reaction. -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
For the record, I think this will be a disaster. There is far too much
code that will get broken, largely silently, and much of it is not
under our control.
On Mon, Jun 5, 2023 at 09:30:28PM +0300, Heikki Linnakangas wrote: > If one process writes over shared memory that it shouldn't, it can cause a > crash in that process or some other process that reads it. Same with > multiple threads, no difference there. > > With a single process, one thread can modify another thread's "backend > private" memory, and cause the other thread to crash. Perhaps that's what > you meant? > > In practice, I don't think it's so bad. Even in a multi-threaded > environment, common bugs like buffer overflows and use-after-free are still > much more likely to access memory owned by the same thread, thanks to how > memory allocators work. And a completely random memory access is still more > likely to cause a segfault than corrupting another thread's memory. And > tools like CLOBBER_FREED_MEMORY/MEMORY_CONTEXT_CHECKING and valgrind are > pretty good at catching memory access bugs at development time, whether it's > multiple processes or threads. I remember we used to have macros we called before we modified critical parts of shared memory, and if a process exited while in those blocks, the server would restart. Unfortunately, I can't find that in the code now. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com Only you can decide what is important to you.
On Mon, Jun 5, 2023 at 4:26 PM Bruce Momjian <bruce@momjian.us> wrote: > I remember we used to have macros we called before we modified critical > parts of shared memory, and if a process exited while in those blocks, > the server would restart. Unfortunately, I can't find that in the code > now. Isn't that what we call a critical section? They effectively "promote" any ERROR (e.g., from an OOM) into a PANIC. I thought that we only used critical sections for things that are WAL-logged, but I double checked just now. Turns out that I was wrong: PGSTAT_BEGIN_WRITE_ACTIVITY() contains its own START_CRIT_SECTION(), despite not being involved in WAL logging. And so critical sections could indeed be described as something that we use whenever shared memory cannot be left in an inconsistent state (which often coincides with WAL logging, but need not). -- Peter Geoghegan
On Mon, Jun 5, 2023 at 04:50:11PM -0700, Peter Geoghegan wrote: > On Mon, Jun 5, 2023 at 4:26 PM Bruce Momjian <bruce@momjian.us> wrote: > > I remember we used to have macros we called before we modified critical > > parts of shared memory, and if a process exited while in those blocks, > > the server would restart. Unfortunately, I can't find that in the code > > now. > > Isn't that what we call a critical section? They effectively "promote" > any ERROR (e.g., from an OOM) into a PANIC. > > I thought that we only used critical sections for things that are > WAL-logged, but I double checked just now. Turns out that I was wrong: > PGSTAT_BEGIN_WRITE_ACTIVITY() contains its own START_CRIT_SECTION(), > despite not being involved in WAL logging. And so critical sections > could indeed be described as something that we use whenever shared > memory cannot be left in an inconsistent state (which often coincides > with WAL logging, but need not). Yes, sorry, critical sections is what I was remembering. My question is whether all unexpected backend exits should be treated as critical sections? -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com Only you can decide what is important to you.
On 6/5/23 2:07 PM, Jonah H. Harris wrote: > On Mon, Jun 5, 2023 at 8:18 AM Tom Lane <tgl@sss.pgh.pa.us > <mailto:tgl@sss.pgh.pa.us>> wrote: > > For the record, I think this will be a disaster. There is far too much > code that will get broken, largely silently, and much of it is not > under our control. > > > While I've long been in favor of a multi-threaded implementation, now in > my old age, I tend to agree with Tom. I'd be interested in Konstantin's > thoughts (and PostgresPro's experience) of multi-threaded vs. internal > pooling with the current process-based model. I recall looking at and > playing with Konstantin's implementations of both, which were > impressive. Yes, the latter doesn't solve the same issues, but many > real-world ones where multi-threaded is argued. Personally, I think > there would be not only a significant amount of time spent dealing with > in-the-field stability regressions before a multi-threaded > implementation matures, but it would also increase the learning curve > for anyone trying to start with internals development. To me, processes feel just a little easier to observe and inspect, a little easier to debug, and a little easier to reason about. Tooling does exist for threads - but operating systems track more things at a process level and I like having the full arsenal of unix process-based tooling at my disposal. Even simple things, like being able to see at a glance from "ps" or "top" output which process is the bgwriter or the checkpointer, and being able to attach gdb only on that process without pausing the whole system. Or to a single backend. A thread model certainly has advantages but I do feel that some useful things might be lost here. And for the record, just within the past few weeks I saw a small mistake in some C code which smashed the stack of another thread in the same process space. It manifested as unpredictable periodic random SIGSEGV and SIGBUS with core dumps that were useless gibberish, and it was rather difficult to root cause. But one interesting outcome of that incident was learning from my colleague Josh that apparently SUSv2 and C99 contradict each other: when snprintf() is called with size=0 then SUSv2 stipulates an unspecified return value less than 1, while C99 allows str to be NULL in this case, and gives the return value (as always) as the number of characters that would have been written in case the output string has been large enough. So long story short... I think the robustness angle on the process model shouldn't be underestimated either. -Jeremy -- http://about.me/jeremy_schneider
On Mon, Jun 5, 2023 at 5:15 PM Bruce Momjian <bruce@momjian.us> wrote: > > Isn't that what we call a critical section? They effectively "promote" > > any ERROR (e.g., from an OOM) into a PANIC. > Yes, sorry, critical sections is what I was remembering. My question is > whether all unexpected backend exits should be treated as critical > sections? I think that it boils down to this: critical sections help us to avoid various inconsistencies that might otherwise be introduced to critical state, usually in shared memory. And so critical sections are mostly about protecting truly crucial state, even in the presence of irrecoverable problems (e.g., those caused by corruption that was missed before the critical section was reached, fsync() reporting failure on recent Postgres versions). This is mostly about the state itself -- it's not about cleaning up from routine errors at all. The server isn't supposed to PANIC, and won't unless some fundamental assumption that the system makes isn't met. I said that an OOM could cause a PANIC. But that really shouldn't be possible in practice, since it can only happen when code in a critical section actually attempts to allocate memory in the first place. There is an assertion in palloc() that will catch code that violates that rule. It has been known to happen from time to time, but theoretically it should never happen. Discussion about the robustness of threads versus processes seems to only be concerned with what can happen after something "impossible" takes place. Not before. Backend code is not supposed to corrupt memory, whether shared or local, with or without threads. Code in critical sections isn't supposed to even attempt memory allocation. Jeremy and others have suggested that processes have significant robustness advantages. Maybe they do, but it's hard to say either way because these benefits only apply "when the impossible happens". In any given case it's reasonable to wonder if the user was protected by our multi-process architecture, or protected by dumb luck. Could even be both. -- Peter Geoghegan
>> For the record, I think this will be a disaster. There is far too >> much >> code that will get broken, largely silently, and much of it is not >> under our control. >> >> > > > If we were starting out today we would probably choose a threaded > implementation. But moving to threaded now seems to me like a > multi-year-multi-person project with the prospect of years to come > chasing bugs and the prospect of fairly modest advantages. The risk to > reward doesn't look great. +1. Long time ago (PostgreSQL 7 days) I modified PostgreSQL to threaded implementation so that it runs on Windows because there's was no Windows port of PostgreSQL at that time. I don't remember the details but it was desperately hard for me. Best reagards, -- Tatsuo Ishii SRA OSS LLC English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp
On Mon, Jun 5, 2023 at 8:18 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:For the record, I think this will be a disaster. There is far too much
code that will get broken, largely silently, and much of it is not
under our control.While I've long been in favor of a multi-threaded implementation, now in my old age, I tend to agree with Tom. I'd be interested in Konstantin's thoughts (and PostgresPro's experience) of multi-threaded vs. internal pooling with the current process-based model. I recall looking at and playing with Konstantin's implementations of both, which were impressive. Yes, the latter doesn't solve the same issues, but many real-world ones where multi-threaded is argued. Personally, I think there would be not only a significant amount of time spent dealing with in-the-field stability regressions before a multi-threaded implementation matures, but it would also increase the learning curve for anyone trying to start with internals development.--Jonah H. Harris
Let me share my experience with porting Postgres to threads (by the way - repository is still alive - https://github.com/postgrespro/postgresql.pthreads
but I have not keep it in sync with recent versions of Postgres).
1. Solving the problem with static variables was not so difficult as I expected - thanks to TLS and its support in modern compilers.
So the only thing we should do is to add some special modified to variable declaration:
-static int MyLockNo = 0;
-static bool holdingAllLocks = false;
+static session_local int MyLockNo = 0;
+static session_local bool holdingAllLocks = false;
But there are about 2k such variables which storage class has to be changed.
This is one of the reasons why I do not agree with the proposal to define some session context, place all session specific variables in such context and pass it everywhere. It will be very inconvenient to maintain structure with 2k fields and adding new field to this struct each time you need some non-local variable. Even i it can be hide in some macros like DEF_SESSION_VAR(type, name).
Also it requires changing of all Postgres code working with this variables, not just declarations.
So patch will be 100x times more and almost any line of Postgres code has to be changed.
And I do not see any reasons for it except portability and avoid dependecy on compiler.
Implementation of TLS is quite efficient (at least at x86) - there is special register pointing to TLS area, so access TLS variable is not more expensive than static variable.
2. Performance improvement from switching to threads was not so large (~10%). But please notice that I have not changed ny Postgres sync primitives.
(But still not sure that using for example pthead_rwlock instead of our own LWLock will cause some gains in performance)
3. Multithreading significantly simplify concurrent query execution and interaction between workers.
Right now with dynamic shared memory stuff we can support work with varying size data in shared memory but
in mutithreaded program it can be done much easier.
4. Multuthreaded model opens a way for fixing many existed Postgres problems: lack of shared catalog and prepared statements cache, changing page pool size (shared buffers) in runtime, ...
5. During this porting I had most of troubles with the following components: GUCs, signals, handling errors and file descriptor cache. File descriptor cache really becomes bottleneck because now all backends and competing for file descriptors which number is usually limited by 1024 (without editing system configuration). Protecting it with mutex cause significant degrade of performance. So I have to maintain thread-local cache.
6. It is not clear how to support external extensions.
7. It will be hard to support non-multithreaded PL languages (like python), but for example support of Java will be more natural and efficient.
I do not think that development of multithreaded application is more complex or requires large "learning curve".
When you deal with parallel execution you should be careful in any case.
The advantage of process model is that there is much clear distinction between shared and private variables.
Concerning debugging and profiling - it is more convenient with multithreading in some cases and less convenient in other.
But programmers are living with threads for more than 30 years so now most tools are supporting threads at least not worse than processes.
And for many developers now working with threads is more natural and convenient.
OOM and local backend memory consumption seems to be one of the main challenges for multithreadig model:
right now some queries can cause high consumption of memory. work_mem is just a hint and real memory consumption can be much higher.
Even if it doesn't cause OOM, still not all of the allocated memory is returned to OS after query completion and increase memory fragmentation.
Right now restart of single backend suffering from memory fragmentation eliminates this problem. But if will be impossible for multhreaded Postgres.
So? as I see from this thread, most of authoritative members of Postgres community are still very pessimistic (or conservative:)
about making Postgres multi-threaded. And it is really huge work which will cause significant code discrepancy. It significantly complicates
backpatching and support of external extension. It can not be done without support and approval by most of committers. This is why this work was stalled in PgPro.
My personal opinion is that Postgres almost reaches its "limit of evolution" or is close to it.
Making some major changes such as multithreading, undo log, columnar store with vector executor
requires so much changes and cause so many conflicts with existed code that it will be easier to develop new system from scratch
rather than trying to plugin new approach in old architecture. May be I wrong. It can be my personal fault that I was not able to bring multithread Postgres, builtin connection pooler, vectorized executor, libpq compression and other my PRs to commit.
I have a filling that it is not possible to merge in mainstream something non-trivial, affecting Postgres core without interest and help of several
committers. Fro the other hand presence of such Postgres forks as TimescaleDB, OrioleDB, GreenPlum demonstrates that Postgres still has high potential for extension.
On Mon, Jun 5, 2023 at 10:52 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > I spoke with some folks at PGCon about making PostgreSQL multi-threaded, > so that the whole server runs in a single process, with multiple > threads. It has been discussed many times in the past, last thread on > pgsql-hackers was back in 2017 when Konstantin made some experiments [0]. > > I feel that there is now pretty strong consensus that it would be a good > thing, more so than before. Lots of work to get there, and lots of > details to be hashed out, but no objections to the idea at a high level. I'm not sure that there's a strong consensus, but I do think it's a good idea. > # Transition period > > The transition surely cannot be done fully in one release. Even if we > could pull it off in core, extensions will need more time to adapt. > There will be a transition period of at least one release, probably > more, where you can choose multi-process or multi-thread model using a > GUC. Depending on how it goes, we can document it as experimental at first. I think the transition period should probably be effectively infinite. There might be some distant future day when we'd remove the process support, if things go incredibly well with threads, but I don't think it would be any time soon. If nothing else, considering that we don't want to force a hard compatibility break for extensions. -- Robert Haas EDB: http://www.enterprisedb.com
On Tue, Jun 6, 2023 at 9:40 AM Robert Haas <robertmhaas@gmail.com> wrote: > I'm not sure that there's a strong consensus, but I do think it's a good idea. Let me elaborate on this a bit. I think one of PostgreSQL's bigger problems right now is that it doesn't scale as far as users would like. Beyond a couple of hundred connections, everything goes to heck. Back in the day, the big scalability problems were around locking, but we've done a pretty good job cleaning that stuff up over the issues. Now, the problem when you run a ton of PostgreSQL connections isn't so much that PostgreSQL stops working as it is that the OS stops working. PostgreSQL backends use a lot of memory, even if they're idle. Some of that is for stuff that we could optimize but haven't, like catcache and relcache entries, and some of it is for stuff that we can't do anything about, like per-process page tables. But the problem isn't just RAM, either. I've seen machines running >1000 PostgreSQL backends where kill -9 took many *minutes* to work because the OS was overwhelmed. I don't know exactly what goes wrong inside the kernel, but clearly something does. Not all databases have this problem, and PostgreSQL isn't going to be able to stop having it without some kind of major architectural change. Changing from a process model to a threaded model might be insufficient, because while I think that threads consume fewer OS resources than processes, what is really needed, in all likelihood, is the ability to have idle connections have neither a process nor a thread associated with them until they cease being idle. That's a huge project and I'm not volunteering to do it, but if we want to have the same kind of scalability as some competing products, that is probably a place to which we ultimately need to go. Getting out of the current model where every backend has an arbitrarily large amount of state hanging off of random global variables, not all of which are even known to any central system, is a critical step in that journey. Also, programming with DSA and shm_mq sucks. It's doable (proof by example) but it's awkward and it takes a long time and the performance isn't great. Here again, threads instead of processes is no panacea. For as long as we support a process model - and my guess is that we're talking about a very long time - new features are going to have to work with those systems or else be optional. But the amount of sheer mental energy that is required to deal with DSA means we're unlikely to ever have a rich library of parallel primitives. Maybe we wouldn't anyway, volunteer efforts are hard to predict, but this is certainly not helping. I do think that there's some danger that if sharing memory becomes as easy as calling palloc(), we'll end up with memory leaks that could eventually take the whole system down. We need to give some thought to how to avoid or manage that danger. Even think about something like the main lock table. That's a fixed size hash table, so lock exhaustion is a real possibility. If we weren't limited to a fixed-size shared memory segment, we could let that thing grow without a server restart. We might not want to let it grow infinitely, but we could raise the maximum size by 100x and allocate as required and I think we'd just be better off. Doing that as things stand would require nailing down that amount of memory forever whether it's ever needed or not, which doesn't seem like a good idea. But doing something where the memory can be allocated only if it's needed would avoid user-facing errors with relatively little cost. I think doing something like this is going to be a huge effort, and frankly, there's probably no point in anybody other than a handful of people (Heikki, Andres, a handful of others) even trying. There's too many ways to go wrong, and this has to be done really well to be worth doing at all. But if somebody with the requisite expertise wants to have a go at it, I don't think we should tell them "no, we don't want that" on principle. Let's talk about whether a specific proposal is good or bad, and why it's good or bad, rather than falling back on an essentially religious argument. It's not an article of faith that PostgreSQL should not use threads: it's a technology decision. The difficulty of reversing the decision made long ago should weigh heavily in evaluating any proposal to do so, but the potential benefits of such a change should be considered, too. -- Robert Haas EDB: http://www.enterprisedb.com
On 06/06/2023 09:40, Robert Haas wrote: > On Mon, Jun 5, 2023 at 10:52 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote: >> I spoke with some folks at PGCon about making PostgreSQL multi-threaded, >> so that the whole server runs in a single process, with multiple >> threads. It has been discussed many times in the past, last thread on >> pgsql-hackers was back in 2017 when Konstantin made some experiments [0]. >> >> I feel that there is now pretty strong consensus that it would be a good >> thing, more so than before. Lots of work to get there, and lots of >> details to be hashed out, but no objections to the idea at a high level. > > I'm not sure that there's a strong consensus, but I do think it's a good idea. The consensus is not as strong as I hoped for... To summarize: Tom, Andrew, Joe are worried that it will break a lot of stuff. That's a valid point. The transition needs to be done well and not break things, I agree with that. But if we can make the transition smooth, that's not an objection to the idea itself. Many comments have been along the lines of "it's hard, not worth the effort". That's fair, but also not an objection to the idea itself, if someone decides to spend the time on it. Bruce was worried about the loss of isolation that the separate address spaces gives, and Jeremy shared an anecdote on that. That is an objection to the idea itself, i.e. even if transition was smooth, bug-free and effortless, that point remains. I personally think the isolation we get from separate address spaces is overrated. Yes, it gives you some protection, but given how much shared memory there is, the blast radius is large even with separate backend processes. So I think there's hope. I didn't hear any _strong_ objections to the idea itself, assuming the transition can be done smoothly. >> # Transition period >> >> The transition surely cannot be done fully in one release. Even if we >> could pull it off in core, extensions will need more time to adapt. >> There will be a transition period of at least one release, probably >> more, where you can choose multi-process or multi-thread model using a >> GUC. Depending on how it goes, we can document it as experimental at first. > > I think the transition period should probably be effectively infinite. > There might be some distant future day when we'd remove the process > support, if things go incredibly well with threads, but I don't think > it would be any time soon. I don't think this is worth it, unless we plan to eventually remove the multi-process mode. We could e.g. make lock table expandable in threaded mode, and fixed-size in process mode, but the big gains would come from being able to share things between threads and have variable-length shared data structures more easily. As long as you need to also support processes, you need to code to the lowest common denominator and don't really get the benefits. I don't know how long a transition period we need. Maybe 1 release, maybe 5. > If nothing else, considering that we don't want to force a hard > compatibility break for extensions. Extensions regularly need small tweaks to adapt to new major Postgres versions, I don't think this would be too different. -- Heikki Linnakangas Neon (https://neon.tech)
On 2023-06-06 08:06, Konstantin Knizhnik wrote: > 7. It will be hard to support non-multithreaded PL languages (like > python), but for example support of Java will be more natural and > efficient. To this I say ... Hmm. Surely, the current situation with a JVM in each backend process (that calls for one) has been often seen as heavier than desirable. At the same time, I am not sure how manageable one giant process with one giant JVM instance would prove to be, either. It is somewhat nice to be able to tweak JVM settings in a session and see what happens, without disrupting other sessions. There may also exist cases for different JVM settings in per-user or per- database GUCs. Like Python with the GIL, it is documented for JNI_CreateJavaVM that "Creation of multiple VMs in a single process is not supported."[1] And the devs of Java, in their immeasurable wisdom, have announced a "JDK Enhancement Proposal" (that's just what these things are called, don't blame Orwell), JEP 411[2][3], in which all of the Security Manager features that PL/Java relies on for bounds on 'trusted' behavior are deprecated for eventual removal with no functional replacement. I'd be even more leery of using one big shared JVM for everybody's work after that happens. Might the work toward allowing a run-time choice between a process or threaded model also make possible some intermediate models as well? A backend process for connections to a particular database, or with particular authentication credentials? Go through the authentication handshake and then sendfd the connected socket to the appropriate process. (Has every supported platform got something like sendfd?) That way, there could be some flexibility to arrange how many distinct backends (and, for Java purposes, how many JVMs) get fired up, and have each sharing sessions that have something in common. Or, would that just require all the complexity of both approaches to synchronization, with no sufficient benefit? Regards, -Chap [1] https://docs.oracle.com/en/java/javase/17/docs/specs/jni/invocation.html#jni_createjavavm [2] https://blogs.apache.org/netbeans/entry/jep-411-deprecate-the-security1 [3] https://github.com/tada/pljava/wiki/JEP-411
On 06/06/2023 11:48, chap@anastigmatix.net wrote: > And the devs of Java, in their immeasurable wisdom, have announced > a "JDK Enhancement Proposal" (that's just what these things are > called, don't blame Orwell), JEP 411[2][3], in which all of the > Security Manager features that PL/Java relies on for bounds on > 'trusted' behavior are deprecated for eventual removal with no > functional replacement. I'd be even more leery of using one big > shared JVM for everybody's work after that happens. Ouch. > Might the work toward allowing a run-time choice between a > process or threaded model also make possible some > intermediate models as well? A backend process for > connections to a particular database, or with particular > authentication credentials? Go through the authentication > handshake and then sendfd the connected socket to the > appropriate process. (Has every supported platform got > something like sendfd?) I'm afraid having multiple processes and JVMs doesn't help that. If you can escape the one JVM in one backend process, it's game over. Backend processes are not a security barrier, and you have the same problems with the current multi-process architecture, too. https://github.com/greenplum-db/plcontainer is one approach. It launches a separate process for the PL, separate from the backend process, and sandboxes that. -- Heikki Linnakangas Neon (https://neon.tech)
On 2023-06-06 12:24, Heikki Linnakangas wrote: > I'm afraid having multiple processes and JVMs doesn't help that. > If you can escape the one JVM in one backend process, it's game over. So there's escape and there's escape, right? Java still prioritizes (and has, in fact, strengthened) barriers against breaking module encapsulation, or getting access to arbitrary native memory or code. The features that have been deprecated, to eventually go away, are the ones that offer fine-grained control over operations that there are Java APIs for. Eventually it won't be as easy as it is now to say "ok, your function gets to open these files or these sockets but not those ones." Even for those things, there may yet be solutions. There are Java APIs for virtualizing the view of the file system, for example. It's yet to be seen how things will shake out. Configuration may get trickier, and there may be some incentive to to include, say, sepgsql in the picture. Sure, even access to a file API can be game over, depending on what file you open, but that's already the risk for every PL with an 'untrusted' flavor. Regards, -Chap
On 06.06.2023 5:13 PM, Robert Haas wrote: > On Tue, Jun 6, 2023 at 9:40 AM Robert Haas <robertmhaas@gmail.com> wrote: >> I'm not sure that there's a strong consensus, but I do think it's a good idea. > Let me elaborate on this a bit. > > > > Not all databases have this problem, and PostgreSQL isn't going to be > able to stop having it without some kind of major architectural > change. Changing from a process model to a threaded model might be > insufficient, because while I think that threads consume fewer OS > resources than processes, what is really needed, in all likelihood, is > the ability to have idle connections have neither a process nor a > thread associated with them until they cease being idle. That's a huge > project and I'm not volunteering to do it, but if we want to have the > same kind of scalability as some competing products, that is probably > a place to which we ultimately need to go. Getting out of the current > model where every backend has an arbitrarily large amount of state > hanging off of random global variables, not all of which are even > known to any central system, is a critical step in that journey. It looks like built-in connection pooler, doesn't it? Actually built-in connection pooler has a lot o common things with multithreaded Postgres. It also needs to keep session context. Te main difference is that there is no need to place here all Postgres global/static variables, because lefitime of most of them is shorter than transaction. So it is really enough to place all such variables in single struct. This is how built-in connection pooler was implemented in PgPro. Reading all concerns against multithreading Postgres makes me think that it may erasonable to combine two approaches: still have processes (backends) but be able to spawn multiple threads inside process (for example for parallel query execution). It can be considered that such approach can only increase complexity of implementation and combine drawbacks of both approaches. But actually such approach allows: 1. Support old (external, non-reentrant) extensions - them will be executed by dedicated backends. 2. Simplify parallel query execution and make it more efficient. 3. Allows to most efficiently use multitreaded PL-s (like JVM based). As far as there will be no single VM for all connections, but only for some group of them(for example belonging to one user), then most complaints concerning sharing VM between different connections can be avoided 4. Avoid or minimize problems with OOM and memory fragmentation. 5. Can be combine with connection pooler (save inactive connection state without having process or thread for it)
On Tue, Jun 6, 2023 at 11:46 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > Bruce was worried about the loss of isolation that the separate address > spaces gives, and Jeremy shared an anecdote on that. That is an > objection to the idea itself, i.e. even if transition was smooth, > bug-free and effortless, that point remains. I personally think the > isolation we get from separate address spaces is overrated. Yes, it > gives you some protection, but given how much shared memory there is, > the blast radius is large even with separate backend processes. An interesting idea might be to look at the places where we ereport or elog FATAL due to some kind of backend data structure corruption and ask whether there would be an argument for elevating the level to PANIC if we changed this. There are definitely some places where we argue that the only corrupted state is backend-local and thus we don't need to PANIC if it's corrupted. I wonder to what extent this change would undermine that argument. Even if it does, I think it's worth it. Corrupted backend-local data structures aren't that common, thankfully. > I don't think this is worth it, unless we plan to eventually remove the > multi-process mode. We could e.g. make lock table expandable in threaded > mode, and fixed-size in process mode, but the big gains would come from > being able to share things between threads and have variable-length > shared data structures more easily. As long as you need to also support > processes, you need to code to the lowest common denominator and don't > really get the benefits. > > I don't know how long a transition period we need. Maybe 1 release, maybe 5. I think 1 release is wildly optimistic. Even if someone wrote a patch for this and got it committed this release cycle, it's likely that there would be follow-up commits needed over a period of several years before it really worked as well as we'd like. Only after that could we consider deprecating the per-process way. But I don't think that's necessarily a huge problem. I originally intended DSM as an optional feature: if you didn't have it, then you couldn't use features that depended on it, but the rest of the system still worked. Eventually, other people liked it enough that we decided to introduce hard dependencies on it. I think that's a good model for a change like this. When the inventor of a new system thinks that we should have a hard dependency on it, MEH. When there's a groundswell of other, unaffiliated hackers making that argument, COOL. I'm also not quite convinced that there's no long-term use case for multi-process mode. Maybe you're right and there isn't, but that amounts to arguing that every extension in the world will be happy to run in a multi-threaded world rather than not. I don't know if I quite believe that. It also amounts to arguing that performance is going to be better for everyone in this new multi-threaded mode, and that it won't cause unforeseen problems for any significant numbers of users, and maybe those things are true, but I think we need to get this new system in place and get some real-world experience before we can judge these kinds of things. I agree that, in theory, it would be nice to get to a place where the multi-process mode is a dinosaur and that we can just rip it out ... but I don't share your confidence that we can get there in any short time period. -- Robert Haas EDB: http://www.enterprisedb.com
I'm also not quite convinced that there's no long-term use case for
multi-process mode. Maybe you're right and there isn't, but that
amounts to arguing that every extension in the world will be happy to
run in a multi-threaded world rather than not. I don't know if I quite
believe that. It also amounts to arguing that performance is going to
be better for everyone in this new multi-threaded mode, and that it
won't cause unforeseen problems for any significant numbers of users,
and maybe those things are true, but I think we need to get this new
system in place and get some real-world experience before we can judge
these kinds of things. I agree that, in theory, it would be nice to
get to a place where the multi-process mode is a dinosaur and that we
can just rip it out ... but I don't share your confidence that we can
get there in any short time period.
On Tue, Jun 6, 2023 at 2:51 PM Kirk Wolak <wolakk@gmail.com> wrote: > I do wonder if we could add better threading within any given session/process to get a hybrid? > [maybe this gets us closer to solving some of the problems incrementally?] I don't think it helps much -- if anything, I think that would be more complicated. > If I could have anything (today)... I would prefer a Master-Master Implementation leveraging some > of the ultra-fast server-server communication protocols to help sync things. Then I wouldn't care. > I could avoid the O/S Overwhelm caused by excessive processes, via spinning up machines. > [Unfortunately I know that PG leverages the filesystem cache, etc to such a degree that communicating > from one master to another would require a really special architecture there. And the N! communication lines]. I think there's plenty of interesting things to improve in this area, but they're different things than what this thread is about. -- Robert Haas EDB: http://www.enterprisedb.com
On Mon, 5 Jun 2023 at 10:52, Heikki Linnakangas <hlinnaka@iki.fi> wrote: > > I spoke with some folks at PGCon about making PostgreSQL multi-threaded, > so that the whole server runs in a single process, with multiple > threads. It has been discussed many times in the past, last thread on > pgsql-hackers was back in 2017 when Konstantin made some experiments [0]. > > I feel that there is now pretty strong consensus that it would be a good > thing, more so than before. Lots of work to get there, and lots of > details to be hashed out, but no objections to the idea at a high level. > > The purpose of this email is to make that silent consensus explicit. If > you have objections to switching from the current multi-process > architecture to a single-process, multi-threaded architecture, please > speak up. I suppose I should reiterate my comments that I gave at the time. I'm not sure they qualify as "objections" but they're some kind of general concern. I think of processes and threads as fundamentally the same things, just a slightly different API -- namely that in one memory is by default unshared and needs to be explicitly shared and in the other it's default shared and needs to be explicitly unshared. There are obvious practical API differences too like how signals are handled but those are just implementation details. So the question is whether defaulting to shared memory or defaulting to unshared memory is better -- and whether the implementation details are significant enough to override that. And my general concern was that in my experience default shared memory leads to hugely complex and chaotic shared data structures with often very loose rules for ownership of shared data and who is responsible for making updates, handling errors, or releasing resources. So all else equal I feel like having a good infrastructure for explicitly allocating shared memory segments and managing them is superior. However all else is not equal. The discussion in the hallway turned to whether we could just use pthread primitives like mutexes and condition variables instead of our own locks -- and the point was raised that those libraries assume these objects will be in threads of one process not shared across completely different processes. And that's probably not the only library we're stuck reimplementing because of this. So the question is are these things worth taking the risk of having data structures shared implicitly and having unclear ownership rules? I was going to say supporting both modes relieves that fear since it would force that extra discipline and allow testing under the more restrictive rule. However I don't think that will actually work. As long as we support both modes we lose all the advantages of threads. We still wouldn't be able to use pthreads and would still need to provide and maintain our homegrown replacement infrastructure. -- greg
On Tue, Jun 6, 2023 at 6:52 AM Andrew Dunstan <andrew@dunslane.net> wrote: > If we were starting out today we would probably choose a threaded implementation. But moving to threaded now seems to melike a multi-year-multi-person project with the prospect of years to come chasing bugs and the prospect of fairly modestadvantages. The risk to reward doesn't look great. > > That's my initial reaction. I could be convinced otherwise. Here is one thing I often think about when contemplating threads. Take a look at dsa.c. It calls itself a shared memory allocator, but really it has two jobs, the second being to provide software emulation of virtual memory. That’s behind dshash.c and now the stats system, and various parts of the parallel executor code. It’s slow and complicated, and far from the state of the art. I wrote that code (building on allocator code from Robert) with the expectation that it was a transitional solution to unblock a bunch of projects. I always expected that we'd eventually be deleting it. When I explain that subsystem to people who are not steeped in the lore of PostgreSQL, it sounds completely absurd. I mean, ... it is, right? My point is that we’re doing pretty unreasonable and inefficient contortions to develop new features -- we're not just happily chugging along without threads at no cost.
Thomas Munro <thomas.munro@gmail.com> writes: > ... My point is > that we’re doing pretty unreasonable and inefficient contortions to > develop new features -- we're not just happily chugging along without > threads at no cost. Sure, but it's not like chugging along *with* threads would be no-cost. Others have already pointed out the permanent downsides of that, such as loss of isolation between sessions leading to debugging headaches (and, I predict, more than one security-grade bug). I agree that if we were building this system from scratch today, we'd probably choose thread-per-session not process-per-session. But the costs of getting to that from where we are will be enormous. I seriously doubt that the net benefits could justify that work, no matter how long you want to look forward. It's not really significantly different from "let's rewrite the server in C++/Rust/$latest_hotness". regards, tom lane
On Tue, Jun 6, 2023 at 11:30 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Tue, Jun 6, 2023 at 11:46 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > > Bruce was worried about the loss of isolation that the separate address > > spaces gives, and Jeremy shared an anecdote on that. That is an > > objection to the idea itself, i.e. even if transition was smooth, > > bug-free and effortless, that point remains. I personally think the > > isolation we get from separate address spaces is overrated. Yes, it > > gives you some protection, but given how much shared memory there is, > > the blast radius is large even with separate backend processes. > > An interesting idea might be to look at the places where we ereport or > elog FATAL due to some kind of backend data structure corruption and > ask whether there would be an argument for elevating the level to > PANIC if we changed this. There are definitely some places where we > argue that the only corrupted state is backend-local and thus we don't > need to PANIC if it's corrupted. I wonder to what extent this change > would undermine that argument. With the threaded model, that shouldn't change, right? Even though all memory space is now shared across threads, we can maintain the same rules for modifying critical shared data structures, i.e. modifying such memory should still fall under the CRITICAL SECTION, so I guess the rules for promoting error level to PANIC will remain the same. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Wed, Jun 7, 2023 at 7:32 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Thomas Munro <thomas.munro@gmail.com> writes: > > ... My point is > > that we’re doing pretty unreasonable and inefficient contortions to > > develop new features -- we're not just happily chugging along without > > threads at no cost. > > Sure, but it's not like chugging along *with* threads would be no-cost. > Others have already pointed out the permanent downsides of that, such > as loss of isolation between sessions leading to debugging headaches > (and, I predict, more than one security-grade bug). I agree in some cases debugging would be hard, but I feel there are cases where the thread model will make the debugging experience better e.g breaking at the entry point of the new parallel worker or other worker is hard with the process model but that would be very smooth with the thread model as per my experience. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On 6/6/23 22:02, Tom Lane wrote: > (and, I predict, more than one security-grade bug). *That* is what worries me the most > I agree that if we were building this system from scratch today, > we'd probably choose thread-per-session not process-per-session. > But the costs of getting to that from where we are will be enormous. > I seriously doubt that the net benefits could justify that work, > no matter how long you want to look forward. It's not really > significantly different from "let's rewrite the server in > C++/Rust/$latest_hotness". Agreed. -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Tue, Jun 6, 2023 at 10:02 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > I agree that if we were building this system from scratch today, > we'd probably choose thread-per-session not process-per-session. > But the costs of getting to that from where we are will be enormous. > I seriously doubt that the net benefits could justify that work, > no matter how long you want to look forward. It's not really > significantly different from "let's rewrite the server in > C++/Rust/$latest_hotness". Well, I don't know, I think that's a bunch of things that are not all the same. Rewriting the server in a whole different programming language would be a massive effort. I can't really see anyone volunteering to rewrite a million lines of C (or whatever we've got) in Rust, and I'm not sure who would use the result if they did, or why. We could, perhaps, allow new source files to be written in Rust while keeping old ones written in C, but then every hacker has to know two languages, and having code written in both languages manipulating the same data structures would probably be a recipe for confusion and bugs. It's hard to believe that the upsides would be worth the pain. Maybe transition to C++ would be easier, or maybe it wouldn't, I'm not sure. But from my point of the view, the issue here is simply that stop-the-world-and-change-everything is not a viable way forward for a project the size of PostgreSQL, but incremental changes are potentially acceptable if the benefits outweigh the drawbacks. So what are the costs, exactly, of transition to a threaded model? It seems to me that there's basically one problem: global variables. Sure, there's a bunch of stuff around process management that would likely have to be revised in some way, but that's not that much code and wouldn't have that much impact on unrelated development. However, the project's widespread and often gratuitous use of global variables would have to be addressed in some way, and I think that will pretty much inevitably involve touching all of those global variable declarations in some way. Now, if we can get away with simply marking all of those thread-local, then it's of the same general flavor as PGDLLIMPORT. I am aware that you think that PGDLLIMPORT markings are ugly as sin, and these would be more widespread since they'd have to be applied to literally every global variable, including file-local ones. However, it's hard to imagine that adding such markings would cause PostgreSQL development to grind to a halt. It would cause minor rebasing pain and that's about it. I hope that we'd have some tool that would make the build fail if any markings are missing and everybody would be annoyed until they finished rebasing all of their WIP patches and then that would just be how things are. It's not *lovely* but it doesn't sound that bad either. In my mind, the bigger question is how much further than that do you have to go? I think I remember a previous conversation with Andres where he opined that thread-local variables are "really expensive" (and I apologize in advance if I'm mis-remembering this). Now, Andres is not a man who accepts a tax on performance of any size without a fight, so his "really expensive" might turn out to resemble my "pretty cheap." However, if widespread use of TLS is too expensive and we have to start rewriting code to not depend on global variables, that's going to be more of a problem. If we can get by with doing such rewrites only in performance-critical places, it might not still be too bad. Personally, I think the degree of dependence that PostgreSQL has on global variables is pretty excessive and I don't think that a certain amount of refactoring to reduce it would be a bad thing. If it turns into an infinite series of hastily-written patches to rejigger every source file we have, though, then I'm not really on board with that. Heikki mentions the idea of having a central Session object and just passing that around. I have a hard time believing that's going to work out nicely. First, it's not extensible. Right now, if you need a bit of additional session-local state, you just declare a variable and you're all set. That's not a perfect system and does cause some problems, but we can't go from there to a system where it's impossible to add session-local state without hacking core. Second, we will be sad if session.h ends up #including every other header file that defines a data structure anywhere in the backend. Or at least I'll be sad. I'm not actually against the idea of having some kind of session object that we pass around, but I think it either needs to be limited to a relatively small set of well-defined things, or else it needs to be design in some kind of extensible way that doesn't require it to know the full details of every sort of object that's being used as session-local state anywhere in the system. I haven't really seen any convincing design ideas around this yet. But I think jumping to the conclusion that the migration path here is akin to rewriting the whole code base in Rust is jumping too far. I do see some problems here that I don't know how to solve, but that's nowhere near in the same category as find . -name '*.c' -exec rm {} \; -- Robert Haas EDB: http://www.enterprisedb.com
On Mon, Jun 5, 2023 at 8:22 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > > I spoke with some folks at PGCon about making PostgreSQL multi-threaded, > so that the whole server runs in a single process, with multiple > threads. It has been discussed many times in the past, last thread on > pgsql-hackers was back in 2017 when Konstantin made some experiments [0]. > > I feel that there is now pretty strong consensus that it would be a good > thing, more so than before. Lots of work to get there, and lots of > details to be hashed out, but no objections to the idea at a high level. > > The purpose of this email is to make that silent consensus explicit. If > you have objections to switching from the current multi-process > architecture to a single-process, multi-threaded architecture, please > speak up. > > If there are no major objections, I'm going to update the developer FAQ, > removing the excuses there for why we don't use threads [1]. And we can > start to talk about the path to get there. Below is a list of some > hurdles and proposed high-level solutions. This isn't an exhaustive > list, just some of the most obvious problems: > > # Transition period > > The transition surely cannot be done fully in one release. Even if we > could pull it off in core, extensions will need more time to adapt. > There will be a transition period of at least one release, probably > more, where you can choose multi-process or multi-thread model using a > GUC. Depending on how it goes, we can document it as experimental at first. > > # Thread per connection > > To get started, it's most straightforward to have one thread per > connection, just replacing backend process with a backend thread. In the > future, we might want to have a thread pool with some kind of a > scheduler to assign active queries to worker threads. Or multiple > threads per connection, or spawn additional helper threads for specific > tasks. But that's future work. With multiple processes, we can use all the available cores (at least theoretically if all those processes are independent). But is that guaranteed with single process multi-thread model? Google didn't throw any definitive answer to that. Usually it depends upon the OS and architecture. Maybe a good start is to start using threads instead of parallel workers e.g. for parallel vacuum, parallel query and so on while leaving the processes for connections and leaders. that itself might take significant time. Based on that experience move to a completely threaded model. Based on my experience with other similar products, I think we will settle on a multi-process multi-thread model. -- Best Wishes, Ashutosh Bapat
07.06.2023 15:53, Robert Haas wrote: > Right now, if you need a bit > of additional session-local state, you just declare a variable and > you're all set. That's not a perfect system and does cause some > problems, but we can't go from there to a system where it's impossible > to add session-local state without hacking core. > or else it needs to > be design in some kind of extensible way that doesn't require it to > know the full details of every sort of object that's being used as > session-local state anywhere in the system. And it is quite possible. Although with indirection involved. For example, we want to add session variable "my_hello_var". We first need to declare "offset variable". Then register it in a session. And then use function and/or macros to get actual address: /* session.h */ extern size_t RegisterSessionVar(size_t size); extern void* CurSessionVar(size_t offset); /* session.c */ typedef struct Session { char *vars; } Session; static _Thread_local Session* curSession; static size_t sessionVarsSize = 0; size_t RegisterSessionVar(size_t size) { size_t off = sessionVarsSize; sessionVarsSize += size; return off; } void* CurSession(size_t offset) { return curSession->vars + offset; } /* module_internal.h */ typedef int my_hello_var_t; extern size_t my_hello_var_offset; /* access macros */ #define my_hello_var (*(my_hello_var_t*)(CurSessionVar(my_hello_var_offset))) /* module.c */ size_t my_hello_var_offset = 0; void PG_init() { RegisterSessionVar(sizeof(my_hello_var_t), &my_hello_var_offset); } For security reasons, offset could be mangled. ------ regards, Yura Sokolov
On 6/5/23 17:33, Heikki Linnakangas wrote: > On 05/06/2023 11:18, Tom Lane wrote: >> Heikki Linnakangas <hlinnaka@iki.fi> writes: >>> I spoke with some folks at PGCon about making PostgreSQL multi-threaded, >>> so that the whole server runs in a single process, with multiple >>> threads. It has been discussed many times in the past, last thread on >>> pgsql-hackers was back in 2017 when Konstantin made some experiments >>> [0]. >> >>> I feel that there is now pretty strong consensus that it would be a good >>> thing, more so than before. Lots of work to get there, and lots of >>> details to be hashed out, but no objections to the idea at a high level. >> >>> The purpose of this email is to make that silent consensus explicit. If >>> you have objections to switching from the current multi-process >>> architecture to a single-process, multi-threaded architecture, please >>> speak up. >> >> For the record, I think this will be a disaster. There is far too much >> code that will get broken, largely silently, and much of it is not >> under our control. > > Noted. Other large projects have gone through this transition. It's not > easy, but it's a lot easier now than it was 10 years ago. The platform > and compiler support is there now, all libraries have thread-safe > interfaces, etc. > Is the platform support really there for all platforms we want/intend to support? I have no problem believing that for modern Linux/BSD systems, but what about the older stuff we currently support. Also, which other projects did this transition? Is there something we could learn from them? Were they restricted to much smaller list of platforms? > I don't expect you or others to buy into any particular code change at > this point, or to contribute time into it. Just to accept that it's a > worthwhile goal. If the implementation turns out to be a disaster, then > it won't be accepted, of course. But I'm optimistic. > I personally am not opposed to the effort in principle, but how do you even evaluate cost and benefits for a transition like this? I have no idea how to quantify the costs/benefits for this as a single change. I've seen some benchmarks in the past, but it's hard to say which of these improvements are possible only with threads, and what would be doable with less invasive changes with the process model. IMHO the only way to move this forward is to divide this into smaller changes, each of which gives us some benefit we'd want anyway. For example, this thread already mentioned improving handling of many connections. AFAICS that requires isolating "session state", which seems useful even without a full switch to threads as it makes connection pooling simpler. It should be easier to get a buy-in for these changes, while introducing abstractions simplifying the switch to threads. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Jun 8, 2023 at 7:20 AM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > Is the platform support really there for all platforms we want/intend to > support? I have no problem believing that for modern Linux/BSD systems, > but what about the older stuff we currently support. There is a conversation to be had about whether/when/how to adopt C11/C17 threads (= same API on Windows and Unix, but sadly two straggler systems don't have required OS support yet (macOS, OpenBSD)), but POSIX + NT threads were all worked out in the 90s. We have last-mover advantage here. > Also, which other projects did this transition? Is there something we > could learn from them? Were they restricted to much smaller list of > platforms? Apache may be interesting. Wide ecosystem of extensions.
Hi, On 2023-06-05 17:51:57 +0300, Heikki Linnakangas wrote: > If there are no major objections, I'm going to update the developer FAQ, > removing the excuses there for why we don't use threads [1]. I think we should do this even if there's no concensus to slowly change to threads. There's clearly no concensus on the opposite either. > # Transition period > > The transition surely cannot be done fully in one release. Even if we could > pull it off in core, extensions will need more time to adapt. There will be > a transition period of at least one release, probably more, where you can > choose multi-process or multi-thread model using a GUC. Depending on how it > goes, we can document it as experimental at first. One interesting bit around the transition is what tooling we ought to provide to detect problems. It could e.g. be reasonably feasible to write something checking how many read-write global variables an extension has on linux systems. > # Extensions > > A lot of extensions also contain global variables or other things that break > in a multi-threaded environment. We need a way to label extensions that > support multi-threading. And in the future, also extensions that *require* a > multi-threaded server. > > Let's add flags to the control file to mark if the extension is thread-safe > and/or process-safe. If you try to load an extension that's not compatible > with the server's mode, throw an error. I don't think the control file is the right place - that seems more like something that should be signalled via PG_MODULE_MAGIC. We need to check this not just during CREATE EXTENSION, but also during loading of libraries - think of shared_preload_libraries. > # Restart on crash > > If a backend process crashes, postmaster terminates all other backends and > restarts the system. That's hard (impossible?) to do safely if everything > runs in one process. We can continue have a separate postmaster process that > just monitors the main process and restarts it on crash. Yea, we definitely need the supervisor function in a separate process. Presumably that means we need to split off some of the postmaster responsibilities - e.g. I don't think it'd make sense to handle connection establishment in the supervisor process. I wonder if this is something that could end up being beneficial even in the process world. A related issue is that we won't get SIGCHLD in the supervisor process anymore. So we'd need to come up with some design for that. Greetings, Andres Freund
Hi, On 2023-06-05 13:40:13 -0400, Jonathan S. Katz wrote: > 2. While I wouldn't want to necessarily discourage a moonshot effort, I > would ask if developer time could be better spent on tackling some of the > other problems around vertical scalability? Per some PGCon discussions, > there's still room for improvement in how PostgreSQL can best utilize > resources available very large "commodity" machines (a 448-core / 24TB RAM > instance comes to mind). I think we're starting to hit quite a few limits related to the process model, particularly on bigger machines. The overhead of cross-process context switches is inherently higher than switching between threads in the same process - and my suspicion is that that overhead will continue to increase. Once you have a significant number of connections we end up spending a *lot* of time in TLB misses, and that's inherent to the process model, because you can't share the TLB across processes. The amount of duplicated code we have to deal with due to to the process model is quite substantial. We have local memory, statically allocated shared memory and dynamically allocated shared memory variants for some things. And that's just going to continue. > I'm purposely giving a nonanswer on whether it's a worthwhile goal, but > rather I'd be curious where it could stack up against some other efforts to > continue to help PostgreSQL improve performance and handle very large > workloads. There's plenty of things we can do before, but in the end I think tackling the issues you mention and moving to threads are quite tightly linked. Greetings, Andres Freund
On 07.06.23 23:30, Andres Freund wrote: > Yea, we definitely need the supervisor function in a separate > process. Presumably that means we need to split off some of the postmaster > responsibilities - e.g. I don't think it'd make sense to handle connection > establishment in the supervisor process. I wonder if this is something that > could end up being beneficial even in the process world. Something to think about perhaps ... how would that be different from using an existing external supervisor process like systemd or supervisord.
Tomas Vondra schrieb am 07.06.2023 um 21:20: > Also, which other projects did this transition? Is there something we > could learn from them? Were they restricted to much smaller list of > platforms? Firebird did this a while ago if I'm not mistaken. Not open source, but Oracle was historically multi-threaded on Windows and multi-process on all other platforms. I _think_ starting with 19c you can optionally run it multi-threaded on Linux as well. But I doubt, they are willing to share any insights ;)
Hi, On 2023-06-05 20:15:56 -0400, Bruce Momjian wrote: > Yes, sorry, critical sections is what I was remembering. My question is > whether all unexpected backend exits should be treated as critical > sections? Yes. People have argued that the process model is more robust. But it turns out that we have to crash-restart for just about any "bad failure" anyway. It used to be (a long time ago) that we didn't, but that was just broken. There are some advantages in debuggability, because it's a *tad* harder for a bug in one process to cause another to crash, if less state is shared. But that's by far outweighed by most debugging / validation tools not understanding the multi-processes-with-shared-shmem model. Greetings, Andres Freund
Hi, On 2023-06-07 23:39:01 +0200, Peter Eisentraut wrote: > On 07.06.23 23:30, Andres Freund wrote: > > Yea, we definitely need the supervisor function in a separate > > process. Presumably that means we need to split off some of the postmaster > > responsibilities - e.g. I don't think it'd make sense to handle connection > > establishment in the supervisor process. I wonder if this is something that > > could end up being beneficial even in the process world. > > Something to think about perhaps ... how would that be different from using > an existing external supervisor process like systemd or supervisord. I think that's not really comparable. A postgres internal solution can maintain resources like shared memory allocations, listening sockets, etc across crash restarts. With something like systemd that's much harder to make work well. And then there's the fact that you now need to deal with much more drastic cross-platform behavioural differences. Greetings, Andres Freund
Hi, On 2023-06-07 08:53:24 -0400, Robert Haas wrote: > In my mind, the bigger question is how much further than that do you > have to go? I think I remember a previous conversation with Andres > where he opined that thread-local variables are "really expensive" > (and I apologize in advance if I'm mis-remembering this). It really is architecture and OS dependent. I think time has reduced the cost somewhat, due to older architectures / OSs aging out. But yea, it's not free. I suspect that we'd gain *far* more from the higher TLB hit rate, than we'd loose due to using many thread local variables. Even with a stupid search-and-replace approach. But we'd gain more if we reduced the number of thread local variables... > Now, Andres is not a man who accepts a tax on performance of any size > without a fight, so his "really expensive" might turn out to resemble my > "pretty cheap." However, if widespread use of TLS is too expensive and we > have to start rewriting code to not depend on global variables, that's going > to be more of a problem. If we can get by with doing such rewrites only in > performance-critical places, it might not still be too bad. Personally, I > think the degree of dependence that PostgreSQL has on global variables is > pretty excessive and I don't think that a certain amount of refactoring to > reduce it would be a bad thing. If it turns into an infinite series of > hastily-written patches to rejigger every source file we have, though, then > I'm not really on board with that. I think a lot of such rewrites would be a good idea, even if we right now all agree to swear we'll never go to threads. Not having any sort of grouping of global variables makes it IMO considerably harder to debug. I can easily ask somebody to print out a variable pointing to a struct describing the state of a subsystem. I can't really do that for 50 variables. And once you do that, I think you reduce the TLS cost substantially. The variable pointing to the struct is already likely in a register. Whereas each individual variable being in TLS makes the job harder for the compiler. Greetings, Andres Freund
Hi, On 2023-06-06 16:14:41 -0400, Greg Stark wrote: > I think of processes and threads as fundamentally the same things, > just a slightly different API -- namely that in one memory is by > default unshared and needs to be explicitly shared and in the other > it's default shared and needs to be explicitly unshared. In theory that's true, in practice it's entirely wrong. For one, the amount of complexity you need to deal with to share state across processes, post fork, is *substantial*. You can share file descriptors across processes, but it's extremely platform dependant, requires cooperation between both processes etc. You can share memory allocations made after the processes forked, but you're typically not going to be able to guarantee they're at the same pointer values. Etc. But more importantly, there's crucial performance differences between threads and processes. Having the same memory mapping between threads makes allows the hardware to share the TLB (on x86 via process context identifiers), which isn't realistically possible with different processes. > However all else is not equal. The discussion in the hallway turned to > whether we could just use pthread primitives like mutexes and > condition variables instead of our own locks -- and the point was > raised that those libraries assume these objects will be in threads of > one process not shared across completely different processes. Independent of threads vs processes, I am -many on using pthread mutexes and condition variables. From experiments, that *looses* performance, and we loose a lot of control and increase cross-platform behavioural differences. I also don't see any benefit in going in that direction. > And that's probably not the only library we're stuck reimplementing > because of this. So the question is are these things worth taking the > risk of having data structures shared implicitly and having unclear > ownership rules? > > I was going to say supporting both modes relieves that fear since it > would force that extra discipline and allow testing under the more > restrictive rule. However I don't think that will actually work. As > long as we support both modes we lose all the advantages of threads. I don't think that has to be true. We could e.g. eventually decide that we don't support parallel query without threading support - which would allow us to get rid of a very significant amount of code and runtime overhead. Greetings, Andres Freund
On 6/7/23 2:39 PM, Thomas Kellerer wrote: > Tomas Vondra schrieb am 07.06.2023 um 21:20: >> Also, which other projects did this transition? Is there something we >> could learn from them? Were they restricted to much smaller list of >> platforms? > > Not open source, but Oracle was historically multi-threaded on Windows > and multi-process on all other platforms. > I _think_ starting with 19c you can optionally run it multi-threaded on > Linux as well. Looks like it actually became publicly available in 12c. AFAICT Oracle supports both modes today, with a config parameter to switch between them. This is a very interesting case study. Concepts Manual: https://docs.oracle.com/en/database/oracle/oracle-database/23/cncpt/process-architecture.html#GUID-4B460E97-18A0-4F5A-A62F-9608FFD43664 Reference: https://docs.oracle.com/en/database/oracle/oracle-database/23/refrn/THREADED_EXECUTION.html#GUID-7A668A49-9FC5-4245-AD27-10D90E5AE8A8 List of Oracle process types, which ones can run as threads and which ones always run as processes: https://docs.oracle.com/en/database/oracle/oracle-database/23/refrn/background-processes.html#GUID-86184690-5531-405F-AA05-BB935F57B76D Looks like they have four processes that will never run in threads: * dbwriter (writes dirty blocks in background) * process monitor (cleanup after process crash to avoid full server restarts) <jealous> * process spawner (like postmaster) * time keeper process Per Tim Hall's oracle-base, it seems that plenty of people are sticking with the process model, and that one use case for threads was: "consolidating lots of instances onto a single server without using the multitennant option. Without the multithreaded model, the number of OS processes could get very high." https://oracle-base.com/articles/12c/multithreaded-model-using-threaded_execution_12cr1 I did google search for "oracle threaded_execution" and browsed a bit; didn't see anything that seems earth shattering so far. Ludovico Caldara and Martin Bach published blogs when it was first released, which just introduced but didn't test or hammer on it. The feature has existed for 10 years now and I don't see any blog posts saying that "everyone should use this because it doubles your performance" or anything like that. I think if there were really significant performance gains then there would be many interesting blog posts on the internet by now from the independent Oracle professional community - I know many of these people. In fact, there's an interesting blog by Kamil Stawiarski from 2015 where he actually observed one case of /slower/ performance with threads. That blog post ends with: "So I raise the question: why and when use threaded execution? If ever?" https://blog.ora-600.pl/2015/12/17/oracle-12c-internals-of-threaded-execution/ I'm not sure if he ever got an answer -Jeremy -- http://about.me/jeremy_schneider
On Thu, Jun 8, 2023 at 10:37 AM Jeremy Schneider <schneider@ardentperf.com> wrote: > On 6/7/23 2:39 PM, Thomas Kellerer wrote: > > Tomas Vondra schrieb am 07.06.2023 um 21:20: > >> Also, which other projects did this transition? Is there something we > >> could learn from them? Were they restricted to much smaller list of > >> platforms? > > > > Not open source, but Oracle was historically multi-threaded on Windows > > and multi-process on all other platforms. > > I _think_ starting with 19c you can optionally run it multi-threaded on > > Linux as well. > Looks like it actually became publicly available in 12c. AFAICT Oracle > supports both modes today, with a config parameter to switch between them. It's old, but this describes the 4 main models and which well known RDBMSes use them in section 2.3: https://dsf.berkeley.edu/papers/fntdb07-architecture.pdf TL;DR DB2 is the winner, it can do process-per-connection, thread-per-connection, process-pool or thread-pool. I understand this thread to be about thread-per-connection (= backend, session, socket) for now.
On Thu, Jun 8, 2023 at 3:00 AM Andres Freund <andres@anarazel.de> wrote: > > Yea, we definitely need the supervisor function in a separate > process. Presumably that means we need to split off some of the postmaster > responsibilities - e.g. I don't think it'd make sense to handle connection > establishment in the supervisor process. I wonder if this is something that > could end up being beneficial even in the process world. > > A related issue is that we won't get SIGCHLD in the supervisor process > anymore. So we'd need to come up with some design for that. If we fork the main Postgres process from the supervisor process then any exit to the main process will send SIGCHLD in the supervisor process, right? I agree we can handle all connection establishment and other thread-related stuff in the main Postgres process. But I assume this main process should be forked out of the supervisor process. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Hi, On 6/8/23 12:37 AM, Jeremy Schneider wrote: > On 6/7/23 2:39 PM, Thomas Kellerer wrote: >> Tomas Vondra schrieb am 07.06.2023 um 21:20: > > I did google search for "oracle threaded_execution" and browsed a bit; > didn't see anything that seems earth shattering so far. FWIW, I recall Karl Arao's wiki page: https://karlarao.github.io/karlaraowiki/#%2212c%20threaded_execution%22 where some performance and memory consumption studies have been done. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Wed, Jun 7, 2023 at 11:37 PM Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On 2023-06-05 13:40:13 -0400, Jonathan S. Katz wrote: > > 2. While I wouldn't want to necessarily discourage a moonshot effort, I > > would ask if developer time could be better spent on tackling some of the > > other problems around vertical scalability? Per some PGCon discussions, > > there's still room for improvement in how PostgreSQL can best utilize > > resources available very large "commodity" machines (a 448-core / 24TB RAM > > instance comes to mind). > > I think we're starting to hit quite a few limits related to the process model, > particularly on bigger machines. The overhead of cross-process context > switches is inherently higher than switching between threads in the same > process - and my suspicion is that that overhead will continue to > increase. Once you have a significant number of connections we end up spending > a *lot* of time in TLB misses, and that's inherent to the process model, > because you can't share the TLB across processes. This part was touched in the "AMA with a Linux Kernale Hacker" Unconference session where he mentioned that the had proposed a 'mshare' syscall for this. So maybe a more fruitful way to fixing the perceived issues with process model is to push for small changes in Linux to overcome these avoiding a wholesale rewrite ? > > > The amount of duplicated code we have to deal with due to to the process model > is quite substantial. We have local memory, statically allocated shared memory > and dynamically allocated shared memory variants for some things. And that's > just going to continue. Maybe we can already remove the distinction between static and dynamic shared memory ? Though I already heard some complaints at the conference discussions that having the dynamic version available has made some developers sloppy in using it resulting in wastefulness. > > > > I'm purposely giving a nonanswer on whether it's a worthwhile goal, but > > rather I'd be curious where it could stack up against some other efforts to > > continue to help PostgreSQL improve performance and handle very large > > workloads. > > There's plenty of things we can do before, but in the end I think tackling the > issues you mention and moving to threads are quite tightly linked. Still we should be focusing our attention at solving the issues and not at "moving to threads" and hoping this will fix the issues by itself. Cheers Hannu
I think I remember that in the early days of development somebody did send a patch-set for making PostgreSQL threaded on Solaris. I don't remember why this did not catch on. On Wed, Jun 7, 2023 at 11:40 PM Thomas Kellerer <shammat@gmx.net> wrote: > > Tomas Vondra schrieb am 07.06.2023 um 21:20: > > Also, which other projects did this transition? Is there something we > > could learn from them? Were they restricted to much smaller list of > > platforms? > > Firebird did this a while ago if I'm not mistaken. > > Not open source, but Oracle was historically multi-threaded on Windows and multi-process on all other platforms. > I _think_ starting with 19c you can optionally run it multi-threaded on Linux as well. > > But I doubt, they are willing to share any insights ;) > > >
On Thu, Jun 8, 2023 at 12:09 AM Andres Freund <andres@anarazel.de> wrote: ... > We could e.g. eventually decide that we > don't support parallel query without threading support - which would allow us > to get rid of a very significant amount of code and runtime overhead. Here I was hoping to go in the opposite direction and support parallel query across replicas. This looks much more doable based on the process model than the single process / multiple threads model. --- Cheers Hannu
On Thu, Jun 8, 2023 at 11:54 AM Hannu Krosing <hannuk@google.com> wrote: > > On Wed, Jun 7, 2023 at 11:37 PM Andres Freund <andres@anarazel.de> wrote: > > > > Hi, > > > > On 2023-06-05 13:40:13 -0400, Jonathan S. Katz wrote: > > > 2. While I wouldn't want to necessarily discourage a moonshot effort, I > > > would ask if developer time could be better spent on tackling some of the > > > other problems around vertical scalability? Per some PGCon discussions, > > > there's still room for improvement in how PostgreSQL can best utilize > > > resources available very large "commodity" machines (a 448-core / 24TB RAM > > > instance comes to mind). > > > > I think we're starting to hit quite a few limits related to the process model, > > particularly on bigger machines. The overhead of cross-process context > > switches is inherently higher than switching between threads in the same > > process - and my suspicion is that that overhead will continue to > > increase. Once you have a significant number of connections we end up spending > > a *lot* of time in TLB misses, and that's inherent to the process model, > > because you can't share the TLB across processes. > > > This part was touched in the "AMA with a Linux Kernale Hacker" > Unconference session where he mentioned that the had proposed a > 'mshare' syscall for this. Also, the *static* huge pages already let you solve this problem now by sharing the page tables Cheers Hannu
On 6/8/23 01:37, Thomas Munro wrote: > On Thu, Jun 8, 2023 at 10:37 AM Jeremy Schneider > <schneider@ardentperf.com> wrote: >> On 6/7/23 2:39 PM, Thomas Kellerer wrote: >>> Tomas Vondra schrieb am 07.06.2023 um 21:20: >>>> Also, which other projects did this transition? Is there something we >>>> could learn from them? Were they restricted to much smaller list of >>>> platforms? >>> >>> Not open source, but Oracle was historically multi-threaded on Windows >>> and multi-process on all other platforms. >>> I _think_ starting with 19c you can optionally run it multi-threaded on >>> Linux as well. >> Looks like it actually became publicly available in 12c. AFAICT Oracle >> supports both modes today, with a config parameter to switch between them. > > It's old, but this describes the 4 main models and which well known > RDBMSes use them in section 2.3: > > https://dsf.berkeley.edu/papers/fntdb07-architecture.pdf > > TL;DR DB2 is the winner, it can do process-per-connection, > thread-per-connection, process-pool or thread-pool. > I think the basic architectures are known, especially from the user perspective. I'm more interested in challenges the projects faced while moving from one architecture to the other, or how / why they support more than just one, etc. In [1] Heikki argued that: I don't think this is worth it, unless we plan to eventually remove the multi-process mode. ... As long as you need to also support processes, you need to code to the lowest common denominator and don't really get the benefits. But these projects clearly support multiple architectures, and have no intention to ditch some of them. So how did they do that? Surely they think there are benefits. One option would be to just have separate code paths for processes and threads, but the effort required to maintain and improve that would be deadly. So the only feasible option seems to be they managed to abstract the subsystems enough for the "regular" code to not care about model. [1] https://www.postgresql.org/message-id/6e3082dc-ff29-9cbf-847e-5f570828b46b@iki.fi > I understand this thread to be about thread-per-connection (= backend, > session, socket) for now. Maybe, although people also proposed to switch the parallel query to threads (so that'd be multiple threads per session). But I don't think it really matters, the concerns are mostly about moving from one architecture to another and/or supporting both. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, On 2023-06-07 08:53:24 -0400, Robert Haas wrote:Now, Andres is not a man who accepts a tax on performance of any size without a fight, so his "really expensive" might turn out to resemble my "pretty cheap." However, if widespread use of TLS is too expensive and we have to start rewriting code to not depend on global variables, that's going to be more of a problem. If we can get by with doing such rewrites only in performance-critical places, it might not still be too bad. Personally, I think the degree of dependence that PostgreSQL has on global variables is pretty excessive and I don't think that a certain amount of refactoring to reduce it would be a bad thing. If it turns into an infinite series of hastily-written patches to rejigger every source file we have, though, then I'm not really on board with that.I think a lot of such rewrites would be a good idea, even if we right now all agree to swear we'll never go to threads. Not having any sort of grouping of global variables makes it IMO considerably harder to debug. I can easily ask somebody to print out a variable pointing to a struct describing the state of a subsystem. I can't really do that for 50 variables. And once you do that, I think you reduce the TLS cost substantially. The variable pointing to the struct is already likely in a register. Whereas each individual variable being in TLS makes the job harder for the compiler.
I could certainly get on board with a project to tame the use of global variables.
cheers
andrew
-- Andrew Dunstan EDB: https://www.enterprisedb.com
[snip]I think we're starting to hit quite a few limits related to the process model, particularly on bigger machines. The overhead of cross-process context switches is inherently higher than switching between threads in the same process - and my suspicion is that that overhead will continue to increase. Once you have a significant number of connections we end up spending a *lot* of time in TLB misses, and that's inherent to the process model, because you can't share the TLB across processes.
IMHO, as one sysadmin who has previously played with Postgres on "quite large" machines, I'd propose what most would call a "hybrid model"....
* Threads are a very valuable addition for the "frontend" of the server. Most would call this a built-in session-aware connection pooler :)
Heikki's (and others') efforts towards separating connection state into discrete structs is clearly a prerequisite for this; Implementation-wise, just toss the connState into a TLS[thread-local storage] variable and many problems just vanish.
Postgres wouldn't be the first to adopt this approach, either...
* For "heavyweight" queries, the scalability of "almost independent" processes w.r.t. NUMA is just _impossible to achieve_ (locality of reference!) with a pure threaded system. When CPU+mem-bound (bandwidth-wise), threads add nothing IMO.
Indeed a separate postmaster is very much needed in order to control the processes / guard overall integrity.
Hence, my humble suggestion is to consider a hybrid architecture which benefits from each model's strengths. I am quite convinced that transition would be much safer and simpler (I do share most of Tom and other's concerns...)
Other projects to draw inspiration from:
* Postfix -- multi-process, postfix's master guards processes and performs privileged operations; unprivileged "subsystems". Interesting IPC solutions
* Apache -- MPMs provide flexibility and support for e.g. non-threaded workloads (PHP is the most popular; cfr. "prefork" multi-process MPM)
* NginX is actually multi-process (one per CPU) + event-based (multiplexing) ...
* PowerDNS is internally threaded, but has a "guardian" process. Seems to be evolving to a more hybrid model.
I would suggest something along the lines of :
* postmaster -- process supervision and (potentially privileged) operations; process coordination (i.e descriptor passing); mostly as-is
* frontend -- connection/session handling; possibly even event-driven
* backends -- process heavyweight queries as independently as possible. Can span worker threads AND processes when needed
* dispatcher -- takes care of cached/lightweight queries (cached catalog / full snapshot visibility+processing)
* utility processes can be left "as is" mostly, except to be made multi-threaded for heavy-sync ones (e.g. vacuum workers, stat workers)
For fixed-size buffers, i.e. pages / chunks, I'd say mmaped (anonymous) shared memory isn't that bad... but haven't read the actual code in years.
For message queues / invalidation messages, i guess that shmem-based sync is really a nuisance. My understanding is that Linux-specific (i.e. eventfd) mechanisms aren't quite considered .. or are they?
The amount of duplicated code we have to deal with due to to the process model is quite substantial. We have local memory, statically allocated shared memory and dynamically allocated shared memory variants for some things. And that's just going to continue.
Code duplication is indeed a problem... but I wouldn't call "different approaches/solution for very similar problems depending on context/requirement" a duplicate. I might well be wrong / lack detail, though... (again: haven't read PG's code for some years already).
Just my two cents.
Thanks,
J.L.
-- Parkinson's Law: Work expands to fill the time alloted to it.
On Thu, 8 Jun 2023 at 11:54, Hannu Krosing <hannuk@google.com> wrote: > > On Wed, Jun 7, 2023 at 11:37 PM Andres Freund <andres@anarazel.de> wrote: > > > > Hi, > > > > On 2023-06-05 13:40:13 -0400, Jonathan S. Katz wrote: > > > 2. While I wouldn't want to necessarily discourage a moonshot effort, I > > > would ask if developer time could be better spent on tackling some of the > > > other problems around vertical scalability? Per some PGCon discussions, > > > there's still room for improvement in how PostgreSQL can best utilize > > > resources available very large "commodity" machines (a 448-core / 24TB RAM > > > instance comes to mind). > > > > I think we're starting to hit quite a few limits related to the process model, > > particularly on bigger machines. The overhead of cross-process context > > switches is inherently higher than switching between threads in the same > > process - and my suspicion is that that overhead will continue to > > increase. Once you have a significant number of connections we end up spending > > a *lot* of time in TLB misses, and that's inherent to the process model, > > because you can't share the TLB across processes. > > > This part was touched in the "AMA with a Linux Kernale Hacker" > Unconference session where he mentioned that the had proposed a > 'mshare' syscall for this. > > So maybe a more fruitful way to fixing the perceived issues with > process model is to push for small changes in Linux to overcome these > avoiding a wholesale rewrite ? We support not just Linux, but also Windows and several (?) BSDs. I'm not against pushing Linux to make things easier for us, but Linux is an open source project, too, where someone need to put in time to get the shiny things that you want. And I'd rather see our time spent in PostgreSQL, as Linux is only used by a part of our user base. > > The amount of duplicated code we have to deal with due to to the process model > > is quite substantial. We have local memory, statically allocated shared memory > > and dynamically allocated shared memory variants for some things. And that's > > just going to continue. > > Maybe we can already remove the distinction between static and dynamic > shared memory ? That sounds like a bad idea, dynamic shared memory is more expensive to maintain than our static shared memory systems, not in the least because DSM is not guaranteed to share the same addresses in each process' address space. > Though I already heard some complaints at the conference discussions > that having the dynamic version available has made some developers > sloppy in using it resulting in wastefulness. Do you know any examples of this wastefulness? > > > I'm purposely giving a nonanswer on whether it's a worthwhile goal, but > > > rather I'd be curious where it could stack up against some other efforts to > > > continue to help PostgreSQL improve performance and handle very large > > > workloads. > > > > There's plenty of things we can do before, but in the end I think tackling the > > issues you mention and moving to threads are quite tightly linked. > > Still we should be focusing our attention at solving the issues and > not at "moving to threads" and hoping this will fix the issues by > itself. I suspect that it is much easier to solve some of the issues when working in a shared address space. E.g. resizing shared_buffers is difficult right now due to the use of a static allocation of shared memory, but if we had access to a single shared address space, it'd be easier to do any cleanup necessary for dynamically increasing/decreasing its size. Same with parallel workers - if we have a shared address space, the workers can pass any sized objects around without being required to move the tuples through DSM and waiting for the leader process to empty that buffer when it gets full. Sure, most of that is probably possible with DSM as well, it's just that I see a lot more issues that you need to take care of when you don't have a shared address space (such as the pointer translation we do in dsa_get_address). Kind regards, Matthias van de Meent Neon, Inc.
On Thu, Jun 8, 2023 at 2:15 PM Matthias van de Meent <boekewurm+postgres@gmail.com> wrote: > > On Thu, 8 Jun 2023 at 11:54, Hannu Krosing <hannuk@google.com> wrote: > > > > On Wed, Jun 7, 2023 at 11:37 PM Andres Freund <andres@anarazel.de> wrote: > > > > > > Hi, > > > > > > On 2023-06-05 13:40:13 -0400, Jonathan S. Katz wrote: > > > > 2. While I wouldn't want to necessarily discourage a moonshot effort, I > > > > would ask if developer time could be better spent on tackling some of the > > > > other problems around vertical scalability? Per some PGCon discussions, > > > > there's still room for improvement in how PostgreSQL can best utilize > > > > resources available very large "commodity" machines (a 448-core / 24TB RAM > > > > instance comes to mind). > > > > > > I think we're starting to hit quite a few limits related to the process model, > > > particularly on bigger machines. The overhead of cross-process context > > > switches is inherently higher than switching between threads in the same > > > process - and my suspicion is that that overhead will continue to > > > increase. Once you have a significant number of connections we end up spending > > > a *lot* of time in TLB misses, and that's inherent to the process model, > > > because you can't share the TLB across processes. > > > > > > This part was touched in the "AMA with a Linux Kernale Hacker" > > Unconference session where he mentioned that the had proposed a > > 'mshare' syscall for this. > > > > So maybe a more fruitful way to fixing the perceived issues with > > process model is to push for small changes in Linux to overcome these > > avoiding a wholesale rewrite ? > > We support not just Linux, but also Windows and several (?) BSDs. I'm > not against pushing Linux to make things easier for us, but Linux is > an open source project, too, where someone need to put in time to get > the shiny things that you want. And I'd rather see our time spent in > PostgreSQL, as Linux is only used by a part of our user base. Do we have any statistics for the distribution of our user base ? My gut feeling says that for performance-critical use the non-Linux is in low single digits at best. My fascination for OpenSource started with realisation that instead of workarounds you can actually fix the problem at source. So if the specific problem is that TLB is not shared then the proper fix is making it shared instead of rewriting everything else to get around it. None of us is limited to writing code in PostgreSQL only. If the easiest and more generix fix can be done in Linux then so be it. It is also possible that Windows and *BSD already have a similar feature. > > > > The amount of duplicated code we have to deal with due to to the process model > > > is quite substantial. We have local memory, statically allocated shared memory > > > and dynamically allocated shared memory variants for some things. And that's > > > just going to continue. > > > > Maybe we can already remove the distinction between static and dynamic > > shared memory ? > > That sounds like a bad idea, dynamic shared memory is more expensive > to maintain than our static shared memory systems, not in the least > because DSM is not guaranteed to share the same addresses in each > process' address space. Then this too needs to be fixed > > > Though I already heard some complaints at the conference discussions > > that having the dynamic version available has made some developers > > sloppy in using it resulting in wastefulness. > > Do you know any examples of this wastefulness? No. Just somebody mentioned it in a hallway conversation and the rest of the developers present mumbled approvingly :) > > > > I'm purposely giving a nonanswer on whether it's a worthwhile goal, but > > > > rather I'd be curious where it could stack up against some other efforts to > > > > continue to help PostgreSQL improve performance and handle very large > > > > workloads. > > > > > > There's plenty of things we can do before, but in the end I think tackling the > > > issues you mention and moving to threads are quite tightly linked. > > > > Still we should be focusing our attention at solving the issues and > > not at "moving to threads" and hoping this will fix the issues by > > itself. > > I suspect that it is much easier to solve some of the issues when > working in a shared address space. Probably. But it would come at the cost of needing to change a lot of other parts of PostgreSQL. I am not against making code cleaner for potential threaded model support. I am just a bit sceptical about the actual switch being easy, or doable in the next 10-15 years. > E.g. resizing shared_buffers is difficult right now due to the use of > a static allocation of shared memory, but if we had access to a single > shared address space, it'd be easier to do any cleanup necessary for > dynamically increasing/decreasing its size. This again could be done with shared memory mapping + dynamic shared memory. > Same with parallel workers - if we have a shared address space, the > workers can pass any sized objects around without being required to > move the tuples through DSM and waiting for the leader process to > empty that buffer when it gets full. Larger shared memory :) Same for shared plan cache and shared schema cache. > Sure, most of that is probably possible with DSM as well, it's just > that I see a lot more issues that you need to take care of when you > don't have a shared address space (such as the pointer translation we > do in dsa_get_address). All of the above seem to point to the need of a single thing - having an option for shared memory mappings . So let's focus on fixing things with minimal required change. And this would not have an adverse affect on systems that can not share mapping, they just won't become faster. And thay are all welcome to add the option for shared mappings too if they see enough value in it. It could sound like the same thing as threaded model, but should need much less changes and likely no changes for most out-of-tree extensions --- Cheers Hannu
On Thu, Jun 8, 2023 at 6:04 AM Hannu Krosing <hannuk@google.com> wrote: > Here I was hoping to go in the opposite direction and support parallel > query across replicas. > > This looks much more doable based on the process model than the single > process / multiple threads model. I don't think this is any more or less difficult to support in one model vs. the other. The problems seem pretty much unrelated. -- Robert Haas EDB: http://www.enterprisedb.com
On 07.06.2023 3:53 PM, Robert Haas wrote: > I think I remember a previous conversation with Andres > where he opined that thread-local variables are "really expensive" > (and I apologize in advance if I'm mis-remembering this). Now, Andres > is not a man who accepts a tax on performance of any size without a > fight, so his "really expensive" might turn out to resemble my "pretty > cheap." However, if widespread use of TLS is too expensive and we have > to start rewriting code to not depend on global variables, that's > going to be more of a problem. If we can get by with doing such > rewrites only in performance-critical places, it might not still be > too bad. Personally, I think the degree of dependence that PostgreSQL > has on global variables is pretty excessive and I don't think that a > certain amount of refactoring to reduce it would be a bad thing. If it > turns into an infinite series of hastily-written patches to rejigger > every source file we have, though, then I'm not really on board with > that. Actually TLS not not more expensive then accessing struct fields (at least at x86 platform), consider the following program: typedef struct { int a; int b; int c; } ABC; __thread int a; __thread int b; __thread int c; void use_struct(ABC* abc) { abc->a += 1; abc->b += 1; abc->c += 1; } void use_tls(ABC* abc) { a += 1; b += 1; c += 1; } Now look at the generated assembler: use_struct: addl $1, (%rdi) addl $1, 4(%rdi) addl $1, 8(%rdi) ret use_tls: addl $1, %fs:a@tpoff addl $1, %fs:b@tpoff addl $1, %fs:c@tpoff ret > Heikki mentions the idea of having a central Session object and just > passing that around. I have a hard time believing that's going to work > out nicely. First, it's not extensible. Right now, if you need a bit > of additional session-local state, you just declare a variable and > you're all set. That's not a perfect system and does cause some > problems, but we can't go from there to a system where it's impossible > to add session-local state without hacking core. Second, we will be > sad if session.h ends up #including every other header file that > defines a data structure anywhere in the backend. Or at least I'll be > sad. I'm not actually against the idea of having some kind of session > object that we pass around, but I think it either needs to be limited > to a relatively small set of well-defined things, or else it needs to > be design in some kind of extensible way that doesn't require it to > know the full details of every sort of object that's being used as > session-local state anywhere in the system. I haven't really seen any > convincing design ideas around this yet. There are about 2k static/global variables in Postgres. It is almost impossible to maintain such struct. But session context may be still needed for other purposes - if we want to support built-in connection pool. If we are using threads, then all variables needs to be either thread-local, either access to them should be synchronized. But If we want to save session context, then there is no need to save/restore all this 2k variables. We need to capture and these variables which lifetime exceeds transaction boundary. There are not so much such variables - tens not hundreds. The question is how to better handle this "session context". There are two alternatives: 1. Save/restore this context from/to normal TLS variables. 2. Replace such variables with access through the session context struct. I prefer 2) because it requires less changes in code. And performance overhead of session context store/resume is negligible when number of such variables is ~10.
On Wed, Jun 7, 2023 at 5:30 PM Andres Freund <andres@anarazel.de> wrote: > On 2023-06-05 17:51:57 +0300, Heikki Linnakangas wrote: > > If there are no major objections, I'm going to update the developer FAQ, > > removing the excuses there for why we don't use threads [1]. > > I think we should do this even if there's no concensus to slowly change to > threads. There's clearly no concensus on the opposite either. This is a very fair point. > One interesting bit around the transition is what tooling we ought to provide > to detect problems. It could e.g. be reasonably feasible to write something > checking how many read-write global variables an extension has on linux > systems. Yes, this would be great. > I don't think the control file is the right place - that seems more like > something that should be signalled via PG_MODULE_MAGIC. We need to check this > not just during CREATE EXTENSION, but also during loading of libraries - think > of shared_preload_libraries. +1. > Yea, we definitely need the supervisor function in a separate > process. Presumably that means we need to split off some of the postmaster > responsibilities - e.g. I don't think it'd make sense to handle connection > establishment in the supervisor process. I wonder if this is something that > could end up being beneficial even in the process world. Yeah, I've had similar thoughts. I'm not exactly sure what the advantages of such a refactoring might be, but the current structure feels pretty limiting. It works OK because we don't do anything in the postmaster other than fork a new backend, but I'm not sure if that's the best strategy. It means, for example, that if there's a ton of new connection requests, we're spawning a ton of new processes, which means that you can put a lot of load on a PostgreSQL instance even if you can't authenticate. Maybe we'd be better off with a pool of processes accepting connections; if authentication fails, that connection goes back into the pool and tries again. If authentication succeeds, either that process transitions to being a regular backend, leaving the authentication pool, or perhaps hands off the connection to a "real backend" at that point and loops around to accept() the next request. Whether that's a good ideal in detail or not, the point remains that having the postmaster handle this task is quite limiting. It forces us to hand off the connection to a new process at the earliest possible stage, so that the postmaster remains free to handle other duties. Giving the responsibility to another process would let us make decisions about where to perform the hand-off based on real architectural thought rather than being forced to do a certain way because nothing else will work. -- Robert Haas EDB: http://www.enterprisedb.com
On Wed, Jun 7, 2023 at 5:37 PM Andres Freund <andres@anarazel.de> wrote: > I think we're starting to hit quite a few limits related to the process model, > particularly on bigger machines. The overhead of cross-process context > switches is inherently higher than switching between threads in the same > process - and my suspicion is that that overhead will continue to > increase. Once you have a significant number of connections we end up spending > a *lot* of time in TLB misses, and that's inherent to the process model, > because you can't share the TLB across processes. This is a very good point. Our default posture on this mailing list is to try to maximize use of OS facilities rather than reimplementing things - well and good. But if a user writes a query with FOO JOIN BAR ON FOO.X = BAR.X OR FOO.Y = BAR.Y and then complains that the resulting query plan sucks, we don't slink off in embarrassment: we tell the user that there's not really any fast plan for that query and that if they write queries like that they have to live with the consequences. But the same thing applies here. To the extent that context switching between more processes is more expensive than context switching between threads for hardware-related reasons, that's not something that the OS can fix for us. If we choose to do the expensive thing then we pay the overhead. -- Robert Haas EDB: http://www.enterprisedb.com
On Wed, Jun 7, 2023 at 5:39 PM Peter Eisentraut <peter@eisentraut.org> wrote: > On 07.06.23 23:30, Andres Freund wrote: > > Yea, we definitely need the supervisor function in a separate > > process. Presumably that means we need to split off some of the postmaster > > responsibilities - e.g. I don't think it'd make sense to handle connection > > establishment in the supervisor process. I wonder if this is something that > > could end up being beneficial even in the process world. > > Something to think about perhaps ... how would that be different from > using an existing external supervisor process like systemd or supervisord. systemd wouldn't start individual PostgreSQL processes, right? If we want a checkpointer and a wal writer and a background writer and whatever we have to have our own supervisor process to spawn all those and keep them running. We could remove the logic to do a full system reset without a postmaster exit in favor of letting systemd restart everything from scratch, if we wanted to do that. But we'd still need our own supervisor to start up all of the individual threads/processes that we need. -- Robert Haas EDB: http://www.enterprisedb.com
On Wed, Jun 7, 2023 at 5:45 PM Andres Freund <andres@anarazel.de> wrote: > People have argued that the process model is more robust. But it turns out > that we have to crash-restart for just about any "bad failure" anyway. It used > to be (a long time ago) that we didn't, but that was just broken. How hard have you thought about memory leaks as a failure mode? Or file descriptor leaks? Right now, a process needs to release all of its shared resources before exiting, or trigger a crash-and-restart cycle. But it doesn't need to release any process-local resources, because the OS will take care of that. But that wouldn't be true any more, and that seems like it might require fixing quite a few things. -- Robert Haas EDB: http://www.enterprisedb.com
On Wed, 7 Jun 2023 at 18:09, Andres Freund <andres@anarazel.de> wrote: > Having the same memory mapping between threads makes allows the > hardware to share the TLB (on x86 via process context identifiers), which > isn't realistically possible with different processes. As a matter of historical interest Solaris actually did implement this across different processes. It was called by the somewhat unfortunate name "Intimate Shared Memory". I don't think Linux ever implemented anything like it but I'm not sure. I think this was not so much about cache hit rate but about just sheer wasted memory in page mappings. So I guess hugepages more or less target the same issues. But I find it interesting that they were already running into issues like this 20 years ago -- presumably those issues have only grown. -- greg
On Thu, Jun 8, 2023 at 8:44 AM Hannu Krosing <hannuk@google.com> wrote: > > That sounds like a bad idea, dynamic shared memory is more expensive > > to maintain than our static shared memory systems, not in the least > > because DSM is not guaranteed to share the same addresses in each > > process' address space. > > Then this too needs to be fixed Honestly, I'm struggling to respond to this non-sarcastically. I mean, I was the one who implemented DSM. Do you think it works the way that it works because I considered doing something smart and decided to do something dumb instead? Suppose you have two PostgreSQL backends A and B. If we're not running on Windows, each of these was forked from the postmaster, so things like the text and data segments and the main shared memory segment are going to be mapped at the same address in both processes, because they inherit those mappings from the postmaster. However, additional things can get mapped into the address space of either process later. This can happen in a variety of ways. For instance, a shared library can get loaded into one process and not the other. Or it can get loaded into both processes but at different addresses - keep in mind that it's the OS, not PostgreSQL, that decides what address to use when loading a shared library. Or, if one process allocates a bunch of memory, then new address space will have to be mapped into that process to handle those memory allocations and, again, it is the OS that decides where to put those mappings. So over time the memory mappings of these two processes can diverge arbitrarily. That means that if the same DSM has to be mapped into both processes, there is no guarantee that it can be placed at the same address in both processes. The address that gets used in one process might not be available in the other process. It's worth pointing out here that there are no portable primitives available for a process to examine what memory segments are mapped into its address space. I think it's probably possible on every OS, but it works differently on different ones. Linux exposes such details through /proc, for example, but macOS doesn't have /proc. So if we're using standard, portable primitives, we can't even TRY to put the DSM at the same address in every process that maps it. But even if we used non-portable primitives to examine what's mapped into the address space of every process, it wouldn't solve the problem. Suppose 4 processes want to share a DSM, so they all run around and use non-portable OS-specific interfaces to figure out where there's a free chunk of address space large enough to accommodate that DSM and they all map it there. Hooray! But then say a fifth process comes along and it ALSO wants to map that DSM, but in that fifth process the address space that was available in the other four processes has already been used by something else. Well, now we're screwed. The fact that DSM is expensive and awkward to use isn't a defect in the implementation of DSM. It's a consequence of the fact that the address space mappings in one PostgreSQL backend can be almost arbitrarily different from the address space mappings in another PostgreSQL backend. If only there were some kind of OS feature available that would allow us to set things up so that all of the PostgreSQL backends shared the same address space mappings! Oh, right, there is: THREADS. The fact that we don't use threads is the reason why DSM sucks and has to suck. In fact it's the reason why DSM has to exist at all. Saying "fix DSM instead of using threads" is roughly in the same category as saying "if the peasants are revolting because they have no bread, then let them eat cake." Both statements evince a complete failure to understand the actual source of the problem. With apologies for my grumpiness, -- Robert Haas EDB: http://www.enterprisedb.com
On Thu, Jun 8, 2023 at 4:56 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Thu, Jun 8, 2023 at 8:44 AM Hannu Krosing <hannuk@google.com> wrote: > > > That sounds like a bad idea, dynamic shared memory is more expensive > > > to maintain than our static shared memory systems, not in the least > > > because DSM is not guaranteed to share the same addresses in each > > > process' address space. > > > > Then this too needs to be fixed > > Honestly, I'm struggling to respond to this non-sarcastically. I mean, > I was the one who implemented DSM. Do you think it works the way that > it works because I considered doing something smart and decided to do > something dumb instead? No, I meant that this needs to be fixed at OS level, by being able to use the same mapping. We should not shy away from asking the OS people for adding the useful features still missing. It was mentioned in the Unconference Kernel Hacker AMA talk and said kernel hacker works for Oracle, andf they also seemed to be needing this :)
On Thu, 8 Jun 2023 at 14:44, Hannu Krosing <hannuk@google.com> wrote: > > On Thu, Jun 8, 2023 at 2:15 PM Matthias van de Meent > <boekewurm+postgres@gmail.com> wrote: > > > > On Thu, 8 Jun 2023 at 11:54, Hannu Krosing <hannuk@google.com> wrote: > > > > > > This part was touched in the "AMA with a Linux Kernale Hacker" > > > Unconference session where he mentioned that the had proposed a > > > 'mshare' syscall for this. > > > > > > So maybe a more fruitful way to fixing the perceived issues with > > > process model is to push for small changes in Linux to overcome these > > > avoiding a wholesale rewrite ? > > > > We support not just Linux, but also Windows and several (?) BSDs. I'm > > not against pushing Linux to make things easier for us, but Linux is > > an open source project, too, where someone need to put in time to get > > the shiny things that you want. And I'd rather see our time spent in > > PostgreSQL, as Linux is only used by a part of our user base. > > Do we have any statistics for the distribution of our user base ? > > My gut feeling says that for performance-critical use the non-Linux is > in low single digits at best. > > My fascination for OpenSource started with realisation that instead of > workarounds you can actually fix the problem at source. So if the > specific problem is that TLB is not shared then the proper fix is > making it shared instead of rewriting everything else to get around > it. None of us is limited to writing code in PostgreSQL only. If the > easiest and more generix fix can be done in Linux then so be it. TLB is a CPU hardware facility, not something that the OS can decide to share between processes. While sharing (some) OS memory management facilities across threads might be possible (as you mention, that mshare syscall would be an example), that doesn't solve the issue of the hardware not supporting sharing TLB entries across processes. We'd use less kernel memory for memory management, but the CPU would still stall on TLB misses every time we switch processes on the CPU (unless we somehow were able to use non-process-namespaced TLB entries, which would make our processes not meaningfully different from threads w.r.t. address space). > > > > > > Maybe we can already remove the distinction between static and dynamic > > > shared memory ? > > > > That sounds like a bad idea, dynamic shared memory is more expensive > > to maintain than our static shared memory systems, not in the least > > because DSM is not guaranteed to share the same addresses in each > > process' address space. > > Then this too needs to be fixed That needs kernel facilities in all (most?) supported OSes, and I think that's much more work than moving to threads: Allocations from the kernel are arbitrarily random across the available address space, so a DSM segment that is allocated in one backend might overlap with unshared allocations of a different backend, making those backends have conflicting memory address spaces. The only way to make that work is to have a shared memory addressing space, but some backends just not having the allocation mapped into their local address space; which seems only slightly more isolated than threads and much more effort to maintain. > > > Though I already heard some complaints at the conference discussions > > > that having the dynamic version available has made some developers > > > sloppy in using it resulting in wastefulness. > > > > Do you know any examples of this wastefulness? > > No. Just somebody mentioned it in a hallway conversation and the rest > of the developers present mumbled approvingly :) The only "wastefulness" that I know of in our use of DSM is the queue, and that's by design: We need to move data from a backend's private memory to memory that's accessible to other backends; i.e. shared memory. You can't do that without copying or exposing your private memory. > > > Still we should be focusing our attention at solving the issues and > > > not at "moving to threads" and hoping this will fix the issues by > > > itself. > > > > I suspect that it is much easier to solve some of the issues when > > working in a shared address space. > > Probably. But it would come at the cost of needing to change a lot of > other parts of PostgreSQL. > > I am not against making code cleaner for potential threaded model > support. I am just a bit sceptical about the actual switch being easy, > or doable in the next 10-15 years. PostgreSQL only has a support cycle of 5 years. 5 years after the last release of un-threaded PostgreSQL we could drop support for "legacy" extension models that don't support threading. > > E.g. resizing shared_buffers is difficult right now due to the use of > > a static allocation of shared memory, but if we had access to a single > > shared address space, it'd be easier to do any cleanup necessary for > > dynamically increasing/decreasing its size. > > This again could be done with shared memory mapping + dynamic shared memory. Yes, but as I said, that's much more difficult than lock and/or atomic operations on shared-between-backends static variables, because if these variables aren't in shared memory you need to pass the messages to update the variables to all backends. > > Same with parallel workers - if we have a shared address space, the > > workers can pass any sized objects around without being required to > > move the tuples through DSM and waiting for the leader process to > > empty that buffer when it gets full. > > Larger shared memory :) > > Same for shared plan cache and shared schema cache. Shared memory in processes is not free, if only because the TLB gets saturated much faster. > > Sure, most of that is probably possible with DSM as well, it's just > > that I see a lot more issues that you need to take care of when you > > don't have a shared address space (such as the pointer translation we > > do in dsa_get_address). > > All of the above seem to point to the need of a single thing - having > an option for shared memory mappings . > > So let's focus on fixing things with minimal required change. That seems logical, but not all kernels support dynamic shared memory mappings. And, as for your suggested solution, I couldn't find much info on this mshare syscall (or its successor mmap/VM_SHARED_PT), nor on whether it would actually fix the TLB issue. > And this would not have an adverse affect on systems that can not > share mapping, they just won't become faster. And thay are all welcome > to add the option for shared mappings too if they see enough value in > it. > > It could sound like the same thing as threaded model, but should need > much less changes and likely no changes for most out-of-tree > extensions We can't expect the kernel to fix everything for us - that's what we build PostgreSQL for. Where possible, we do want to rely on OS primitives, but I'm not sure that it would be easy to share memory address mappings across backends, for reasons including the above ("That needs kernel facilities in all [...] more effort to maintain"). Kind regards, Matthias van de Meent Neon, Inc.
On 2023-06-08 10:33:26 -0400, Greg Stark wrote: > On Wed, 7 Jun 2023 at 18:09, Andres Freund <andres@anarazel.de> wrote: > > Having the same memory mapping between threads makes allows the > > hardware to share the TLB (on x86 via process context identifiers), which > > isn't realistically possible with different processes. > > As a matter of historical interest Solaris actually did implement this > across different processes. It was called by the somewhat unfortunate > name "Intimate Shared Memory". I don't think Linux ever implemented > anything like it but I'm not sure. I don't think it shared the TLB - it did share page tables though.
On Thu, 8 Jun 2023 at 17:02, Hannu Krosing <hannuk@google.com> wrote: > > On Thu, Jun 8, 2023 at 4:56 PM Robert Haas <robertmhaas@gmail.com> wrote: > > > > On Thu, Jun 8, 2023 at 8:44 AM Hannu Krosing <hannuk@google.com> wrote: > > > > That sounds like a bad idea, dynamic shared memory is more expensive > > > > to maintain than our static shared memory systems, not in the least > > > > because DSM is not guaranteed to share the same addresses in each > > > > process' address space. > > > > > > Then this too needs to be fixed > > > > Honestly, I'm struggling to respond to this non-sarcastically. I mean, > > I was the one who implemented DSM. Do you think it works the way that > > it works because I considered doing something smart and decided to do > > something dumb instead? > > No, I meant that this needs to be fixed at OS level, by being able to > use the same mapping. > > We should not shy away from asking the OS people for adding the useful > features still missing. While I agree that "sharing page tables across processes" is useful, it looks like it'd be much more effort to correctly implement for e.g. DSM than implementing threading. Konstantin's diff is "only" 20.1k lines [0] added and/or modified, which is a lot, but it's manageable (13k+ of which are from files that were auto-generated and then committed, likely accidentally). > It was mentioned in the Unconference Kernel Hacker AMA talk and said > kernel hacker works for Oracle, andf they also seemed to be needing > this :) Though these new kernel features allowing for better performance (mostly in kernel memory usage, probably) would be nice to have, we wouldn't get performance benefits for older kernels, benefits which we would get if we were to implement threading. I'm not on board with a policy of us twiddling thumbs and waiting for the OS to fix our architectural performance issues. Sure, the kernel could optimize for our usage pattern, but I think that's not something we can (or should) rely on for performance ^1. Kind regards, Matthias van de Meent [0] https://github.com/postgrespro/postgresql.pthreads/compare/801386af...d5933309?w=1 ^1 OT: I think the same about us (ab)using the OS page cache, but that's a tale for a different time and thread.
On Thu, Jun 8, 2023 at 11:02 AM Hannu Krosing <hannuk@google.com> wrote: > No, I meant that this needs to be fixed at OS level, by being able to > use the same mapping. > > We should not shy away from asking the OS people for adding the useful > features still missing. > > It was mentioned in the Unconference Kernel Hacker AMA talk and said > kernel hacker works for Oracle, andf they also seemed to be needing > this :) Fair enough, but we aspire to work on a bunch of different operating systems. To make use of an OS facility, we need something that works on at least Linux, Windows, macOS, and a few different BSD flavors. It's not as if when the PostgreSQL project asks for a new operating system facility everyone springs into action to provide it immediately. And even if they did, and even if they all released an implementation of whatever we requested next year, it would still be at least five, more realistically ten, years before systems with those facilities were ubiquitous. And unless we have truly obscene amounts of clout in the OS community, it's likely that all of those different operating systems would implement different things to meet the stated need, and then we'd have to have a complex bunch of platform-dependent code in order to keep working on all of those systems. To me, this is a road to nowhere. I have no problem at all with us expressing our needs to the OS community, but realistically, any PostgreSQL feature that depends on an OS feature less than twenty years old is going to have to be optional, which means that if we want to do anything about sharing address space mappings in the next few years, it's going to need to be based on threads. -- Robert Haas EDB: http://www.enterprisedb.com
On 2023-06-08 14:01:16 +0200, Jose Luis Tallon wrote: > * For "heavyweight" queries, the scalability of "almost independent" > processes w.r.t. NUMA is just _impossible to achieve_ (locality of > reference!) with a pure threaded system. When CPU+mem-bound > (bandwidth-wise), threads add nothing IMO. I don't think this is true in any sort of way.
Hi, On 2023-06-08 12:15:58 +0200, Hannu Krosing wrote: > On Thu, Jun 8, 2023 at 11:54 AM Hannu Krosing <hannuk@google.com> wrote: > > > > On Wed, Jun 7, 2023 at 11:37 PM Andres Freund <andres@anarazel.de> wrote: > > > > > > Hi, > > > > > > On 2023-06-05 13:40:13 -0400, Jonathan S. Katz wrote: > > > > 2. While I wouldn't want to necessarily discourage a moonshot effort, I > > > > would ask if developer time could be better spent on tackling some of the > > > > other problems around vertical scalability? Per some PGCon discussions, > > > > there's still room for improvement in how PostgreSQL can best utilize > > > > resources available very large "commodity" machines (a 448-core / 24TB RAM > > > > instance comes to mind). > > > > > > I think we're starting to hit quite a few limits related to the process model, > > > particularly on bigger machines. The overhead of cross-process context > > > switches is inherently higher than switching between threads in the same > > > process - and my suspicion is that that overhead will continue to > > > increase. Once you have a significant number of connections we end up spending > > > a *lot* of time in TLB misses, and that's inherent to the process model, > > > because you can't share the TLB across processes. > > > > > > This part was touched in the "AMA with a Linux Kernale Hacker" > > Unconference session where he mentioned that the had proposed a > > 'mshare' syscall for this. As-is that'd just lead to sharing page table, not the TLB. I don't think you currently do sharing of the TLB for parts of your address space on x86 hardware. It's possible that something like that gets added to future hardware, but ... > Also, the *static* huge pages already let you solve this problem now > by sharing the page tables You don't share the page tables with huge pages on linux. - Andres
Do we have any statistics for the distribution of our user base ?
My gut feeling says that for performance-critical use the non-Linux is
in low single digits at best.
Hi, On 2023-06-08 16:47:48 +0300, Konstantin Knizhnik wrote: > Actually TLS not not more expensive then accessing struct fields (at least > at x86 platform), consider the following program: It really depends on the OS and the architecture, not just the architecture. And even on x86-64 Linux, the fact that you're using the segment offset in the address calculation means you can't use the more complicated addressing modes for other reasons. And plenty instructions, e.g. most (?) SSE instructions, won't be able to use that kind of addressing directly. Even just compiling your, example you can see that with gcc -O2 you get considerably faster code with the non-TLS version. As a fairly extreme example, here's the mingw -O3 compiled code: use_struct: movq xmm1, QWORD PTR .LC0[rip] movq xmm0, QWORD PTR [rcx] add DWORD PTR 8[rcx], 1 paddd xmm0, xmm1 movq QWORD PTR [rcx], xmm0 ret use_tls: sub rsp, 40 lea rcx, __emutls_v.a[rip] call __emutls_get_address lea rcx, __emutls_v.b[rip] add DWORD PTR [rax], 1 call __emutls_get_address lea rcx, __emutls_v.c[rip] add DWORD PTR [rax], 1 call __emutls_get_address add DWORD PTR [rax], 1 add rsp, 40 ret Greetings, Andres Freund
On Wed, Jun 07, 2023 at 10:26:07AM +1200, Thomas Munro wrote: > On Tue, Jun 6, 2023 at 6:52???AM Andrew Dunstan <andrew@dunslane.net> wrote: > > If we were starting out today we would probably choose a threaded implementation. But moving to threaded now seems tome like a multi-year-multi-person project with the prospect of years to come chasing bugs and the prospect of fairly modestadvantages. The risk to reward doesn't look great. > > > > That's my initial reaction. I could be convinced otherwise. > > Here is one thing I often think about when contemplating threads. > Take a look at dsa.c. It calls itself a shared memory allocator, but > really it has two jobs, the second being to provide software emulation > of virtual memory. That???s behind dshash.c and now the stats system, > and various parts of the parallel executor code. It???s slow and > complicated, and far from the state of the art. I wrote that code > (building on allocator code from Robert) with the expectation that it > was a transitional solution to unblock a bunch of projects. I always > expected that we'd eventually be deleting it. When I explain that > subsystem to people who are not steeped in the lore of PostgreSQL, it > sounds completely absurd. I mean, ... it is, right? My point is Isn't all the memory operations would require nearly the same shared memory allocators if someone switches to a threaded imple- mentation? > that we???re doing pretty unreasonable and inefficient contortions to > develop new features -- we're not just happily chugging along without > threads at no cost. >
I discovered this thread from a Twitter post "PostgreSQL will finally be rewritten in Rust" :) On Mon, Jun 5, 2023 at 5:18 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Heikki Linnakangas <hlinnaka@iki.fi> writes: > > I spoke with some folks at PGCon about making PostgreSQL multi-threaded, > > so that the whole server runs in a single process, with multiple > > threads. It has been discussed many times in the past, last thread on > > pgsql-hackers was back in 2017 when Konstantin made some experiments [0]. > > > I feel that there is now pretty strong consensus that it would be a good > > thing, more so than before. Lots of work to get there, and lots of > > details to be hashed out, but no objections to the idea at a high level. > > > The purpose of this email is to make that silent consensus explicit. If > > you have objections to switching from the current multi-process > > architecture to a single-process, multi-threaded architecture, please > > speak up. > > For the record, I think this will be a disaster. There is far too much > code that will get broken, largely silently, and much of it is not > under our control. > > regards, tom lane > >
Hi, On 2023-06-08 17:02:08 +0200, Hannu Krosing wrote: > On Thu, Jun 8, 2023 at 4:56 PM Robert Haas <robertmhaas@gmail.com> wrote: > > > > On Thu, Jun 8, 2023 at 8:44 AM Hannu Krosing <hannuk@google.com> wrote: > > > > That sounds like a bad idea, dynamic shared memory is more expensive > > > > to maintain than our static shared memory systems, not in the least > > > > because DSM is not guaranteed to share the same addresses in each > > > > process' address space. > > > > > > Then this too needs to be fixed > > > > Honestly, I'm struggling to respond to this non-sarcastically. I mean, > > I was the one who implemented DSM. Do you think it works the way that > > it works because I considered doing something smart and decided to do > > something dumb instead? > > No, I meant that this needs to be fixed at OS level, by being able to > use the same mapping. > > We should not shy away from asking the OS people for adding the useful > features still missing. There's a large part of this that is about hardware, not software. And honestly, for most of the problems the answer is to just use threads. Adding complexity to operating systems to make odd architectures like postgres' better is a pretty dubious proposition. I don't think we have even remotely enough influence on CPU design to make e.g. *partial* TLB sharing across processes a thing. > It was mentioned in the Unconference Kernel Hacker AMA talk and said > kernel hacker works for Oracle, andf they also seemed to be needing > this :) The proposals around that don't really help us all that much. Sharing the page table will be a bit more efficient, but it won't really change anything dramatically. From what I understand they are primarily interested in changing properties of a memory mapping across multiple processes, e.g. making some memory executable and have that reflected in all processes. I don't think this will help us much. Greetings, Andres Freund
Hi, On 2023-06-08 17:55:57 +0200, Matthias van de Meent wrote: > While I agree that "sharing page tables across processes" is useful, > it looks like it'd be much more effort to correctly implement for e.g. > DSM than implementing threading. > Konstantin's diff is "only" 20.1k lines [0] added and/or modified, > which is a lot, but it's manageable (13k+ of which are from files that > were auto-generated and then committed, likely accidentally). Honestly, I don't think this patch is in a good enough state to allow a realistic estimation of the overall work. Making global variables TLS is the *easy* part. Redesigning postmaster, definining how to deal with extension libraries, extension compatibility, developing tools to make developing a threaded postgres feasible, dealing with freeing session lifetime memory allocations that previously were freed via process exit, making the change realistically reviewable, portability are all much harder. Greetings, Andres Freund
Hi, On 2023-06-08 11:56:13 -0400, Robert Haas wrote: > On Thu, Jun 8, 2023 at 11:02 AM Hannu Krosing <hannuk@google.com> wrote: > > No, I meant that this needs to be fixed at OS level, by being able to > > use the same mapping. > > > > We should not shy away from asking the OS people for adding the useful > > features still missing. > > > > It was mentioned in the Unconference Kernel Hacker AMA talk and said > > kernel hacker works for Oracle, andf they also seemed to be needing > > this :) > > Fair enough, but we aspire to work on a bunch of different operating > systems. To make use of an OS facility, we need something that works > on at least Linux, Windows, macOS, and a few different BSD flavors. > It's not as if when the PostgreSQL project asks for a new operating > system facility everyone springs into action to provide it > immediately. And even if they did, and even if they all released an > implementation of whatever we requested next year, it would still be > at least five, more realistically ten, years before systems with those > facilities were ubiquitous. I'm less concerned about this aspect - most won't have upgraded to a version of postgres that benefit from threaded postgres in a similar timeframe. And if the benefits are large enough, people will move. But: > And unless we have truly obscene amounts of clout in the OS community, it's > likely that all of those different operating systems would implement > different things to meet the stated need, and then we'd have to have a > complex bunch of platform-dependent code in order to keep working on all of > those systems. And even more likely, they just won't do anything, because it's a model that large parts of the industry have decided isn't going anywhere. It'd be one thing if we had 5 kernel devs that we could deploy to work on this, but we don't. So we have to convince kernel devs employed by others that somehow this is an urgent enough thing that they should work on it. The likely, imo justified, answer is just going to be: Fix your architecture, then we can talk. Greetings, Andres Freund
On Fri, Jun 9, 2023 at 5:02 AM Ilya Anfimov <ilan@tzirechnoy.com> wrote: > Isn't all the memory operations would require nearly the same > shared memory allocators if someone switches to a threaded imple- > mentation? It's true that we'd need concurrency-aware MemoryContext implementations (details can be debated), but we wouldn't need that address translation layer, which adds a measurable cost at every access.
On 8/6/23 15:56, Robert Haas wrote: > Yeah, I've had similar thoughts. I'm not exactly sure what the > advantages of such a refactoring might be, but the current structure > feels pretty limiting. It works OK because we don't do anything in the > postmaster other than fork a new backend, but I'm not sure if that's > the best strategy. It means, for example, that if there's a ton of new > connection requests, we're spawning a ton of new processes, which > means that you can put a lot of load on a PostgreSQL instance even if > you can't authenticate. Maybe we'd be better off with a pool of > processes accepting connections; if authentication fails, that > connection goes back into the pool and tries again. This. It's limited by connection I/O, hence a perfect use for threads (minimize per-connection overhead). IMV, "session state" would be best stored/managed here. Would need a way to convey it efficiently, though. > If authentication > succeeds, either that process transitions to being a regular backend, > leaving the authentication pool, or perhaps hands off the connection > to a "real backend" at that point and loops around to accept() the > next request. Nicely done by passing the FD around.... But at this point, we'd just get a nice reimplementation of a threaded connection pool inside Postgres :\ > Whether that's a good ideal in detail or not, the point remains that > having the postmaster handle this task is quite limiting. It forces us > to hand off the connection to a new process at the earliest possible > stage, so that the postmaster remains free to handle other duties. > Giving the responsibility to another process would let us make > decisions about where to perform the hand-off based on real > architectural thought rather than being forced to do a certain way > because nothing else will work. At least "tcop" surely feels like belonging in a separate process .... J.L.
On Fri, Jun 9, 2023 at 4:00 AM Andres Freund <andres@anarazel.de> wrote: > On 2023-06-08 12:15:58 +0200, Hannu Krosing wrote: > > > This part was touched in the "AMA with a Linux Kernale Hacker" > > > Unconference session where he mentioned that the had proposed a > > > 'mshare' syscall for this. > > As-is that'd just lead to sharing page table, not the TLB. I don't think you > currently do sharing of the TLB for parts of your address space on x86 > hardware. It's possible that something like that gets added to future > hardware, but ... I wasn't in Mathew Wilcox's unconference in Ottawa but I found an older article on LWN: https://lwn.net/Articles/895217/ For what it's worth, FreeBSD hackers have studied this topic too (and it's been done in Android and no doubt other systems before): https://www.cs.rochester.edu/u/sandhya/papers/ispass19.pdf I've shared that paper on this list before in the context of super/huge pages and their benefits (to executable code, and to the buffer pool), but a second topic in that paper is the idea of a shared page table: "We find that sharing PTPs across different processes can reduce execution cycles by as much as 6.9%. Moreover, the combined effects of using superpages to map the main executable and sharing PTPs for the small shared libraries can reduce execution cycles up to 18.2%." And that's just part of it, because those guys are more interested in shared code/libraries and such so that's probably not even getting to the stuff like buffer pool and DSMs that we might tend to think of first. I'm pretty sure PostgreSQL (along with another fork-based RDBMSs mentioned in this thread) must be one of the worst offenders for page table bloat, simply because we can have a lot of processes and touch a lot of memory. I'm no expert in this stuff, but it seems to be that with shared page table schemes you can avoid wasting huge amounts of RAM on duplicated page table entries (pages * processes), and with huge/super pages you can reduce the number of pages, but AFAIK you still can't escape the TLB shootdown cost, which is all-or-nothing (PCID level at best). The only way to avoid TLB shootdowns on context switches is to have *exactly the same memory map*. Or, as Robert succinctly shouted, "THREADS".
> On Mon, Jun 05, 2023 at 06:43:54PM +0300, Heikki Linnakangas wrote: > On 05/06/2023 11:28, Tristan Partin wrote: > > > # Exposed PIDs > > > > > > We expose backend process PIDs to users in a few places. > > > pg_stat_activity.pid and pg_terminate_backend(), for example. They need > > > to be replaced, or we can assign a fake PID to each connection when > > > running in multi-threaded mode. > > > > Would it be possible to just transparently slot in the thread ID > > instead? > > Perhaps. It might break applications that use the PID directly with e.g. > 'kill <PID>', though. I think things are getting more interesting if some external resource accounting like cgroups is taking place. From what I know cgroup v2 has only few controllers that allow threaded granularity, and memory or io controllers are not part of this list. Since Postgres is doing quite a lot of different things, sometimes it makes sense to put different limitations on different types of activity, e.g. to give more priority to a certain critical internal job on the account of slowing down backends. In the end it might be complicated or not possible to do that for individual threads. Such cases are probably not very important from the high level point of view, but could become an argument when deciding what should be a process and what should be a thread.
I discovered this thread from a Twitter post "PostgreSQL will finally
be rewritten in Rust" :)
Hi, On 2023-06-09 07:34:49 +1200, Thomas Munro wrote: > I wasn't in Mathew Wilcox's unconference in Ottawa but I found an > older article on LWN: > > https://lwn.net/Articles/895217/ > > For what it's worth, FreeBSD hackers have studied this topic too (and > it's been done in Android and no doubt other systems before): > > https://www.cs.rochester.edu/u/sandhya/papers/ispass19.pdf > > I've shared that paper on this list before in the context of > super/huge pages and their benefits (to executable code, and to the > buffer pool), but a second topic in that paper is the idea of a shared > page table: "We find that sharing PTPs across different processes can > reduce execution cycles by as much as 6.9%. Moreover, the combined > effects of using superpages to map the main executable and sharing > PTPs for the small shared libraries can reduce execution cycles up to > 18.2%." And that's just part of it, because those guys are more > interested in shared code/libraries and such so that's probably not > even getting to the stuff like buffer pool and DSMs that we might tend > to think of first. I've experimented with using huge pages for executable code on linux, and the benefits are quite noticable: https://www.postgresql.org/message-id/20221104212126.qfh3yzi7luvyy5d6%40awork3.anarazel.de I'm a bit dubious that sharing the page table for executable code increase the benefit that much further in real workloads. I suspect the reason it was different for the authors of the paper is: > A fixed number of back-to-back > transactions are performed on a 5GB database, and we use the > -C option of pgbench to toggle between reconnecting after > each transaction (reconnect mode) and using one persistent > connection per client (persistent connection mode). We use > the reconnect mode by default unless stated otherwise. Using -C explains why you'd see a lot of benefit from sharing page tables for executable code. But I don't think -C is a particularly interesting workload to optimize for. > I'm no expert in this stuff, but it seems to be that with shared page > table schemes you can avoid wasting huge amounts of RAM on duplicated > page table entries (pages * processes), and with huge/super pages you > can reduce the number of pages, but AFAIK you still can't escape the > TLB shootdown cost, which is all-or-nothing (PCID level at best). Pretty much that. While you can avoid some TLB shootdowns via PCIDs, that only avoids flushing the TLB, it doesn't help with the TLB hit rate being much lower due to the number of "redundant" mappings with different PCIDs. > The only way to avoid TLB shootdowns on context switches is to have *exactly > the same memory map*. Or, as Robert succinctly shouted, "THREADS". +1 Greetings, Andres Freund
Hi,
On 2023-06-09 07:34:49 +1200, Thomas Munro wrote:
> I wasn't in Mathew Wilcox's unconference in Ottawa but I found an
> older article on LWN:
>
> https://lwn.net/Articles/895217/
>
> For what it's worth, FreeBSD hackers have studied this topic too (and
> it's been done in Android and no doubt other systems before):
>
> https://www.cs.rochester.edu/u/sandhya/papers/ispass19.pdf
>
> I've shared that paper on this list before in the context of
> super/huge pages and their benefits (to executable code, and to the
> buffer pool), but a second topic in that paper is the idea of a shared
> page table: "We find that sharing PTPs across different processes can
> reduce execution cycles by as much as 6.9%. Moreover, the combined
> effects of using superpages to map the main executable and sharing
> PTPs for the small shared libraries can reduce execution cycles up to
> 18.2%." And that's just part of it, because those guys are more
> interested in shared code/libraries and such so that's probably not
> even getting to the stuff like buffer pool and DSMs that we might tend
> to think of first.
I've experimented with using huge pages for executable code on linux, and the
benefits are quite noticable:
https://www.postgresql.org/message-id/20221104212126.qfh3yzi7luvyy5d6%40awork3.anarazel.de
I'm a bit dubious that sharing the page table for executable code increase the
benefit that much further in real workloads. I suspect the reason it was
different for the authors of the paper is:
> A fixed number of back-to-back
> transactions are performed on a 5GB database, and we use the
> -C option of pgbench to toggle between reconnecting after
> each transaction (reconnect mode) and using one persistent
> connection per client (persistent connection mode). We use
> the reconnect mode by default unless stated otherwise.
Using -C explains why you'd see a lot of benefit from sharing page tables for
executable code. But I don't think -C is a particularly interesting workload
to optimize for.
> I'm no expert in this stuff, but it seems to be that with shared page
> table schemes you can avoid wasting huge amounts of RAM on duplicated
> page table entries (pages * processes), and with huge/super pages you
> can reduce the number of pages, but AFAIK you still can't escape the
> TLB shootdown cost, which is all-or-nothing (PCID level at best).
Pretty much that. While you can avoid some TLB shootdowns via PCIDs, that only
avoids flushing the TLB, it doesn't help with the TLB hit rate being much
lower due to the number of "redundant" mappings with different PCIDs.
> The only way to avoid TLB shootdowns on context switches is to have *exactly
> the same memory map*. Or, as Robert succinctly shouted, "THREADS".
+1
Greetings,
Andres Freund
On Fri, 9 Jun 2023 at 17:20, Dave Cramer <davecramer@postgres.rocks> wrote: > > This is somewhat orthogonal to the topic of threading but relevant to the use of resources. > > If we are going to undertake some hard problems perhaps we should be looking at other problems that solve other long termissues before we commit to spending resources on changing the process model. -1. This and that are orthogonal and effort in one does not need to block the other. If someone is willing to put in the effort, let them. Last time I checked we, as a project, are not blocking bugfixes for new features in MAIN either (or vice versa). > One thing I can think of is upgrading. AFAIK dump and restore is the only way to change the on disk format. > Presuming that eventually we will be forced to change the on disk format it would be nice to be able to do so in a mannerwhich does not force long down times I agree that we should improve our upgrade process (and we had a great discussion on the topic at the PGCon Unconference last week), but in my view that's not relevant to this discussion. Kind regards, Matthias van de Meent Neon, Inc.
Greetings, * Dave Cramer (davecramer@postgres.rocks) wrote: > One thing I can think of is upgrading. AFAIK dump and restore is the only > way to change the on disk format. > Presuming that eventually we will be forced to change the on disk format it > would be nice to be able to do so in a manner which does not force long > down times There is an ongoing effort moving in this direction. The $subject isn't great, but this patch set (which we are currently working on updating...): https://commitfest.postgresql.org/43/3986/ attempts changing a lot of currently compile-time block-size pieces to be run-time which would open up the possibility to have a different page format for, eg, different tablespaces. Possibly even different block sizes. We'd certainly welcome discussion from others who are interested. Thanks, Stephen
Attachment
On Wed, Jun 7, 2023 at 06:38:38PM +0530, Ashutosh Bapat wrote: > With multiple processes, we can use all the available cores (at least > theoretically if all those processes are independent). But is that > guaranteed with single process multi-thread model? Google didn't throw > any definitive answer to that. Usually it depends upon the OS and > architecture. > > Maybe a good start is to start using threads instead of parallel > workers e.g. for parallel vacuum, parallel query and so on while > leaving the processes for connections and leaders. that itself might > take significant time. Based on that experience move to a completely > threaded model. Based on my experience with other similar products, I > think we will settle on a multi-process multi-thread model. I think we have a few known problem that we might be able to solve without threads, but can help us eventually move to threads if we find it useful: 1) Use threads for background workers rather than processes 2) Allow sessions to be stopped and started by saving their state Ideally we would solve the problem of making shared structures resizable, but I am not sure how that can be easily done without threads. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com Only you can decide what is important to you.
On Thu, Jun 8, 2023 at 11:37:00AM +1200, Thomas Munro wrote: > It's old, but this describes the 4 main models and which well known > RDBMSes use them in section 2.3: > > https://dsf.berkeley.edu/papers/fntdb07-architecture.pdf > > TL;DR DB2 is the winner, it can do process-per-connection, > thread-per-connection, process-pool or thread-pool. > > I understand this thread to be about thread-per-connection (= backend, > session, socket) for now. I am quite confused that few people seem to care about which model, processes or threads, is better for Oracle, and how having both methods available can be a reasonable solution to maintain. Someone suggested they abstracted the differences so the maintenance burden was minor, but that seems very hard to me. Did these vendors start with processes, add threads, and then find that threads had downsides so they had to keep both? -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com Only you can decide what is important to you.
Greetings,
* Dave Cramer (davecramer@postgres.rocks) wrote:
> One thing I can think of is upgrading. AFAIK dump and restore is the only
> way to change the on disk format.
> Presuming that eventually we will be forced to change the on disk format it
> would be nice to be able to do so in a manner which does not force long
> down times
There is an ongoing effort moving in this direction. The $subject isn't
great, but this patch set (which we are currently working on
updating...): https://commitfest.postgresql.org/43/3986/ attempts
changing a lot of currently compile-time block-size pieces to be
run-time which would open up the possibility to have a different page
format for, eg, different tablespaces. Possibly even different block
sizes. We'd certainly welcome discussion from others who are
interested.
Thanks,
Stephen
On Mon, Jun 5, 2023 at 4:52 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > > If there are no major objections, I'm going to update the developer FAQ, > removing the excuses there for why we don't use threads [1]. I think it is not wise to start the wholesale removal of the objections there. But I think it is worthwhile to revisit the section about threads and maybe split out the historic part which is no more true, and provide both pros and cons for these. I started with this short summary from the discussion in this thread, feel free to expand, argue, fix :) * is current excuse -- is counterargument or ack ---------------- As an example, threads are not yet used instead of multiple processes for backends because: * Historically, threads were poorly supported and buggy. -- yes they were, not relevant now when threads are well-supported and non-buggy * An error in one backend can corrupt other backends if they're threads within a single process -- still valid for silent corruption -- for detected crash - yes, but we are restarting all backends in case of crash anyway. * Speed improvements using threads are small compared to the remaining backend startup time. -- we now have some measurements that show significant performance improvements not related to startup time * The backend code would be more complex. -- this is still the case -- even more worrisome is that all extensions also need to be rewritten -- and many incompatibilities will be silent and take potentially years to find * Terminating backend processes allows the OS to cleanly and quickly free all resources, protecting against memory and file descriptor leaks and making backend shutdown cheaper and faster -- still true * Debugging threaded programs is much harder than debugging worker processes, and core dumps are much less useful -- this was countered by claiming that -- by now we have reasonable debugger support for threads -- there is no direct debugger support for debugging the exact system set up like PostgreSQL processes + shared memory * Sharing of read-only executable mappings and the use of shared_buffers means processes, like threads, are very memory efficient -- this seems to say that the current process model is as good as threads ? -- there were a few counterarguments -- per-backend virtual memory mapping can add up to significant amount of extra RAM usage -- the discussion did not yet touch various per-backend caches (pg_catalog cache, statement cache) which are arguably easier to implement in threaded model -- TLB reload at each process switch is expensive and would be mostly avoided in case of threads * Regular creation and destruction of processes helps protect against memory fragmentation, which can be hard to manage in long-running processes -- probably still true -------------------------------------
I don't have an objection, but I do wonder: can one (or perhaps a few) queries/workloads be provided where threading would be significantly beneficial? (some material there could help get people on-board with the idea and potentially guide many of the smaller questions that arise along the way) On Mon, 5 Jun 2023 at 15:52, Heikki Linnakangas <hlinnaka@iki.fi> wrote: > > I spoke with some folks at PGCon about making PostgreSQL multi-threaded, > so that the whole server runs in a single process, with multiple > threads. It has been discussed many times in the past, last thread on > pgsql-hackers was back in 2017 when Konstantin made some experiments [0]. > > I feel that there is now pretty strong consensus that it would be a good > thing, more so than before. Lots of work to get there, and lots of > details to be hashed out, but no objections to the idea at a high level. > > The purpose of this email is to make that silent consensus explicit. If > you have objections to switching from the current multi-process > architecture to a single-process, multi-threaded architecture, please > speak up. > > If there are no major objections, I'm going to update the developer FAQ, > removing the excuses there for why we don't use threads [1]. And we can > start to talk about the path to get there. Below is a list of some > hurdles and proposed high-level solutions. This isn't an exhaustive > list, just some of the most obvious problems: > > # Transition period > > The transition surely cannot be done fully in one release. Even if we > could pull it off in core, extensions will need more time to adapt. > There will be a transition period of at least one release, probably > more, where you can choose multi-process or multi-thread model using a > GUC. Depending on how it goes, we can document it as experimental at first. > > # Thread per connection > > To get started, it's most straightforward to have one thread per > connection, just replacing backend process with a backend thread. In the > future, we might want to have a thread pool with some kind of a > scheduler to assign active queries to worker threads. Or multiple > threads per connection, or spawn additional helper threads for specific > tasks. But that's future work. > > # Global variables > > We have a lot of global and static variables: > > $ objdump -t bin/postgres | grep -e "\.data" -e "\.bss" | grep -v > "data.rel.ro" | wc -l > 1666 > > Some of them are pointers to shared memory structures and can stay as > they are. But many of them are per-connection state. The most > straightforward conversion for those is to turn them into thread-local > variables, like Konstantin did in [0]. > > It might be good to have some kind of a Session context struct that we > pass everywhere, or maybe have a single thread-local variable to hold > it. Many of the global variables would become fields in the Session. But > that's future work. > > # Extensions > > A lot of extensions also contain global variables or other things that > break in a multi-threaded environment. We need a way to label extensions > that support multi-threading. And in the future, also extensions that > *require* a multi-threaded server. > > Let's add flags to the control file to mark if the extension is > thread-safe and/or process-safe. If you try to load an extension that's > not compatible with the server's mode, throw an error. > > We might need new functions in addition _PG_init, called at connection > startup and shutdown. And background worker API probably needs some changes. > > # Exposed PIDs > > We expose backend process PIDs to users in a few places. > pg_stat_activity.pid and pg_terminate_backend(), for example. They need > to be replaced, or we can assign a fake PID to each connection when > running in multi-threaded mode. > > # Signals > > We use signals for communication between backends. SIGURG in latches, > and SIGUSR1 in procsignal, for example. Those primitives need to be > rewritten with some other signalling mechanism in multi-threaded mode. > In principle, it's possible to set per-thread signal handlers, and send > a signal to a particular thread (pthread_kill), but I think it's better > to just rewrite them. > > We also document that you can send SIGINT, SIGTERM or SIGHUP to an > individual backend process. I think we need to deprecate that, and maybe > come up with some convenient replacement. E.g. send a message with > backend ID to a unix domain socket, and a new pg_kill executable to send > those messages. > > # Restart on crash > > If a backend process crashes, postmaster terminates all other backends > and restarts the system. That's hard (impossible?) to do safely if > everything runs in one process. We can continue have a separate > postmaster process that just monitors the main process and restarts it > on crash. > > # Thread-safe libraries > > Need to switch to thread-safe versions of library functions, e.g. > uselocale() instead of setlocale(). > > The Python interpreter has a Global Interpreter Lock. It's not possible > to create two completely independent Python interpreters in the same > process, there will be some lock contention on the GIL. Fortunately, the > python community just accepted https://peps.python.org/pep-0684/. That's > exactly what we need: it makes it possible for separate interpreters to > have their own GILs. It's not clear to me if that's in Python 3.12 > already, or under development for some future version, but by the time > we make the switch in Postgres, there probably will be a solution in > cpython. > > At a quick glance, I think perl and TCL are fine, you can have multiple > interpreters in one process. Need to check any other libraries we use. > > > [0] > https://www.postgresql.org/message-id/flat/9defcb14-a918-13fe-4b80-a0b02ff85527%40postgrespro.ru > > [1] > https://wiki.postgresql.org/wiki/Developer_FAQ#Why_don.27t_you_use_raw_devices.2C_async-I.2FO.2C_.3Cinsert_your_favorite_wizz-bang_feature_here.3E.3F > > -- > Heikki Linnakangas > Neon (https://neon.tech) > >
On Sat, Jun 10, 2023 at 11:32 PM Hannu Krosing <hannuk@google.com> wrote: > > On Mon, Jun 5, 2023 at 4:52 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > > > > If there are no major objections, I'm going to update the developer FAQ, > > removing the excuses there for why we don't use threads [1]. > > I think it is not wise to start the wholesale removal of the objections there. > > But I think it is worthwhile to revisit the section about threads and > maybe split out the historic part which is no more true, and provide > both pros and cons for these. > > I started with this short summary from the discussion in this thread, > feel free to expand, argue, fix :) > * is current excuse > -- is counterargument or ack > ---------------- > As an example, threads are not yet used instead of multiple processes > for backends because: > * Historically, threads were poorly supported and buggy. > -- yes they were, not relevant now when threads are well-supported and non-buggy > > * An error in one backend can corrupt other backends if they're > threads within a single process > -- still valid for silent corruption > -- for detected crash - yes, but we are restarting all backends in > case of crash anyway. > > * Speed improvements using threads are small compared to the remaining > backend startup time. > -- we now have some measurements that show significant performance > improvements not related to startup time > > * The backend code would be more complex. > -- this is still the case > -- even more worrisome is that all extensions also need to be rewritten > -- and many incompatibilities will be silent and take potentially years to find > > * Terminating backend processes allows the OS to cleanly and quickly > free all resources, protecting against memory and file descriptor > leaks and making backend shutdown cheaper and faster > -- still true > > * Debugging threaded programs is much harder than debugging worker > processes, and core dumps are much less useful > -- this was countered by claiming that > -- by now we have reasonable debugger support for threads > -- there is no direct debugger support for debugging the exact > system set up like PostgreSQL processes + shared memory > > * Sharing of read-only executable mappings and the use of > shared_buffers means processes, like threads, are very memory > efficient > -- this seems to say that the current process model is as good as threads ? > -- there were a few counterarguments > -- per-backend virtual memory mapping can add up to significant > amount of extra RAM usage > -- the discussion did not yet touch various per-backend caches > (pg_catalog cache, statement cache) which are arguably easier to > implement in threaded model > -- TLB reload at each process switch is expensive and would be > mostly avoided in case of threads I think it is worth mentioning that parallel worker infrastructure will be simplified with threaded models e.g. 'parallel query', and 'parallel vacuum'. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On 6/10/23 13:20, Dave Cramer wrote: > > > On Fri, 9 Jun 2023 at 18:29, Stephen Frost <sfrost@snowman.net > <mailto:sfrost@snowman.net>> wrote: > > Greetings, > > * Dave Cramer (davecramer@postgres.rocks) wrote: > > One thing I can think of is upgrading. AFAIK dump and restore is > the only > > way to change the on disk format. > > Presuming that eventually we will be forced to change the on disk > format it > > would be nice to be able to do so in a manner which does not force > long > > down times > > There is an ongoing effort moving in this direction. The $subject isn't > great, but this patch set (which we are currently working on > updating...): https://commitfest.postgresql.org/43/3986/ > <https://commitfest.postgresql.org/43/3986/> attempts > changing a lot of currently compile-time block-size pieces to be > run-time which would open up the possibility to have a different page > format for, eg, different tablespaces. Possibly even different block > sizes. We'd certainly welcome discussion from others who are > interested. > > Thanks, > > Stephen > > > Upgrading was just one example of difficult problems that need to be > addressed. My thought was that before we commit to something as > potentially resource intensive as changing the threading model we > compile a list of other "big issues" and prioritize. > I doubt anyone expects the community to commit to the threading switch in this sense - drop everything else and just start working on this (pretty massive) change. Not going to happen. > I realize open source is more of a scratch your itch kind of development > model, but I'm not convinced the random walk that entails is the > appropriate way to move forward. At the very least I'd like us to > question it. I may be missing something, but it's not clear to me whether you argue for the open source approach or against it. I personally think it's perfectly fine for people to work on scratching their itch and focus on stuff that yields value to them (or their customers). And I think the only way to succeed at the threading switch is within this very framework - split it into (much) smaller steps that are beneficial on their own and scratch some other itch. For example, we have issues with large number of connections and we've discussed stuff like built-in connection pooling etc. for a very long time (including this thread). But we have session state in various places in process private memory, which makes it borderline impossible and thus we don't have anything built-in. IIUC the threading would needs to isolate/define the session state anyway, so perhaps it could do it in a way that'd also work for the connection pooling (with processes)? Which would mean this particular change is immediately beneficial even without the threading switch (which I'd expect to take considerable amount of time). In a way, I think this "split into independently beneficial steps" strategy is the only option with a meaningful chance of success. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Jun 12, 2023, at 13:53, Tomas Vondra wrote: > In a way, I think this "split into independently beneficial steps" > strategy is the only option with a meaningful chance of success. +1 /Joel
Is the following true or not? 1. If we switch processes to threads but leave the amount of session local variables unchanged, there would be hardly any performance gain. 2. If we move some backend's local variables into shared memory then the performance gain would be very near to what we get with threads having equal amount of session-local variables. In other words, the overall goal in principle is to gain from less memory copying wherever it doesn't add the burden of locks for concurrent variables access? Regards, Pavel Borisov, Supabase
Hi, On 2023-06-12 16:23:14 +0400, Pavel Borisov wrote: > Is the following true or not? > > 1. If we switch processes to threads but leave the amount of session > local variables unchanged, there would be hardly any performance gain. False. > 2. If we move some backend's local variables into shared memory then > the performance gain would be very near to what we get with threads > having equal amount of session-local variables. False. > In other words, the overall goal in principle is to gain from less > memory copying wherever it doesn't add the burden of locks for > concurrent variables access? False. Those points seems pretty much unrelated to the potential gains from switching to a threading model. The main advantages are: 1) We'd gain from being able to share state more efficiently (using normal pointers) and more dynamically (not needing to pre-allocate). That'd remove a good amount of complexity. As an example, consider the work we need to do to ferry tuples from one process to another. Even if we just continue to use shm_mq, in a threading world we could just put a pointer in the queue, but have the tuple data be shared between the processes etc. Eventually this could include removing the 1:1 connection<->process/thread model. That's possible to do with processes as well, but considerably harder. 2) Making context switches cheaper / sharing more resources at the OS and hardware level. Greetings, Andres Freund
On 10/06/2023 21:01, Hannu Krosing wrote: > On Mon, Jun 5, 2023 at 4:52 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote: >> >> If there are no major objections, I'm going to update the developer FAQ, >> removing the excuses there for why we don't use threads [1]. > > I think it is not wise to start the wholesale removal of the objections there. > > But I think it is worthwhile to revisit the section about threads and > maybe split out the historic part which is no more true, and provide > both pros and cons for these. > I started with this short summary from the discussion in this thread, > feel free to expand, argue, fix :) > * is current excuse > -- is counterargument or ack Thanks, that's a good idea. > * Speed improvements using threads are small compared to the remaining > backend startup time. > -- we now have some measurements that show significant performance > improvements not related to startup time Also, I don't expect much performance gain directly from switching to threads. The point is that switching to a multi-threaded model makes possible, or at least greatly simplifies, a lot of other development. Which can then help with the backend startup time, among other things. For example, a shared catalog cache. > * The backend code would be more complex. > -- this is still the case I don't quite buy that. A multi-threaded model isn't inherently more complex than a multi-process model. Just different. Sure, the transition period will be more complex, when we need to support both models. But in the long run, if we can remove the multi-process mode, we can make a lot of things *simpler*. > -- even more worrisome is that all extensions also need to be rewritten "rewritten" is an exaggeration. Yes, extensions will need adapt, similar to the core code. But I hope it will be pretty mechanical work, marking global variables as thread-local and such. Many extensions will work with little to no changes. > -- and many incompatibilities will be silent and take potentially years to find IMO this is the most scary part of all this. I'm optimistic that we can have enough compiler support and tooling to catch most issues. But we don't know for sure at this point. > * Terminating backend processes allows the OS to cleanly and quickly > free all resources, protecting against memory and file descriptor > leaks and making backend shutdown cheaper and faster > -- still true Yep. I'm not too worried about PostgreSQL code, our memory contexts and resource owners are very good at stopping leaks. But 3rd party libraries could pose hard problems. IIRC we still have a leak with the LLVM JIT code, for example. We should fix that anyway, of course, but the multi-process model is more forgiving with leaks like that. -- Heikki Linnakangas Neon (https://neon.tech)
On Mon, Jun 12, 2023 at 12:24:30PM -0700, Andres Freund wrote: > Those points seems pretty much unrelated to the potential gains from switching > to a threading model. The main advantages are: > > 1) We'd gain from being able to share state more efficiently (using normal > pointers) and more dynamically (not needing to pre-allocate). That'd remove > a good amount of complexity. As an example, consider the work we need to do > to ferry tuples from one process to another. Even if we just continue to > use shm_mq, in a threading world we could just put a pointer in the queue, > but have the tuple data be shared between the processes etc. > > Eventually this could include removing the 1:1 connection<->process/thread > model. That's possible to do with processes as well, but considerably > harder. > > 2) Making context switches cheaper / sharing more resources at the OS and > hardware level. Yes. FWIW, while reading the thread, parallel workers stroke me as the first area that would benefit from all that. Could it be easier to figure out the incremental pieces if working on a new node doing a Gather based on threads, for instance? -- Michael
Attachment
On 12.06.2023 3:23 PM, Pavel Borisov wrote: > Is the following true or not? > > 1. If we switch processes to threads but leave the amount of session > local variables unchanged, there would be hardly any performance gain. > 2. If we move some backend's local variables into shared memory then > the performance gain would be very near to what we get with threads > having equal amount of session-local variables. > > In other words, the overall goal in principle is to gain from less > memory copying wherever it doesn't add the burden of locks for > concurrent variables access? > > Regards, > Pavel Borisov, > Supabase > > IMHO both statements are not true. Switching to threads will cause less context switch overhead (because all threads are sharing the same memory space and so preserve TLB. How big will be this advantage? In my prototype I got ~10%. But may be it is possible to fin workloads when it is larger. Postgres backend is "thick" not because of large number of local variables. It is because of local caches: catalog cache, relation cache, prepared statements cache,... If they are not rewritten, then backend still may consume a lot of memory even if it will be thread rather then process. But threads simplify development of global caches, although it can be done with DSM.
At Tue, 13 Jun 2023 09:55:36 +0300, Konstantin Knizhnik <knizhnik@garret.ru> wrote in > Postgres backend is "thick" not because of large number of local > variables. > It is because of local caches: catalog cache, relation cache, prepared > statements cache,... > If they are not rewritten, then backend still may consume a lot of > memory even if it will be thread rather then process. > But threads simplify development of global caches, although it can be > done with DSM. With the process model, that local stuff are flushed out upon reconnection. If we switch to the thread model, we will need an expiration mechanism for those stuff. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On 13.06.2023 10:55 AM, Kyotaro Horiguchi wrote: > At Tue, 13 Jun 2023 09:55:36 +0300, Konstantin Knizhnik <knizhnik@garret.ru> wrote in >> Postgres backend is "thick" not because of large number of local >> variables. >> It is because of local caches: catalog cache, relation cache, prepared >> statements cache,... >> If they are not rewritten, then backend still may consume a lot of >> memory even if it will be thread rather then process. >> But threads simplify development of global caches, although it can be >> done with DSM. > With the process model, that local stuff are flushed out upon > reconnection. If we switch to the thread model, we will need an > expiration mechanism for those stuff. We already have invalidation mechanism. It will be also used in case of shared cache, but we do not need to send invalidations to all backends. I do not completely understand your point. Right now caches (for example catalog cache) is not limited at all. So if you have very large database schema, then this cache will consume a lot of memory (multiplied by number of backends). The fact that it is flushed out upon reconnection can not help much: what if backends are not going to disconnect? In case of shared cache we will have to address the same problem: whether this cache should be limited (with some replacement discipline as LRU). Or it is unlimited. In case of shared cache, size of the cache is less critical because it is not multiplied by number of backends. So we can assume that catalog and relation cache should always fir in memory (otherwise significant rewriting of all Postgtres code working with relations will be needed). But Postgres also have temporary tables. For them we may need local backend cache in any case. Global temp table patch was not approved so we still have to deal with this awful temp tables. In any case I do not understand why do we need some expiration mechanism for this caches. If there is some relation than information about this relation should be kept in the cache as long as this relation is alive. If there is not enough memory to cache information about all relations, then we may need some replacement algorithm. But I do not think that there is any sense to remove some item fro the cache just because it is too old.
At Tue, 13 Jun 2023 11:20:56 +0300, Konstantin Knizhnik <knizhnik@garret.ru> wrote in > > > On 13.06.2023 10:55 AM, Kyotaro Horiguchi wrote: > > At Tue, 13 Jun 2023 09:55:36 +0300, Konstantin Knizhnik > > <knizhnik@garret.ru> wrote in > >> Postgres backend is "thick" not because of large number of local > >> variables. > >> It is because of local caches: catalog cache, relation cache, prepared > >> statements cache,... > >> If they are not rewritten, then backend still may consume a lot of > >> memory even if it will be thread rather then process. > >> But threads simplify development of global caches, although it can be > >> done with DSM. > > With the process model, that local stuff are flushed out upon > > reconnection. If we switch to the thread model, we will need an > > expiration mechanism for those stuff. > > We already have invalidation mechanism. It will be also used in case > of shared cache, but we do not need to send invalidations to all > backends. Invalidation is not expiration. > I do not completely understand your point. > Right now caches (for example catalog cache) is not limited at all. > So if you have very large database schema, then this cache will > consume a lot of memory (multiplied by number of > backends). The fact that it is flushed out upon reconnection can not > help much: what if backends are not going to disconnect? Right now, if one out of many backends creates a huge system catalog cahce, it can be cleard upon disconnection. The same client can repeat this process, but users can ensure such situations don't persist. However, with the thread model, we won't be able to clear parts of the cache that aren't required by the active backends anymore. (Of course with threads, we can avoid duplications, though.) > In case of shared cache we will have to address the same problem: > whether this cache should be limited (with some replacement discipline > as LRU). > Or it is unlimited. In case of shared cache, size of the cache is less > critical because it is not multiplied by number of backends. Yes. > So we can assume that catalog and relation cache should always fir in > memory (otherwise significant rewriting of all Postgtres code working > with relations will be needed). I'm not sure that is ture.. But likely to be? > But Postgres also have temporary tables. For them we may need local > backend cache in any case. > Global temp table patch was not approved so we still have to deal with > this awful temp tables. > > In any case I do not understand why do we need some expiration > mechanism for this caches. I don't think it is efficient that PostgreSQL to consume a large amount of memory for seldom-used content. While we may not need expiration mechanism for moderate use cases, I have observed instances where a single process hogs a significant amount of memory, particularly for intermittent tasks. > If there is some relation than information about this relation should > be kept in the cache as long as this relation is alive. > If there is not enough memory to cache information about all > relations, then we may need some replacement algorithm. > But I do not think that there is any sense to remove some item fro the > cache just because it is too old. Ah. I see. I am fine with a replacement mechanishm. But the evicition algorithm seems almost identical to the exparation algorithm. The algorithm will not be simply driven by object age, but I'm not sure we need more than access frequency. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On 6/13/23 10:20, Konstantin Knizhnik wrote: > The fact that it is flushed out upon reconnection can not > help much: what if backends are not going to disconnect? This is why many connection pools have a maximum connection lifetime which can be configured. So in practice flushing all caches on disconnect helps a lot. The nice proper solution might very well be adding a maximum cache sizes and replacement but it obviously makes the cache more complex and adds an new GUC. Probably worth it, but flushing caches on disconnect is a simple solution which works well in practice for many but no all workloads. Andreas
On 13.06.2023 11:46 AM, Kyotaro Horiguchi wrote: > So we can assume that catalog and relation cache should always fit in > memory >> memory (otherwise significant rewriting of all Postgtres code working >> with relations will be needed). > I'm not sure that is ture.. But likely to be? Sorry, looks like I was wrong. Right now access to sys/cat/rel caches is protected by reference counter. So we can easily add some replacement algorithm for this caches. > I don't think it is efficient that PostgreSQL to consume a large > amount of memory for seldom-used content. While we may not need > expiration mechanism for moderate use cases, I have observed instances > where a single process hogs a significant amount of memory, > particularly for intermittent tasks. Usually system catalog is small enough and do not cause any problems with memory consumption. But partitioned and temporary tables can cause bloat of catalog. In such cases some eviction mechanism will be really useful. But I do not think that it is somehow related with using threads instead of process. The question whether to use private or shared cache is not directly related to threads vs. process choice. Yes, threads makes implementation of shared cache much easier. But it can be also done using dynamic memory segments, Definitely shared cache has its pros and cons, first if all it requires sycnhronization which may have negative impact o performance. I have made an attempt to combine both caches: use relatively small per-backend local cache and large shared cache. I wonder what people think about the idea to make backends less thick by using shared cache.
At Wed, 14 Jun 2023 08:46:05 +0300, Konstantin Knizhnik <knizhnik@garret.ru> wrote in > But I do not think that it is somehow related with using threads > instead of process. > The question whether to use private or shared cache is not directly > related to threads vs. process choice. Yeah, I unconsciously conflated the two things. We can use per-thread cache on multithreading. > Yes, threads makes implementation of shared cache much easier. But it > can be also done using dynamic > memory segments, Definitely shared cache has its pros and cons, first > if all it requires sycnhronization > which may have negative impact o performance. True. > I have made an attempt to combine both caches: use relatively small > per-backend local cache > and large shared cache. > I wonder what people think about the idea to make backends less thick > by using shared cache. I remember of a relatively old thread about that. https://www.postgresql.org/message-id/4E72940DA2BF16479384A86D54D0988A567B9245%40G01JPEXMBKW04 regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On 6/14/23 09:01, Kyotaro Horiguchi wrote: > At Wed, 14 Jun 2023 08:46:05 +0300, Konstantin Knizhnik <knizhnik@garret.ru> wrote in >> But I do not think that it is somehow related with using threads >> instead of process. >> The question whether to use private or shared cache is not directly >> related to threads vs. process choice. > > Yeah, I unconsciously conflated the two things. We can use per-thread > cache on multithreading. For sure, and we can drop the cache when dropping the memory context. And in the first versions of an imagined threaded PostgreSQL I am sure that is how things will work. Then later someone will have to investigate which caches are worth making shared and what the eviction/expiration strategy should be. Andreas
On Mon, 12 Jun 2023 at 20:24, Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On 2023-06-12 16:23:14 +0400, Pavel Borisov wrote: > > Is the following true or not? > > > > 1. If we switch processes to threads but leave the amount of session > > local variables unchanged, there would be hardly any performance gain. > > False. > > > > 2. If we move some backend's local variables into shared memory then > > the performance gain would be very near to what we get with threads > > having equal amount of session-local variables. > > False. > > > > In other words, the overall goal in principle is to gain from less > > memory copying wherever it doesn't add the burden of locks for > > concurrent variables access? > > False. > > Those points seems pretty much unrelated to the potential gains from switching > to a threading model. The main advantages are: I think that they're practical performance-related questions about the benefits of performing a technical migration that could involve significant development time, take years to complete, and uncover problems that cause reliability issues for a stable, proven database management system. > 1) We'd gain from being able to share state more efficiently (using normal > pointers) and more dynamically (not needing to pre-allocate). That'd remove > a good amount of complexity. As an example, consider the work we need to do > to ferry tuples from one process to another. Even if we just continue to > use shm_mq, in a threading world we could just put a pointer in the queue, > but have the tuple data be shared between the processes etc. > > Eventually this could include removing the 1:1 connection<->process/thread > model. That's possible to do with processes as well, but considerably > harder. This reads like a code quality argument: that's worthwhile, but I don't see how it supports your 'False' assertions. Do two queries running in separate processes spend much time allocating and waiting on resources that could be shared within a single thread? > 2) Making context switches cheaper / sharing more resources at the OS and > hardware level. That seems valid. Even so, I would expect that for many queries, I/O access and row processing time is the bulk of the work, and that context-switches to/from other query processes is relatively negligible.
On Tue, Jun 13, 2023 at 9:55 AM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > > At Tue, 13 Jun 2023 09:55:36 +0300, Konstantin Knizhnik <knizhnik@garret.ru> wrote in > > Postgres backend is "thick" not because of large number of local > > variables. > > It is because of local caches: catalog cache, relation cache, prepared > > statements cache,... > > If they are not rewritten, then backend still may consume a lot of > > memory even if it will be thread rather then process. > > But threads simplify development of global caches, although it can be > > done with DSM. > > With the process model, that local stuff are flushed out upon > reconnection. If we switch to the thread model, we will need an > expiration mechanism for those stuff. The part that can not be so easily solved is that "the local stuff" can include some leakage that is not directly controlled by us. I remember a few times when memory leaks in some PostGIS packages cause slow memory exhaustion and the simple fix was limiting connection lifetime to something between 15 min and an hour. The main problem here is that PostGIS uses a few tens of other GPL GIS related packages which are all changing independently and thus it is quite hard to be sure that none of these have developed a leak. And you also likely can not just stop upgrading these as they also contain security fixes. I have no idea what the fix could be in case of threaded server.
On Wed, Jun 14, 2023 at 3:16 PM James Addison <jay@jp-hosting.net> wrote: > I think that they're practical performance-related questions about the > benefits of performing a technical migration that could involve > significant development time, take years to complete, and uncover > problems that cause reliability issues for a stable, proven database > management system. I don't. I think they're reflecting confusion about what the actual, practical path forward is. For a first cut at this, all of our global variables become thread-local. Every single last one of them. So there's no savings of the type described in that email. We do each and every thing just as we do it today, except that it's all in different parts of a single address space instead of different address spaces with a chunk of shared memory mapped into each one. Syscaches don't change, catcaches don't change, memory copying is not reduced, literally nothing changes. The coding model is just as it is today. Except for decorating global variables, virtually no backend code needs to notice or care about the transition. There are a few exceptions. For instance, TopMemoryContext would need to be deleted explicitly, and the FD caching stuff would have to be revised, because it uses up all the FDs that the process can open, and having many threads doing that in a single process isn't going to work. There's probably some other things that I'm forgetting, but the typical effect on the average bit of backend code should be very, very low. If it isn't, we're doing it wrong. So, I think saying "oh, this is going to destabliize PostgreSQL for years" is just fear-mongering. If someone proposes a patch that we think is going to have that effect, we should (and certainly will) reject it. But I see no reason why we can't have a good patch for this where most code changes only in mechanical ways that are easy to validate. > This reads like a code quality argument: that's worthwhile, but I > don't see how it supports your 'False' assertions. Do two queries > running in separate processes spend much time allocating and waiting > on resources that could be shared within a single thread? I don't have any idea what this has to do with what Andres was talking about, honestly. However, there certainly are cases of the thing you're talking about here. Having many backends separately open the same file means we've got a whole bunch of different file descriptors accessing the same file instead of just one. That does have a meaningful cost on some workloads. Passing tuples between cooperating processes that are jointly executing a parallel query is costly in the current scheme, too. There might be ways to improve on that somewhat even without threads, but if you don't think that the process model made getting parallel query working harder and less efficient, I'm here as the guy who wrote a lot of that code to tell you that it very much did. > That seems valid. Even so, I would expect that for many queries, I/O > access and row processing time is the bulk of the work, and that > context-switches to/from other query processes is relatively > negligible. That's completely true, but there are ALSO many OTHER situations in which the overhead of frequent context switching is absolutely crushing. You might as well argue that umbrellas don't need to exist because there are lots of sunny days. -- Robert Haas EDB: http://www.enterprisedb.com
Hi, On 2023-06-13 16:55:12 +0900, Kyotaro Horiguchi wrote: > At Tue, 13 Jun 2023 09:55:36 +0300, Konstantin Knizhnik <knizhnik@garret.ru> wrote in > > Postgres backend is "thick" not because of large number of local > > variables. > > It is because of local caches: catalog cache, relation cache, prepared > > statements cache,... > > If they are not rewritten, then backend still may consume a lot of > > memory even if it will be thread rather then process. > > But threads simplify development of global caches, although it can be > > done with DSM. > > With the process model, that local stuff are flushed out upon > reconnection. If we switch to the thread model, we will need an > expiration mechanism for those stuff. Isn't that just doing something like MemoryContextDelete(TopMemoryContext) at the end of proc_exit() (or it's thread equivalent)? Greetings, Andres Freund
On Wed, Jun 14, 2023 at 3:46 PM Hannu Krosing <hannuk@google.com> wrote: > I remember a few times when memory leaks in some PostGIS packages > cause slow memory exhaustion and the simple fix was limiting > connection lifetime to something between 15 min and an hour. > > The main problem here is that PostGIS uses a few tens of other GPL GIS > related packages which are all changing independently and thus it is > quite hard to be sure that none of these have developed a leak. And > you also likely can not just stop upgrading these as they also contain > security fixes. > > I have no idea what the fix could be in case of threaded server. Presumably, when a thread exits, we MemoryContextDelete(TopMemoryContext). If the leak is into any memory context managed by PostgreSQL, this still frees the memory. But it might not be. Right now, if a library does a malloc() that it doesn't free() every once in a while, it's no big deal. If it does it too often, it's a problem now, too. But if it does it only every now and then, process exit will prevent accumulation over time. In a threaded model, that isn't true any longer: those allocations will accumulate until we OOM. And IMHO that's definitely a very significant downside of this direction. I don't think it should be dispositive because such problems are, hopefully, fixable, whereas some of the problems caused by the process model are basically unfixable except by not using it any more. However, if we lived in a world where both models were supported and a particular user said, "hey, I'm sticking with the process model because I don't trust my third-party libraries not to leak," I would be like "yep, I totally get it." -- Robert Haas EDB: http://www.enterprisedb.com
On Wed, 14 Jun 2023 at 20:48, Robert Haas <robertmhaas@gmail.com> wrote: > > On Wed, Jun 14, 2023 at 3:16 PM James Addison <jay@jp-hosting.net> wrote: > > I think that they're practical performance-related questions about the > > benefits of performing a technical migration that could involve > > significant development time, take years to complete, and uncover > > problems that cause reliability issues for a stable, proven database > > management system. > > I don't. I think they're reflecting confusion about what the actual, > practical path forward is. Ok. My concern is that the balance between the downstream ecosystem impact (people and processes that use PIDs to identify, monitor and manage query and background processes, for example) compared to the benefits (performance improvement for some -- but what kind of? -- workloads) seems unclear, and if it's unclear, it's less likely to be compelling. Pavel's message and questions seem to poke at some of the potential limitations of the performance improvements, and Andres' response mentions reduced complexity and reduced context-switching. Elsewhere I also see that TLB (translation lookaside buffer?) lookups in particular should see improvements. Those are good, but somewhat unquantified. The benefits are less of an immediate concern if there's going to be a migration/transition phase where both the process model and the thread model are available. But again, if the benefits of the threading model aren't clear, people are unlikely to want to switch, and I don't think that the cost for people and systems to migrate from tooling and methods built around processes will be zero. That could lead to a bad outcome, where the codebase includes both models and yet is unable to plan to simplify to one.
On Tue, 13 Jun 2023 at 07:55, Konstantin Knizhnik <knizhnik@garret.ru> wrote: > > > > On 12.06.2023 3:23 PM, Pavel Borisov wrote: > > Is the following true or not? > > > > 1. If we switch processes to threads but leave the amount of session > > local variables unchanged, there would be hardly any performance gain. > > 2. If we move some backend's local variables into shared memory then > > the performance gain would be very near to what we get with threads > > having equal amount of session-local variables. > > > > In other words, the overall goal in principle is to gain from less > > memory copying wherever it doesn't add the burden of locks for > > concurrent variables access? > > > > Regards, > > Pavel Borisov, > > Supabase > > > > > IMHO both statements are not true. > Switching to threads will cause less context switch overhead (because > all threads are sharing the same memory space and so preserve TLB. > How big will be this advantage? In my prototype I got ~10%. But may be > it is possible to fin workloads when it is larger. Hi Konstantin - do you have code/links that you can share for the prototype and benchmarks used to gather those results?
On Tue, 13 Jun 2023 at 07:55, Konstantin Knizhnik <knizhnik@garret.ru> wrote:On 12.06.2023 3:23 PM, Pavel Borisov wrote:Is the following true or not? 1. If we switch processes to threads but leave the amount of session local variables unchanged, there would be hardly any performance gain. 2. If we move some backend's local variables into shared memory then the performance gain would be very near to what we get with threads having equal amount of session-local variables. In other words, the overall goal in principle is to gain from less memory copying wherever it doesn't add the burden of locks for concurrent variables access? Regards, Pavel Borisov, SupabaseIMHO both statements are not true. Switching to threads will cause less context switch overhead (because all threads are sharing the same memory space and so preserve TLB. How big will be this advantage? In my prototype I got ~10%. But may be it is possible to fin workloads when it is larger.Hi Konstantin - do you have code/links that you can share for the prototype and benchmarks used to gather those results?
Sorry, I have already shared the link:
https://github.com/postgrespro/postgresql.pthreads/
As you can see last commit was 6 years ago when I stopped work on this project.
Why? I already tried to explain it:
- benefits from switching to threads were not so large. May be I just failed to fid proper workload, but is was more or less expected result,
because most of the code was not changed - it uses the same sync primitives, the same local catalog/relation caches,..
To take all advantage of multithreadig model it is necessary to rewrite many components, especially related with interprocess communication.
But maintaining such fork of Postgres and synchronize it with mainstream requires too much efforts and I was not able to do it myself.
There are three different but related directions of improving current Postgres:
1. Replacing processes with threads
2. Builtin connection pooler
3. Lightweight backends (shared catalog/relation/prepared statements caches)
The motivation for such changes are also similar:
1. Increase Postgres scalability
2. Reduce memory consumption
3. Make Postgres better fir cloud and serverless requirements
I am not sure now which one should be addressed first or them can be done together.
Replacing static variables with thread-local is the first and may be the easiest step.
It requires more or less mechanical changes. More challenging thing is replacing private per-backend data structures
with shared ones (caches, file descriptors,...)
On Thu, 15 Jun 2023 at 08:12, Konstantin Knizhnik <knizhnik@garret.ru> wrote: > > > > On 15.06.2023 1:23 AM, James Addison wrote: > > On Tue, 13 Jun 2023 at 07:55, Konstantin Knizhnik <knizhnik@garret.ru> wrote: > > > On 12.06.2023 3:23 PM, Pavel Borisov wrote: > > Is the following true or not? > > 1. If we switch processes to threads but leave the amount of session > local variables unchanged, there would be hardly any performance gain. > 2. If we move some backend's local variables into shared memory then > the performance gain would be very near to what we get with threads > having equal amount of session-local variables. > > In other words, the overall goal in principle is to gain from less > memory copying wherever it doesn't add the burden of locks for > concurrent variables access? > > Regards, > Pavel Borisov, > Supabase > > > IMHO both statements are not true. > Switching to threads will cause less context switch overhead (because > all threads are sharing the same memory space and so preserve TLB. > How big will be this advantage? In my prototype I got ~10%. But may be > it is possible to fin workloads when it is larger. > > Hi Konstantin - do you have code/links that you can share for the > prototype and benchmarks used to gather those results? > > > > Sorry, I have already shared the link: > https://github.com/postgrespro/postgresql.pthreads/ Nope, my mistake for not locating the existing link - thank you. Is there a reason that parser-related files (flex/bison) are added as part of the changeset? (I'm trying to narrow it down to only the changes necessary for the functionality. so far it looks mostly fairly minimal, which is good. the adjustments to progname are another thing that look a bit unusual/maybe unnecessary for the feature) > As you can see last commit was 6 years ago when I stopped work on this project. > Why? I already tried to explain it: > - benefits from switching to threads were not so large. May be I just failed to fid proper workload, but is was more orless expected result, > because most of the code was not changed - it uses the same sync primitives, the same local catalog/relation caches,.. > To take all advantage of multithreadig model it is necessary to rewrite many components, especially related with interprocesscommunication. > But maintaining such fork of Postgres and synchronize it with mainstream requires too much efforts and I was not able todo it myself. I get the feeling that there are probably certain query types or patterns where a significant, order-of-magnitude speedup is possible with threads - but yep, I haven't seen those described in detail yet on the mailing list (but as hinted by my not noticing the github link previously, maybe I'm not following the list closely enough). What workloads did you try with your version of the project? > There are three different but related directions of improving current Postgres: > 1. Replacing processes with threads > 2. Builtin connection pooler > 3. Lightweight backends (shared catalog/relation/prepared statements caches) > > The motivation for such changes are also similar: > 1. Increase Postgres scalability > 2. Reduce memory consumption > 3. Make Postgres better fir cloud and serverless requirements > > I am not sure now which one should be addressed first or them can be done together. > > Replacing static variables with thread-local is the first and may be the easiest step. > It requires more or less mechanical changes. More challenging thing is replacing private per-backend data structures > with shared ones (caches, file descriptors,...) Thank you. Personally I think that motivation two (reducing memory consumption) -- as long as it can be done without detrimentally affecting functionality or correctness, and without making the code harder to develop/understand -- could provide benefits for all three of the motivating cases (and, in fact, for non-cloud/serverful use cases too). This is making me wonder about other performance/scalability areas that might not have been considered due to focus on the details of the existing codebase, but I'll save that for another thread and will try to learn more first.
On Thu, Jun 15, 2023 at 9:12 AM Konstantin Knizhnik <knizhnik@garret.ru> wrote: > There are three different but related directions of improving current Postgres: > 1. Replacing processes with threads Here we could likely start with making parallel query multi-threaded. This would also remove the big blocker for parallelizing things like CREATE TABLE AS SELECT ... where we are currently held bac by the restriction that only the leader process can write. > 2. Builtin connection pooler Would be definitely a nice thing to have. And we could even start by integrating a non-threaded pooler like pgbouncer to run as a postgresql worker process (or two). > 3. Lightweight backends (shared catalog/relation/prepared statements caches) Shared prepared statement caches (of course have to be per-user and per-database) would give additional benefit of lightweight connection poolers not needing to track these. Currently the missing support of named prepared statements is one of the main hindrances of using pgbouncer with JDBC in transaction pooling mode (you can use it, but have to turn off automatic statement preparing) > > The motivation for such changes are also similar: > 1. Increase Postgres scalability > 2. Reduce memory consumption > 3. Make Postgres better fit cloud and serverless requirements The memory consumption reduction would be a big and clear win for many workloads. Also just moving more things in shared memory will also prepare us for move to threaded server (if it will eventually happen) > I am not sure now which one should be addressed first or them can be done together. Shared caches seem like a guaranteed win at least on memory usage. There could be performance (and complexity) downsides for specific workloads, but they would be the same as for the threaded model, so would also be a good learning opportunity. > Replacing static variables with thread-local is the first and may be the easiest step. I think we got our first patch doing this (as part of patches for running PG threaded on Solaris) quite early in the OSS development , could have been even in the last century :) > It requires more or less mechanical changes. More challenging thing is replacing private per-backend data structures > with shared ones (caches, file descriptors,...) Indeed, sharing caches would be also part of the work that is needed for the sharded model, so anyone feeling strongly about moving to threads could start with this :) --- Hannu
On Thu, Jun 15, 2023 at 10:41 AM James Addison <jay@jp-hosting.net> wrote: > > This is making me wonder about other performance/scalability areas > that might not have been considered due to focus on the details of the > existing codebase, but I'll save that for another thread and will try > to learn more first. A gradual move to more shared structures seems to be a way forward It should get us all the benefits of threading minus the need for TLB reloading and (in some cases) reduction of per-process virtual memory mapping tables. In any case we would need to implement all the locking and parallelism management of these shared structures that are not there in the current process architecture. So a fair bit of work but also a clearly defined benefits of 1) reduced memory usage 2) no need to rebuild caches for each new connection 3) no need to track PREPARE statements inside connection poolers. There can be extra complexity when different connections use the same prepared statement name (say "PREP001") for different queries. For this wel likely will need a good cooperation with connection pooler where it passes some kind of client connection id along at the transaction start
One more unexpected benefit of having shared caches would be easing access to other databases. If the system caches are there for all databases anyway, then it becomes much easier to make queries using objects from multiple databases. Note that this does not strictly need threads, just shared caches. On Thu, Jun 15, 2023 at 11:04 AM Hannu Krosing <hannuk@google.com> wrote: > > On Thu, Jun 15, 2023 at 10:41 AM James Addison <jay@jp-hosting.net> wrote: > > > > This is making me wonder about other performance/scalability areas > > that might not have been considered due to focus on the details of the > > existing codebase, but I'll save that for another thread and will try > > to learn more first. > > A gradual move to more shared structures seems to be a way forward > > It should get us all the benefits of threading minus the need for TLB > reloading and (in some cases) reduction of per-process virtual memory > mapping tables. > > In any case we would need to implement all the locking and parallelism > management of these shared structures that are not there in the > current process architecture. > > So a fair bit of work but also a clearly defined benefits of > 1) reduced memory usage > 2) no need to rebuild caches for each new connection > 3) no need to track PREPARE statements inside connection poolers. > > There can be extra complexity when different connections use the same > prepared statement name (say "PREP001") for different queries. > For this wel likely will need a good cooperation with connection > pooler where it passes some kind of client connection id along at the > transaction start
On 15.06.2023 11:41 AM, James Addison wrote: > On Thu, 15 Jun 2023 at 08:12, Konstantin Knizhnik <knizhnik@garret.ru> wrote: >> >> >> On 15.06.2023 1:23 AM, James Addison wrote: >> >> On Tue, 13 Jun 2023 at 07:55, Konstantin Knizhnik <knizhnik@garret.ru> wrote: >> >> >> On 12.06.2023 3:23 PM, Pavel Borisov wrote: >> >> Is the following true or not? >> >> 1. If we switch processes to threads but leave the amount of session >> local variables unchanged, there would be hardly any performance gain. >> 2. If we move some backend's local variables into shared memory then >> the performance gain would be very near to what we get with threads >> having equal amount of session-local variables. >> >> In other words, the overall goal in principle is to gain from less >> memory copying wherever it doesn't add the burden of locks for >> concurrent variables access? >> >> Regards, >> Pavel Borisov, >> Supabase >> >> >> IMHO both statements are not true. >> Switching to threads will cause less context switch overhead (because >> all threads are sharing the same memory space and so preserve TLB. >> How big will be this advantage? In my prototype I got ~10%. But may be >> it is possible to fin workloads when it is larger. >> >> Hi Konstantin - do you have code/links that you can share for the >> prototype and benchmarks used to gather those results? >> >> >> >> Sorry, I have already shared the link: >> https://github.com/postgrespro/postgresql.pthreads/ > Nope, my mistake for not locating the existing link - thank you. > > Is there a reason that parser-related files (flex/bison) are added as > part of the changeset? (I'm trying to narrow it down to only the > changes necessary for the functionality. so far it looks mostly > fairly minimal, which is good. the adjustments to progname are > another thing that look a bit unusual/maybe unnecessary for the > feature) Sorry, absolutely no reason - just my fault. >> As you can see last commit was 6 years ago when I stopped work on this project. >> Why? I already tried to explain it: >> - benefits from switching to threads were not so large. May be I just failed to fid proper workload, but is was more orless expected result, >> because most of the code was not changed - it uses the same sync primitives, the same local catalog/relation caches,.. >> To take all advantage of multithreadig model it is necessary to rewrite many components, especially related with interprocesscommunication. >> But maintaining such fork of Postgres and synchronize it with mainstream requires too much efforts and I was not ableto do it myself. > I get the feeling that there are probably certain query types or > patterns where a significant, order-of-magnitude speedup is possible > with threads - but yep, I haven't seen those described in detail yet > on the mailing list (but as hinted by my not noticing the github link > previously, maybe I'm not following the list closely enough). > > What workloads did you try with your version of the project? I do not remember now precisely (6 years passed). But definitely I tried pgbench, especially read-only pgbench (to be more CPU rather than disk bounded)
On 15.06.2023 12:04 PM, Hannu Krosing wrote: > So a fair bit of work but also a clearly defined benefits of > 1) reduced memory usage > 2) no need to rebuild caches for each new connection > 3) no need to track PREPARE statements inside connection poolers. Shared plan cache (not only prepared statements cache) also opens way to more sophisticated query optimizations. Right now we are not performing some optimization (like constant expression folding) just because them increase time of processing normal queries. This is why queries generated by ORMs or wizards, which can contain a lot of dumb stuff, are not well simplified by Postgres. With MS-Sql it is quite frequent that query execution time is much smaller than query optimization time. Having shared plan cache allows us to spend more time in optimization without risk to degrade performance.
I think planner would also benefit from threads. There are many tasks in planner that are independent and can be scheduled using dependency graph. They are too small to be parallelized through separate backends but large enough to be performed by threads. Planning queries involving partitioned tables take longer time (in seconds) esp. when there are thousands of partitions. That kind of planning will get immensely benefited by threading. Of course we can use backends which can pull tasks from queue but sharing the PlannerInfo and its substructure is easier through the same address space rather than shared memory. On Sat, Jun 10, 2023 at 5:25 AM Bruce Momjian <bruce@momjian.us> wrote: > > On Wed, Jun 7, 2023 at 06:38:38PM +0530, Ashutosh Bapat wrote: > > With multiple processes, we can use all the available cores (at least > > theoretically if all those processes are independent). But is that > > guaranteed with single process multi-thread model? Google didn't throw > > any definitive answer to that. Usually it depends upon the OS and > > architecture. > > > > Maybe a good start is to start using threads instead of parallel > > workers e.g. for parallel vacuum, parallel query and so on while > > leaving the processes for connections and leaders. that itself might > > take significant time. Based on that experience move to a completely > > threaded model. Based on my experience with other similar products, I > > think we will settle on a multi-process multi-thread model. > > I think we have a few known problem that we might be able to solve > without threads, but can help us eventually move to threads if we find > it useful: > > 1) Use threads for background workers rather than processes > 2) Allow sessions to be stopped and started by saving their state > > Ideally we would solve the problem of making shared structures > resizable, but I am not sure how that can be easily done without > threads. > > -- > Bruce Momjian <bruce@momjian.us> https://momjian.us > EDB https://enterprisedb.com > > Only you can decide what is important to you. -- Best Wishes, Ashutosh Bapat
Hi, On 6/7/23 23:37, Andres Freund wrote: > I think we're starting to hit quite a few limits related to the process model, > particularly on bigger machines. The overhead of cross-process context > switches is inherently higher than switching between threads in the same > process - and my suspicion is that that overhead will continue to > increase. Once you have a significant number of connections we end up spending > a *lot* of time in TLB misses, and that's inherent to the process model, > because you can't share the TLB across processes. Another problem I haven't seen mentioned yet is the excessive kernel memory usage because every process has its own set of page table entries (PTEs). Without huge pages the amount of wasted memory can be huge if shared buffers are big. For example with 256 GiB of used shared buffers a single process needs about 256 MiB for the PTEs (for simplicity I ignored the tree structure of the page tables and just took the number of 4k pages times 4 bytes per PTE). With 512 connections, which is not uncommon for machines with many cores, a total of 128 GiB of memory is just spent on page tables. We used non-transparent huge pages to work around this limitation but they come with plenty of provisioning challenges, especially in cloud infrastructures where different services run next to each other on the same server. Transparent huge pages have unpredictable performance disadvantages. Also if some backends only use shared buffers sparsely, memory is wasted for the remaining, unused range inside the huge page. -- David Geier (ServiceNow)
On Thu, 15 Jun 2023 at 11:07, Hannu Krosing <hannuk@google.com> wrote: > > One more unexpected benefit of having shared caches would be easing > access to other databases. > > If the system caches are there for all databases anyway, then it > becomes much easier to make queries using objects from multiple > databases. We have several optimizations in our visibility code that allow us to remove dead tuples from this database when another database still has a connection that has an old snapshot in which the deleting transaction of this database has not yet committed. This is allowed because we can say with confidence that other database's connections will never be able to see this database's tables. If we were to allow cross-database data access, that would require cross-database snapshot visibility checks, and that would severely hinder these optimizations. As an example, it would increase the work we need to do for snapshots: For the snapshot data of tables that aren't shared catalogs, we only need to consider our own database's backends for visibility. With cross-database visibility, we would need to consider all active backends for all snapshots, and this can be significantly more work. Kind regards, Matthias van de Meent Neon (https://neon.tech/)
Hi,
On 6/7/23 23:37, Andres Freund wrote:
> I think we're starting to hit quite a few limits related to the process model,
> particularly on bigger machines. The overhead of cross-process context
> switches is inherently higher than switching between threads in the same
> process - and my suspicion is that that overhead will continue to
> increase. Once you have a significant number of connections we end up spending
> a *lot* of time in TLB misses, and that's inherent to the process model,
> because you can't share the TLB across processes.
Another problem I haven't seen mentioned yet is the excessive kernel
memory usage because every process has its own set of page table entries
(PTEs). Without huge pages the amount of wasted memory can be huge if
shared buffers are big.
14:30:01 378244 15886844 97.67 0 11239012 12296276 60.10 10003540 4909180 240
14:40:01 308632 15956456 98.10 0 11329516 12295892 60.10 10015044 4981784 200
14:50:01 458956 15806132 97.18 0 11383484 12101652 59.15 9853612 5019916 112
15:00:01 10592736 5672352 34.87 0 4446852 8378324 40.95 1602532 3473020 264 <-- reboot!
15:10:01 9151160 7113928 43.74 0 5298184 8968316 43.83 2714936 3725092 124
15:20:01 8629464 7635624 46.94 0 6016936 8777028 42.90 2881044 4102888 148
15:30:01 8467884 7797204 47.94 0 6285856 8653908 42.30 2830572 4323292 436
15:40:02 8077480 8187608 50.34 0 6828240 8482972 41.46 2885416 4671620 320
15:50:01 7683504 8581584 52.76 0 7226132 8511932 41.60 2998752 4958880 308
16:00:01 7239068 9026020 55.49 0 7649948 8496764 41.53 3032140 5358388 232
16:10:01 7030208 9234880 56.78 0 7899512 8461588 41.36 3108692 5492296 216
14:40:01 all 9.95 0.00 0.69 0.02 0.00 89.33
14:50:01 all 10.22 0.00 0.83 0.02 0.00 88.93
15:00:01 all 10.62 0.00 1.63 0.76 0.00 86.99
15:10:01 all 8.55 0.00 0.72 0.12 0.00 90.61
On 10/06/2023 21:01, Hannu Krosing wrote:
> On Mon, Jun 5, 2023 at 4:52 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> * The backend code would be more complex.
> -- this is still the case
I don't quite buy that. A multi-threaded model isn't inherently more
complex than a multi-process model. Just different. Sure, the transition
period will be more complex, when we need to support both models. But in
the long run, if we can remove the multi-process mode, we can make a lot
of things *simpler*.
> -- even more worrisome is that all extensions also need to be rewritten
"rewritten" is an exaggeration. Yes, extensions will need adapt, similar
to the core code. But I hope it will be pretty mechanical work, marking
global variables as thread-local and such. Many extensions will work
with little to no changes.
> -- and many incompatibilities will be silent and take potentially years to find
IMO this is the most scary part of all this. I'm optimistic that we can
have enough compiler support and tooling to catch most issues. But we
don't know for sure at this point.
> * Terminating backend processes allows the OS to cleanly and quickly
> free all resources, protecting against memory and file descriptor
> leaks and making backend shutdown cheaper and faster
> -- still true
Yep. I'm not too worried about PostgreSQL code, our memory contexts and
resource owners are very good at stopping leaks. But 3rd party libraries
could pose hard problems. IIRC we still have a leak with the LLVM JIT
code, for example. We should fix that anyway, of course, but the
multi-process model is more forgiving with leaks like that.
--
Heikki Linnakangas
Neon (https://neon.tech)
Hi, On 8/11/23 14:05, Merlin Moncure wrote: > On Thu, Jul 27, 2023 at 8:28 AM David Geier <geidav.pg@gmail.com> wrote: > > Hi, > > On 6/7/23 23:37, Andres Freund wrote: > > I think we're starting to hit quite a few limits related to the > process model, > > particularly on bigger machines. The overhead of cross-process > context > > switches is inherently higher than switching between threads in > the same > > process - and my suspicion is that that overhead will continue to > > increase. Once you have a significant number of connections we > end up spending > > a *lot* of time in TLB misses, and that's inherent to the > process model, > > because you can't share the TLB across processes. > > Another problem I haven't seen mentioned yet is the excessive kernel > memory usage because every process has its own set of page table > entries > (PTEs). Without huge pages the amount of wasted memory can be huge if > shared buffers are big. > > > Hm, noted this upthread, but asking again, does this > help/benefit interactions with the operating system make oom kill > situations less likely? These things are the bane of my existence, > and I'm having a hard time finding a solution that prevents them other > than running pgbouncer and lowering max_connections, which adds > complexity. I suspect I'm not the only one dealing with this. > What's really scary about these situations is they come without > warning. Here's a pretty typical example per sar -r. > > The conjecture here is that lots of idle connections make the server > appear to have less memory available than it looks, and sudden > transient demands can cause it to destabilize. It does in the sense that your server will have more memory available in case you have many long living connections around. Every connection has less kernel memory overhead if you will. Of course even then a runaway query will be able to invoke the OOM killer. The unfortunate thing with the OOM killer is that, in my experience, it often kills the checkpointer. That's because the checkpointer will touch all of shared buffers over time which makes it likely to get selected by the OOM killer. Have you tried disabling memory overcommit? -- David Geier (ServiceNow)
Greetings, * David Geier (geidav.pg@gmail.com) wrote: > On 8/11/23 14:05, Merlin Moncure wrote: > > Hm, noted this upthread, but asking again, does this > > help/benefit interactions with the operating system make oom kill > > situations less likely? These things are the bane of my existence, and > > I'm having a hard time finding a solution that prevents them other than > > running pgbouncer and lowering max_connections, which adds complexity. > > I suspect I'm not the only one dealing with this. What's really scary > > about these situations is they come without warning. Here's a pretty > > typical example per sar -r. > > > > The conjecture here is that lots of idle connections make the server > > appear to have less memory available than it looks, and sudden transient > > demands can cause it to destabilize. > > It does in the sense that your server will have more memory available in > case you have many long living connections around. Every connection has less > kernel memory overhead if you will. Of course even then a runaway query will > be able to invoke the OOM killer. The unfortunate thing with the OOM killer > is that, in my experience, it often kills the checkpointer. That's because > the checkpointer will touch all of shared buffers over time which makes it > likely to get selected by the OOM killer. Have you tried disabling memory > overcommit? This is getting a bit far afield in terms of this specific thread, but there's an ongoing effort to give PG administrators knobs to be able to control how much actual memory is used rather than depending on the kernel to actually tell us when we're "out" of memory. There'll be new patches for the September commitfest posted soon. If you're interested in this issue, it'd be great to get more folks involved in review and testing. Thanks! Stephen