Thread: One process per session lack of sharing
Hi Is there any plan to implement "session per thread" or "shared sessions between thread"? We have analyzed the ability to contribute pgSql to jvm bytecode compiler but with current thread model this idea is far from optimal.(Vm can be different of course. But currently we use oracle and jvm is important for us) We have faced with some lack of sharing resources. So in our test memory usage per session: Oracle: about 5M MSSqlServer: about 4M postgreSql: about 160М It's discussed on pgsql-general@postgresql.org: http://www.mail-archive.com/pgsql-general@postgresql.org/msg206452.html >I think the "problem" that he is having is fixable only by changing how >PostgreSQL itself works. His problem is a PL/pgSQL function which is 11K >lines in length. When invoked, this function is "compiled" into a large >tokenized parse tree. This parse tree is only usable in the session which >invoked the the function. Apparently this parse tree takes a lot of memory. >And "n" concurrent users of this, highly used, function will therefore >require "n" times as much memory because the parse tree is _not_ >shareable. This is explained in: >https://www.postgresql.org/docs/9.5/static/plpgsql-implementation.html#PLPGSQL-PLAN-CACHING Next interesting answer(from Karl Czajkowski <karlcz@isi.edu> in private): > But, I search the > archives of the mailing list, and when others have previously > suggested such caching or reuse, it was immediately shot down by core > developers.
AMatveev@bitec.ru writes: > Is there any plan to implement "session per thread" or "shared > sessions between thread"? No, not really. The amount of overhead that would add --- eg, the need for locking on what used to be single-use caches --- makes the benefit highly questionable. Also, most people who need this find that sticking a connection pooler in front of the database solves their problem, so there's not that much motivation to do a ton of work inside the database to solve it there. regards, tom lane
Hi > AMatveev@bitec.ru writes: >> Is there any plan to implement "session per thread" or "shared >> sessions between thread"? > No, not really. The amount of overhead that would add --- eg, the need > for locking on what used to be single-use caches --- makes the benefit > highly questionable. A two-layer cache is the best answer. > Also, most people who need this find that sticking > a connection pooler in front of the database solves their problem It has some disadvantages. Lack of temporary table for example Practical usage of that table with connection poller is highly questionable. And so on. > , so > there's not that much motivation to do a ton of work inside the database > to solve it there. It is clear. Thank you. --
On Tue, Jul 12, 2016 at 9:18 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > AMatveev@bitec.ru writes: >> Is there any plan to implement "session per thread" or "shared >> sessions between thread"? > > No, not really. The amount of overhead that would add --- eg, the need > for locking on what used to be single-use caches --- makes the benefit > highly questionable. Also, most people who need this find that sticking > a connection pooler in front of the database solves their problem, so > there's not that much motivation to do a ton of work inside the database > to solve it there. I agree that there's not really a plan to implement this, but I don't agree that connection pooling solves the whole problem. Most people can't get by with statement pooling, so in practice you are looking at transaction pooling or session pooling. And that means that you can't really keep the pool size as small as you'd like because backends can be idle in transaction for long enough to force the pool size to be pretty large. Also, pooling causes the same backends to get reused for different sessions which touch different relations and different functions so that, for example, the relcache and the PL/pgsql function caches grow until every one of those sessions has everything cached that any client needs. That can cause big problems. So, I actually think it would be a good idea to think about this. The problem, of course, is that as long as we allow arbitrary parts of the code - including extension code - to declare global variables and store arbitrary stuff in them without any coordination, it's impossible to imagine hibernating and resuming a session without a risk of things going severely awry. This was a major issue for parallel query, but we've solved it, mostly, by designating the things that rely on global variables as parallel-restricted, and there actually aren't a ton of those. So I think it's imaginable that we can get to a point where we can, at least in some circumstances, let a backend exit and reconstitute its state at a later time. It's not an easy project, but I think it is one we will eventually need to do. Insisting that the current model is working is just sticking our head in the sand. It's mostly working, but there are workloads where it fails badly - and competing database products survive a number of scenarios where we just fall on our face. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 14 July 2016 at 03:59, Robert Haas <robertmhaas@gmail.com> wrote:
-- I agree that there's not really a plan to implement this, but I don't
agree that connection pooling solves the whole problem. Most people
can't get by with statement pooling, so in practice you are looking at
transaction pooling or session pooling. And that means that you can't
really keep the pool size as small as you'd like because backends can
be idle in transaction for long enough to force the pool size to be
pretty large. Also, pooling causes the same backends to get reused
for different sessions which touch different relations and different
functions so that, for example, the relcache and the PL/pgsql function
caches grow until every one of those sessions has everything cached
that any client needs. That can cause big problems.
So, I actually think it would be a good idea to think about this.
I agree. It's been on my mind for a while, but I've been assuming it's likely to involve such architectural upheaval as to be impractical.
Right now PostgreSQL conflates "user session state" and "execution engine" into one process. This means we need an external connection pooler to handle things if we want more user connections than we can efficiently handle in terms of number of executors. Current poolers don't do much to keep track of user state, they just arbitrate access to executors and expect applications to re-establish any needed state (SET vars, LISTEN, etc) or not use features that require persistence across the current pooling level.
This leaves users in the hard position of using very high, inefficient max_connections values to keep track of application<->DB state or jump through awkward hoops to use transaction pooling, either at the application level (Java appserver pools, etc) or through a proxy.
If using the high max_connections approach the user must also ensure that they don't have all those max_connections actually doing work at the same time using some kind of external coordination. Otherwise they'll thrash the server and face out of memory issues (especially with our rather simplistic work_mem management, etc) and poor performance.
The solution, to me, is to separate "user state" and "executor". Sounds nice, but we use global variables _everywhere_ and it's assumed throughout the code that we have one user session for the life of a backend, though with some exceptions for SET SESSION AUTHORIZATION. It's not likely to be fun.
The
problem, of course, is that as long as we allow arbitrary parts of the
code - including extension code - to declare global variables and
store arbitrary stuff in them without any coordination, it's
impossible to imagine hibernating and resuming a session without a
risk of things going severely awry.
Yeah. We'd definitely need a session state management mechanism with save and restore functionality.
There's also stuff like:
* LISTEN
* advistory locking at the session level
* WITH HOLD cursors
that isn't simple to just save and restore. Those are some of the same things that are painful with transaction pooling right now.
This was a major issue for
parallel query, but we've solved it, mostly, by designating the things
that rely on global variables as parallel-restricted, and there
actually aren't a ton of those. So I think it's imaginable that we
can get to a point where we can, at least in some circumstances, let a
backend exit and reconstitute its state at a later time. It's not an
easy project, but I think it is one we will eventually need to do.
I agree on both points, but I think "not easy" is rather an understatement.
Starting with a narrow scope would help. Save/restore GUCs and the other easy stuff, and disallow sessions that are actively LISTENing, hold advisory locks, have open cursors, etc from being saved and restored.
BTW, I think this would also give us a useful path toward allowing connection poolers to change the active user and re-authenticate on an existing backend. Right now you have to use SET ROLE or SET SESSION AUTHORIZATION (ugh) and can't stop the client you hand the connection to from just RESETing back to the pooler's user and doing whatever it wants.
Insisting that the current model is working is just sticking our head
in the sand. It's mostly working, but there are workloads where it
fails badly - and competing database products survive a number of
scenarios where we just fall on our face.
Yep, and like parallel query it's a long path, but it's one we've got to face sooner or later.
On 12 July 2016 at 21:57, <AMatveev@bitec.ru> wrote:
-- Hi
Is there any plan to implement "session per thread" or "shared
sessions between thread"?
As has been noted by others, there isn't any such plan right now.
PostgreSQL isn't threaded. It uses a multi-processing shared-nothing-by-default memory with explicit shared memory, plus copy-on-write memory forking from the postmaster for initial backend state (except on Windows, where we emulate that). It relies on processes being cheap and light-weight.
This is a very poor fit with Java's thread-based shared-by-default model with expensive heavyweight process startup and cheap threads. Process forking doesn't clone all threads and the JVM has lots of worker threads, so we can't start the JVM once in the postmaster then clone it with each forked postgres backend. Plus the JVM just isn't designed to cope with that and would surely get thoroughly confused when its file handles are cloned, its process ID changes, etc. This is one of the reasons PL/Java has never really taken off. We can mitigate the JVM startup costs a bit by preloading the JVM libraries into the postmaster and using the JVM's base class library preloading, but unless you're running trivial Java code you still do a lot of work at each JVM start after the postgres backend forks.
We have analyzed the ability to contribute pgSql to jvm bytecode compiler but with
current thread model this idea is far from optimal.(Vm can be different of course.
But currently we use oracle and jvm is important for us)
Yep, that's a real sticking point for a number of people.
The usual solution at this point is to move most of the work into an application-server mid-layer. That moves work further away from the DB, which has its own costs, and isn't something you're likely to be happy with if you're looking at things like optimising PL/PgSQL with a bytecode compiler. But it's the best we have right now.
We have faced with some lack of sharing resources.
So in our test memory usage per session:
Oracle: about 5M
MSSqlServer: about 4M
postgreSql: about 160М
Yep, that sounds about right. Unfortunately.
You may be able to greatly reduce that cost if you can store your cached compiled data in a shared memory segment created by your extension. This will get a bit easier with the new dynamic shared memory infrastructure, but it's going to be no fun at all to make that play with the JVM. You'll probably need a lot of JNI.
Craig>That moves work further away from the DB, which has its own costs, and isn't something you're likely to be happy with if you're looking at things like optimising PL/PgSQL with a bytecode compiler. But it's the best we have right now.
What if JVM was started within a background worker?
Then JVM can spawn several threads that serve PL requests on a "thread per backend" basis.
Craig>You may be able to greatly reduce that cost if you can store your cached compiled data in a shared memory segment created by your extension.
Craig>This will get a bit easier with the new dynamic shared memory infrastructure, but it's going to be no fun at all to make that play with the JVM. You'll probably need a lot of JNI.
There's https://github.com/jnr/jnr-ffi that enables to call C functions without resorting to writing JNI wrappers.
Vladimir
On 14 July 2016 at 14:28, Vladimir Sitnikov <sitnikov.vladimir@gmail.com> wrote:
Craig>That moves work further away from the DB, which has its own costs, and isn't something you're likely to be happy with if you're looking at things like optimising PL/PgSQL with a bytecode compiler. But it's the best we have right now.What if JVM was started within a background worker?Then JVM can spawn several threads that serve PL requests on a "thread per backend" basis.
You can't really execute the plpgsql-compiled-to-bytecode outside the user session, so you need a JVM in it anyway.
You probably could have a bgworker or pool of bgworkers doing your plpgsql compilation and caching. But because your plpgsql might reference uncommitted catalog entries in the local backend, you'd the bgworker to join an exported snapshot from the backend you're compiling the plpgsql for. If it doesn't let you avoid having a jvm in each backend it's not likely to be too useful.
Craig>You may be able to greatly reduce that cost if you can store your cached compiled data in a shared memory segment created by your extension.Craig>This will get a bit easier with the new dynamic shared memory infrastructure, but it's going to be no fun at all to make that play with the JVM. You'll probably need a lot of JNI.There's https://github.com/jnr/jnr-ffi that enables to call C functions without resorting to writing JNI wrappers.
Yes, and JNA as well.
Hi > On Tue, Jul 12, 2016 at 9:18 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> AMatveev@bitec.ru writes: >>> Is there any plan to implement "session per thread" or "shared >>> sessions between thread"? >>... >> so >> there's not that much motivation to do a ton of work inside the database >> to solve it there. > I agree that there's not really a plan to implement this, but I don't > ... > So, I actually think it would be a good idea to think about this. I just want to note that converting global variables to thread-specific variables. It's large work offcourse. But it's not seemed to be a ton of work. And it's the most part of refactoring for "session per thread". Offcourse that's not all. But it seemed to be the most valuable reason not to do that work.
Hi > It's mostly working, but there are workloads where it > fails badly - and competing database products survive a number of > scenarios where we just fall on our face. > So, I actually think it would be a good idea to think about this. Just to think. http://www.tiobe.com/tiobe_index The pl/sql has 18-th position. Where is pgSql. I've looked up the Ide for pgSql. If compare with oracle I can say there is not tools which can compete with "PlSql developer" for example. (functinality / price) https://www.allroundautomations.com/plsqldev.html?gclid=CjwKEAjw8Jy8BRCE0pOC9qzRhkMSJABC1pvJepfRpWeyMJ7CTZzlQE_PojlBO0vqGIZvVSW4jiQxShoC4PLw_wcB Why? May it because choosing another database is more profitable? I can't say for others, but for us: Offcourse We can implement some of our task in postgreSql. But when I think on full migration, it's just not real. We can contribute something but we can't work against postgreSql architecture. Our calculation shows that it is cheaper to implement "Session per thread" themselfs for example. But it's more cheaper to buy Oracle(Even if we would write from scratch). And there is just no customers which want to pay for that. Note, we don't have enough skill at postgreSql and the think that postgresql core team may do for a month, we can do foryears. So in our layer we just can't attract resource for that task. At other side there is people who have infrastructure, skills and experience but they fill comfortable as is "" > there's not that much motivation to do a ton of work inside the database > to solve it there. It's clear, they work on there task. We all work on our task. But it's just a wall. It's sad. There is proverbial in russia: "It's shine and poverty of open source" May be it is this case :)
On 14 July 2016 at 16:41, <AMatveev@bitec.ru> wrote:
Hi
> On Tue, Jul 12, 2016 at 9:18 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> AMatveev@bitec.ru writes:
>>> Is there any plan to implement "session per thread" or "shared
>>> sessions between thread"?
>>...
>> so
>> there's not that much motivation to do a ton of work inside the database
>> to solve it there.
> I agree that there's not really a plan to implement this, but I don't
> ...
> So, I actually think it would be a good idea to think about this.
I just want to note that converting global variables to thread-specific variables.
I don't think anyone's considering moving from multi-processing to multi-threading in PostgreSQL. I really, really like the protection that the shared-nothing-by-default process model gives us, among other things.
I'm personally not absolutely opposed to threading, but you'll find it hard to convince anyone it's worth the huge work required to ensure that everything in PostgreSQL is done thread-safely, adapt all our logic to handle thread IDs where we use process IDs, etc. It'd be a massive amount of work for no practical gain for most users, and a huge reliability loss in the short to medium term as we ironed out all the bugs.
Where I agreed with you, and where I think Robert sounded like he was agreeing, was that our current design where we have one executor per user sessions and can't suspend/resume sessions is problematic.
It's large work offcourse.
But it's not seemed to be a ton of work.
Er.... yeah, it really is. It's not just the mechanical changes. It's verifying that everything's correct on all the supported platforms. Ensuring that all the C library stuff we do is thread-safe, all the SSL stuff, etc. Getting rid of all the function-static variable use. Lots more.
Hi >>>> Is there any plan to implement "session per thread" or "shared >>>> sessions between thread"? > I'm personally not absolutely opposed to threading, but you'll find > it hard to convince anyone it's worth the huge work required to > ensure that everything in PostgreSQL is done thread-safely It's clear for me, I understand that organizing that work is really very hard. It's work for new segment of market in long perspective. For most open source project this is very difficult. In some case it may be not possible at all. But in the most cases there is proverb: "We make the road by walking on it" It's very important just to start. And may be the right start is to fix the Faq https://wiki.postgresql.org/wiki/FAQ#Why_does_PostgreSQL_use_so_much_memory.3F >Why does PostgreSQL use so much memory? >Despite appearances, this is absolutely normal It's not normal. It's "as is". You should use pgBouncer. See "Re: [HACKERS] One process per session lack of sharing" And it is why >there are workloads where it >fails badly - and competing database products survive a number of >scenarios where we just fall on our face > Er.... yeah, it really is. It's not just the mechanical changes. > It's verifying that everything's correct on all the supported > platforms. Ensuring that all the C library stuff we do is > thread-safe, all the SSL stuff, etc. Getting rid of all the > function-static variable use. Lots more. In the most cases the work can be done part by part. May be there is such parts. It's not necessary to do everything at once.
2016-07-15 11:29 GMT+02:00 <AMatveev@bitec.ru>:
Hi
>>>> Is there any plan to implement "session per thread" or "shared
>>>> sessions between thread"?
> I'm personally not absolutely opposed to threading, but you'll find
> it hard to convince anyone it's worth the huge work required to
> ensure that everything in PostgreSQL is done thread-safely
It's clear for me, I understand that organizing that work is really very
hard. It's work for new segment of market in long perspective.
For most open source project this is very difficult. In some case
it may be not possible at all.
But in the most cases there is proverb: "We make the road by walking on it"
It's very important just to start.
I disagree - there is lot of possible targets with much higher benefits - columns storage, effective execution - compiled execution, implementation of temporal databases, better support for dynamic structures, better support for XML, JSON, integration of connection pooling, ...
There is only few use cases - mostly related to Oracle emulation when multi threading is necessary - and few can be solved better - PLpgSQL to C compilation and similar techniques.
The organization of work is hard, but pretty harder is doing this work - and doing it without impact on current code base, current users. MySQL is thread based database - is better than Postgres, or there is more users migrated from Orace? Not.
Regards
Pavel
And may be the right start is to fix the Faq
https://wiki.postgresql.org/wiki/FAQ#Why_does_PostgreSQL_use_so_much_memory.3F
>Why does PostgreSQL use so much memory?
>Despite appearances, this is absolutely normal
It's not normal. It's "as is". You should use pgBouncer. See "Re: [HACKERS] One process per session lack of sharing"
And it is why
>there are workloads where it
>fails badly - and competing database products survive a number of
>scenarios where we just fall on our face
> Er.... yeah, it really is. It's not just the mechanical changes.
> It's verifying that everything's correct on all the supported
> platforms. Ensuring that all the C library stuff we do is
> thread-safe, all the SSL stuff, etc. Getting rid of all the
> function-static variable use. Lots more.
In the most cases the work can be done part by part.
May be there is such parts. It's not necessary to do everything at once.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hi > I disagree - there is lot of possible targets with much higher > benefits - columns storage, effective execution - compiled > execution, implementation of temporal databases, better support for > dynamic structures, better support for XML, JSON, integration of connection pooling, ... Off course the task is different so optimal configuration is different too. So the best balance between process per thread can change. But now he is in one extreme point. > There is only few use cases - mostly related to Oracle emulation It's few cases for one and it's most cases for others. > when multi threading is necessary - and few can be solved better - > PLpgSQL to C compilation and similar techniques. It's few cases for one and it's most cases for others. In our cases we just buy oracle and it's would be cheeper. Off course if our customers for some reason would agree to pay for that technique. We have nothing against. > The organization of work is hard, but pretty harder is doing this > work - and doing it without impact on current code base, current > users. MySQL is thread based database - is better than Postgres, or > there is more users migrated from Orace? Not. We want to decide our task by PostgreSql as easy as by Oracle. So you can say You should buy oracle and You will be right. I'm just interested if this is the position of the majority.
2016-07-15 12:20 GMT+02:00 <AMatveev@bitec.ru>:
Hi
> I disagree - there is lot of possible targets with much higher
> benefits - columns storage, effective execution - compiled
> execution, implementation of temporal databases, better support for
> dynamic structures, better support for XML, JSON, integration of connection pooling, ...
Off course the task is different so optimal configuration is different too.
So the best balance between process per thread can change.
But now he is in one extreme point.
> There is only few use cases - mostly related to Oracle emulation
It's few cases for one and it's most cases for others.
> when multi threading is necessary - and few can be solved better -
> PLpgSQL to C compilation and similar techniques.
It's few cases for one and it's most cases for others.
In our cases we just buy oracle and it's would be cheeper.
Off course if our customers for some reason would agree to pay for that
technique. We have nothing against.
> The organization of work is hard, but pretty harder is doing this
> work - and doing it without impact on current code base, current
> users. MySQL is thread based database - is better than Postgres, or
> there is more users migrated from Orace? Not.
We want to decide our task by PostgreSql as easy as by Oracle.
So you can say You should buy oracle and You will be right.
Can be nice, if we can help to all Oracle users - but it is not possible in this world :( - there is lot of barriers - threading is only one, second should be different design of PL/SQL - it is based on out processed, next can be libraries, JAVA integration, and lot of others. I believe so lot of users can be simple migrated, NTT has statistics - 60% is migrated just with using Orafce. But still there will be 10% where migration is not possible without significant refactoring. I don't believe so is cheaper to modify Postgres to support threads than modify some Oracle applications.
The threading for Postgres is not small projects - it can require hundreds man days.
I'm just interested if this is the position of the majority.
sure - it is my personal opinion.
Regards
Pavel
On 15 July 2016 at 18:05, Pavel Stehule <pavel.stehule@gmail.com> wrote:
-- There is only few use cases - mostly related to Oracle emulation when multi threading is necessary - and few can be solved better - PLpgSQL to C compilation and similar techniques.
Right.
If amatveev (username, unsure of full name) wants to improve PL/PgSQL performance and the ability of a JVM to share resources between backends, then it would be more productive to focus on that than on threading.
As for "fixing" the FAQ... for the great majority of people the FAQ entry on memory use is accurate. Sure, if you load a JVM into each backend and load a bunch of cached data in it, you'll get bad memory use. So don't do that. You're not measuring PostgreSQL, you're measuring PostgreSQL-plus-my-JVM-extension. Why does it use so much memory? 'cos it loads a whole bunch of stuff into each backend.
Now, there are other cases where individual PostgreSQL backends use lots of memory. But that FAQ entry refers to the common misconception that each PostgreSQL process's reported memory use is the actual system memory it uses. That isn't the case because most systems account badly for shared memory, and it confuses a lot of people. The FAQ entry doesn't need fixing.
Maybe the FAQ entry needs rewording to qualify it so it says that "in most cases" it's just shared memory mis-accounting. But that's about it.
Hi > Can be nice, if we can help to all Oracle users - but it is not > possible in this world :( - there is lot of barriers - threading is > only one, second should be different design of PL/SQL - it is based > on out processed, next can be libraries, JAVA integration, and lot > of others. I believe so lot of users can be simple migrated, NTT has > statistics - 60% is migrated just with using Orafce. But still there > will be 10% where migration is not possible without significant > refactoring. The most of our customers now use oracle enterprise edition. You can know better how important this is. But I agree with you that in other cases we can use PostgreSql. We can use postgreSql with some disadvantages of pgBouncer anywhare where the scalability is not main risk.(Such customers usually don't buy Enterprise) >I don't believe so is cheaper to modify Postgres to > support threads than modify some Oracle applications. The key is Scaling. Some parallels processing just can not be divorced from data without reducing performance. It very difficult question would be it possible at all to get comparable performance at application server for such cases. If we "inject" applications server to postgreSql for that scalability and functionality we need multithreading. If customization for every project is not big. It's may be tuned. But from some point the tuning is not profitable. (The database works in 24x7 and we need the ability to fix bugs on the fly) So If for some reason we would start to use postgresql. There is always a question what to choose funcionality or scalability. And usually our customers need both. >I don't believe so is cheaper For us it's may be not cheaper. It's just imposible.
2016-07-15 13:25 GMT+02:00 <AMatveev@bitec.ru>:
Hi
> Can be nice, if we can help to all Oracle users - but it is not
> possible in this world :( - there is lot of barriers - threading is
> only one, second should be different design of PL/SQL - it is based
> on out processed, next can be libraries, JAVA integration, and lot
> of others. I believe so lot of users can be simple migrated, NTT has
> statistics - 60% is migrated just with using Orafce. But still there
> will be 10% where migration is not possible without significant
> refactoring.
The most of our customers now use oracle enterprise edition.
You can know better how important this is.
But I agree with you that in other cases we can use PostgreSql.
We can use postgreSql with some disadvantages of pgBouncer anywhare
where the scalability is not main risk.(Such customers usually don't
buy Enterprise)
>I don't believe so is cheaper to modify Postgres to
> support threads than modify some Oracle applications.
The key is Scaling.
Some parallels processing just can not be divorced from data without reducing performance.
It very difficult question would be it possible at all to get
comparable performance at application server for such cases.
If we "inject" applications server to postgreSql for that scalability and functionality we need multithreading.
but parallel processing doesn't requires threading support - see PostgreSQL 9.6 features.
I am not sure, but I am thinking so PL/SQL is based on processed and not on threads too. So maybe this discussion is little bit out, because we use different terms.
Regards
Pavel
If customization for every project is not big.
It's may be tuned. But from some point the tuning is not profitable.
(The database works in 24x7 and we need the ability to fix bugs on the fly)
So If for some reason we would start to use postgresql.
There is always a question what to choose funcionality or scalability.
And usually our customers need both.
>I don't believe so is cheaper
For us it's may be not cheaper. It's just imposible.
Hi > If amatveev (username, unsure of full name) wants to improve > PL/PgSQL performance and the ability of a JVM to share resources > between backends, then it would be more productive to focus on that than on threading. Note, I've statred this post with https://www.postgresql.org/message-id/flat/409604420.20160711111532%40bitec.ru#409604420.20160711111532@bitec.ru Oracle: about 5M MSSqlServer: about 4M postgreSql: about 160М It's 11K loc of pgSql. And our code base is more than 4000k(for pgSql) lines of code.
Hi > but parallel processing doesn't requires threading support - see PostgreSQL 9.6 features. To share dynamic execution code between threads much more easy(If sharing this code between process is possible). There is many other interaction techniques between threads which is absence between process.
2016-07-15 14:54 GMT+02:00 <AMatveev@bitec.ru>:
Hi
> but parallel processing doesn't requires threading support - see PostgreSQL 9.6 features.
To share dynamic execution code between threads much more easy(If sharing this code between process is possible).
There is many other interaction techniques between threads which is
absence between process.
This is true, only when data are immutable and in memory. Elsewhere it is false idea.
Regards
Pavel
Hi > This is true, only when data are immutable and in memory. Elsewhere it is false idea. For case when the server works 24x7 and you need ability to fix bugs(or update) on the fly in any app code. It's usual.
On 15/07/2016 09:28, Craig Ringer wrote: > I don't think anyone's considering moving from multi-processing to > multi-threading in PostgreSQL. I really, really like the protection > that the shared-nothing-by-default process model gives us, among other > things. > As I understand it, the main issue is that it is hard to integrate extensions that use heavyweight runtimes and are focussed on isolation within a virtual machine. Its not just Perhaps it would be possible for the postmaster (or a delegate process) to host such a runtime, and find a way for a user process that wants to use such a runtime to communicate with it, whether by copying function parameters over RPC or by sharing some of its address space explicitly to the runtime to operate on directly. Such a host delegate process could be explicitly built with multithread support and not 'infect' the rest of the code with its requirements. Using granular RPC is nice for isolation but I am concerned that the latencies might be high.
2016-07-15 18:43 GMT+02:00 james <james@mansionfamily.plus.com>:
On 15/07/2016 09:28, Craig Ringer wrote:I don't think anyone's considering moving from multi-processing to multi-threading in PostgreSQL. I really, really like the protection that the shared-nothing-by-default process model gives us, among other things.As I understand it, the main issue is that it is hard to integrate extensions that use heavyweight runtimes and are focussed on isolation within a virtual machine. Its not just
Perhaps it would be possible for the postmaster (or a delegate process) to host such a runtime, and find a way for a user process that wants to use such a runtime to communicate with it, whether by copying function parameters over RPC or by sharing some of its address space explicitly to the runtime to operate on directly.
Such a host delegate process could be explicitly built with multithread support and not 'infect' the rest of the code with its requirements.
Using granular RPC is nice for isolation but I am concerned that the latencies might be high.
What I know about Oracle, PL/SQL, Java - all is executed as outprocess calls. I am sure, so PL doesn't share process with SQL engine there
Regards
Pavel
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
2016-07-15 19:57 GMT+02:00 Матвеев Алексей <amatveev@global-system.ru>:
Hi
>What I know about Oracle, PL/SQL, Java - all is executed as outprocess calls. I am sure, so PL doesn't share process with SQL engine there
You are highly not correct.
Sorry, I will answer more when i come to work.
Regards
Pavel
Hi >What I know about Oracle, PL/SQL, Java - all is executed as outprocess calls. I am sure, so PL doesn't share process withSQL engine there You are highly not correct. Sorry, I will answer more when i come to work.
On 16 July 2016 at 00:43, james <james@mansionfamily.plus.com> wrote:
On 15/07/2016 09:28, Craig Ringer wrote:I don't think anyone's considering moving from multi-processing to multi-threading in PostgreSQL. I really, really like the protection that the shared-nothing-by-default process model gives us, among other things.As I understand it, the main issue is that it is hard to integrate extensions that use heavyweight runtimes and are focussed on isolation within a virtual machine. Its not just
Perhaps it would be possible for the postmaster (or a delegate process) to host such a runtime, and find a way for a user process that wants to use such a runtime to communicate with it, whether by copying function parameters over RPC or by sharing some of its address space explicitly to the runtime to operate on directly.
It is, and the JVM supports that, but it's so costly that it would eliminate most of the benefits this user is seeking from the kind of sharing they want. It also needs at least a minimal JVM running in the target backends.
The issue here is an architectural mismatch between PostgreSQL and the JVM, made worse by the user's very stored-proc-heavy code. Some other runtime that's designed to co-operate with a multiprocessing environment could well be fine, but the JVM isn't. At least, the Sun/Oracle/OpenJDK JVM isn't.
They could explore doing their bytecode compilation for another runtime that's more friendly toward multiprocessing (maybe Mono? Haven't tried) and/or look at using PostgreSQL's shared memory facilities within their target runtime. It's not like this is insoluible without completely rebuilding PostgreSQL.
For them it'd be great if PostgreSQL used multi-threading instead of multi-processing, but it doesn't, and it's not likely to. So they've got to find other solutions to their difficulties within PostgreSQL or use another product.
Such a host delegate process could be explicitly built with multithread support and not 'infect' the rest of the code with its requirements.
Not if it lives in the postmaster. It'd have to be forked and communicated with at one remove.
Using granular RPC is nice for isolation but I am concerned that the latencies might be high.
Yep. When your goal is performance and you're trying to move stuff closer to the DB and out of an appserver, it's likely counterproductive.
On 15 July 2016 at 20:54, <AMatveev@bitec.ru> wrote:
Hi
> but parallel processing doesn't requires threading support - see PostgreSQL 9.6 features.
To share dynamic execution code between threads much more easy(If sharing this code between process is possible).
There is many other interaction techniques between threads which is
absence between process.
How do you think the buffer cache works?
Lots more could be shared, too. Cached plans, for example.
It's possible. You just have to change how you think about it, and you might not be able to do it within the framework of the JVM. So it's quite possibly not worth it for you. Nonetheless, don't assume it can't be done just because you can't do it the way you're used to thinking of doing it.
--
On Sun, Jul 17, 2016 at 3:23 AM, Craig Ringer <craig@2ndquadrant.com> wrote:
Lots more could be shared, too. Cached plans, for example.
But the fact that PostgreSQL has transactional DDL complicates things like
a shared plan cache and shared PL/pgSQL execution trees. Those things are
much easier in a trivial database implementation, where an ALTER TABLE is
just trampling over a running transaction.
Regards, Jan
Jan Wieck
Senior Postgres Architect
Senior Postgres Architect
On 12 July 2016 at 09:57, <AMatveev@bitec.ru> wrote:
--
We have faced with some lack of sharing resources.
So in our test memory usage per session:
Oracle: about 5M
MSSqlServer: about 4M
postgreSql: about 160М
Using shared resources also has significant problems, so care must be taken.
I think its clear that threading is out, but it is far from being the only solution to reducing the memory overhead of sharing.
Analysing the overhead and suggesting practical ways forward may help.
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Fri, Jul 15, 2016 at 4:28 AM, Craig Ringer <craig@2ndquadrant.com> wrote: > I don't think anyone's considering moving from multi-processing to > multi-threading in PostgreSQL. I really, really like the protection that the > shared-nothing-by-default process model gives us, among other things. We get some very important protection by having the postmaster in a separate address space from the user processes, but separating the other backends from each other has no value. If one of the backends dies, we take provisions to make sure they all die, which is little or no different from what would happen if we had the postmaster as one process and all of the other backends as threads within a second process. As far as I can see, running each and every backend in a separate process has downsides but no upsides. It slows down the system and makes it difficult to share data between processes without much in the way of benefits. > I'm personally not absolutely opposed to threading, but you'll find it hard > to convince anyone it's worth the huge work required to ensure that > everything in PostgreSQL is done thread-safely, adapt all our logic to > handle thread IDs where we use process IDs, etc. It'd be a massive amount of > work for no practical gain for most users, and a huge reliability loss in > the short to medium term as we ironed out all the bugs. It would actually be pretty simple to allow PostgreSQL to be compiled to use either processes or threads, provided that you don't mind using something like GCC's __thread keyword. When compiling with threads, slap __thread on every global variable we have (using some kind of macro trick, no doubt), spawn threads instead of processes wherever you like, and I think you're more or less done. There could be some problems with third-party libraries we use, but I bet there probably wouldn't be all that many problems. Of course, there's not necessarily a whole lot of benefit to such a minimal transformation, but you could certainly do useful things on top of it. For example, the parallel query code could arrange to pass pointers to existing data structures in some cases instead of copying those data structures as we do currently. Spinning up a new thread and giving it pointers to some of the old thread's data structures is probably a lot faster than spinning up a new process and serializing and deserializing those data structures, so you wouldn't necessarily have to do all that much work before the "thread model" compile started to have noticeable advantages over the "process model" compile. You could pass tuples around directly rather than by copying them, too. A lot of things that we might want to do in this area would expose us to the risk of server-lifespan memory leaks, and we'd need to spend time and energy figuring out how to minimize those risks, but enough other people have written complex, long-running multithreaded programs that I think it is probably possible to do so without unduly compromising reliability. > Where I agreed with you, and where I think Robert sounded like he was > agreeing, was that our current design where we have one executor per user > sessions and can't suspend/resume sessions is problematic. The problems are very closely related. The problem with suspending and resuming sessions is that you need to keep all of the session's global variable contents (except for any caches that are safe to rebuild) until the session is resumed; and we have no way of discovering all of the global variables a process is using and no general mechanism that can be used to serialize and deserialize them. The problem with using threads is that code which uses global variables will not be thread-safe unless all of those variables are thread-local. Getting our hands around the uncontrolled use of global variables - doubtless at the risk of breaking third-party code - seems crucial. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 7/14/16 12:34 AM, Craig Ringer wrote: > Starting with a narrow scope would help. Save/restore GUCs and the other > easy stuff, and disallow sessions that are actively LISTENing, hold > advisory locks, have open cursors, etc from being saved and restored. Along the lines of narrow scope... I wonder about allowing functions to execute in a separate process that communicates back to the main backend. That would allow unsafe languages to operate under a different OS user that was tightly restricted (ie: nobody/nogroup), but it could also allow for a pool of "function executors". Depending on how it was structured, it might also insulate the database from having to panic if a function crashed it's process. -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com 855-TREBLE2 (855-873-2532) mobile: 512-569-9461
Robert Haas <robertmhaas@gmail.com> writes: > On Fri, Jul 15, 2016 at 4:28 AM, Craig Ringer <craig@2ndquadrant.com> wrote: >> I don't think anyone's considering moving from multi-processing to >> multi-threading in PostgreSQL. I really, really like the protection that the >> shared-nothing-by-default process model gives us, among other things. > We get some very important protection by having the postmaster in a > separate address space from the user processes, but separating the > other backends from each other has no value. I do not accept that proposition in the least. For one thing, debugging becomes an order of magnitude harder when you've got multiple threads in the same address space: you have essentially zero guarantees about what one thread might have done to the supposedly-private state of another one. > ... enough other people have > written complex, long-running multithreaded programs that I think it > is probably possible to do so without unduly compromising reliability. I would bet that every single successful project of that sort has been written with threading in mind from the get-go. Trying to retro-fit threading onto thirty years' worth of single-threaded coding is a recipe for breaking your project; even if you had control of all the code running in the address space, which we assuredly do not. regards, tom lane
On Sun, Jul 17, 2016 at 3:12 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Fri, Jul 15, 2016 at 4:28 AM, Craig Ringer <craig@2ndquadrant.com> wrote: >>> I don't think anyone's considering moving from multi-processing to >>> multi-threading in PostgreSQL. I really, really like the protection that the >>> shared-nothing-by-default process model gives us, among other things. > >> We get some very important protection by having the postmaster in a >> separate address space from the user processes, but separating the >> other backends from each other has no value. > > I do not accept that proposition in the least. For one thing, debugging > becomes an order of magnitude harder when you've got multiple threads > in the same address space: you have essentially zero guarantees about > what one thread might have done to the supposedly-private state of > another one. Well, that's true, in theory. In practice, random memory clobbers are a pretty rare type of bug. The chances that thread A crashed because thread B overwrote its supposedly-private state are just not very high. Also, such bugs are extremely hard to troubleshoot even when there is only process and one thread involved, so it's not like things are a rose garden today. I don't buy the argument that it's worth giving up arbitrary amounts of performance and functionality for this. >> ... enough other people have >> written complex, long-running multithreaded programs that I think it >> is probably possible to do so without unduly compromising reliability. > > I would bet that every single successful project of that sort has been > written with threading in mind from the get-go. Trying to retro-fit > threading onto thirty years' worth of single-threaded coding is a recipe > for breaking your project; even if you had control of all the code running > in the address space, which we assuredly do not. I admit that it is risky, but I think there are things that could be done to limit the risk. I don't believe we can indefinitely continue to ignore the potential performance benefits of making a switch like this. Breaking a thirty-year old code base irretrievably would be sad, but letting it fade into irrelevance because we're not willing to make the architecture changes that are needed to remain relevant would be sad, too. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sun, Jul 17, 2016 at 3:04 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote: > On 7/14/16 12:34 AM, Craig Ringer wrote: >> Starting with a narrow scope would help. Save/restore GUCs and the other >> easy stuff, and disallow sessions that are actively LISTENing, hold >> advisory locks, have open cursors, etc from being saved and restored. > > Along the lines of narrow scope... I wonder about allowing functions to > execute in a separate process that communicates back to the main backend. > That would allow unsafe languages to operate under a different OS user that > was tightly restricted (ie: nobody/nogroup), but it could also allow for a > pool of "function executors". Depending on how it was structured, it might > also insulate the database from having to panic if a function crashed it's > process. You can do this sort of thing with background workers today. Just create a parallel context and set the entrypoint to code that will execute the guts of the function. Retrieve the results using a shm_mq. With somewhat more work, you could have persistent background workers that get reused for one function call after another instead of being continually spun up and torn down. However, I suspect the IPC costs would make this rather slow. Thomas Munro mentioned to me an IPC facility called "doors" a while back which, if I understood him correctly, is supposed to let you do a remote procedure call which also transfers the current processes' time slice to the process on the other end of the door. Maybe that would be faster, and fast enough; or maybe not. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sun, Jul 17, 2016 at 9:28 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Sun, Jul 17, 2016 at 3:12 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Fri, Jul 15, 2016 at 4:28 AM, Craig Ringer <craig@2ndquadrant.com> wrote:
>>> I don't think anyone's considering moving from multi-processing to
>>> multi-threading in PostgreSQL. I really, really like the protection that the
>>> shared-nothing-by-default process model gives us, among other things.
>
>> We get some very important protection by having the postmaster in a
>> separate address space from the user processes, but separating the
>> other backends from each other has no value.
>
> I do not accept that proposition in the least. For one thing, debugging
> becomes an order of magnitude harder when you've got multiple threads
> in the same address space: you have essentially zero guarantees about
> what one thread might have done to the supposedly-private state of
> another one.
Well, that's true, in theory. In practice, random memory clobbers are
a pretty rare type of bug. The chances that thread A crashed because
thread B overwrote its supposedly-private state are just not very
high. Also, such bugs are extremely hard to troubleshoot even when
there is only process and one thread involved, so it's not like things
are a rose garden today. I don't buy the argument that it's worth
giving up arbitrary amounts of performance and functionality for this.
The random memory clobbers are partially rare because they often aren't
triggered. Many of them are of the dangling pointer type and in a single
threaded process, there is less of a chance to allocate and overwrite the
free'd and then used memory.
But you are right, all of them are tricky to hunt and I remember using
hardware watch points and what not in the past. They make your day
though when you finally find them. So you will be sorry when the last
one is hunted down.
>> ... enough other people have
>> written complex, long-running multithreaded programs that I think it
>> is probably possible to do so without unduly compromising reliability.
>
> I would bet that every single successful project of that sort has been
> written with threading in mind from the get-go. Trying to retro-fit
> threading onto thirty years' worth of single-threaded coding is a recipe
> for breaking your project; even if you had control of all the code running
> in the address space, which we assuredly do not.
I admit that it is risky, but I think there are things that could be
done to limit the risk. I don't believe we can indefinitely continue
to ignore the potential performance benefits of making a switch like
this. Breaking a thirty-year old code base irretrievably would be
sad, but letting it fade into irrelevance because we're not willing to
make the architecture changes that are needed to remain relevant would
be sad, too.
I have to agree with Robert on that one. We have been "thinking" about
multi-threading some 16 years ago already. We were aware of the dangers
and yet at least considered doing it some day for things like a parallel
executor. And that would probably be our best bang for the buck still.
The risks of jamming all sessions into a single, multi-threaded process are
huge. Think of snapshot visibility together with catalog cache invalidations.
I'd say no to that one as a first step.
But multi-threading the executor or even certain utility commands at first
should not be rejected purely on the notion that "we don't have multithreading
today."
Regards, Jan
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Jan Wieck
Senior Postgres Architect
Senior Postgres Architect
Hi > Such a host delegate process could be explicitly built with > multithread support and not 'infect' the rest of the code with its requirements. > > Using granular RPC is nice for isolation but I am concerned that the latencies might be high. I agree with you. Moreover I think that some decision have not sense with this "thread model" in any way. For example Embedded oracle XML DB applications: http://docs.oracle.com/cd/B28359_01/appdev.111/b28369/xdb23jv1.htm#i1043708 """ You can run a Java servlet. Servlets work better as the top-level entry point into Oracle Database, and require using HTTP(S) as the protocol to accessOracle Database. """ > What I know about Oracle, PL/SQL, Java - all is executed as > outprocess calls. I am sure, so PL doesn't share process with SQL engine there It's better to say that java is executed like outprocess calls. It's done for wide libraries support. But it is not separate process. """ The JDBC server-side internal driver, the Oracle JVM, the database, and the SQL engine all run within the same address space, and therefore, the issue of networkround-trips is irrelevant """ http://docs.oracle.com/cd/E11882_01/java.112/e16548/overvw.htm#JJDBC28026
Hi > The issue here is an architectural mismatch between PostgreSQL and > the JVM, made worse by the user's very stored-proc-heavy code. Some > other runtime that's designed to co-operate with a multiprocessing > environment could well be fine, but the JVM isn't. At least, the Sun/Oracle/OpenJDK JVM isn't. Actually the lack of threads make any vm quite limit in some aspects of scalability. The desire to use jvm is the result that there is no desire to reinvent the wheel.
Hi > I admit that it is risky, but I think there are things that could be > done to limit the risk. I don't believe we can indefinitely continue > to ignore the potential performance benefits of making a switch like > this. Breaking a thirty-year old code base irretrievably would be > sad, but letting it fade into irrelevance because we're not willing to > make the architecture changes that are needed to remain relevant would > be sad, too. I can add, that nowadays it seems that the paralleling processing is the only way to scale. They can't wait that CPU Clock Speeds Increased in in the coming years. I understand that use of thread has some difficulties. I can not understand why use of thread can have disadvantages. Actually I think that parallelling using threads is much easy than parallelling using processes.
Hi > There's https://github.com/jnr/jnr-ffi that enables to call C > functions without resorting to writing JNI wrappers. I have not said that you are wrong. It's the dark side of "like seprate process" They can cheaply call sql from jvm. And they can't cheaply call jvm from sql. Jvm in oracle appeared a long time ago. May by when java thread model had many faults.(befor jvm 1.6 for example) Nowadays it seems to have sense only for compatibility.
On 18 July 2016 at 06:04, <AMatveev@bitec.ru> wrote:
This https://github.com/davecramer/plj-new is a very old project that did work at one time which attempted to do RPC calls to the jvm to address exactly this problem.
Hi
> There's https://github.com/jnr/jnr-ffi that enables to call C
> functions without resorting to writing JNI wrappers.
I have not said that you are wrong.
It's the dark side of "like seprate process"
They can cheaply call sql from jvm.
And they can't cheaply call jvm from sql.
However "cheaply" calling jvm from sql or vice-versa is not really possible.
I do like the idea of the background worker and shared memory though.
Hi > This https://github.com/davecramer/plj-new is a very old project > that did work at one time which attempted to do RPC calls to the jvm to address exactly this problem. > However "cheaply" calling jvm from sql or vice-versa is not really possible. > I do like the idea of the background worker and shared memory though. It's not opposite concepts. It's like two level cache. Something is best with shared memory. When "a sharing of upper layer" is best with shared process. And there is something that should not sharing at all. Any deviation is always overhead. But to be honest I really do not like "sharing". It is against human nature. And I will be really happy when there are processors with infinite performance and memory with infinite size. :)))
On Mon, Jul 18, 2016 at 2:41 PM, <AMatveev@bitec.ru> wrote: > And I will be really happy when there are processors with infinite > performance and memory with infinite size. >:))) Well for what it's worth there's no theoretical difference between multi-process and multi-threaded. They're just two different APIs to create shared memory and other kernel data structures and they both allow all the sharing you want. In the API Postgres uses everything starts out non-shared and we explicitly set up the parts we want shared. In the other nearly everything starts shared though it's possible to unshare parts. Once they're set up the CPU and MMU don't really care what kernel API was used to set them up. In other words, there's no theoretical reason you couldn't have adapt a JVM to create a large shared memory segment using mmap or SysV shared memory and live entirely within that including the Java stacks and object allocations and so on. The net result of what's running on the CPU can actually end up being exactly equivalent (though I suspect a natural implementation would end up being slightly different) and there are pros and cons to each API. -- greg
Hi > In other words, there's no theoretical reason you couldn't have adapt > a JVM to create a large shared memory segment using mmap or SysV I think even if I was the leader in OS development, I could not correctly answer your question. So just let discuss. Ok, I agree with you that there is no " theoretical reason " But in practice I think the main reason that OS(VM) developers implement this things differently. >there's no theoretical reason you couldn't Why does Os developers make threads? If there is no reason the use of thread just waste? Why do most(any) common web server or balancer use thread? May be they are bad theoretical? and so on But to be more constructive. I just don't know how to make between process things that we can easily do between threads, in most os. I don't know how to share mutable execution code. so i just cant imagine how to implement http://docs.oracle.com/javase/7/docs/technotes/guides/vm/multiple-language-support.html#invokedynamic in optimal way. I don't understand why mutex has overhead compare to critical section for windows. http://stackoverflow.com/questions/800383/what-is-the-difference-between-mutex-and-critical-section And so on. So I think as long as process and thread have different function in OS, use process like thread will have overhead in practice.
Hi >So I think as long as process and thread have different function in OS, >use process like thread will have overhead in practice. But There are other negative things. I think parallel oriented library usually do not work with process. So Jvm integration is not exception. It is regularity.
On Mon, Jul 18, 2016 at 10:03 AM, Greg Stark <stark@mit.edu> wrote: > On Mon, Jul 18, 2016 at 2:41 PM, <AMatveev@bitec.ru> wrote: >> And I will be really happy when there are processors with infinite >> performance and memory with infinite size. >>:))) > > Well for what it's worth there's no theoretical difference between > multi-process and multi-threaded. They're just two different APIs to > create shared memory and other kernel data structures and they both > allow all the sharing you want. In the API Postgres uses everything > starts out non-shared and we explicitly set up the parts we want > shared. In the other nearly everything starts shared though it's > possible to unshare parts. Once they're set up the CPU and MMU don't > really care what kernel API was used to set them up. That's totally true, but allocating additional shared memory after system startup is a pain. It's true that we now have DSM, but the fact that DSM segments can be mapped in different addresses in different processes is a nuisance. We will still get a lot of benefit out of having DSM, but it's not right to imply that threading wouldn't make things easier. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sun, Jul 17, 2016 at 10:00 PM, Jan Wieck <jan@wi3ck.info> wrote: >> I admit that it is risky, but I think there are things that could be >> done to limit the risk. I don't believe we can indefinitely continue >> to ignore the potential performance benefits of making a switch like >> this. Breaking a thirty-year old code base irretrievably would be >> sad, but letting it fade into irrelevance because we're not willing to >> make the architecture changes that are needed to remain relevant would >> be sad, too > > I have to agree with Robert on that one. We have been "thinking" about > multi-threading some 16 years ago already. We were aware of the dangers > and yet at least considered doing it some day for things like a parallel > executor. And that would probably be our best bang for the buck still. > > The risks of jamming all sessions into a single, multi-threaded process are > huge. Think of snapshot visibility together with catalog cache > invalidations. > I'd say no to that one as a first step. > > But multi-threading the executor or even certain utility commands at first > should not be rejected purely on the notion that "we don't have > multithreading > today." I think the risk profile is exactly the opposite of what you are suggesting here. If we provide an option to compile the server with all global variables converted to thread-local variables, there's really not a whole lot that can break, AFAICS. We'll technically be multi-threaded but the code need not know or care about the other threads; only in the event of a memory clobber can they affect each other. On the other hand, if we start trying to create multiple threads per processes just for certain purposes - be it the executor or certain utility commands - any C code that is reachable within the secondary threads needs to have specific provisions for thread-safety. That would create all kinds of problems, no doubt. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2016-07-18 15:47:58 -0400, Robert Haas wrote: > I think the risk profile is exactly the opposite of what you are > suggesting here. If we provide an option to compile the server with > all global variables converted to thread-local variables, there's > really not a whole lot that can break, AFAICS. Using TLS will slow down things noticeably though. So if we were to go there, we'd have to make up for some constant slowdown. > We'll technically be multi-threaded but the code need not know or care > about the other threads; only in the event of a memory clobber can > they affect each other. But that'll make it pretty hard to take advantage of multi-threading to a meaningful degree. Except for being able to create shared memory after the fact - quite useful! - there'd not be much point. - Andres
On 19 July 2016 at 03:53, Andres Freund <andres@anarazel.de> wrote:
On 2016-07-18 15:47:58 -0400, Robert Haas wrote:
> I think the risk profile is exactly the opposite of what you are
> suggesting here. If we provide an option to compile the server with
> all global variables converted to thread-local variables, there's
> really not a whole lot that can break, AFAICS.
Using TLS will slow down things noticeably though. So if we were to go
there, we'd have to make up for some constant slowdown.
Does TLS really have more of a performance impact than process context switches?
Genuine question, I'm clueless in the area.
> We'll technically be multi-threaded but the code need not know or care
> about the other threads; only in the event of a memory clobber can
> they affect each other.
But that'll make it pretty hard to take advantage of multi-threading to
a meaningful degree. Except for being able to create shared memory after
the fact - quite useful! - there'd not be much point.
Yeah. TLS is too simplistic. To do any useful parallel work without serialization/deserialization, some threads need to be in groups with other threads, sharing some data structures but having private copies of others.
That said, TLS can be used as a reasonable default state, then individual data structures extracted and shared more formally when needed. So it's a sane starting point.
On 2016-07-19 08:33:20 +0800, Craig Ringer wrote: > On 19 July 2016 at 03:53, Andres Freund <andres@anarazel.de> wrote: > > > On 2016-07-18 15:47:58 -0400, Robert Haas wrote: > > > I think the risk profile is exactly the opposite of what you are > > > suggesting here. If we provide an option to compile the server with > > > all global variables converted to thread-local variables, there's > > > really not a whole lot that can break, AFAICS. > > > > Using TLS will slow down things noticeably though. So if we were to go > > there, we'd have to make up for some constant slowdown. > > > > > Does TLS really have more of a performance impact than process context > switches? Those aren't really alternatives, unless I'm missing something. You have context switches between threads as well. They're a bit cheaper (due to less TLB invalidations), but generally not a *whole* lot. What TLS requires is basically for every thread local variable to go through one (IIRC sometimes two) additional layer of indirection. For workloads which are bottlenecked on single core performance (i.e. most of pg, regardless of parallel query), that can be painful. Andres
On 18 July 2016 at 02:27, Robert Haas <robertmhaas@gmail.com> wrote:
On Fri, Jul 15, 2016 at 4:28 AM, Craig Ringer <craig@2ndquadrant.com> wrote:
> I don't think anyone's considering moving from multi-processing to
> multi-threading in PostgreSQL. I really, really like the protection that the
> shared-nothing-by-default process model gives us, among other things.
We get some very important protection by having the postmaster in a
separate address space from the user processes, but separating the
other backends from each other has no value. If one of the backends
dies, we take provisions to make sure they all die, which is little or
no different from what would happen if we had the postmaster as one
process and all of the other backends as threads within a second
process. As far as I can see, running each and every backend in a
separate process has downsides but no upsides. It slows down the
system and makes it difficult to share data between processes without
much in the way of benefits.
That's a good point, the random memory overwrites that Tom mentioned aside.
I think that memory leaks we currently ignore as insignificant one-time losses that'll get cleaned up on backend exit will become relvant so some cleanup work would be needed there, but it's not technically difficult and valgrind is a wonderful thing.
One minor thing to be aware of is that cPython has a horrible threading design with a single massive lock for the whole interpreter. PL/Python will perform apallingly in a threaded context. It looks like it has more recently gained support for separate interpreters (each with their own GIL) within a single process though, if I'm reading the docs correctly:
so maybe it'd just require some plpython tweaks to switch to the right interpreter for the current backend.
I don't think plpython issues are a huge cause for hand-wringing anyway, really. TBH, if it really is practical to move Pg to a threaded model and not as hard as I thought, I do see the advantages. Mainly because I'd _love_ efficient embedded Java and C# runtime, it's downright embarrassing to tell people they should write procs in Perl or Python (or TCL!) if they can't do it in plpgsql.
If we can stand the macro code pollution, it'd be interesting to do a minimal conversion as an experiment and let some buildfarm members start digesting it once it runs. Find issues slowly over time, make it an experimental build option. We could do things like transparently use threadids whereever we currently expose a PID in the UI, change the bgworker backend to spawn threads rather than procs, etc. (One nice consequence would be the possibility of getting rid of most of EXEC_BACKEND since the postmaster launching the backend proc would be a one-time thing, once it was stable enough to make the thread model the only option on Windows).
It'd be very helpful to find a nice portable library that abstracts platform threading specifics and has a less horrid API than pthreads, rather than having to DIY. (See e.g.: https://software.intel.com/en-us/blogs/2006/10/19/why-windows-threads-are-better-than-posix-threads ). Or use C++11 <thread> :p [dives for fireproof suit]
> Where I agreed with you, and where I think Robert sounded like he was
> agreeing, was that our current design where we have one executor per user
> sessions and can't suspend/resume sessions is problematic.
The problems are very closely related. The problem with suspending
and resuming sessions is that you need to keep all of the session's
global variable contents (except for any caches that are safe to
rebuild) until the session is resumed; and we have no way of
discovering all of the global variables a process is using and no
general mechanism that can be used to serialize and deserialize them.
Right. In our per-process model we'd have to provide a subsystem that serializes/deserializes session state and has callbacks for extensions to register their own save/restore callbacks. We'd probably want it even if the improbable happened and Pg moved to threads - which until this thread I would've considered the same as saying "pigs fly" or "the US gets universal health care". Since we'd want to be able to essentially page out idle sessons, though it'd be more of a nice-to-have than a necessity.
Individual Pg subsystems would probably register callbacks with the save/restore subsystem, rather than trying to have the save/restore subsystem have the knowledge to reach into all the other subsystems. Since we'd block save/restore when there's an active query of course, it might not actually be that bad. Especially if we started with save/restore only on idle state, not idle-in-transaction, i.e. start with transaction pooling.
Since I got started with Pg, I've taken it as given that PostgreSQL Will Never Use Threads, Don't Even Talk About It. As taboo as query hints or more so. Is this actually a serious option?
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
PostgreSQL Development, 24x7 Support, Training & Services
Hi > Using TLS will slow down things noticeably though. So if we were to go > there, we'd have to make up for some constant slowdown. I can not understand why? I've read https://msdn.microsoft.com/en-us/library/windows/desktop/ms686749(v=vs.85).aspx and http://david-grs.github.io/tls_performance_overhead_cost_linux/ """ The results are quite straightforward: no overhead at all. """ 0x0000000000404f40 <+0>: inc DWORD PTR [rip+0x202382]vs0x0000000000404f50 <+0>: inc DWORD PTR fs:0xfffffffffffffffc It's clear.
On Mon, Jul 18, 2016 at 8:56 PM, Craig Ringer <craig@2ndquadrant.com> wrote: > Since I got started with Pg, I've taken it as given that PostgreSQL Will > Never Use Threads, Don't Even Talk About It. As taboo as query hints or more > so. Is this actually a serious option? I'm sure that depends on who you ask. But for myself, yes, I think it's a serious option. I think that the sort of minimal conversion that I discussed above could be done and made stable enough to label as "we have this experimental option..." by one competent developer in the course of one release cycle without otherwise unduly disrupting development. From that base, we could consider patches to optimize the thread model case, and maybe after gaining some experience and letting it shake out for a release or two we'd decide that the thread model is ready to be officially supported. I bet there would be a lot of interest in the thread model from the user and developer communities. It would probably be a significant win on Windows - where I understand that the ratio of process creation cost : thread creation cost is much worse than it is on Linux - and it would probably open up numerous possible optimizations for parallel query. It would probably also have some downsides and likely some horrible bugs, but that's why you start it out as an experimental feature. In short, I believe the conventional wisdom on this topic is misguided. Most of the previous discussions of using threading have assumed that we'd go through all of the backend-private stuff and make it thread-safe. That's a bad plan, first because it's not very well-defined, second because it could slow down parts of the system that rely on the absence of synchronization primitives in certain code paths, and third because it requires a single mammoth act of development on a scale that would be extremely hard to make successful. However, the method that I'm proposing is a completely different kettle of fish. It is not, as Tom points out, entirely without danger, but it allows the first patch to be a mostly mechanical transformation and then allows incremental development on top of that framework. For that reason, I believe it's at least an order of magnitude less impractical than the "go through and make everything thread-safe" approach. Unlike that approach, it also makes it realistically possible to support both models. None of that means that it will necessarily work out, but I'm bullish. Even though parallel query has already had more bug reports than I would have liked, it's good evidence that you can make architectural changes that touch almost every part of the system without necessarily breaking everything. You just have to go slow, be methodical, and budget time to fix the bugs. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2016-07-19 14:18:22 +0300, AMatveev@bitec.ru wrote: > Hi > > > > Using TLS will slow down things noticeably though. So if we were to go > > there, we'd have to make up for some constant slowdown. > I can not understand why? > > I've read > https://msdn.microsoft.com/en-us/library/windows/desktop/ms686749(v=vs.85).aspx > and > http://david-grs.github.io/tls_performance_overhead_cost_linux/ > """ > The results are quite straightforward: no overhead at all. > """ > > 0x0000000000404f40 <+0>: inc DWORD PTR [rip+0x202382] > vs > 0x0000000000404f50 <+0>: inc DWORD PTR fs:0xfffffffffffffffc Not really true IIRC. For one segment offset stuff is encoded more widely, and for another, it'll generate more uops in many microarchitectures. Also, we actually *do* qualify for the exception in the blog you linked above: We have a fair amount of dynamically linked code.
> On 12 July 2016 at 09:57, <AMatveev@bitec.ru <mailto:AMatveev@bitec.ru>> wrote: > > We have faced with some lack of sharing resources. > So in our test memory usage per session: > Oracle: about 5M > MSSqlServer: about 4M > postgreSql: about 160М > > > Using shared resources also has significant problems, so care must be taken. To see memory allocation by opinion of pgsql it's possible to use https://github.com/postgrespro/memstat This module (9.6+) could collect information about MemoryContexts allocation. -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/