Re: Anyone working on better transaction locking? - Mailing list pgsql-hackers

From Kevin Brown
Subject Re: Anyone working on better transaction locking?
Date
Msg-id 20030413041710.GW1833@filer
Whole thread Raw
In response to Re: Anyone working on better transaction locking?  (Shridhar Daithankar <shridhar_daithankar@persistent.co.in>)
Responses Re: Anyone working on better transaction locking?
List pgsql-hackers
Shridhar Daithankar wrote:
> > There are situations in which a database would have to handle a lot of
> > concurrent requests.  Handling ATM transactions over a large area is
> > one such situation.  A database with current weather information might
> > be another, if it is actively queried by clients all over the country.
> > Acting as a mail store for a large organization is another.  And, of
> > course, acting as a filesystem is definitely another.  :-)
> 
> Well, there is another aspect one should consider. Tuning a database
> engine for a specifiic workload is a hell of a job and shifting it
> to altogether other end of paradigm must be justified.

Certainly, but that justification comes from the problem being
solved.  If the nature of the problem demands tons of short
transactions (and as I said, a number of problems have such a
requirement), then tuning the database so that it can deal with it is
a requirement if that database is to be used at all.

Now, keep in mind that "tuning the database" here covers a *lot* of
ground and a lot of solutions, including connection-pooling
middleware.

> OK. Postgresql is not optimised to handle lots of concurrent
> connections, at least not much to allow one apache request handler
> to use a connection. Then middleware connection pooling like done in
> php might be a simpler solution to go rather than redoing the
> postgresql stuff. Because it works.

I completely agree.  In fact, I see little reason to change PG's
method of connection handling because I see little reason that a
general-purpose connection pooling frontend can't be developed.

Another method that could help is to prefork the postmaster.

> > This is true, but whether you choose to limit the use of threads to a
> > few specific situations or use them throughout the database, the
> > dangers and difficulties faced by the developers when using threads
> > will be the same.
> 
> I do not agree. Let's say I put threading functions in posgresql
> that do not touch shared memory interface at all. They would be hell
> lot simpler to code and mainten than converting postgresql to one
> thread per connection model.

I think you misunderstand what I'm saying.

There are two approaches we've been talking about thus far:

1.  One thread per connection.  In this instance, every thread shares   exactly the same memory space.

2.  One process per connection, with each process able to create   additional worker threads to handle things like
concurrentsorts.   In this instance, threads that belong to the same process all   share the same memory space
(includingthe SysV shared memory pool   that the processes use to communicate with each other), but the   only memory
that*all* the threads will have in common is the SysV   shared memory pool.
 

Now, the *scope* of the problems introduced by using threading is
different between the two approaches, but the *nature* of the problems
is the same: for any given process, the introduction of threads will
significantly complicate the debugging of memory corruption issues.
This problem will be there no matter which approach you use; the only
difference will be the scale.

And that's why you're probably better off with the third approach:

3.  One process per connection, with each process able to create   additional worker subprocesses to handle things like
concurrent  sorts.  IPC between the subprocesses can be handled using a number   of different mechanisms, perhaps
includingthe already-used SysV   shared memory pool.
 

The reason you're probably better off with this third approach is that
by the time you need the concurrency for sorting, etc., the amount of
time you'll spend on the actual process of sorting, etc. will be so
much larger than the amount of time it takes to create, manage, and
destroy the concurrent processes (even on systems that have extremely
heavyweight processes, like Solaris and Windows) that there will be no
discernable difference between using threads and using processes.  It
may take a few milliseconds to create, manage, and destroy the
subprocesses, but the amount of work to be done is likely to represent
at least a couple of *hundred* milliseconds for a concurrent approach
to be worth it at all.  And if that's the case, you may as well save
yourself the problems associated with using threads.

Even if you'd gain as much as a 10% speed improvement by using threads
to handle concurrent sorts and such instead of processes (an
improvement that is likely to be very difficult to achieve), I think
you're still going to be better off using processes.  To justify the
dangers of using threads, you'd need to see something like a factor of
two or more gain in overall performance, and I don't see how that's
going to be possible even on systems with very heavyweight processes.


I might add that the case where you're likely to gain significant
benefits from using either threads or subprocesses to handle
concurrent sorts is one in which you probably *won't* get many
concurrent connections...because if you're dealing with a lot of
concurrent connections (no matter how long-lived they may be), you're
probably *already* using all of the CPUs on the machine anyway.  The
situation where doing the concurrent subprocesses or subthreads will
help you is one where the connections in question are relatively
long-lived and are performing big, complex queries -- exactly the
situation in which threads won't help you at all relative to
subprocesses, because the amount of work to do on behalf of the
connection will dwarf (that is, be many orders of magnitude greater
than) the amount of time it takes to create, manage, and tear down a
process.

> > Of course, back here in the real world they *do* have to worry about
> > this stuff, and that's why it's important to quantify the problem.
> > It's not sufficient to say that "processes are slow and threads are
> > fast".  Processes on the target platform may well be slow relative to
> > other systems (and relative to threads).  But the question is: for the
> > problem being solved, how much overhead does process handling
> > represent relative to the total amount of overhead the solution itself
> > incurs?
> 
> That is correct. However it would be a fair assumption on part of
> postgresql developers that a process once setup does not have much
> of processing overhead involved as such, given the state of modern
> server class OS and hardware. So postgresql as it is, fits in that
> model. I mean it is fine that postgresql has heavy
> connections. Simpler solution is to pool them.

I'm in complete agreement here, and it's why I have very little faith
that a threaded approach to any of the concurrency problems will yield
enough benefits to justify the very significant drawbacks that a
threaded approach brings to the table.


-- 
Kevin Brown                          kevin@sysexperts.com



pgsql-hackers by date:

Previous
From: Mark Kirkwood
Date:
Subject: Re: Anyone working on better transaction locking?
Next
From: Kevin Brown
Date:
Subject: Re: Anyone working on better transaction locking?