Thread: [GENERAL] Is is safe to use SPI in multiple threads?
Hi, buddies!
I'm new to PG and want to implement my domain-specific system based on PG. I wish to arrange my data as several tables in database and translate my DSL into SQL statements for query. Since one DSL statement may be mapped to several SQL statements, it's better to push the DSL server as close to the PG server as possible. I found PG's backgroud worker meet my needs. I can setup a background worker bounded to PG server and listen to a port for network requests.
But I encounter a problem that the Server Programing Interfaces are not THREAD SAFE. There are some global variables defined like: SPI_processed, SPI_tuptable, etc. This limit to my DSL server to work in single thread mode which is quite inefficient. So my questions are:
1. Is there a way to use SPI in multi-thread style?
2. Another option is to use libpq, like normal clients do. Is libpq as efficient as SPI?
Thanks in advance!
Xiafei Qiu
On 2016-12-09 16:52:05 +0800, Qiu Xiafei wrote: > 1. Is there a way to use SPI in multi-thread style? No. > 2. Another option is to use libpq, like normal clients do. Is libpq as > efficient as SPI? No. - Andres
On Fri, Dec 09, 2016 at 02:37:58PM -0800, Andres Freund wrote: > On 2016-12-09 16:52:05 +0800, Qiu Xiafei wrote: > > 1. Is there a way to use SPI in multi-thread style? > > No. > > > 2. Another option is to use libpq, like normal clients do. Is libpq as > > efficient as SPI? > > No. To give more details here, Postgres relies heavily on the fact that sessions working in parallel on the backend should be done in separate processes, like for transaction or snapshot handling. -- Michael
Attachment
Thanks for your reply.
On Saturday, December 10, 2016, Michael Paquier <michael.paquier@gmail.com> wrote:
Because of the one-backend-per-session concept of PG, I think I should bind one my DSL session to one bg worker only. It seems work. But is there a way to launch a bg worker when a new session starts, just like pg's per-session-backend do? Is it possible to run a bg worker for incoming sessions and to launch a new bg worker to handle the session when it comes?
On Saturday, December 10, 2016, Michael Paquier <michael.paquier@gmail.com> wrote:
On Fri, Dec 09, 2016 at 02:37:58PM -0800, Andres Freund wrote:
> On 2016-12-09 16:52:05 +0800, Qiu Xiafei wrote:
> > 1. Is there a way to use SPI in multi-thread style?
>
> No.
>
> > 2. Another option is to use libpq, like normal clients do. Is libpq as
> > efficient as SPI?
>
> No.
To give more details here, Postgres relies heavily on the fact that sessions
working in parallel on the backend should be done in separate processes,
like for transaction or snapshot handling.
--
Michael
On Sun, Dec 11, 2016 at 2:39 PM, Qiu Xiafei <qiuxiafei@gmail.com> wrote: > Because of the one-backend-per-session concept of PG, I think I should bind > one my DSL session to one bg worker only. It seems work. But is there a way > to launch a bg worker when a new session starts, just like pg's > per-session-backend do? Is it possible to run a bg worker for incoming > sessions and to launch a new bg worker to handle the session when it comes? There is the concept of dynamic background workers in Postgres. That's what for example parallel query uses. At planning a number of workers thought as suited is selected, and then spawned dynamically at execution, at the limit defined by max_worker_processes though. Have a look at worker_spi in the code tree, which is a module that does present a way to spawn workers dynamically. This infrastructure could allow for example anybody to re-create what autovacuum does with a launcher process and workers created depending on what needs to be cleaned up per database as a plugin. -- Michael
On 2016-12-09 16:52:05 +0800, Qiu Xiafei wrote: > I'm new to PG and want to implement my domain-specific system based on PG. I > wish to arrange my data as several tables in database and translate my DSL into > SQL statements for query. Since one DSL statement may be mapped to several SQL > statements, it's better to push the DSL server as close to the PG server as > possible. I found PG's backgroud worker meet my needs. I can setup a background > worker bounded to PG server and listen to a port for network requests. > > But I encounter a problem that the Server Programing Interfaces are not THREAD > SAFE. There are some global variables defined like: SPI_processed, > SPI_tuptable, etc. This limit to my DSL server to work in single thread mode > which is quite inefficient. I had a similar requirement. I solved it by moving the application logic out of the stored procedures. All the stored procedure does is an RPC call (I use ØMQ for that) to a server process and send the result back to the client. The server process converts the request into multiple SQL queries which can be processed in parallel. The downside is of course that the communication overhead is much higher (A minimum of 4 network messages per request). That's not a problem in my case, but you mileage may vary. The advantages in my opinion are: * A standalone server process is easier to test and debug than a bunch of stored procedures. * I can easily scale out if necessary: Currently my database and server process run on the same machine, but I could distribute them over several machines with (almost) no change in logic. hp -- _ | Peter J. Holzer | A coding theorist is someone who doesn't |_|_) | | think Alice is crazy. | | | hjp@hjp.at | -- John Gordon __/ | http://www.hjp.at/ | http://downlode.org/Etext/alicebob.html
Attachment
On 23/12/2016 13:41, Peter J. Holzer wrote: > On 2016-12-09 16:52:05 +0800, Qiu Xiafei wrote: >> I'm new to PG and want to implement my domain-specific system based on PG. I >> wish to arrange my data as several tables in database and translate my DSL into >> SQL statements for query. Since one DSL statement may be mapped to several SQL >> statements, it's better to push the DSL server as close to the PG server as >> possible. I found PG's backgroud worker meet my needs. I can setup a background >> worker bounded to PG server and listen to a port for network requests. >> >> But I encounter a problem that the Server Programing Interfaces are not THREAD >> SAFE. There are some global variables defined like: SPI_processed, >> SPI_tuptable, etc. This limit to my DSL server to work in single thread mode >> which is quite inefficient. > > I had a similar requirement. I solved it by moving the application logic > out of the stored procedures. All the stored procedure does is an RPC > call (I use ØMQ for that) to a server process and send the result back > to the client. The server process converts the request into multiple SQL > queries which can be processed in parallel. > > The downside is of course that the communication overhead is much > higher (A minimum of 4 network messages per request). That's not a > problem in my case, but you mileage may vary. > > The advantages in my opinion are: > > * A standalone server process is easier to test and debug than a bunch > of stored procedures. > * I can easily scale out if necessary: Currently my database and server > process run on the same machine, but I could distribute them over > several machines with (almost) no change in logic. > > hp > Sorry to revive such an old topic. I am facing a similar requirement where I am running multiple queries concurrently. Like Qiu Xiafei, I am looking at SPI, and dynamic background workers. In particular, I am using SPI_execq(...) on each dynamic background workers I spawn. What I am experiencing is that I am not seeing a speedup, and I am beginning to wonder if I have done something wrong, if the overheads are too big, or if there are some limitations I am not aware of. As I see that none of the comments here make much of a reference to performance/speedup, would you be so kind as to tell me how satisfied you were with performance? Any insights would be greatly appreciated. Thanks, Tom
On 2020-02-18 19:07:44 +0000, Tom Mercha wrote: > On 23/12/2016 13:41, Peter J. Holzer wrote: > > On 2016-12-09 16:52:05 +0800, Qiu Xiafei wrote: > >> I'm new to PG and want to implement my domain-specific system based on PG. I > >> wish to arrange my data as several tables in database and translate my DSL into > >> SQL statements for query. Since one DSL statement may be mapped to several SQL > >> statements, it's better to push the DSL server as close to the PG server as > >> possible. I found PG's backgroud worker meet my needs. I can setup a background > >> worker bounded to PG server and listen to a port for network requests. > >> > >> But I encounter a problem that the Server Programing Interfaces are not THREAD > >> SAFE. There are some global variables defined like: SPI_processed, > >> SPI_tuptable, etc. This limit to my DSL server to work in single thread mode > >> which is quite inefficient. > > > > I had a similar requirement. I solved it by moving the application logic > > out of the stored procedures. All the stored procedure does is an RPC > > call (I use ØMQ for that) to a server process and send the result back > > to the client. The server process converts the request into multiple SQL > > queries which can be processed in parallel. > > > > The downside is of course that the communication overhead is much > > higher (A minimum of 4 network messages per request). That's not a > > problem in my case, but you mileage may vary. > > > > The advantages in my opinion are: > > > > * A standalone server process is easier to test and debug than a bunch > > of stored procedures. > > * I can easily scale out if necessary: Currently my database and server > > process run on the same machine, but I could distribute them over > > several machines with (almost) no change in logic. > > Sorry to revive such an old topic. I am facing a similar requirement > where I am running multiple queries concurrently. Like Qiu Xiafei, I am > looking at SPI, and dynamic background workers. In particular, I am > using SPI_execq(...) on each dynamic background workers I spawn. What I > am experiencing is that I am not seeing a speedup, and I am beginning to > wonder if I have done something wrong, if the overheads are too big, or > if there are some limitations I am not aware of. > > As I see that none of the comments here make much of a reference to > performance/speedup, would you be so kind as to tell me how satisfied > you were with performance? Any insights would be greatly appreciated. The speedup depends very much on how evenly you can divide up your workload between worker processes. Let's assume that we have a query which takes 10 seconds and 8 worker processes. If you can't split that query, there is no speedup. 1 worker will be busy for 10 seconds, the others will be idle. If you can split it into 20 queries which will run for 0.5 seconds each, we will be finished in 1.5 seconds (run 8 queries in the first 0.5 seconds, another 8 in the second, and then the last 4). If you can split the query, but the runtimes of the subqueries will be very different (e.g. they will run for 10, 5, 2, 1, 0.5 and 15 times 0.1 seconds), then the total run time will be close to the longest-running subquery - in this case 10 seconds. All the workers will start busy but at some point they run out of work and have to wait for that single slow subquery. So the speedup really depends a lot on how smartly you can divide up your workload. And this is unfortunately something which varies a lot: Firstly, our data is skewed. We have a lot more data about Austria than Afghanistan, so simply partitioning by country doesn't give us equal cost queries. Secondly, even with SSDs, access to permanent storage is still a lot slower than access to RAM, so your access times may change unpredictably depending on whether the data you access is hot or cold. Which brings me to another blessing and/or curse of this approach: Having a central server process outside of postgresql makes it easy to cache (partial) results. So very often we already have the data in our application cache and can generate a response in milliseconds. Which is good. However, when we don't have the data and have to ask the database, it is quite likely that it isn't in the cache of the database or even the OS either, so we have to hit the disks - which is slow. So in conclusion: We are seeing a substantial speedup. But it is definitely not linear in the number of worker processes, and the effect of caching is much greater than that of parallelization. And we pay for that by a large variation in query times (between milliseconds and tens of seconds). hp -- _ | Peter J. Holzer | Story must make more sense than reality. |_|_) | | | | | hjp@hjp.at | -- Charles Stross, "Creative writing __/ | http://www.hjp.at/ | challenge!"