Thread: Parallel Execution of Query
I am doing similarity matching (cosine, euclidean). If I have 4000 entries in a table, the number of comparisons will be 16M. I am running postgres on a virtual machine, so it takes 20-25 minutes to run the query or the system crashes. Can I run the query in parallel? I heard there are tools like PL/Proxy and pgpool, can I use them to create several databases on the same machine and run query in parallel?
While parallel queries is an area of ongoing work, I do not believe that a single connection can currently use more than a single core. You could experiment with turning your 4000x4000 comparisons into 4 queries doing 2000x2000 comparison (assuming you have 4 cores available to the VM, the more cores, the more cores you could split it up) (just remember to use async for each of those so you don't execute them in serial!). Jim On 11/30/2015 12:00 PM, Shmagi Kavtaradze wrote: > I am doing similarity matching (cosine, euclidean). If I have 4000 > entries in a table, the number of comparisons will be 16M. I am running > postgres on a virtual machine, so it takes 20-25 minutes to run the > query or the system crashes. Can I run the query in parallel? I heard > there are tools like PL/Proxy and pgpool, can I use them to create > several databases on the same machine and run query in parallel?
Attachment
Thanks for answer. What about async? can you tell in details?
On Mon, Nov 30, 2015 at 9:56 PM, James Keener <jim@jimkeener.com> wrote:
While parallel queries is an area of ongoing work, I do not believe that
a single connection can currently use more than a single core. You
could experiment with turning your 4000x4000 comparisons into 4 queries
doing 2000x2000 comparison (assuming you have 4 cores available to the
VM, the more cores, the more cores you could split it up) (just remember
to use async for each of those so you don't execute them in serial!).
Jim
On 11/30/2015 12:00 PM, Shmagi Kavtaradze wrote:
> I am doing similarity matching (cosine, euclidean). If I have 4000
> entries in a table, the number of comparisons will be 16M. I am running
> postgres on a virtual machine, so it takes 20-25 minutes to run the
> query or the system crashes. Can I run the query in parallel? I heard
> there are tools like PL/Proxy and pgpool, can I use them to create
> several databases on the same machine and run query in parallel?
async is just an ability some client libraries offer/expose which allow you to resume execution of your application before results are returned from the database. This is similar to async HTTP requests, "promises" (common in modern JavaScript), or async/await (as offered in some languages) if you're familiar with those. It doesn't change the 1-core per query limitation on the server. Jim
Attachment
Shmagi,
First, I would explore creating multiple VM's and trying to run the query in parallel VM's. If you can easily clone your VMs, try creating two VM and running 1/2 the query on each VM.
Then try 4 VM's, then 8 and so on.
An complex approach for a single VM is to write C UDF (User Defined Function). The UDF should do the following
1) Take a select query a the input argument
2) Run the query and store the results in a C collection (a list or array of C structs)
3) Loop over the C collection N by N times computing the similarity matching (cosine, euclidean)
4) Output the result as a set of rows
This is a non-trivial approach, as it requires deep knowledge of PostgreSQL C functions. But it could
speed up calculations like this by orders of magnitude
Regards
Tim
From: "Shmagi Kavtaradze" <kavtaradze.s@gmail.com>
To: pgsql-novice@postgresql.org
Sent: Monday, November 30, 2015 9:00:40 AM
Subject: [NOVICE] Parallel Execution of Query
To: pgsql-novice@postgresql.org
Sent: Monday, November 30, 2015 9:00:40 AM
Subject: [NOVICE] Parallel Execution of Query
I am doing similarity matching (cosine, euclidean). If I have 4000 entries in a table, the number of comparisons will be 16M. I am running postgres on a virtual machine, so it takes 20-25 minutes to run the query or the system crashes. Can I run the query in parallel? I heard there are tools like PL/Proxy and pgpool, can I use them to create several databases on the same machine and run query in parallel?