Thread: need help with a query
Hi,
I am updating a big table (90M records) with data from another rather large table (4M entries). Here is my update query:
update links set target_size =
( select size from articles where articles.article_id = links.article_to)
I have built all the indexes one might need, increased shared mem buffers to 400MB, I looked at the query plan and it looks reasonable.
But its taking an eternity to run: I've been running the query for 3 hours now on my new Mac laptop, and looking at the activity monitor I see that postrges is spending all of this time in disk IO (average CPU load of postgres process is about 4-5%).
However, just looking at the query, postgres could cache the articles table and do a single pass over the links table...
Please let me know if there is a good solution to this.
Thanks!
Pavel Velikhov
Institute of Systems Programming
Russian Academy of Sciences
I am updating a big table (90M records) with data from another rather large table (4M entries). Here is my update query:
update links set target_size =
( select size from articles where articles.article_id = links.article_to)
I have built all the indexes one might need, increased shared mem buffers to 400MB, I looked at the query plan and it looks reasonable.
But its taking an eternity to run: I've been running the query for 3 hours now on my new Mac laptop, and looking at the activity monitor I see that postrges is spending all of this time in disk IO (average CPU load of postgres process is about 4-5%).
However, just looking at the query, postgres could cache the articles table and do a single pass over the links table...
Please let me know if there is a good solution to this.
Thanks!
Pavel Velikhov
Institute of Systems Programming
Russian Academy of Sciences
__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.com
On 10/19/07, Pavel Velikhov <pvelikhov@yahoo.com> wrote: > > Hi, > > I am updating a big table (90M records) with data from another rather > large table (4M entries). Here is my update query: > > update links set target_size = > ( select size from articles where articles.article_id = > links.article_to) try: UPDATE links SET target_size = size FROM articles WHERE articles.article_id = links.article_to; -- Jonah H. Harris, Sr. Software Architect | phone: 732.331.1324 EnterpriseDB Corporation | fax: 732.331.1301 499 Thornall Street, 2nd Floor | jonah.harris@enterprisedb.com Edison, NJ 08837 | http://www.enterprisedb.com/
Thanks for you help!
Got a very different query plan this time, with a hash join between links and articles. At least now postgres is using both shared memory buffers and working mem, but its still completely IO bound, only getting in 5-6% CPU once in a while. I guess I can't squeeze more out of the laptop, but I also have a machine with 16GB RAM that I'll try this on next. Should I allocate tons of memory into shared buffers or into the working memory?
Thanks in advance!
Got a very different query plan this time, with a hash join between links and articles. At least now postgres is using both shared memory buffers and working mem, but its still completely IO bound, only getting in 5-6% CPU once in a while. I guess I can't squeeze more out of the laptop, but I also have a machine with 16GB RAM that I'll try this on next. Should I allocate tons of memory into shared buffers or into the working memory?
Thanks in advance!
----- Original Message ----
From: Jonah H. Harris <jonah.harris@gmail.com>
To: Pavel Velikhov <pvelikhov@yahoo.com>
Cc: pgsql-performance@postgresql.org
Sent: Friday, October 19, 2007 7:52:58 PM
Subject: Re: [PERFORM] need help with a query
On 10/19/07, Pavel Velikhov <pvelikhov@yahoo.com> wrote:
>
> Hi,
>
> I am updating a big table (90M records) with data from another rather
> large table (4M entries). Here is my update query:
>
> update links set target_size =
> ( select size from articles where articles.article_id =
> links.article_to)
try:
UPDATE links
SET target_size = size
FROM articles
WHERE articles.article_id = links.article_to;
--
Jonah H. Harris, Sr. Software Architect | phone: 732.331.1324
EnterpriseDB Corporation | fax: 732.331.1301
499 Thornall Street, 2nd Floor | jonah.harris@enterprisedb.com
Edison, NJ 08837 | http://www.enterprisedb.com/
---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend
From: Jonah H. Harris <jonah.harris@gmail.com>
To: Pavel Velikhov <pvelikhov@yahoo.com>
Cc: pgsql-performance@postgresql.org
Sent: Friday, October 19, 2007 7:52:58 PM
Subject: Re: [PERFORM] need help with a query
On 10/19/07, Pavel Velikhov <pvelikhov@yahoo.com> wrote:
>
> Hi,
>
> I am updating a big table (90M records) with data from another rather
> large table (4M entries). Here is my update query:
>
> update links set target_size =
> ( select size from articles where articles.article_id =
> links.article_to)
try:
UPDATE links
SET target_size = size
FROM articles
WHERE articles.article_id = links.article_to;
--
Jonah H. Harris, Sr. Software Architect | phone: 732.331.1324
EnterpriseDB Corporation | fax: 732.331.1301
499 Thornall Street, 2nd Floor | jonah.harris@enterprisedb.com
Edison, NJ 08837 | http://www.enterprisedb.com/
---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend
__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.com
Pavel Velikhov <pvelikhov@yahoo.com> writes: > Got a very different query plan this time, with a hash join between links and articles. At least now postgres is usingboth shared memory buffers and working mem, but its still completely IO bound, only getting in 5-6% CPU once in a while.I guess I can't squeeze more out of the laptop, but I also have a machine with 16GB RAM that I'll try this on next.Should I allocate tons of memory into shared buffers or into the working memory? For a hash join, I think you want to raise work_mem as high as you can (without driving the system into swapping). It won't read any block of the tables more than once, so there's no point in having lots of buffers. regards, tom lane
Thanks a lot folks,
Left the query running for 10+ hours and had to kill it. I guess there really was no need to have lots of
shared buffers (the hope was that postgresql will cache the whole table). I ended up doing this step inside
the application as a pre-processing step. Can't have postgres running with different fsych options since this
will be part of an "easy to install and run" app, that should just require a typical PosgreSQL installation.
Pavel Velikhov
Left the query running for 10+ hours and had to kill it. I guess there really was no need to have lots of
shared buffers (the hope was that postgresql will cache the whole table). I ended up doing this step inside
the application as a pre-processing step. Can't have postgres running with different fsych options since this
will be part of an "easy to install and run" app, that should just require a typical PosgreSQL installation.
Pavel Velikhov
----- Original Message ----
From: Kenneth Marshall <ktm@rice.edu>
To: Pavel Velikhov <pvelikhov@yahoo.com>
Cc: Jonah H. Harris <jonah.harris@gmail.com>; pgsql-performance@postgresql.org
Sent: Friday, October 19, 2007 8:17:48 PM
Subject: Re: [PERFORM] need help with a query
On Fri, Oct 19, 2007 at 09:11:36AM -0700, Pavel Velikhov wrote:
> Thanks for you help!
>
> Got a very different query plan this time, with a hash join between links and articles. At least now postgres is using both shared memory buffers and working mem, but its still completely IO bound, only getting in 5-6% CPU once in a while. I guess I can't squeeze more out of the laptop, but I also have a machine with 16GB RAM that I'll try this on next. Should I allocate tons of memory into shared buffers or into the working memory?
This is an extremely I/O intensive query that must rewrite every
entry in the table. You could speed it up by starting postgresql
with fsync disabled, run the update, and then restart it with
fsync re-enabled.
Ken
From: Kenneth Marshall <ktm@rice.edu>
To: Pavel Velikhov <pvelikhov@yahoo.com>
Cc: Jonah H. Harris <jonah.harris@gmail.com>; pgsql-performance@postgresql.org
Sent: Friday, October 19, 2007 8:17:48 PM
Subject: Re: [PERFORM] need help with a query
On Fri, Oct 19, 2007 at 09:11:36AM -0700, Pavel Velikhov wrote:
> Thanks for you help!
>
> Got a very different query plan this time, with a hash join between links and articles. At least now postgres is using both shared memory buffers and working mem, but its still completely IO bound, only getting in 5-6% CPU once in a while. I guess I can't squeeze more out of the laptop, but I also have a machine with 16GB RAM that I'll try this on next. Should I allocate tons of memory into shared buffers or into the working memory?
This is an extremely I/O intensive query that must rewrite every
entry in the table. You could speed it up by starting postgresql
with fsync disabled, run the update, and then restart it with
fsync re-enabled.
Ken
__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.com
On 10/20/07, Pavel Velikhov <pvelikhov@yahoo.com> wrote: > Left the query running for 10+ hours and had to kill it. I guess there > really was no need to have lots of > shared buffers (the hope was that postgresql will cache the whole table). I > ended up doing this step inside > the application as a pre-processing step. Can't have postgres running with > different fsych options since this > will be part of an "easy to install and run" app, that should just require a > typical PosgreSQL installation. Is the size always different? If not, you could limit the updates: UPDATE links SET target_size = size FROM articles WHERE articles.article_id = links.article_to AND links.target_size != articles.size; Since this is a huge operation, what about trying: CREATE TABLE links_new AS SELECT l.col1, l.col2, a.size as target_size, l.col4, ... FROM links l, articles a WHERE a.article_id = l.article_to; Then truncate links, copy the data from links_new. Alternatively, you could drop links, rename links_new to links, and recreate the constraints. I guess the real question is application design. Why doesn't this app store size at runtime instead of having to batch this huge update? -- Jonah H. Harris, Sr. Software Architect | phone: 732.331.1324 EnterpriseDB Corporation | fax: 732.331.1301 499 Thornall Street, 2nd Floor | jonah.harris@enterprisedb.com Edison, NJ 08837 | http://www.enterprisedb.com/
On 10/20/07, Pavel Velikhov <pvelikhov@yahoo.com> wrote:
> Left the query running for 10+ hours and had to kill it. I guess there
> really was no need to have lots of
> shared buffers (the hope was that postgresql will cache the whole table). I
> ended up doing this step inside
> the application as a pre-processing step. Can't have postgres running with
> different fsych options since this
> will be part of an "easy to install and run" app, that should just require a
> typical PosgreSQL installation.
>Is the size always different? If not, you could limit the updates:
>UPDATE links
> SET target_size = size
>FROM articles
>WHERE articles.article_id = links.article_to
> AND links.target_size != articles.size;
Ah, this sounds better for sure! But its probably as good as the scan with an index-scan subquery I was getting before...
>Since this is a huge operation, what about trying:
>CREATE TABLE links_new AS SELECT l.col1, l.col2, a.size as
>target_size, l.col4, ... FROM links l, articles a WHERE a.article_id =
>l.article_to;
>Then truncate links, copy the data from links_new. Alternatively, you
>could drop links, rename links_new to links, and recreate the
>constraints.
>I guess the real question is application design. Why doesn't this app
>store size at runtime instead of having to batch this huge update?
This is a link analysis application, I need to materialize all the sizes for target
articles in order to have the runtime part (vs. the loading part) run efficiently. I.e.
I really want to avoid a join with the articles table at runtime.
I have solved the size problem by other means (I compute it in my loader), but
I still have one query that needs to update a pretty large percentage of the links table...
I have previously used mysql, and for some reason I didn't have a problem with queries
like this (on the other hand mysql was crashing when building an index on article_to in the
links relation, so I had to work without a critical index)...
Thank!
Pavel
--
Jonah H. Harris, Sr. Software Architect | phone: 732.331.1324
EnterpriseDB Corporation | fax: 732.331.1301
499 Thornall Street, 2nd Floor | jonah.harris@enterprisedb.com
Edison, NJ 08837 | http://www.enterprisedb.com/
---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster
> Left the query running for 10+ hours and had to kill it. I guess there
> really was no need to have lots of
> shared buffers (the hope was that postgresql will cache the whole table). I
> ended up doing this step inside
> the application as a pre-processing step. Can't have postgres running with
> different fsych options since this
> will be part of an "easy to install and run" app, that should just require a
> typical PosgreSQL installation.
>Is the size always different? If not, you could limit the updates:
>UPDATE links
> SET target_size = size
>FROM articles
>WHERE articles.article_id = links.article_to
> AND links.target_size != articles.size;
Ah, this sounds better for sure! But its probably as good as the scan with an index-scan subquery I was getting before...
>Since this is a huge operation, what about trying:
>CREATE TABLE links_new AS SELECT l.col1, l.col2, a.size as
>target_size, l.col4, ... FROM links l, articles a WHERE a.article_id =
>l.article_to;
>Then truncate links, copy the data from links_new. Alternatively, you
>could drop links, rename links_new to links, and recreate the
>constraints.
>I guess the real question is application design. Why doesn't this app
>store size at runtime instead of having to batch this huge update?
This is a link analysis application, I need to materialize all the sizes for target
articles in order to have the runtime part (vs. the loading part) run efficiently. I.e.
I really want to avoid a join with the articles table at runtime.
I have solved the size problem by other means (I compute it in my loader), but
I still have one query that needs to update a pretty large percentage of the links table...
I have previously used mysql, and for some reason I didn't have a problem with queries
like this (on the other hand mysql was crashing when building an index on article_to in the
links relation, so I had to work without a critical index)...
Thank!
Pavel
--
Jonah H. Harris, Sr. Software Architect | phone: 732.331.1324
EnterpriseDB Corporation | fax: 732.331.1301
499 Thornall Street, 2nd Floor | jonah.harris@enterprisedb.com
Edison, NJ 08837 | http://www.enterprisedb.com/
---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster
__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.com