Thread: need help with a query

need help with a query

From
Pavel Velikhov
Date:
Hi,

  I am updating a big table (90M records) with data from another rather large table (4M entries). Here is my update query:

    update links set target_size =
    ( select size from articles where articles.article_id = links.article_to)

 I have built all the indexes one might need, increased shared mem buffers to 400MB, I looked at the query plan and it looks reasonable.
But its taking an eternity to run: I've been running the query for 3 hours now on my new Mac laptop, and looking at the activity monitor I see that postrges is spending all of this time in disk IO (average CPU load of postgres process is about 4-5%).

 However, just looking at the query, postgres could cache the articles table and do a single pass over the links table...

 Please let me know if there is a good solution to this.

Thanks!
Pavel Velikhov
Institute of Systems Programming
Russian Academy of Sciences

__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.com

Re: need help with a query

From
"Jonah H. Harris"
Date:
On 10/19/07, Pavel Velikhov <pvelikhov@yahoo.com> wrote:
>
> Hi,
>
>   I am updating a big table (90M records) with data from another rather
> large table (4M entries). Here is my update query:
>
>     update links set target_size =
>     ( select size from articles where articles.article_id =
> links.article_to)

try:

UPDATE links
   SET target_size = size
  FROM articles
 WHERE articles.article_id = links.article_to;

--
Jonah H. Harris, Sr. Software Architect | phone: 732.331.1324
EnterpriseDB Corporation                | fax: 732.331.1301
499 Thornall Street, 2nd Floor          | jonah.harris@enterprisedb.com
Edison, NJ 08837                        | http://www.enterprisedb.com/

Re: need help with a query

From
Pavel Velikhov
Date:
Thanks for you help!

Got a very different query plan this time, with a hash join between links and articles. At least now postgres is using both shared memory buffers and working mem, but its still completely IO bound, only getting in 5-6% CPU once in a while. I guess I can't squeeze more out of the laptop, but I also have a machine with 16GB RAM that I'll try this on next. Should I allocate tons of memory into shared buffers or into the working memory?

Thanks in advance!

----- Original Message ----
From: Jonah H. Harris <jonah.harris@gmail.com>
To: Pavel Velikhov <pvelikhov@yahoo.com>
Cc: pgsql-performance@postgresql.org
Sent: Friday, October 19, 2007 7:52:58 PM
Subject: Re: [PERFORM] need help with a query

On 10/19/07, Pavel Velikhov <pvelikhov@yahoo.com> wrote:
>
> Hi,
>
>  I am updating a big table (90M records) with data from another rather
> large table (4M entries). Here is my update query:
>
>    update links set target_size =
>    ( select size from articles where articles.article_id =
> links.article_to)

try:

UPDATE links
  SET target_size = size
  FROM articles
WHERE articles.article_id = links.article_to;

--
Jonah H. Harris, Sr. Software Architect | phone: 732.331.1324
EnterpriseDB Corporation                | fax: 732.331.1301
499 Thornall Street, 2nd Floor          | jonah.harris@enterprisedb.com
Edison, NJ 08837                        | http://www.enterprisedb.com/

---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend


__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.com

Re: need help with a query

From
Tom Lane
Date:
Pavel Velikhov <pvelikhov@yahoo.com> writes:
> Got a very different query plan this time, with a hash join between links and articles. At least now postgres is
usingboth shared memory buffers and working mem, but its still completely IO bound, only getting in 5-6% CPU once in a
while.I guess I can't squeeze more out of the laptop, but I also have a machine with 16GB RAM that I'll try this on
next.Should I allocate tons of memory into shared buffers or into the working memory? 

For a hash join, I think you want to raise work_mem as high as you can
(without driving the system into swapping).  It won't read any block of
the tables more than once, so there's no point in having lots of
buffers.

            regards, tom lane

Re: need help with a query

From
Pavel Velikhov
Date:
Thanks a lot folks,

Left the query running for 10+ hours and had to kill it. I guess there really was no need to have lots of
shared buffers (the hope was that postgresql will cache the whole table). I ended up doing this step inside
the application as a pre-processing step. Can't have postgres running with different fsych options since this
will be part of an "easy to install and run" app, that should just require a typical PosgreSQL installation.

Pavel Velikhov

----- Original Message ----
From: Kenneth Marshall <ktm@rice.edu>
To: Pavel Velikhov <pvelikhov@yahoo.com>
Cc: Jonah H. Harris <jonah.harris@gmail.com>; pgsql-performance@postgresql.org
Sent: Friday, October 19, 2007 8:17:48 PM
Subject: Re: [PERFORM] need help with a query

On Fri, Oct 19, 2007 at 09:11:36AM -0700, Pavel Velikhov wrote:
> Thanks for you help!
>
> Got a very different query plan this time, with a hash join between links and articles. At least now postgres is using both shared memory buffers and working mem, but its still completely IO bound, only getting in 5-6% CPU once in a while. I guess I can't squeeze more out of the laptop, but I also have a machine with 16GB RAM that I'll try this on next. Should I allocate tons of memory into shared buffers or into the working memory?

This is an extremely I/O intensive query that must rewrite every
entry in the table. You could speed it up by starting postgresql
with fsync disabled, run the update, and then restart it with
fsync re-enabled.

Ken


__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.com

Re: need help with a query

From
"Jonah H. Harris"
Date:
On 10/20/07, Pavel Velikhov <pvelikhov@yahoo.com> wrote:
> Left the query running for 10+ hours and had to kill it. I guess there
> really was no need to have lots of
> shared buffers (the hope was that postgresql will cache the whole table). I
> ended up doing this step inside
> the application as a pre-processing step. Can't have postgres running with
> different fsych options since this
> will be part of an "easy to install and run" app, that should just require a
> typical PosgreSQL installation.

Is the size always different?  If not, you could limit the updates:

UPDATE links
  SET target_size = size
 FROM articles
 WHERE articles.article_id = links.article_to
        AND links.target_size != articles.size;

Since this is a huge operation, what about trying:

CREATE TABLE links_new AS SELECT l.col1, l.col2, a.size as
target_size, l.col4, ... FROM links l, articles a WHERE a.article_id =
l.article_to;

Then truncate links, copy the data from links_new.  Alternatively, you
could drop links, rename links_new to links, and recreate the
constraints.

I guess the real question is application design.  Why doesn't this app
store size at runtime instead of having to batch this huge update?

--
Jonah H. Harris, Sr. Software Architect | phone: 732.331.1324
EnterpriseDB Corporation                | fax: 732.331.1301
499 Thornall Street, 2nd Floor          | jonah.harris@enterprisedb.com
Edison, NJ 08837                        | http://www.enterprisedb.com/

Re: need help with a query

From
Pavel Velikhov
Date:

On 10/20/07, Pavel Velikhov <pvelikhov@yahoo.com> wrote:
> Left the query running for 10+ hours and had to kill it. I guess there
> really was no need to have lots of
> shared buffers (the hope was that postgresql will cache the whole table). I
> ended up doing this step inside
> the application as a pre-processing step. Can't have postgres running with
> different fsych options since this
> will be part of an "easy to install and run" app, that should just require a
> typical PosgreSQL installation.

>Is the size always different?  If not, you could limit the updates:

>UPDATE links
>  SET target_size = size
>FROM articles
>WHERE articles.article_id = links.article_to
>        AND links.target_size != articles.size;

Ah, this sounds better for sure! But its probably as good as the scan with an index-scan subquery I was getting before...

>Since this is a huge operation, what about trying:

>CREATE TABLE links_new AS SELECT l.col1, l.col2, a.size as
>target_size, l.col4, ... FROM links l, articles a WHERE a.article_id =
>l.article_to;

>Then truncate links, copy the data from links_new.  Alternatively, you
>could drop links, rename links_new to links, and recreate the
>constraints.

>I guess the real question is application design.  Why doesn't this app
>store size at runtime instead of having to batch this huge update?

This is a link analysis application, I need to materialize all the sizes for target
articles in order to have the runtime part (vs. the loading part) run efficiently. I.e.
I really want to avoid a join with the articles table at runtime.

I have solved the size problem by other means (I compute it in my loader), but
I still have one query that needs to update a pretty large percentage of the links table...
I have previously used mysql, and for some reason I didn't have a problem with queries
like this (on the other hand mysql was crashing when building an index on article_to in the
links relation, so I had to work without a critical index)...

Thank!
Pavel


--
Jonah H. Harris, Sr. Software Architect | phone: 732.331.1324
EnterpriseDB Corporation                | fax: 732.331.1301
499 Thornall Street, 2nd Floor          | jonah.harris@enterprisedb.com
Edison, NJ 08837                        | http://www.enterprisedb.com/

---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster


__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.com