Thread: Comparitive UPDATE speed

Comparitive UPDATE speed

From
Josh Berkus
Date:
Relative performance question:

I have 2 UPDATE queires in a function.

table_a:  117,000 records
table_b: 117,000 records
table_c: 1.5 million records

 #1 updates table_a, field_2 from table_b, field_1 based on a joining field_3.
Around 110,000 updates
#2 updates table_a, field_5 from table_c, field_2 joining on field_3.
Around 110,000 updates.

#1 takes 5-7 minutes; #2 takes about 15 seconds.  The only difference I can
discern is that table_a, field_2 is indexed and table_a, field_5 is not.

Is it reasonable that updating the index would actually make the query take
20x longer?  If not, I'll post actual table defs and query statements.

--
Josh Berkus
josh@agliodbs.com
Aglio Database Solutions
San Francisco

Re: Comparitive UPDATE speed

From
"Josh Berkus"
Date:
Randy,

> I'm not sure about 20 times longer but you would have index records
> that
> would need to be changed.  Is field_3 indexed in all 3 tables?  If
> table_b
> does not have an index on field_3 and the other tables do, I'd guess
> that
> would make this take longer too.

Yeah, they're indexed.   I'm going to try the updates without the index
on field_2 tonight.

-Josh Berkus

Re: Comparitive UPDATE speed

From
Ron Johnson
Date:
On Tue, 2002-10-01 at 16:51, Josh Berkus wrote:
> Relative performance question:
>
> I have 2 UPDATE queires in a function.
>
> table_a:  117,000 records
> table_b: 117,000 records
> table_c: 1.5 million records
>
>  #1 updates table_a, field_2 from table_b, field_1 based on a joining field_3.
> Around 110,000 updates
> #2 updates table_a, field_5 from table_c, field_2 joining on field_3.
> Around 110,000 updates.
>
> #1 takes 5-7 minutes; #2 takes about 15 seconds.  The only difference I can
> discern is that table_a, field_2 is indexed and table_a, field_5 is not.
>
> Is it reasonable that updating the index would actually make the query take
> 20x longer?  If not, I'll post actual table defs and query statements.

Absolutely.  You are doing lots of extra work.

For each of the 110,000 updates, you are deleting a leaf node from one
part of the index tree and then inserting it into another part of the
tree.

It will get even worse as you add more rows to table_a, since the
index tree will get deeper, and more work work must be done during
each insert and delete.

--
+------------------------------------------------------------+
| Ron Johnson, Jr.     mailto:ron.l.johnson@cox.net          |
| Jefferson, LA  USA   http://members.cox.net/ron.l.johnson  |
|                                                            |
| "What other evidence do you have that they are terrorists, |
|  other than that they trained in these camps?"             |
|   17-Sep-2002 Katie Couric to an FBI agent regarding the 5 |
|   men arrested near Buffalo NY                             |
+------------------------------------------------------------+


Re: Comparitive UPDATE speed

From
"Josh Berkus"
Date:
Folks,

Sorry for the double-quoting here.   I sent this to just Ron by
accident.  My original question is double-quoted, Ron is quoted, and my
responses are below.  Thanks!

> > Ok, I'm still confused.
> >
> > I'm updating a (not not indexed) field in a 117,000 row table based
> on
> > information in another 117,000 row table.   The update is an
> integer, and the
> > linking fields are indexed.   Yet the two queries are flattening my
>
> > dual-processor, RAID5 database server for up to 11 minutes ...
> using 230mb
> > ram the entire time.   I simply can't believe that these two
> queries are that
> > difficult.
>
> So there's no index on elbs_matter_links.case_id?  From your original
>
> post, I thought that there *is* an index on that field.

I'm now dropping it before the update.   Unfortunately, dropping the
index made no appreciable gain in performance.

> > I've increased the memory available to the update to 256mb, and
> tried forcing
> > an index scan ... to no avail.   Ideas, please?
> >
> > The queries:
> >
> > UPDATE elbs_matter_links SET case_id = case_clients.case_id
> > FROM case_clients
> > WHERE elbs_matter_links.mmatter = case_clients.matter_no;
>
> What happens if you run the query:
> SELECT eml.case_id, cc.case_id, eml.mmatter, cc.matter_no
> FROM elbs_matter_links eml,
>      case_clients cc
> WHERE eml.mmatter = cc.matter_no;
>
> That, for all intents and purposes, is your UPDATE statement, just
> without doing the UPDATE.  How fast does it run?

Slowly.  It takes about 60 seconds to return data.  This may be the
problem.  Thoughts?  Here's EXPLAIN output:

Hash Join  (cost=3076.10..91842.88 rows=108648 width=40)
  ->  Seq Scan on elbs_matter_links eml  (cost=0.00..85641.87
rows=117787 width=20)
  ->  Hash  (cost=2804.48..2804.48 rows=108648 width=20)
        ->  Seq Scan on case_clients cc  (cost=0.00..2804.48
rows=108648 width=20)

According to the parser, using the indexes would be worse:

Merge Join  (cost=0.00..520624.38 rows=108648 width=40)
  ->  Index Scan using idx_eml_mmatter on elbs_matter_links eml
 (cost=0.00..451735.00 rows=117787 width=20)
  ->  Index Scan using idx_caseclients_matter on case_clients cc
 (cost=0.00..66965.20 rows=108648 width=20)

Though in practice, a forced index scan returns rows in about 60
seconds, same as the SeqScan version.

All of this seems very costly for a query that, while it does return a
lot of rows, is essentially a very simple query.

More importantly, on the hardware I'm using, I would expect better
performance that I get on my laptop ... and I'm not seeing it.  I just
can't believe that the simple query above could soak 200mb of RAM for a
full 60 seconds to return a result.    It's like queries over a certain
result size on the system choke postgres.


My reference data below:
==============================================

>
> > UPDATE elbs_matter_links SET case_id = cases.case_id
> > FROM cases
> > WHERE elbs_matter_links.docket = cases.docket
> >  AND elbs_matter_links.case_id IS NULL;
> >
> >
> > EXPLAIN output:
> >
> > Hash Join  (cost=4204.83..39106.77 rows=8473 width=299)
> >   ->  Index Scan using idx_eml_mmatter on elbs_matter_links
> > (cost=0.00..34668.94 rows=8473 width=279)
> >   ->  Hash  (cost=2808.38..2808.38 rows=109038 width=20)
> >         ->  Seq Scan on case_clients  (cost=0.00..2808.38
> rows=109038
> > width=20)
> >
> > Nested Loop  (cost=0.00..32338.47 rows=99 width=300)
> >   ->  Seq Scan on cases  (cost=0.00..9461.97 rows=4297 width=21)
> >   ->  Index Scan using idx_eml_docket on elbs_matter_links
>  (cost=0.00..5.31
> > rows=1 width=279)
> >
> > Table defintions:
> >
> > Table "elbs_matter_links"
> >       Column      |         Type          |       Modifiers
> > ------------------+-----------------------+-----------------------
> >  mmatter          | character varying(15) | not null
> >  case_id          | integer               |
> >  matter_check     | character varying(20) | not null default 'OK'
> >  docket           | character varying(50) |
> >  case_name        | character varying(50) |
> >  practice         | character varying(50) |
> >  opp_counsel_name | character varying(50) |
> >  opp_counsel_id   | integer               |
> >  act_type         | character varying(10) |
> >  lead_case_id     | integer               |
> >  lead_case_docket | character varying(50) |
> >  disease          | character varying(50) |
> >  docket_no        | character varying(25) |
> >  juris_state      | character varying(6)  |
> >  juris_local      | character varying(20) |
> >  status           | smallint              | not null default 1
> >  client_id        | integer               |
> >  office_loc       | character varying(5)  |
> >  date_filed       | date                  |
> >  date_served      | date                  |
> >  date_resolved    | date                  |
> >  case_status      | character varying(5)  |
> >  settle_amount    | numeric(12,2)         | default 0
> >  narrative        | text                  |
> >  comment          | character varying(50) |
> >  client_no        | character varying(10) |
> >  juris_id         | integer               |
> > Indexes: idx_eml_check,
> >          idx_eml_docket,
> >          idx_eml_mmatter
> > Primary key: elbs_matter_links_pkey
> >
> >                                      Table "case_clients"
> >       Column      |         Type          |
>                     Modifiers
> >
>
------------------+-----------------------+----------------------------------------------------
> >  case_client_id   | integer               | not null default
> > nextval('case_clients_seq'::text)
> >  case_id          | integer               | not null
> >  client_id        | integer               | not null
> >  office_loc       | character varying(5)  |
> >  date_filed       | date                  |
> >  date_served      | date                  |
> >  date_resolved    | date                  |
> >  matter_no        | character varying(15) | not null
> >  case_status      | character varying(5)  | not null
> >  settle_amount    | numeric(14,2)         | not null default 0
> >  matter_narrative | text                  |
> >  comment          | character varying(50) |
> > Indexes: idx_case_clients_client,
> >          idx_caseclients_case,
> >          idx_caseclients_matter,
> >          idx_caseclients_resolved,
> >          idx_caseclients_served,
> >          idx_caseclients_status
> > Primary key: case_clients_pkey
> >
> >
> >                                      Table "cases"
> >       Column      |         Type          |
>                  Modifiers
> >
>
------------------+-----------------------+---------------------------------------------
> >  case_id          | integer               | not null default
> > nextval('cases_seq'::text)
> >  docket           | character varying(50) | not null
> >  case_name        | character varying(50) | not null
> >  practice         | character varying(50) | not null
> >  opp_counsel_name | character varying(50) |
> >  opp_counsel_id   | integer               |
> >  act_type         | character varying(10) |
> >  lead_case_id     | integer               |
> >  lead_case_docket | character varying(50) |
> >  disease          | character varying(50) |
> >  docket_no        | character varying(25) | not null
> >  juris_state      | character varying(6)  | not null
> >  juris_local      | character varying(20) |
> >  tgroup_id        | integer               |
> >  status           | smallint              | not null default 1
> >  juris_id         | integer               |
> > Indexes: idx_case_cases_juris,
> >          idx_cases_docket,
> >          idx_cases_lead,
> >          idx_cases_name,
> >          idx_cases_status,
> >          idx_cases_tgroup,
> >          idx_lower_case_name
> >
> >
> >
> > --
> > Josh Berkus
> > josh@agliodbs.com
> > Aglio Database Solutions
> > San Francisco
> --
> +------------------------------------------------------------+
> | Ron Johnson, Jr.     mailto:ron.l.johnson@cox.net          |
> | Jefferson, LA  USA   http://members.cox.net/ron.l.johnson  |
> |                                                            |
> | "What other evidence do you have that they are terrorists, |
> |  other than that they trained in these camps?"             |
> |   17-Sep-2002 Katie Couric to an FBI agent regarding the 5 |
> |   men arrested near Buffalo NY                             |
> +------------------------------------------------------------+
>

______AGLIO DATABASE SOLUTIONS___________________________
                                       Josh Berkus
  Complete information technology      josh@agliodbs.com
   and data management solutions       (415) 565-7293
  for law firms, small businesses        fax 621-2533
    and non-profit organizations.      San Francisco

Re: Comparitive UPDATE speed

From
Andrew Sullivan
Date:
On Fri, Oct 04, 2002 at 08:54:56AM -0700, Josh Berkus wrote:

> Slowly.  It takes about 60 seconds to return data.  This may be the
> problem.  Thoughts?  Here's EXPLAIN output:

[. . .]

> According to the parser, using the indexes would be worse:

Have you run this with EXPLAIN ANALYSE?  It will actually perform the
necessary steps, so it will reveal if the planner is getting
something wrong.

A

--
----
Andrew Sullivan                         204-4141 Yonge Street
Liberty RMS                           Toronto, Ontario Canada
<andrew@libertyrms.info>                              M2P 2A8
                                         +1 416 646 3304 x110


Re: Comparitive UPDATE speed

From
Josh Berkus
Date:
Andrew,

> Have you run this with EXPLAIN ANALYSE?  It will actually perform the
> necessary steps, so it will reveal if the planner is getting
> something wrong.

Here it is:

Hash Join  (cost=3076.10..91842.88 rows=108648 width=40) (actual
time=18625.19..22823.39 rows=108546 loops=1)
  ->  Seq Scan on elbs_matter_links eml  (cost=0.00..85641.87 rows=117787
width=20) (actual time=18007.69..19515.63 rows=117787 loops=1)
  ->  Hash  (cost=2804.48..2804.48 rows=108648 width=20) (actual
time=602.12..602.12 rows=0 loops=1)
        ->  Seq Scan on case_clients cc  (cost=0.00..2804.48 rows=108648
width=20) (actual time=5.18..370.68 rows=108648 loops=1)
Total runtime: 22879.26 msec

The above doesn't seem bad, except that this is some serious hardware in this
system and 23 seconds right after VACUUM ANALYZE is too long.  I've a feeling
that I botched one of my postgresql.conf parameters or something.

I'll do an explain for the UPDATE query later, when the users are off the
system.

-Josh Berkus

--
Josh Berkus
josh@agliodbs.com
Aglio Database Solutions
San Francisco

Re: Comparitive UPDATE speed

From
Andrew Sullivan
Date:
On Fri, Oct 04, 2002 at 11:13:09AM -0700, Josh Berkus wrote:
>
> Andrew,
>
> > Have you run this with EXPLAIN ANALYSE?  It will actually perform the
> > necessary steps, so it will reveal if the planner is getting
> > something wrong.
>
> Here it is:

Oops, sorry.  What if you force the index use here?  Just because the
planner thinks that's more expensive doesn't mean that it is.

A

--
----
Andrew Sullivan                         204-4141 Yonge Street
Liberty RMS                           Toronto, Ontario Canada
<andrew@libertyrms.info>                              M2P 2A8
                                         +1 416 646 3304 x110


Re: Comparitive UPDATE speed

From
Josh Berkus
Date:
Andrew,

> Oops, sorry.  What if you force the index use here?  Just because the
> planner thinks that's more expensive doesn't mean that it is.

Yeah, I tried it ... no faster, no slower, really.

BTW, in case you missed it, the real concern is that an UPDATE query similar
to the SELECT query we are discussing takes over 10 minutes, which on this
hardware is ridiculous.  Robert suggested that we test the SELECT query to
see if there were general performance problems; apparently, there are.

The hardware I'm using is:
dual-processor Athalon 1400mhz motherboard
raid 5 UW SCSI drive array with 3 drives
512mb DDR RAM
SuSE Linux 7.3 (Kernel 2.4.10)
Postgres is on its own LVM partition
PostgreSQL 7.2.1 on i686-pc-linux-gnu, compiled by GCC 2.95.3
    (will upgrade to 7.2.3 very soon)
Postgresql.conf has: fdatasync, various chared memory tuned to allocate 256mb
to postgres (which seems to be working correctly).
Debug level 2.

When the UPDATE query takes a long time, I generally can watch the log hover
in the land of "Reaping dead child processes" for 30-90 seconds per
iteration.

--
Josh Berkus
josh@agliodbs.com
Aglio Database Solutions
San Francisco

Re: Comparitive UPDATE speed

From
Andrew Sullivan
Date:
On Fri, Oct 04, 2002 at 12:09:42PM -0700, Josh Berkus wrote:

> BTW, in case you missed it, the real concern is that an UPDATE query similar
> to the SELECT query we are discussing takes over 10 minutes, which on this
> hardware is ridiculous.  Robert suggested that we test the SELECT query to
> see if there were general performance problems; apparently, there are.

Yes, that's my thought, too.

> Postgresql.conf has: fdatasync, various chared memory tuned to allocate 256mb
> to postgres (which seems to be working correctly).

Hmm.  Are you swapping?  Lots of temp files?  (I presume you've been
over all that.)  Half your physical memory seems pretty dangerous to
me.  If oyu reduce that, does it help?

> When the UPDATE query takes a long time, I generally can watch the log hover
> in the land of "Reaping dead child processes" for 30-90 seconds per
> iteration.

Ick.  Hmm.  What sort of numbers do you get from vmstat, iostat, sar,
and friends?

A
--
----
Andrew Sullivan                         204-4141 Yonge Street
Liberty RMS                           Toronto, Ontario Canada
<andrew@libertyrms.info>                              M2P 2A8
                                         +1 416 646 3304 x110


Re: Comparitive UPDATE speed

From
Tom Lane
Date:
Josh Berkus <josh@agliodbs.com> writes:
> Hash Join  (cost=3076.10..91842.88 rows=108648 width=40) (actual
> time=18625.19..22823.39 rows=108546 loops=1)
>   ->  Seq Scan on elbs_matter_links eml  (cost=0.00..85641.87 rows=117787
> width=20) (actual time=18007.69..19515.63 rows=117787 loops=1)
>   ->  Hash  (cost=2804.48..2804.48 rows=108648 width=20) (actual
> time=602.12..602.12 rows=0 loops=1)
>         ->  Seq Scan on case_clients cc  (cost=0.00..2804.48 rows=108648
> width=20) (actual time=5.18..370.68 rows=108648 loops=1)
> Total runtime: 22879.26 msec

Hm.  Why does it take 19500 milliseconds to read 117787 rows from
elbs_matter_links, if 108648 rows can be read from case_clients in 370
msec?  And why does the output show that the very first of those rows
was returned only after 18000 msec?

I am suspicious that this table has a huge number of empty pages in it,
mostly at the beginning.  If so, a VACUUM FULL would help.  (Try
"vacuum full verbose elbs_matter_links" and see if it indicates it's
reclaiming any large number of pages.)

If that proves to be the answer, you need to look to your FSM
parameters, and perhaps arrange for more frequent regular vacuums
of this table.

            regards, tom lane

Re: Comparitive UPDATE speed

From
Tom Lane
Date:
Josh Berkus <josh@agliodbs.com> writes:
> When the UPDATE query takes a long time, I generally can watch the log hover
> in the land of "Reaping dead child processes" for 30-90 seconds per
> iteration.

Uh ... would you translate that observation into English please?  Or
better, provide the log output you're looking at?

            regards, tom lane

Re: Comparitive UPDATE speed

From
Josh Berkus
Date:
Tom,

> I am suspicious that this table has a huge number of empty pages in it,
> mostly at the beginning.  If so, a VACUUM FULL would help.  (Try
> "vacuum full verbose elbs_matter_links" and see if it indicates it's
> reclaiming any large number of pages.)

Thank you.  Aha.

That appears to have been the main problem; apparently, at some time during my
tinkering, I dumped most of the rows from elbs_matter_links a couple of
times. Ooops.

I'll post the new situation when I test the update queries tonight.

--
Josh Berkus
josh@agliodbs.com
Aglio Database Solutions
San Francisco