Thread: Table UPDATE is too slow
We have a web based application with data that is updated daily. The biggest bottleneck occurs when we try to update one of the tables. This table contains 58,000 rows and 62 columns, and EVERY column is indexed. Every column is queryable (?) by the users through the web interface so we are reluctant to remove the indexes (recreating them would be time consuming too). The primary key is an INT and the rest of the columns are a mix of NUMERIC, TEXT, and DATEs. A typical update is: UPDATE dataTable SET field01=44.5, field02=44.5, field03='Bob', field04='foo', ... , field60='2004-08-30', field61='2004-08-29' WHERE id = 1234; Also of note is that the update is run about 10 times per day; we get blocks of data from 10 different sources, so we pre-process the data and then update the table. We also run VACUUM FULL ANALYZE on a nightly basis. Does anyone have some idea on how we can increase speed, either by changing the updates, designing the database differently, etc, etc? This is currently a big problem for us. Other notables: The UPDATE is run from a within a function: FOR rec IN SELECT ...LOOP RETURN NEXT rec; UPDATE dataTable..... Postgres 7.4.3 debian stable 2 GB RAM 80 DB IDE drive (we can't change it) shared_buffers = 2048 sort_mem = 1024 max_fsm_pages = 40000 checkpoint_segments = 5 random_page_cost = 3 Thanks Ron
What is the datatype of the id column? -tfo On Aug 31, 2004, at 1:11 PM, Ron St-Pierre wrote: > We have a web based application with data that is updated daily. The > biggest bottleneck occurs when we try to update > one of the tables. This table contains 58,000 rows and 62 columns, and > EVERY column is indexed. Every column is > queryable (?) by the users through the web interface so we are > reluctant to remove the indexes (recreating them would > be time consuming too). The primary key is an INT and the rest of the > columns are a mix of NUMERIC, TEXT, and DATEs. > A typical update is: > UPDATE dataTable SET field01=44.5, field02=44.5, field03='Bob', > field04='foo', ... , field60='2004-08-30', field61='2004-08-29' > WHERE id = 1234; > > Also of note is that the update is run about 10 times per day; we get > blocks of data from 10 different sources, so we pre-process the > data and then update the table. We also run VACUUM FULL ANALYZE on a > nightly basis. > Does anyone have some idea on how we can increase speed, either by > changing the updates, designing the database > differently, etc, etc? This is currently a big problem for us. > Other notables: > The UPDATE is run from a within a function: FOR rec IN SELECT > ...LOOP RETURN NEXT rec; UPDATE dataTable..... > Postgres 7.4.3 > debian stable > 2 GB RAM > 80 DB IDE drive (we can't change it) > shared_buffers = 2048 > sort_mem = 1024 max_fsm_pages = 40000 > checkpoint_segments = 5 > random_page_cost = 3 > Thanks > Ron
On Tue, Aug 31, 2004 at 11:11:02AM -0700, Ron St-Pierre wrote: > We have a web based application with data that is updated daily. The > biggest bottleneck occurs when we try to update > one of the tables. This table contains 58,000 rows and 62 columns, and > EVERY column is indexed. That is usually a very bad idea; for every update, PostgreSQL has to update 62 indexes. Do you really do queries on all those 62 columns? > A typical update is: > UPDATE dataTable SET field01=44.5, field02=44.5, field03='Bob', > field04='foo', ... , field60='2004-08-30', field61='2004-08-29' > WHERE id = 1234; That looks like poor database normalization, really. Are you sure you don't want to split this into multiple tables instead of having 62 columns? > Other notables: > The UPDATE is run from a within a function: FOR rec IN SELECT ...LOOP > RETURN NEXT rec; UPDATE dataTable..... > Postgres 7.4.3 > debian stable > 2 GB RAM > 80 DB IDE drive (we can't change it) Are you doing all this in multiple transactions, or in a sngle one? Wrapping the FOR loop in a transaction might help speed. /* Steinar */ -- Homepage: http://www.sesse.net/
Thomas F. O'Connell wrote: > What is the datatype of the id column? > The id column is INTEGER. Ron
Steinar H. Gunderson wrote: >On Tue, Aug 31, 2004 at 11:11:02AM -0700, Ron St-Pierre wrote: > > >>We have a web based application with data that is updated daily. The >>biggest bottleneck occurs when we try to update >>one of the tables. This table contains 58,000 rows and 62 columns, and >>EVERY column is indexed. >> >> > >That is usually a very bad idea; for every update, PostgreSQL has to update >62 indexes. Do you really do queries on all those 62 columns? > > Yes, I know that it's not a very good idea, however queries are allowed against all of those columns. One option is to disable some or all of the indexes when we update, run the update, and recreate the indexes, however it may slow down user queries. Because there are so many indexes, it is time consuming to recreate them after the update. > > >>A typical update is: >> UPDATE dataTable SET field01=44.5, field02=44.5, field03='Bob', >> field04='foo', ... , field60='2004-08-30', field61='2004-08-29' >> WHERE id = 1234; >> >> > >That looks like poor database normalization, really. Are you sure you don't >want to split this into multiple tables instead of having 62 columns? > No, it is properly normalized. The data in this table is stock fundamentals, stuff like 52 week high, ex-dividend date, etc, etc. > > > >>Other notables: >> The UPDATE is run from a within a function: FOR rec IN SELECT ...LOOP >>RETURN NEXT rec; UPDATE dataTable..... >> Postgres 7.4.3 >> debian stable >> 2 GB RAM >> 80 DB IDE drive (we can't change it) >> >> > >Are you doing all this in multiple transactions, or in a sngle one? Wrapping >the FOR loop in a transaction might help speed. > We're doing it in multiple transactions within the function. Could we do something like this?: .... BEGIN FOR rec IN SELECT field01, field02, ... FROM otherTable LOOP RETURN NEXT rec; UPDATE dataTable SET field01=rec.field01, field02=rec.field02, rec.field03=field03, ... WHERE id = rec.id; COMMIT; .... If we can do it this way, are there any other gotcha's we should be aware of? Ron
On Tue, 31 Aug 2004 11:11:02 -0700 Ron St-Pierre <rstpierre@syscor.com> wrote: > Postgres 7.4.3 > debian stable > 2 GB RAM > 80 DB IDE drive (we can't change it) > > shared_buffers = 2048 > sort_mem = 1024 > max_fsm_pages = 40000 > checkpoint_segments = 5 > random_page_cost = 3 I agree with all of the follow ups that having indexes on every column is a bad idea. I would remove the indexes from the least searched upon 10-20 columns, as I'm sure this will help your performance. You mention that not having indexes on some of the columns because it "may slow down user queries". I think you should investigate this and make sure they are necessary. I've seen many an application, with far more rows than you're dealing with, only need 1 or 2 indexes, even when all (or most) columns could be searched. Also, you should consider increasing your shared_buffers and probably your sort memory a touch as well. This will help your overall performance. --------------------------------- Frank Wiles <frank@wiles.org> http://www.wiles.org ---------------------------------
On Tue, Aug 31, 2004 at 11:35:38AM -0700, Ron St-Pierre wrote: > We're doing it in multiple transactions within the function. Could we do > something like this?: > > .... > BEGIN > FOR rec IN SELECT field01, field02, ... FROM otherTable LOOP > RETURN NEXT rec; > UPDATE dataTable SET field01=rec.field01, field02=rec.field02, > rec.field03=field03, ... > WHERE id = rec.id; > COMMIT; > .... > > > If we can do it this way, are there any other gotcha's we should be > aware of? AFAIK you should be able to do this, yes (although I have no experience with PL/SQL); I'm not sure how much it buys you, but you might want to test it, at least. /* Steinar */ -- Homepage: http://www.sesse.net/
> >That looks like poor database normalization, really. Are you > sure you > >don't want to split this into multiple tables instead of having 62 > >columns? > > > No, it is properly normalized. The data in this table is stock > fundamentals, stuff like 52 week high, ex-dividend date, etc, etc. Hmm, the two examples you gave there are actually ripe for breaking out into another table. It's not quite 'normalisation', but if you have data that changes very rarely, why not group it into a separate table? You could have the highly volatile data in one table, the semi-volatile stuff in another, and the pretty static stuff in a third. Looked at another way, if you have sets of fields that tend to change together, group them into tables together. That way you will radically reduce the number of indexes that are affected by each update. But as someone else pointed out, you should at the very least wrap your updates in a big transaction. M
Ron St-Pierre wrote: > Yes, I know that it's not a very good idea, however queries are allowed > against all of those columns. One option is to disable some or all of the > indexes when we update, run the update, and recreate the indexes, > however it may slow down user queries. Because there are so many indexes, > it is time consuming to recreate them after the update. Just because a query can run against any column does not mean all columns should be indexed. Take a good look at the column types and their value distribution. Let's say I have a table of addresses but every address I collect is in the 94116 zip code. That would mean indexes on city, state and zip are not only useless but could decrease performance. Also, if a search always includes a unique key (or a column with highly unique values), eliminating the other indexes would force the planner to always use that index first.
Thanks for everyone's comments (Thomas, Steinar, Frank, Matt, William). Right now I'm bench-marking the time it takes for each step in the end of day update process and then I am going to test a few things: - dropping most indexes, and check the full processing time and see if there is any noticeable performance degradation on the web-end - wrapping the updates in a transaction, and check the timing - combining the two - reviewing my shared_buffers and sort_memory settings Thanks Ron
Ron St-Pierre <rstpierre@syscor.com> writes: > Does anyone have some idea on how we can increase speed, either by > changing the updates, designing the database > differently, etc, etc? This is currently a big problem for us. > Other notables: > The UPDATE is run from a within a function: FOR rec IN SELECT ...LOOP > RETURN NEXT rec; UPDATE dataTable..... One point that I don't think was made before: by doing a collection of updates in this serial one-at-a-time fashion, you are precluding any possibility of query optimization over the collection of updates. It might win to dump the update data into a temp table and do a single UPDATE command joining to the temp table. Or not --- quite possibly not --- but I think it's something to think about. regards, tom lane
Ron St-Pierre wrote: > We have a web based application with data that is updated daily. The > biggest bottleneck occurs when we try to update > one of the tables. This table contains 58,000 rows and 62 columns, and > EVERY column is indexed. Have you thought of / tried using 2 separate databases or tables and switching between them? Since you seem to be updating all the values, it might be a lot faster to re-create the table from scratch without indexes and add those later (maybe followed by a VACUUM ANALYZE) ... That said, I'm not entirely sure how well postgres' client libraries can deal with tables being renamed while in use, perhaps someone can shed some light on this. Regards, Marinos -- Dipl.-Ing. Marinos Yannikos, CEO Preisvergleich Internet Services AG Obere Donaustraße 63/2, A-1020 Wien Tel./Fax: (+431) 5811609-52/-55
Do all of the commands to swap tables in a transaction. The table gets locked briefly but should have a lot less impact then the update command. On Mon, 06 Sep 2004 01:28:04 +0200, Marinos J. Yannikos <mjy@geizhals.at> wrote: > > That said, I'm not entirely sure how well postgres' client libraries can > deal with tables being renamed while in use, perhaps someone can shed > some light on this. >
Hello, I have this table : CREATE TABLE apparts ( id SERIAL NOT NULL PRIMARY KEY, price FLOAT NOT NULL, surface INTEGER NOT NULL, price_sq FLOAT NOT NULL, rooms INTEGER NULL, vente BOOL NOT NULL, category TEXT NOT NULL, zipcode INTEGER NOT NULL, departement INTEGER NOT NULL ) WITHOUT OIDS; There is a BTREE index on 'departement'. The table fits in RAM. When I want to SELECT according to my indexed field, postgres chooses a sequential scan unless the number of rows to be returned is very, very small : apparts=> explain analyze select * from apparts where departement=42; Seq Scan on apparts (cost=0.00..853.12 rows=1403 width=47) (actual time=5.094..52.026 rows=1516 loops=1) Filter: (departement = 42) Total runtime: 52.634 ms OK, it returns 1516 rows, so maybe the seq scan is right. apparts=> SET enable_seqscan = 0; apparts=> explain analyze select * from apparts where departement=42; Index Scan using apparts_dept on apparts (cost=0.00..1514.59 rows=1403 width=47) (actual time=0.045..2.770 rows=1516 loops=1) Index Cond: (departement = 42) Total runtime: 3.404 ms Um, 15 times faster... Index scan is called only when there are few rows. With other values for 'departement' where there are few rows, the Index is used automatically. This is logical, even if I should adjust the page costs. I wish I could tell postgres "this table will fit in RAM and be accessed often, so for this table, the page seek cost should be very low". Everything is vacuum full analyze. Now, if I LIMIT the query to 10 rows, the index should be used all the time, because it will always return few rows... well, it doesn't ! apparts=> SET enable_seqscan = 1; apparts=> explain analyze select * from apparts where departement=42 LIMIT 10; Limit (cost=0.00..6.08 rows=10 width=47) (actual time=5.003..5.023 rows=10 loops=1) -> Seq Scan on apparts (cost=0.00..853.12 rows=1403 width=47) (actual time=4.998..5.013 rows=10 loops=1) Filter: (departement = 42) Total runtime: 5.107 ms Now, let's try : apparts=> SET enable_seqscan = 0; apparts=> explain analyze select * from apparts where departement=42 LIMIT 10; Limit (cost=0.00..10.80 rows=10 width=47) (actual time=0.047..0.072 rows=10 loops=1) -> Index Scan using apparts_dept on apparts (cost=0.00..1514.59 rows=1403 width=47) (actual time=0.044..0.061 rows=10 loops=1) Index Cond: (departement = 42) Total runtime: 0.157 ms So, by itself, Postgres will select a very bad query plan (32x slower) on a query which would be executed very fast using indexes. If I use OFFSET + LIMIT, it only gets worse because the seq scan has to scan more rows : apparts=> SET enable_seqscan = 1; apparts=> explain analyze select * from apparts where departement=42 LIMIT 10 OFFSET 85; QUERY PLAN --------------------------------------------------------------------------------------------------------------- Limit (cost=51.69..57.77 rows=10 width=47) (actual time=10.224..10.246 rows=10 loops=1) -> Seq Scan on apparts (cost=0.00..853.12 rows=1403 width=47) (actual time=5.254..10.200 rows=95 loops=1) Filter: (departement = 42) Total runtime: 10.326 ms apparts=> SET enable_seqscan = 1; apparts=> explain analyze select * from apparts where departement=42 LIMIT 10 OFFSET 1000; Limit (cost=608.07..614.15 rows=10 width=47) (actual time=43.993..44.047 rows=10 loops=1) -> Seq Scan on apparts (cost=0.00..853.12 rows=1403 width=47) (actual time=5.328..43.791 rows=1010 loops=1) Filter: (departement = 42) Total runtime: 44.128 ms apparts=> SET enable_seqscan = 0; apparts=> explain analyze select * from apparts where departement=42 LIMIT 10 OFFSET 1000; Limit (cost=1079.54..1090.33 rows=10 width=47) (actual time=2.147..2.170 rows=10 loops=1) -> Index Scan using apparts_dept on apparts (cost=0.00..1514.59 rows=1403 width=47) (actual time=0.044..1.860 rows=1010 loops=1) Index Cond: (departement = 42) Total runtime: 2.259 ms Why is it that way ? The planner should use the LIMIT values when planning the query, should it not ?
Update : select * from apparts where departement=69 order by departement limit 10; does use an index scan (because of the ORDER BY), even with OFFSET, and it's a lot faster.
On Mon, 6 Sep 2004, [iso-8859-15] Pierre-Frédéric Caillaud wrote: > Why is it that way ? The planner should use the LIMIT values when > planning the query, should it not ? And it do use limit values, the estimated cost was lower when you had the limit, What you need to do is to tune pg for your computer. For example the following settings: * effective_cache - this setting tells pg how much the os are caching (for example use top to find out during a normal work load). You said that the tables fit in memory and by telling pg how much is cached it might adjust it's plans accordingly. * random_page_cost - how expensive is a random access compared to seq. access. This is dependent on the computer and disk system you have. If the setting above does not help, maybe you need to lower this to variable to 2 or something. And don't forget the shared_buffer setting. But most people usually have it tuned in my experience (but usually too high). Here is an article that might help you: http://www.varlena.com/varlena/GeneralBits/Tidbits/perf.html -- /Dennis Björklund
=?iso-8859-15?Q?Pierre-Fr=E9d=E9ric_Caillaud?= <lists@boutiquenumerique.com> writes: > Now, if I LIMIT the query to 10 rows, the index should be used all the > time, because it will always return few rows... well, it doesn't ! Not at all. From the planner's point of view, the LIMIT is going to reduce the cost by about a factor of 10/1403, since the underlying plan step will only be run partway through. That's not going to change the decision about which underlying plan step is cheapest: 10/1403 of a cheaper plan is still always less than 10/1403 of a more expensive plan. Later, you note that LIMIT with ORDER BY does affect the plan choice --- that's because in that situation one plan alternative has a much higher startup cost than the other (namely the cost of a sort step). A small LIMIT can allow the fast-startup plan to be chosen even though it would be estimated to be the loser if run to completion. regards, tom lane
OK, thanks a lot for your explanations. Knowing how the planner "thinks", makes it pretty logical. Thank you. Now another question... I have a table of records representing forum posts with a primary key (id), a topic_id, a timestamp, and other fields which I won't detail. I want to split them into pages (like forums usually do), with N posts per page. In that case : SELECT * FROM table WHERE topic_id=... ORDER BY post_timestamp asc LIMIT N OFFSET N*page; Also it's almost the same to order by id rather than post_timestamp (id being a serial). SELECT * FROM table WHERE topic_id=... ORDER BY id asc LIMIT N OFFSET N*page; This query runs slower and slower as the OFFSET grows, which is a problem because the most accessed page in a forum is the last one. So, for the last page, I tried : SELECT * FROM table WHERE topic_id=... ORDER BY id desc LIMIT N; But this does not use the index at all (seq scan + sort + limit). My solution is simple : build an index on (-id), or on (some date)-post_timestamp, then : SELECT * FROM table WHERE topic_id=... ORDER BY (-id) desc LIMIT N; Then the last page is the fastest one, but it always shows N posts. That's not a problem, so I guess I'll use that. I don't like forums which show 1 post on the last page because the number of posts modulo N is 1. I may store the number of posts in a forum (updated by a trigger) to avoid costly COUNT queries to count the pages, so I could use ORDER BY id for the first half of the pages, and ORDER BY (-id) for the rest, so it will always be fastest. I could even create a pages table to store the id of the first post on that page and then : SELECT * FROM table WHERE topic_id=... AND id>id_of_first_post_in_page ORDER BY id asc LIMIT N; then all pages would be aqually fast. Or, I could cache the query results for all pages but the last one. Finally, the question : having a multiple field btree, it is not harder to scan it in "desc order" than in "asc order". So why does not Postgres do it ? Here is a btree example : topic_id id 1 1 1 10 2 2 2 5 2 17 3 4 3 6 suppose I SELECT WHERE topic_id=2 ORDER BY topic_id ASC,id ASC. Postgres simply finds the first row with topic_id=2 and goes from there. suppose I SELECT WHERE topic_id=2 ORDER BY topic_id ASC,id DESC. Postgres does a seq scan, but it could think a bit more and start at "first index node which has topic_id>2" (simple to find in a btree) then go backwards in the index. This can ge beneralized to any combination of (asc,desc). I made some more experiments, and saw Postgres does an 'Index Scan' when ORDER BY clauses are all ASC, and an 'Index Scan Backwards' when all ORDER BY are DESC. However, it does not handle a combination of ASC and DESC? What do you think of this ? On Mon, 06 Sep 2004 12:40:41 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote: > =?iso-8859-15?Q?Pierre-Fr=E9d=E9ric_Caillaud?= > <lists@boutiquenumerique.com> writes: >> Now, if I LIMIT the query to 10 rows, the index should be used all the >> time, because it will always return few rows... well, it doesn't ! > > Not at all. From the planner's point of view, the LIMIT is going to > reduce the cost by about a factor of 10/1403, since the underlying plan > step will only be run partway through. That's not going to change the > decision about which underlying plan step is cheapest: 10/1403 of a > cheaper plan is still always less than 10/1403 of a more expensive plan. > > Later, you note that LIMIT with ORDER BY does affect the plan choice > --- that's because in that situation one plan alternative has a much > higher startup cost than the other (namely the cost of a sort step). > A small LIMIT can allow the fast-startup plan to be chosen even though > it would be estimated to be the loser if run to completion. > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 6: Have you searched our list archives? > > http://archives.postgresql.org >
=?iso-8859-15?Q?Pierre-Fr=E9d=E9ric_Caillaud?= <lists@boutiquenumerique.com> writes: > suppose I SELECT WHERE topic_id=2 ORDER BY topic_id ASC,id DESC. > Postgres does a seq scan, but it could think a bit more and start at > "first index node which has topic_id>2" (simple to find in a btree) then > go backwards in the index. If you write it as SELECT WHERE topic_id=2 ORDER BY topic_id DESC,id DESC. then an index on (topic_id, id) will work fine. The mixed ASC/DESC ordering is not compatible with the index. regards, tom lane
Yes, you're right as usual. I had not thought about playing with ORDER BY on a field which has only one value in the result set. > If you write it as > SELECT WHERE topic_id=2 ORDER BY topic_id DESC,id DESC. > then an index on (topic_id, id) will work fine. The mixed ASC/DESC > ordering is not compatible with the index.
Pierre-Frédéric Caillaud wrote: > > Yes, you're right as usual. As usual ? Do you think your father can be wrong on you ? :-) Gaetano