Thread: problems maintaining boolean columns in a large table
I've got a database that holds a bunch of articles in a table called 'article'. It has a bunch of columns, and each row might hold a few KB of data, say. I'm maintaining a separate fulltext database, and so I added a boolean flag, 'needs_indexing' to my 'article' table to keep track of which articles have been indexed (and I have some trigger functions on 'article' to automatically set the flag if the article is modified). It all works fine. Except when I want to rebuild my index from scratch. I need to set all those flags, but it takes _ages_ to do "UPDATE article SET needs_indexing=true;" (many hours at least - I've never let it run to completion) I _think_ the reason it takes so long is that postgresql doesn't modify rows in place - it creates an entry for the modified row and zaps the old one. So by touching _every_ row I'm basically forcing it to rebuild my whole database... I've got about 2 million rows in 'articles'. There are a few indexes on columns in 'articles' which obviously will slow things down too. I've had a minor attempt at tuning (increased checkpoint_segments) an d I'm sure there are a bunch of other tricks I could use to bulk-set that flag in much less time... But my gut feeling is that the flag would be better off in it's own table anyway, eg: CREATE TABLE needs_indexing ( article_id integer references article(id) ); So, if an article is listed in this table, it needs indexing. (maybe with a constraint to ensure uniqueness - I only need articles entered once in this table) Does this sound like a reasonable way to go? Any advice or insight welcome! Thanks, Ben. -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
On 09/02/10 11:25, Ben Campbell wrote: [snip] > I need to set all > those flags, but it takes _ages_ to do "UPDATE article SET > needs_indexing=true;" [snip] > I _think_ the reason it takes so long is that postgresql doesn't modify > rows in place - it creates an entry for the modified row and zaps the > old one. So by touching _every_ row I'm basically forcing it to rebuild > my whole database... I've got about 2 million rows in 'articles'. > There are a few indexes on columns in 'articles' which obviously will > slow things down too. Exactly so. > I've had a minor attempt at tuning (increased checkpoint_segments) an d > I'm sure there are a bunch of other tricks I could use to bulk-set that > flag in much less time... Probably not, actually. > But my gut feeling is that the flag would be better off in it's own > table anyway, eg: > > CREATE TABLE needs_indexing ( > article_id integer references article(id) > ); > > So, if an article is listed in this table, it needs indexing. > (maybe with a constraint to ensure uniqueness - I only need articles > entered once in this table) That sounds sensible to me - "needs indexing" depends on two things: 1. Is the article more recent than it's FTS index 2. Have you asked for it to be re-indexed anyway. The first point concerns two tables, and the second is (from the database point of view) arbitrary. A separate table containing indexing jobs to do seems the sensible approach. Lets you batch up your re-indexing work too. Oh - you might want to consider how/whether to handle multiple entries for the same article in your queue. -- Richard Huxton Archonet Ltd
Richard Huxton wrote: > On 09/02/10 11:25, Ben Campbell wrote: [I was talking about moving a "needs_indexing" flag out of a big table into it's own table] >> But my gut feeling is that the flag would be better off in it's own >> table anyway, eg: >> >> CREATE TABLE needs_indexing ( >> article_id integer references article(id) >> ); > That sounds sensible to me Cool - glad to know I'm not suggesting something totally insane! I never can quite tell when I'm doing database stuff :-) > Oh - you might want to consider how/whether to handle multiple entries > for the same article in your queue. I settled on: CREATE TABLE needs_indexing ( article_id integer REFERENCES article(id) PRIMARY KEY ); The primary key-ness enforces uniqueness, and any time I want to add an article to the queue I just make sure I do a DELETE before the INSERT. Bound to be more efficient ways to do it, but it works. Thanks, Ben.
Hi Ben, could you post your trigger function? When you need to rebuild the index, you could disable the trigger setting the flag if the article is modified. This could speed up your UPDATE. Mit freundlichen Grüßen Timo Klecker -----Ursprüngliche Nachricht----- Von: pgsql-general-owner@postgresql.org [mailto:pgsql-general-owner@postgresql.org] Im Auftrag von Ben Campbell Gesendet: Dienstag, 9. Februar 2010 12:26 An: pgsql-general@postgresql.org Betreff: [GENERAL] problems maintaining boolean columns in a large table I've got a database that holds a bunch of articles in a table called 'article'. It has a bunch of columns, and each row might hold a few KB of data, say. I'm maintaining a separate fulltext database, and so I added a boolean flag, 'needs_indexing' to my 'article' table to keep track of which articles have been indexed (and I have some trigger functions on 'article' to automatically set the flag if the article is modified). It all works fine. Except when I want to rebuild my index from scratch. I need to set all those flags, but it takes _ages_ to do "UPDATE article SET needs_indexing=true;" (many hours at least - I've never let it run to completion) I _think_ the reason it takes so long is that postgresql doesn't modify rows in place - it creates an entry for the modified row and zaps the old one. So by touching _every_ row I'm basically forcing it to rebuild my whole database... I've got about 2 million rows in 'articles'. There are a few indexes on columns in 'articles' which obviously will slow things down too. I've had a minor attempt at tuning (increased checkpoint_segments) an d I'm sure there are a bunch of other tricks I could use to bulk-set that flag in much less time... But my gut feeling is that the flag would be better off in it's own table anyway, eg: CREATE TABLE needs_indexing ( article_id integer references article(id) ); So, if an article is listed in this table, it needs indexing. (maybe with a constraint to ensure uniqueness - I only need articles entered once in this table) Does this sound like a reasonable way to go? Any advice or insight welcome! Thanks, Ben. -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Timo Klecker wrote: > could you post your trigger function? When you need to rebuild the index, > you could disable the trigger setting the flag if the article is modified. > This could speed up your UPDATE. Embarrassingly, when I checked, I found that I'd never gotten around to writing that particular trigger function... (It's just being handled at the app level). However, there _is_ a trigger function which sets another flag somewhere which I bet is responsible for a lot of the time... it sets a "modified" flag on any journalist associated with the article: ------------------------------- -- article table trigger CREATE OR REPLACE FUNCTION article_setjournomodified_onupdate() RETURNS TRIGGER AS $$ BEGIN -- whenever article is modified, set the modified flag on any attributed journos UPDATE journo SET modified=true WHERE id IN (SELECT journo_id FROM journo_attr WHERE article_id=NEW.id); return NULL; END; $$ LANGUAGE 'plpgsql'; CREATE TRIGGER article_update AFTER UPDATE ON article FOR EACH ROW EXECUTE PROCEDURE article_setjournomodified_onupdate(); ------------------------------- (excuse the bad linebreaks!) I bet the subselect in that trigger slows things down. "article_id" in journo attr is a foreign key: "journo_attr_article_id_fkey" FOREIGN KEY (article_id) REFERENCES article(id) ON DELETE CASCADE Can the SELECT use such a foreign key index to speed things up? Or do I need to explicitly add another index? (and yes, I know that's a stupid newbie question!) Either way, I'll have a go at disabling the trigger to see what impact it has on the bulk update of 'article.needs_indexing'! Actually, I think it's a good argument for moving the needs_indexing flag out of the article table - modifying any other article fields should cause attributed journos to be marked 'modified', but the 'needs_indexing' doesn't need to do this - it's just a little implementation detail rather than real data... (and the same goes for 'journo.modified'!) Thanks, Ben.
2010/2/10 Ben Campbell <ben@scumways.com>
-- I settled on:
CREATE TABLE needs_indexing (
article_id integer REFERENCES article(id) PRIMARY KEY
);
The primary key-ness enforces uniqueness, and any time I want to add an article to the queue I just make sure I do a DELETE before the INSERT. Bound to be more efficient ways to do it, but it works.
better use
INSERT INTO needs_indexing (article_id)
SELECT NEW.id
WHERE NOT EXISTS ( SELECT 42 FROM needs_indexing WHERE article_id = NEW.id );
INSERT INTO needs_indexing (article_id)
SELECT NEW.id
WHERE NOT EXISTS ( SELECT 42 FROM needs_indexing WHERE article_id = NEW.id );
Filip Rembiałkowski
JID,mailto:filip.rembialkowski@gmail.com
http://filip.rembialkowski.net/
Hi Ben, you can check weather one of your indexes is used within the Query by simply using EXPLAIN ANALYZE. EXPLAIN ANALYZE SELECT * from journo WHERE id IN (SELECT journo_id FROM journo_attr WHERE article_id=$AnyExistingIdHere$); Maybe you have another trigger on the journo table, that is triggered on update? This would indeed slow everything down. As you mentioned you should move the needs_indexing flag out of the article table. This could simply hold all the article_ids that need indexing. Greetings Timo Klecker -----Ursprüngliche Nachricht----- Von: pgsql-general-owner@postgresql.org [mailto:pgsql-general-owner@postgresql.org] Im Auftrag von Ben Campbell Gesendet: Donnerstag, 11. Februar 2010 14:45 An: pgsql-general@postgresql.org Betreff: Re: [GENERAL] problems maintaining boolean columns in a large table Timo Klecker wrote: > could you post your trigger function? When you need to rebuild the index, > you could disable the trigger setting the flag if the article is modified. > This could speed up your UPDATE. Embarrassingly, when I checked, I found that I'd never gotten around to writing that particular trigger function... (It's just being handled at the app level). However, there _is_ a trigger function which sets another flag somewhere which I bet is responsible for a lot of the time... it sets a "modified" flag on any journalist associated with the article: ------------------------------- -- article table trigger CREATE OR REPLACE FUNCTION article_setjournomodified_onupdate() RETURNS TRIGGER AS $$ BEGIN -- whenever article is modified, set the modified flag on any attributed journos UPDATE journo SET modified=true WHERE id IN (SELECT journo_id FROM journo_attr WHERE article_id=NEW.id); return NULL; END; $$ LANGUAGE 'plpgsql'; CREATE TRIGGER article_update AFTER UPDATE ON article FOR EACH ROW EXECUTE PROCEDURE article_setjournomodified_onupdate(); ------------------------------- (excuse the bad linebreaks!) I bet the subselect in that trigger slows things down. "article_id" in journo attr is a foreign key: "journo_attr_article_id_fkey" FOREIGN KEY (article_id) REFERENCES article(id) ON DELETE CASCADE Can the SELECT use such a foreign key index to speed things up? Or do I need to explicitly add another index? (and yes, I know that's a stupid newbie question!) Either way, I'll have a go at disabling the trigger to see what impact it has on the bulk update of 'article.needs_indexing'! Actually, I think it's a good argument for moving the needs_indexing flag out of the article table - modifying any other article fields should cause attributed journos to be marked 'modified', but the 'needs_indexing' doesn't need to do this - it's just a little implementation detail rather than real data... (and the same goes for 'journo.modified'!) Thanks, Ben. -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Ben Campbell wrote: > I _think_ the reason it takes so long is that postgresql doesn't > modify rows in place - it creates an entry for the modified row and > zaps the old one. So by touching _every_ row I'm basically forcing it > to rebuild my whole database... I've got about 2 million rows in > 'articles'. > There are a few indexes on columns in 'articles' which obviously will > slow things down too. at the expense of disk space, try setting fill_factor for that table to something like 70 instead of the default 100.