Thread: problems maintaining boolean columns in a large table

problems maintaining boolean columns in a large table

From
Ben Campbell
Date:
I've got a database that holds a bunch of articles in a table called
'article'. It has a bunch of columns, and each row might hold a few KB
of data, say.
I'm maintaining a separate fulltext database, and so I added a boolean
flag, 'needs_indexing' to my 'article' table to keep track of which
articles have been indexed (and I have some trigger functions on
'article' to automatically set the flag if the article is modified).

It all works fine.
Except when I want to rebuild my index from scratch. I need to set all
those flags, but it takes _ages_  to do "UPDATE article SET
needs_indexing=true;" (many hours at least - I've never let it run to
completion)

I _think_ the reason it takes so long is that postgresql doesn't modify
rows in place - it creates an entry for the modified row and zaps the
old one. So by touching _every_ row I'm basically forcing it to rebuild
my whole database... I've got about 2 million rows in 'articles'.
There are a few indexes on columns in 'articles' which obviously will
slow things down too.

I've had a minor attempt at tuning (increased checkpoint_segments) an d
I'm sure there are a bunch of other tricks I could use to bulk-set that
flag in much less time...

But my gut feeling is that the flag would be better off in it's own
table anyway, eg:

CREATE TABLE needs_indexing (
   article_id integer references article(id)
);

So, if an article is listed in this table, it needs indexing.
(maybe with a constraint to ensure uniqueness - I only need articles
entered once in this table)

Does this sound like a reasonable way to go?
Any advice or insight welcome!

Thanks,
Ben.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: problems maintaining boolean columns in a large table

From
Richard Huxton
Date:
On 09/02/10 11:25, Ben Campbell wrote:
[snip]
> I need to set all
> those flags, but it takes _ages_ to do "UPDATE article SET
> needs_indexing=true;"
[snip]
> I _think_ the reason it takes so long is that postgresql doesn't modify
> rows in place - it creates an entry for the modified row and zaps the
> old one. So by touching _every_ row I'm basically forcing it to rebuild
> my whole database... I've got about 2 million rows in 'articles'.
> There are a few indexes on columns in 'articles' which obviously will
> slow things down too.

Exactly so.

> I've had a minor attempt at tuning (increased checkpoint_segments) an d
> I'm sure there are a bunch of other tricks I could use to bulk-set that
> flag in much less time...

Probably not, actually.

> But my gut feeling is that the flag would be better off in it's own
> table anyway, eg:
>
> CREATE TABLE needs_indexing (
> article_id integer references article(id)
> );
>
> So, if an article is listed in this table, it needs indexing.
> (maybe with a constraint to ensure uniqueness - I only need articles
> entered once in this table)

That sounds sensible to me - "needs indexing" depends on two things:
  1. Is the article more recent than it's FTS index
  2. Have you asked for it to be re-indexed anyway.
The first point concerns two tables, and the second is (from the
database point of view) arbitrary. A separate table containing indexing
jobs to do seems the sensible approach. Lets you batch up your
re-indexing work too.

Oh - you might want to consider how/whether to handle multiple entries
for the same article in your queue.

--
   Richard Huxton
   Archonet Ltd

Re: problems maintaining boolean columns in a large table

From
Ben Campbell
Date:
Richard Huxton wrote:
> On 09/02/10 11:25, Ben Campbell wrote:
[I was talking about moving a "needs_indexing" flag out of a big table
into it's own table]
>> But my gut feeling is that the flag would be better off in it's own
>> table anyway, eg:
>>
>> CREATE TABLE needs_indexing (
>> article_id integer references article(id)
>> );

> That sounds sensible to me

Cool - glad to know I'm not suggesting something totally insane! I never
can quite tell when I'm doing database stuff :-)

> Oh - you might want to consider how/whether to handle multiple entries
> for the same article in your queue.

I settled on:

CREATE TABLE needs_indexing (
   article_id integer REFERENCES article(id) PRIMARY KEY
);

The primary key-ness enforces uniqueness, and any time I want to add an
article to the queue I just make sure I do a DELETE before the INSERT.
Bound to be more efficient ways to do it, but it works.

Thanks,
Ben.



Re: problems maintaining boolean columns in a large table

From
"Timo Klecker"
Date:
Hi Ben,

could you post your trigger function? When you need to rebuild the index,
you could disable the trigger setting the flag if the article is modified.
This could speed up your UPDATE.



Mit freundlichen Grüßen
Timo Klecker




-----Ursprüngliche Nachricht-----
Von: pgsql-general-owner@postgresql.org
[mailto:pgsql-general-owner@postgresql.org] Im Auftrag von Ben Campbell
Gesendet: Dienstag, 9. Februar 2010 12:26
An: pgsql-general@postgresql.org
Betreff: [GENERAL] problems maintaining boolean columns in a large table

I've got a database that holds a bunch of articles in a table called
'article'. It has a bunch of columns, and each row might hold a few KB
of data, say.
I'm maintaining a separate fulltext database, and so I added a boolean
flag, 'needs_indexing' to my 'article' table to keep track of which
articles have been indexed (and I have some trigger functions on
'article' to automatically set the flag if the article is modified).

It all works fine.
Except when I want to rebuild my index from scratch. I need to set all
those flags, but it takes _ages_  to do "UPDATE article SET
needs_indexing=true;" (many hours at least - I've never let it run to
completion)

I _think_ the reason it takes so long is that postgresql doesn't modify
rows in place - it creates an entry for the modified row and zaps the
old one. So by touching _every_ row I'm basically forcing it to rebuild
my whole database... I've got about 2 million rows in 'articles'.
There are a few indexes on columns in 'articles' which obviously will
slow things down too.

I've had a minor attempt at tuning (increased checkpoint_segments) an d
I'm sure there are a bunch of other tricks I could use to bulk-set that
flag in much less time...

But my gut feeling is that the flag would be better off in it's own
table anyway, eg:

CREATE TABLE needs_indexing (
   article_id integer references article(id)
);

So, if an article is listed in this table, it needs indexing.
(maybe with a constraint to ensure uniqueness - I only need articles
entered once in this table)

Does this sound like a reasonable way to go?
Any advice or insight welcome!

Thanks,
Ben.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: problems maintaining boolean columns in a large table

From
Ben Campbell
Date:
Timo Klecker wrote:
> could you post your trigger function? When you need to rebuild the index,
> you could disable the trigger setting the flag if the article is modified.
> This could speed up your UPDATE.

Embarrassingly, when I checked, I found that I'd never gotten around to
writing that particular trigger function... (It's just being handled at
the app level).
However, there _is_ a trigger function which sets another flag somewhere
which I bet is responsible for a lot of the time... it sets a "modified"
flag on any journalist associated with the article:


-------------------------------

-- article table trigger
CREATE OR REPLACE FUNCTION article_setjournomodified_onupdate() RETURNS
TRIGGER AS $$
BEGIN
     -- whenever article is modified, set the modified flag on any
attributed journos
     UPDATE journo SET modified=true WHERE id IN (SELECT journo_id FROM
journo_attr WHERE article_id=NEW.id);
     return NULL;
END;
$$ LANGUAGE 'plpgsql';

CREATE TRIGGER article_update AFTER UPDATE ON article FOR EACH ROW
EXECUTE PROCEDURE article_setjournomodified_onupdate();

-------------------------------

(excuse the bad linebreaks!)
I bet the subselect in that trigger slows things down.
"article_id" in journo attr is a foreign key:

"journo_attr_article_id_fkey" FOREIGN KEY (article_id) REFERENCES
article(id) ON DELETE CASCADE

Can the SELECT use such a foreign key index to speed things up? Or do I
need to explicitly add another index? (and yes, I know that's a stupid
newbie question!)

Either way, I'll have a go at disabling the trigger to see what impact
it has on the bulk update of 'article.needs_indexing'!

Actually, I think it's a good argument for moving the needs_indexing
flag out of the article table - modifying any other article fields
should cause attributed journos to be marked 'modified', but the
'needs_indexing' doesn't need to do this - it's just a little
implementation detail rather than real data...
(and the same goes for 'journo.modified'!)

Thanks,
Ben.

Re: problems maintaining boolean columns in a large table

From
Filip Rembiałkowski
Date:


2010/2/10 Ben Campbell <ben@scumways.com>
I settled on:

CREATE TABLE needs_indexing (
 article_id integer REFERENCES article(id) PRIMARY KEY
);

The primary key-ness enforces uniqueness, and any time I want to add an article to the queue I just make sure I do a DELETE before the INSERT. Bound to be more efficient ways to do it, but it works.

better use

INSERT INTO needs_indexing (article_id)
SELECT NEW.id
WHERE NOT EXISTS ( SELECT 42 FROM needs_indexing WHERE article_id = NEW.id );





--
Filip Rembiałkowski
JID,mailto:filip.rembialkowski@gmail.com
http://filip.rembialkowski.net/

Re: problems maintaining boolean columns in a large table

From
"Timo Klecker"
Date:
Hi Ben,

you can check weather one of your indexes is used within the Query by simply
using EXPLAIN ANALYZE.

EXPLAIN ANALYZE SELECT * from journo WHERE id IN (SELECT journo_id FROM
journo_attr WHERE article_id=$AnyExistingIdHere$);

Maybe you have another trigger on the journo table, that is triggered on
update? This would indeed slow everything down. As you mentioned you should
move the needs_indexing flag out of the article table. This could simply
hold all the article_ids that need indexing.


Greetings
Timo Klecker




-----Ursprüngliche Nachricht-----
Von: pgsql-general-owner@postgresql.org
[mailto:pgsql-general-owner@postgresql.org] Im Auftrag von Ben Campbell
Gesendet: Donnerstag, 11. Februar 2010 14:45
An: pgsql-general@postgresql.org
Betreff: Re: [GENERAL] problems maintaining boolean columns in a large table

Timo Klecker wrote:
> could you post your trigger function? When you need to rebuild the index,
> you could disable the trigger setting the flag if the article is modified.
> This could speed up your UPDATE.

Embarrassingly, when I checked, I found that I'd never gotten around to
writing that particular trigger function... (It's just being handled at
the app level).
However, there _is_ a trigger function which sets another flag somewhere
which I bet is responsible for a lot of the time... it sets a "modified"
flag on any journalist associated with the article:


-------------------------------

-- article table trigger
CREATE OR REPLACE FUNCTION article_setjournomodified_onupdate() RETURNS
TRIGGER AS $$
BEGIN
     -- whenever article is modified, set the modified flag on any
attributed journos
     UPDATE journo SET modified=true WHERE id IN (SELECT journo_id FROM
journo_attr WHERE article_id=NEW.id);
     return NULL;
END;
$$ LANGUAGE 'plpgsql';

CREATE TRIGGER article_update AFTER UPDATE ON article FOR EACH ROW
EXECUTE PROCEDURE article_setjournomodified_onupdate();

-------------------------------

(excuse the bad linebreaks!)
I bet the subselect in that trigger slows things down.
"article_id" in journo attr is a foreign key:

"journo_attr_article_id_fkey" FOREIGN KEY (article_id) REFERENCES
article(id) ON DELETE CASCADE

Can the SELECT use such a foreign key index to speed things up? Or do I
need to explicitly add another index? (and yes, I know that's a stupid
newbie question!)

Either way, I'll have a go at disabling the trigger to see what impact
it has on the bulk update of 'article.needs_indexing'!

Actually, I think it's a good argument for moving the needs_indexing
flag out of the article table - modifying any other article fields
should cause attributed journos to be marked 'modified', but the
'needs_indexing' doesn't need to do this - it's just a little
implementation detail rather than real data...
(and the same goes for 'journo.modified'!)

Thanks,
Ben.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: problems maintaining boolean columns in a large table

From
John R Pierce
Date:
Ben Campbell wrote:
> I _think_ the reason it takes so long is that postgresql doesn't
> modify rows in place - it creates an entry for the modified row and
> zaps the old one. So by touching _every_ row I'm basically forcing it
> to rebuild my whole database... I've got about 2 million rows in
> 'articles'.
> There are a few indexes on columns in 'articles' which obviously will
> slow things down too.

at the expense of disk space, try setting fill_factor for that table to
something like 70 instead of the default 100.