Thread: Performance problem with table containing a lot of text (blog)

Performance problem with table containing a lot of text (blog)

From
Kari Lavikka
Date:
Hello!

Some background info.. We have a blog table that contains about eight
million blog entries. Average length of an entry is 1200 letters. Because
each 8k page can accommodate only a few entries, every query that involves
several entries causes several random seeks to disk.  We are having
problems with queries like:

1) give me a list of months when I have written someting
2) give me id's of entries I have written on month X year X
3) give me the number of blog entries my friends have written since last
    time

Clustering would probably decrease random seeks but it is not an option.
It locks the table and operation would take "some" time. It should also be
done periodically to maintain clustering.

I guess that file system cache gets filled with text contents of blog
entries although they are totally useless for queries like these. Contents
of individual blog entries are cached to memcached on application level
anyway. There's rarely any need to fetch them from database.

It would be nice if I could flag a column to be toasted always, regardless
of it's length.

Because there isn't such option maybe I should create a separate table for
blog text content. Does anybody have better ideas for this? :)

Thanks!


P.S. Here's a plan for query #3. Users can have several bookmark groups
they are following. User can limit visibility of an entry to some of
his/her bookmark group. Those are not any kind of bottlenecks anyway...

  Sort  (cost=34112.60..34117.94 rows=2138 width=14)
    Sort Key: count(*), upper((u.nick)::text)
    ->  HashAggregate  (cost=33962.28..33994.35 rows=2138 width=14)
          ->  Nested Loop  (cost=8399.95..33946.24 rows=2138 width=14)
                ->  Nested Loop  (cost=8399.95..9133.16 rows=90 width=22)
                      ->  HashAggregate  (cost=8399.95..8402.32 rows=237 width=8)
                            ->  Nested Loop  (cost=0.00..8395.99 rows=792 width=8)
                                  ->  Index Scan using user_bookmark_uid on user_bookmark ub  (cost=0.00..541.39
rows=2368width=12) 
                                        Index Cond: (uid = 256979)
                                  ->  Index Scan using user_bookmark_group_pkey on user_bookmark_group bg
(cost=0.00..3.30rows=1 width=4) 
                                        Index Cond: ("outer".bookmark_group_id = bg.bookmark_group_id)
                                        Filter: (("type" >= 0) AND ("type" <= 1) AND (trace_blog = 'y'::bpchar))
                      ->  Index Scan using users_uid_accepted_only on users u  (cost=0.00..3.06 rows=1 width=14)
                            Index Cond: (u.uid = "outer".marked_uid)
                ->  Index Scan using blog_entry_uid_beid on blog_entry be  (cost=0.00..275.34 rows=24 width=8)
                      Index Cond: ((be.uid = "outer".marked_uid) AND (COALESCE("outer".last_seen_blog_entry_id, 0) <
be.blog_entry_id))
                      Filter: ((visibility = 'p'::bpchar) AND ((status = 'p'::bpchar) OR (status = 'l'::bpchar)) AND
((bookmark_group_idIS NULL) OR (subplan))) 
                      SubPlan
                        ->  Index Scan using user_bookmark_pkey on user_bookmark fub  (cost=0.00..3.42 rows=1 width=0)
                              Index Cond: ((bookmark_group_id = $0) AND (marked_uid = 256979))

P.S. That particular user has quite many unread entries though...

     |\__/|
     ( oo )    Kari Lavikka - tuner@bdb.fi - (050) 380 3808
__ooO(  )Ooo_______ _____ ___ _ _  _   _    _      _                  _
       ""

Re: Performance problem with table containing a lot of text (blog)

From
Dan Harris
Date:
Kari Lavikka wrote:
> Hello!
>
> Some background info.. We have a blog table that contains about eight
> million blog entries. Average length of an entry is 1200 letters.
> Because each 8k page can accommodate only a few entries, every query
> that involves several entries causes several random seeks to disk.  We
> are having problems with queries like:
>
> 1) give me a list of months when I have written someting
> 2) give me id's of entries I have written on month X year X
> 3) give me the number of blog entries my friends have written since last
>    time

I didn't see your schema, but couldn't these problems be solved by storing the
article id, owner id, and blog date in a separate table?  It seems that if you
don't actually need the content of the blogs, all of those questions could be
answered by querying a very simple table with minimal I/O overhead.



Re: Performance problem with table containing a lot of text (blog)

From
Kari Lavikka
Date:
> I didn't see your schema, but couldn't these problems be solved by storing
> the article id, owner id, and blog date in a separate table?  It seems that
> if you don't actually need the content of the blogs, all of those questions
> could be answered by querying a very simple table with minimal I/O overhead.

Yes. I was suggesting this as an option but I'm wondering if there
are other solutions.

     |\__/|
     ( oo )    Kari Lavikka - tuner@bdb.fi - (050) 380 3808
__ooO(  )Ooo_______ _____ ___ _ _  _   _    _      _                  _
       ""

On Tue, 28 Aug 2007, Dan Harris wrote:

> Kari Lavikka wrote:
>> Hello!
>>
>> Some background info.. We have a blog table that contains about eight
>> million blog entries. Average length of an entry is 1200 letters. Because
>> each 8k page can accommodate only a few entries, every query that involves
>> several entries causes several random seeks to disk.  We are having
>> problems with queries like:
>>
>> 1) give me a list of months when I have written someting
>> 2) give me id's of entries I have written on month X year X
>> 3) give me the number of blog entries my friends have written since last
>>    time
>
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: Have you checked our extensive FAQ?
>
>              http://www.postgresql.org/docs/faq
>

Re: Performance problem with table containing a lot of text (blog)

From
"Heikki Linnakangas"
Date:
Kari Lavikka wrote:
> It would be nice if I could flag a column to be toasted always,
> regardless of it's length.

The idea of being able to set the toast threshold per column was
discussed during 8.3 development, but no patch was produced IIRC. We
might do that in the future. If you're willing to compile from source,
you can lower TOAST_TUPLE_THRESHOLD.

You could also use ALTER TABLE ... ALTER COLUMN ... SET STORAGE EXTERNAL
to force the long blog entries to be stored in the toast table instead
of compressing them in the main table. Values smaller than
TOAST_TUPLE_THRESHOLD (2k by default?) still wouldn't be toasted,
though, so it might not make much difference.

> Because there isn't such option maybe I should create a separate table
> for blog text content. Does anybody have better ideas for this? :)

That's probably the easiest solution. You can put a view on top of them
to hide it from the application.

> P.S. Here's a plan for query #3. Users can have several bookmark groups
> they are following. User can limit visibility of an entry to some of
> his/her bookmark group. Those are not any kind of bottlenecks anyway...

If the user_bookmark table is not clustered by uid, I'm surprised the
planner didn't choose a bitmap index scan. Which version of PostgreSQL
is this?

PS. EXPLAIN ANALYZE is much more helpful than plain EXPLAIN.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

Re: Performance problem with table containing a lot of text (blog)

From
Kari Lavikka
Date:
On Wed, 29 Aug 2007, Heikki Linnakangas wrote:

> The idea of being able to set the toast threshold per column was
> discussed during 8.3 development, but no patch was produced IIRC. We
> might do that in the future. If you're willing to compile from source,
> you can lower TOAST_TUPLE_THRESHOLD.

We are currently using Postgres 8.1 but have to upgrade to 8.2 shortly.
New version fixes some vacuum problems.

I always compile postgres from source. Maybe I have to do some
calculations because that setting affects all tables and databases. Most
of our text/varchar columns are quite short but setting the threshold too
low causes excessive seeks to toast tables... right?

>> Because there isn't such option maybe I should create a separate table
>> for blog text content. Does anybody have better ideas for this? :)
>
> That's probably the easiest solution. You can put a view on top of them
> to hide it from the application.

Yeh.

> If the user_bookmark table is not clustered by uid, I'm surprised the
> planner didn't choose a bitmap index scan.

Drumroll... there are:
     "user_bookmark_pkey" PRIMARY KEY, btree (bookmark_group_id, marked_uid), tablespace "lun3"
     "user_bookmark_marked_uid" btree (marked_uid)
     "user_bookmark_uid" btree (uid) CLUSTER, tablespace "lun3"

Queries are mostly like "Gimme all of my bookmarked friends in all of my
bookmark groups" and rarely the opposite "Gimme all users who have
bookmarked me"

I have clustered the table using uid to minimize random page fetches.

  - Kari

>
> --
>  Heikki Linnakangas
>  EnterpriseDB   http://www.enterprisedb.com
>
> ---------------------------(end of broadcast)---------------------------
> TIP 2: Don't 'kill -9' the postmaster
>

Re: Performance problem with table containing a lot of text (blog)

From
"Heikki Linnakangas"
Date:
Kari Lavikka wrote:
> On Wed, 29 Aug 2007, Heikki Linnakangas wrote:
>
>> The idea of being able to set the toast threshold per column was
>> discussed during 8.3 development, but no patch was produced IIRC. We
>> might do that in the future. If you're willing to compile from source,
>> you can lower TOAST_TUPLE_THRESHOLD.
>
> We are currently using Postgres 8.1 but have to upgrade to 8.2 shortly.
> New version fixes some vacuum problems.
>
> I always compile postgres from source. Maybe I have to do some
> calculations because that setting affects all tables and databases. Most
> of our text/varchar columns are quite short but setting the threshold
> too low causes excessive seeks to toast tables... right?

Right. If you have trouble finding the right balance, you can also use
ALTER STORAGE PLAIN to force the other columns not to be toasted.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com