Creating tons of tables to support a query - Mailing list pgsql-general

From Jan Ploski
Subject Creating tons of tables to support a query
Date
Msg-id 14629117.1031522324792.JavaMail.jpl@remotejava
Whole thread Raw
Responses Re: Creating tons of tables to support a query
Re: Creating tons of tables to support a query
Re: Creating tons of tables to support a query
List pgsql-general
Hello,

I have apparently solved a certain performance problem and would
like to consult with other PostgreSQL users to figure out if I am
doing things the right way.

I have two tables: "section" and "message". Each message has
a "dateSent" timestamp and belongs to exactly one section, so I have
a foreign key sectionID in message. The task is to efficiently
figure out the number of messages that arrived into a section after
any given point in time. In other words, optimize this query:

select count(*) from message where sectionID = ? and dateSent > ?

At present, there are about 11,000 messages and 227 sections,
with messages distributed rather unevenly across sections.
The table "message" also contains a minor number of private
messages, the sectionID of which is null.

There is an index on message.dateSent, which PostgreSQL decides not to
use for execution of the above query. This gives me a plan like that:

Aggregate  (cost=1419.26..1419.26 rows=1 width=0)
  ->  Seq Scan on message  (cost=0.00..1408.13 rows=4449 width=0)

...with execution time for 100 queries equal to 7.90 seconds.

If I set enable_seqscan=no, the same test takes 5.21 seconds,
which is still much too long.

To improve the performance, I originally decided to add a column
"msgnum" to the table "message", which would be incremented as
each message is inserted. To figure out the number of messages
that arrived between two points in time [t1,t2], I'd find the
lowest message number before t1 and highest message number before
t2, and compute msgnum_high-msgnum_low+1. (When messages are deleted,
renumbering would have to occur.

This alternate approach worked very well on a test table that
included only messages from a single section. I used queries such as:

select msgnum from message where dateSent > ? order by dateSent limit 1

and

select msgnum from message where dateSent < ? order by dateSent desc limit 1

However, when I added the condition "and sectionID = ?", the performance
dropped to worse than of the simple count(*) query that I mentioned first.

To work around that, I decided to create a message number tracking
table for each section (247 tables!), with an index on dateSent
for each table (247 indices!), and use dynamically composed queries.

Querying for a message count that way works roughly 60 times faster
than the previous approach = very well. However, I am concerned about
the need to keep these extra tables up-to-date (with triggers) when
inserting rows into "message", also worried about the overhead of
creating/dropping a table per section and about the overall lack of
elegance of my solution.

I am in particular wondering, why an index on message(sectionID, dateSent)
does not make these queries comparably fast:

    select msgnum from message where
        sectionID = ? and
        dateSent > ?
        order by dateSent
        limit 1;

    select msgnum from scnt_9 where
        dateSent > ?
        order by dateSent
        limit 1;

(scnt_9 is a lookup table which only creates msgnums for messages
with sectionID == 9)

I would be grateful for your advice. Is anyone else creating hundreds
of lookup tables to cope with performance problems? Is this an issue
which would be nicely solved by partitioning tables (or indices) by
column value in a commercial RDBMS like Oracle?

Best regards -
Jan Ploski


pgsql-general by date:

Previous
From: Stephan Szabo
Date:
Subject: Re: "...integer[] references..." = error
Next
From: Martijn van Oosterhout
Date:
Subject: Re: Creating tons of tables to support a query