Creating tons of tables to support a query - Mailing list pgsql-general
From | Jan Ploski |
---|---|
Subject | Creating tons of tables to support a query |
Date | |
Msg-id | 14629117.1031522324792.JavaMail.jpl@remotejava Whole thread Raw |
Responses |
Re: Creating tons of tables to support a query
Re: Creating tons of tables to support a query Re: Creating tons of tables to support a query |
List | pgsql-general |
Hello, I have apparently solved a certain performance problem and would like to consult with other PostgreSQL users to figure out if I am doing things the right way. I have two tables: "section" and "message". Each message has a "dateSent" timestamp and belongs to exactly one section, so I have a foreign key sectionID in message. The task is to efficiently figure out the number of messages that arrived into a section after any given point in time. In other words, optimize this query: select count(*) from message where sectionID = ? and dateSent > ? At present, there are about 11,000 messages and 227 sections, with messages distributed rather unevenly across sections. The table "message" also contains a minor number of private messages, the sectionID of which is null. There is an index on message.dateSent, which PostgreSQL decides not to use for execution of the above query. This gives me a plan like that: Aggregate (cost=1419.26..1419.26 rows=1 width=0) -> Seq Scan on message (cost=0.00..1408.13 rows=4449 width=0) ...with execution time for 100 queries equal to 7.90 seconds. If I set enable_seqscan=no, the same test takes 5.21 seconds, which is still much too long. To improve the performance, I originally decided to add a column "msgnum" to the table "message", which would be incremented as each message is inserted. To figure out the number of messages that arrived between two points in time [t1,t2], I'd find the lowest message number before t1 and highest message number before t2, and compute msgnum_high-msgnum_low+1. (When messages are deleted, renumbering would have to occur. This alternate approach worked very well on a test table that included only messages from a single section. I used queries such as: select msgnum from message where dateSent > ? order by dateSent limit 1 and select msgnum from message where dateSent < ? order by dateSent desc limit 1 However, when I added the condition "and sectionID = ?", the performance dropped to worse than of the simple count(*) query that I mentioned first. To work around that, I decided to create a message number tracking table for each section (247 tables!), with an index on dateSent for each table (247 indices!), and use dynamically composed queries. Querying for a message count that way works roughly 60 times faster than the previous approach = very well. However, I am concerned about the need to keep these extra tables up-to-date (with triggers) when inserting rows into "message", also worried about the overhead of creating/dropping a table per section and about the overall lack of elegance of my solution. I am in particular wondering, why an index on message(sectionID, dateSent) does not make these queries comparably fast: select msgnum from message where sectionID = ? and dateSent > ? order by dateSent limit 1; select msgnum from scnt_9 where dateSent > ? order by dateSent limit 1; (scnt_9 is a lookup table which only creates msgnums for messages with sectionID == 9) I would be grateful for your advice. Is anyone else creating hundreds of lookup tables to cope with performance problems? Is this an issue which would be nicely solved by partitioning tables (or indices) by column value in a commercial RDBMS like Oracle? Best regards - Jan Ploski
pgsql-general by date: