Thread: Creating tons of tables to support a query
Hello, I have apparently solved a certain performance problem and would like to consult with other PostgreSQL users to figure out if I am doing things the right way. I have two tables: "section" and "message". Each message has a "dateSent" timestamp and belongs to exactly one section, so I have a foreign key sectionID in message. The task is to efficiently figure out the number of messages that arrived into a section after any given point in time. In other words, optimize this query: select count(*) from message where sectionID = ? and dateSent > ? At present, there are about 11,000 messages and 227 sections, with messages distributed rather unevenly across sections. The table "message" also contains a minor number of private messages, the sectionID of which is null. There is an index on message.dateSent, which PostgreSQL decides not to use for execution of the above query. This gives me a plan like that: Aggregate (cost=1419.26..1419.26 rows=1 width=0) -> Seq Scan on message (cost=0.00..1408.13 rows=4449 width=0) ...with execution time for 100 queries equal to 7.90 seconds. If I set enable_seqscan=no, the same test takes 5.21 seconds, which is still much too long. To improve the performance, I originally decided to add a column "msgnum" to the table "message", which would be incremented as each message is inserted. To figure out the number of messages that arrived between two points in time [t1,t2], I'd find the lowest message number before t1 and highest message number before t2, and compute msgnum_high-msgnum_low+1. (When messages are deleted, renumbering would have to occur. This alternate approach worked very well on a test table that included only messages from a single section. I used queries such as: select msgnum from message where dateSent > ? order by dateSent limit 1 and select msgnum from message where dateSent < ? order by dateSent desc limit 1 However, when I added the condition "and sectionID = ?", the performance dropped to worse than of the simple count(*) query that I mentioned first. To work around that, I decided to create a message number tracking table for each section (247 tables!), with an index on dateSent for each table (247 indices!), and use dynamically composed queries. Querying for a message count that way works roughly 60 times faster than the previous approach = very well. However, I am concerned about the need to keep these extra tables up-to-date (with triggers) when inserting rows into "message", also worried about the overhead of creating/dropping a table per section and about the overall lack of elegance of my solution. I am in particular wondering, why an index on message(sectionID, dateSent) does not make these queries comparably fast: select msgnum from message where sectionID = ? and dateSent > ? order by dateSent limit 1; select msgnum from scnt_9 where dateSent > ? order by dateSent limit 1; (scnt_9 is a lookup table which only creates msgnums for messages with sectionID == 9) I would be grateful for your advice. Is anyone else creating hundreds of lookup tables to cope with performance problems? Is this an issue which would be nicely solved by partitioning tables (or indices) by column value in a commercial RDBMS like Oracle? Best regards - Jan Ploski
On Sun, Sep 08, 2002 at 11:58:44PM +0200, Jan Ploski wrote: > Hello, > > I am in particular wondering, why an index on message(sectionID, dateSent) > does not make these queries comparably fast: > > select msgnum from message where > sectionID = ? and > dateSent > ? > order by dateSent > limit 1; > > select msgnum from scnt_9 where > dateSent > ? > order by dateSent > limit 1; > > (scnt_9 is a lookup table which only creates msgnums for messages > with sectionID == 9) > Can you send the results of EXPLAIN ANALYZE for both those queries. Thanks. -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > There are 10 kinds of people in the world, those that can do binary > arithmetic and those that can't.
On Sun, Sep 08, 2002 at 23:58:44 +0200, Jan Ploski <jpljpl@gmx.de> wrote: > > select count(*) from message where sectionID = ? and dateSent > ? If the dateDent is the same for each query, you should probably be doing it in one query. Something like: select sectionID, count(*) from message where dateSent > ? group by sectionID
On Mon, Sep 09, 2002 at 09:36:17AM +1000, Martijn van Oosterhout wrote: > On Sun, Sep 08, 2002 at 11:58:44PM +0200, Jan Ploski wrote: > > Hello, > > > > I am in particular wondering, why an index on message(sectionID, dateSent) > > does not make these queries comparably fast: > > > > select msgnum from message where > > sectionID = ? and > > dateSent > ? > > order by dateSent > > limit 1; > > > > select msgnum from scnt_9 where > > dateSent > ? > > order by dateSent > > limit 1; > > > > (scnt_9 is a lookup table which only creates msgnums for messages > > with sectionID == 9) > > > > Can you send the results of EXPLAIN ANALYZE for both those queries. Thanks. Martijn, I can't run EXPLAIN ANALYZE (using 7.1.3), but here are the results of EXPLAIN: Limit (cost=1677.74..1677.74 rows=1 width=10) -> Sort (cost=1677.74..1677.74 rows=4449 width=10) -> Seq Scan on message (cost=0.00..1408.13 rows=4449 width=10) Limit (cost=0.00..0.05 rows=1 width=12) -> Index Scan using scnt_idx_9 on scnt_9 (cost=0.00..234.47 rows=4661 width= The fast query is obviously doing less work. Any ideas why? Thank you - JPL
On Sun, Sep 08, 2002 at 06:51:55PM -0500, Bruno Wolff III wrote: > On Sun, Sep 08, 2002 at 23:58:44 +0200, > Jan Ploski <jpljpl@gmx.de> wrote: > > > > select count(*) from message where sectionID = ? and dateSent > ? > > If the dateDent is the same for each query, you should probably be doing > it in one query. Something like: > select sectionID, count(*) from message where dateSent > ? group by sectionID Bruno, The dateSent is not same for each section, though it may be same for several sections (in my app, sections belong to forums, and dateSent is same for all sections of the same forum). Anyway, I would like to avoid counting by traversing a huge number of rows, and I think it's what the backend does behind the scenes for such a query. -JPL
On Mon, Sep 09, 2002 at 01:54:08 +0200, Jan Ploski <jpljpl@gmx.de> wrote: > > The dateSent is not same for each section, though it may be same > for several sections (in my app, sections belong to forums, and dateSent > is same for all sections of the same forum). Anyway, I would like to avoid > counting by traversing a huge number of rows, and I think it's what the > backend does behind the scenes for such a query. Well it sounds like group by won't by you all that much. However, counts aren't saved anywhere so you are going to need to look at each row that satisfies the condition of the query. You might want to double check that the time of the two timestamps (the one in the table and the constant used for a particular query) are the same type. If not you may need a cast to get postgres to use the index.
On Sun, 8 Sep 2002, Jan Ploski wrote: > I am in particular wondering, why an index on message(sectionID, dateSent) > does not make these queries comparably fast: > > select msgnum from message where > sectionID = ? and > dateSent > ? > order by dateSent > limit 1; I don't think that'll use an index on (sectionID, dateSent) for the sort step. I think an index on (dateSent,sectionID) might be, however.
On Sun, Sep 08, 2002 at 07:49:32PM -0700, Stephan Szabo wrote: > On Sun, 8 Sep 2002, Jan Ploski wrote: > > > I am in particular wondering, why an index on message(sectionID, dateSent) > > does not make these queries comparably fast: > > > > select msgnum from message where > > sectionID = ? and > > dateSent > ? > > order by dateSent > > limit 1; > > I don't think that'll use an index on (sectionID, dateSent) for the sort > step. I think an index on (dateSent,sectionID) might be, however. Stephan, Indeed, my mistake. With an index on (dateSent,sectionID), the plan becomes: Limit (cost=0.00..2.36 rows=1 width=10) -> Index Scan using test_idx2 on message (cost=0.00..10482.08 rows=4449 width=10) Alas, this does not help me further. I did two tests: Test 1: Section 9 contained 5143 messages. Test 2: Section 241 contained 0 messages. The timing results (for 5000 queries) are: 1. Using index on message(dateSent, sectionID): 11 seconds Using index on scnt_9(dateSent): 17 seconds 2. Using index on message(dateSent, sectionID): 320 seconds Using index on scnt_241(dateSent): 2 seconds The problem is that (apparently?) the whole (dateSent, sectionID) index must be scanned in the second test, while the scnt_241 index simply contains no values and yields quick results. Since the auxiliary tables speed up things so much and behave well for sections with few messages, I tend to believe that this is the way to go for me. Two questions remain open: what kind of overheads do I incur by creating that many tables (hundreds, maybe thousands in the future)? And, second, since there is no support for pl/pgSQL "execute select ... into" in 7.1.3, I need to collect results inserted into a temporary table. Is this kind of "execute" statement implemented in the newest version of PostgreSQL yet? Take care - JPL
On Monday 09 Sep 2002 11:52 am, Jan Ploski wrote: > Indeed, my mistake. With an index on (dateSent,sectionID), the plan > becomes: > > Limit (cost=0.00..2.36 rows=1 width=10) > -> Index Scan using test_idx2 on message (cost=0.00..10482.08 rows=4449 > width=10) > > Alas, this does not help me further. I did two tests: > > Test 1: Section 9 contained 5143 messages. > Test 2: Section 241 contained 0 messages. > > The timing results (for 5000 queries) are: > > 1. Using index on message(dateSent, sectionID): 11 seconds > Using index on scnt_9(dateSent): 17 seconds > > 2. Using index on message(dateSent, sectionID): 320 seconds > Using index on scnt_241(dateSent): 2 seconds > > > The problem is that (apparently?) the whole (dateSent, sectionID) index > must be scanned in the second test, while the scnt_241 index simply > contains no values and yields quick results. Have you considered using partial indexes? You can set up something like: CREATE INDEX msg_idx_9 ON message (dateSent) WHERE sectionID=9 For each section you have - this should allow for the indexing advantage without the overhead of separate tables. This feature is non-standard AFAIK and is covered in section 7.8 of the manual. - Richard Huxton
On Mon, Sep 09, 2002 at 02:10:47PM +0100, Richard Huxton wrote: > On Monday 09 Sep 2002 11:52 am, Jan Ploski wrote: > > The problem is that (apparently?) the whole (dateSent, sectionID) index > > must be scanned in the second test, while the scnt_241 index simply > > contains no values and yields quick results. > > Have you considered using partial indexes? You can set up something like: > > CREATE INDEX msg_idx_9 ON message (dateSent) WHERE sectionID=9 > > For each section you have - this should allow for the indexing advantage > without the overhead of separate tables. This feature is non-standard AFAIK > and is covered in section 7.8 of the manual. Hello Richard, thanks for your hint. I did not know about this feature and it sounds like a perfect fit for the task. Having created the partial index for section 241 (the empty one), I run into another problem. EXPLAIN ANALYZE select msgnum from message where sectionID = 241 order by dateSent desc limit 1; Limit (cost=0.00..56.43 rows=1 width=12) (actual time=0.03..0.03 rows=0 loops=1) -> Index Scan Backward using message_pidx0_241 on message (cost=0.00..56.43 rows=0 width=12) (actual time=0.02..0.02rows=0 loops=1) Total runtime: 0.14 msec Looks fine. But: 500 iterations of the following code loop select msgnum from message into v_cnt where sectionID = 241 order by dateSent desc limit 1; end loop report execution time 00:00:00.03179, while 500 iterations of the following (which is closer to what I need): v_sid := 241; loop select msgnum from message into v_cnt where sectionID = v_sid order by dateSent desc limit 1; end loop take a whopping 00:00:31.442402. Needless to say, I don't have this problem with select msgnum from scnt_241 into v_cnt order by dateSent desc limit 1. But I'd really like to drop all these extra tables. How can I find out what is going on behind the scenes? Regards - JPL
Stephan Szabo said: > > On Sun, 8 Sep 2002, Jan Ploski wrote: > > > I am in particular wondering, why an index on message(sectionID, dateSent) > > does not make these queries comparably fast: > > > > select msgnum from message where > > sectionID = ? and > > dateSent > ? > > order by dateSent > > limit 1; > > I don't think that'll use an index on (sectionID, dateSent) for the sort > step. I think an index on (dateSent,sectionID) might be, however. > I know I've read this before on the list (probably several times). But either my skull is too thick or the topic too abstract; why is no index used for (sectionID, dateSent) but (dateSent, sectionID) does? They are the same columns, but just reversed. I don't see why that would make a difference. Is there some rule-of-thumb for determining when an index is used and when it isn't rather than trail and error using EXPLAIN? Shane
Jan Ploski <jpljpl@gmx.de> writes: > On Sun, Sep 08, 2002 at 07:49:32PM -0700, Stephan Szabo wrote: >> On Sun, 8 Sep 2002, Jan Ploski wrote: >>> I am in particular wondering, why an index on message(sectionID, dateSent) >>> does not make these queries comparably fast: >>> >>> select msgnum from message where >>> sectionID = ? and >>> dateSent > ? >>> order by dateSent >>> limit 1; >> >> I don't think that'll use an index on (sectionID, dateSent) for the sort >> step. I think an index on (dateSent,sectionID) might be, however. > Alas, this does not help me further. I did two tests: Yes, it makes sense that for a little-used section that way wouldn't be very efficient. I would suggest that you want to use an index on (sectionID, dateSent), and that the way to make the system do the right thing is select msgnum from message where sectionID = ? and dateSent > ? order by sectionID, dateSent limit 1; Without the extra ORDER BY clause, the planner is not smart enough to see that the requested ordering actually matches the index ordering. Another possible gotcha is that depending on datatype details the planner might be using only one of the two index columns. As far as I noticed, you didn't tell us the exact column datatypes or the exact form in which the comparison values are supplied? regards, tom lane
On Monday 09 Sep 2002 3:22 pm, Jan Ploski wrote: [snipped - trying partial indexes] > But: 500 iterations of the following code > > loop > select msgnum from message into v_cnt where sectionID = 241 order > by dateSent desc limit 1; end loop > > report execution time 00:00:00.03179, while 500 iterations of the following > (which is closer to what I need): > > v_sid := 241; > > loop > select msgnum from message into v_cnt where sectionID = v_sid order > by dateSent desc limit 1; end loop > > take a whopping 00:00:31.442402. Needless to say, I don't have this problem > with select msgnum from scnt_241 into v_cnt order by dateSent desc limit 1. > But I'd really like to drop all these extra tables. > > How can I find out what is going on behind the scenes? Look at the section on logging/debugging in your postgresql.conf file (also the SET/SHOW commands and the manuals). However, if that code is plpgsql then the query plan is fixed at compile-time which might well mean that it isn't using the partial indexes. Normally you'd use EXECUTE to get around this, but you can't do that with SELECT...INTO There are only two options I can think of at the moment: 1. Rewrite the plpgsql as an sql function/plperl etc (anything dynamic) 2. Take a step back and keep a separate table of messages per minute/hour/whatever and keep it up to date with triggers. Something smarter might occur to me later or (more likely) to someone else, but I've got to finish something off at the moment. HTH - Richard Huxton
On Mon, Sep 09, 2002 at 11:41:38AM -0400, Tom Lane wrote: > Yes, it makes sense that for a little-used section that way wouldn't be > very efficient. I would suggest that you want to use an index on > (sectionID, dateSent), and that the way to make the system do the > right thing is > > select msgnum from message where > sectionID = ? and > dateSent > ? > order by sectionID, dateSent > limit 1; > > Without the extra ORDER BY clause, the planner is not smart enough to > see that the requested ordering actually matches the index ordering. Tom, thanks you for advice. Now I get performance comparable to that when using a partial index with "workarounds", i.e. select msgnum from message where sectionID=9 and dateSent>'2000-11-12 02:05:35.94' order by sectionID,dateSent limit 1; works equally fast with an index on message(sectionID, dateSent) as through a partial index on message where sectionID = 9. When executed on the empty section 241, the statement is way faster than the partial index based solution (0.67 seconds vs 8.5 seconds), probably because I had to resort to trickery to make it work reasonably at all. What I still cannot grasp is why select msgnum into v_cnt from message where sectionID = 241 order by dateSent desc limit 1; is so much faster than v_sid := 241; select msgnum into v_cnt from message where sectionID = v_sid order by dateSent desc limit 1; as I mentioned in an earlier message. In fact, to get the partial index solution up to decent performance, I had to write something like delete from tmp; execute ''insert into tmp select msgnum into v_cnt from message where sectionID = '' || v_sid || '' oder by dateSent desc limit 1''; select max(cnt) into v_cnt from tmp; Following Richard's suggestion, I turned query debugging on in postgresql.conf. What jumps into my eyes in the log is Shared blocks: 132 read (sectionID = 241) vs Shared blocks: 1517 read (sectionID = v_sid) in postgres usage stats output after executing the query. When I wrap the select statement into the awkward execute, I get "152 read", which is still much, much better than "1517 read", if that (apart from execution time) can be taken as an indicator for performance. Do you have any clues? > Another possible gotcha is that depending on datatype details the > planner might be using only one of the two index columns. As far > as I noticed, you didn't tell us the exact column datatypes or the > exact form in which the comparison values are supplied? The column types are integer for sectionID is and timestamp for dateSent. I am passing parameters of these types into a PL/pgSQL procedure, which then executes a "select into" with these parameters in the where clause. Take care - JPL
Jan Ploski <jpljpl@gmx.de> writes: > What I still cannot grasp is why > select msgnum into v_cnt from message where sectionID = 241 > order by dateSent desc limit 1; > is so much faster than > v_sid := 241; > select msgnum into v_cnt from message where sectionID = v_sid > order by dateSent desc limit 1; The latter cannot use a partial index because the sectionID parameter is a parameter, not a literal constant. The system has no way to know that the SELECT won't be re-executed with a different value of v_sid, so it can't generate a query plan that relies on the specific value of v_sid. Thus, no partial-index-using plan will be produced. You can get around that by judicious use of EXECUTE, because it doesn't cache a query plan. But I see no need to; the partial-index approach is going to be inferior to a correctly used single index anyway, because the sheer number of indexes will bog things down (especially updates). >> Another possible gotcha is that depending on datatype details the >> planner might be using only one of the two index columns. As far >> as I noticed, you didn't tell us the exact column datatypes or the >> exact form in which the comparison values are supplied? > The column types are integer for sectionID is and timestamp for dateSent. > I am passing parameters of these types into a PL/pgSQL procedure, which then > executes a "select into" with these parameters in the where clause. That should be okay. People tend to get burnt with int2 and int8 columns ... regards, tom lane
S Dawalt <shane.dawalt@wright.edu> writes: > select msgnum from message where > sectionID = ? and > dateSent > ? > order by dateSent > limit 1; >> >> I don't think that'll use an index on (sectionID, dateSent) for the sort >> step. I think an index on (dateSent,sectionID) might be, however. > I know I've read this before on the list (probably several times). But > either my skull is too thick or the topic too abstract; why is no index used > for (sectionID, dateSent) but (dateSent, sectionID) does? The issue is whether the indexscan satisfies the ORDER BY condition or just the WHERE conditions. If the planner thinks it needs both an indexscan and a subsequent SORT step, it is much less likely to choose the indexscan-based plan --- and rightfully so in this case, since the LIMIT doesn't help if you have to sort the data before you know which is the single output row you should return. That is, LIMIT INDEXSCAN can be a very cheap plan, but LIMIT SORT INDEXSCAN is not likely to be cheap, because the LIMIT helps not at all for aborting the indexscan or the sort short of completion. Now, you know and I know that given the constraint "WHERE sectionID = ?" it would actually be okay to pretend that indexscanning an index on (sectionID, dateSent) yields data ordered simply by dateSent. The planner will not currently make that deduction, however, and so you have to help it along by asking for your data "ORDERED BY sectionID, dateSent". The system is able to match that to the sort ordering of the two-column index and realize that it needs no SORT step. regards, tom lane
On Mon, Sep 09, 2002 at 11:13:01AM -0400, S Dawalt wrote: > > Stephan Szabo said: > > > > > On Sun, 8 Sep 2002, Jan Ploski wrote: > > > > > I am in particular wondering, why an index on message(sectionID, > dateSent) > > > does not make these queries comparably fast: > > > > > > select msgnum from message where > > > sectionID = ? and > > > dateSent > ? > > > order by dateSent > > > limit 1; > > > > I don't think that'll use an index on (sectionID, dateSent) for the sort > > step. I think an index on (dateSent,sectionID) might be, however. > > > > I know I've read this before on the list (probably several times). But > either my skull is too thick or the topic too abstract; why is no index used > for (sectionID, dateSent) but (dateSent, sectionID) does? They are the same > columns, but just reversed. I don't see why that would make a difference. > Is there some rule-of-thumb for determining when an index is used and when > it isn't rather than trail and error using EXPLAIN? Hmm, take out the order by. How long does it take then? How about trying: select * from (select msgnum, datesent from message where sectionID = ? and dateSent > ?) order by datesent limit 1; maybe that will force the plan you want. -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > There are 10 kinds of people in the world, those that can do binary > arithmetic and those that can't.