Thread: SQL report
I have the following senario.<br /><br />I have a tracking system. The system will record the status of an object regularly,all the status records are stored in one table. And it will keep a history of maximum 1000 status record for eachobject it tracks. The maximum objects the system will track is 100,000. Which means I will potentially have a table sizeof 100 million records.<br /><br />I have to generate a report on the latest status of all objects being tracked at aparticular point in time, and also I have to allow user to sort and filter on different columes in the status record displayedin the report.<br /><br />The following is a brief description in the status record (they are not actual code)<br/><br />ObjectRecord(<br /> objectId bigint PrimaryKey<br /> desc varchar <br />)<br /><br />StatusRecord (<br />id bigint PrimaryKey<br /> objectId bigint indexed<br /> datetime bigint indexed<br /> capacity double <br /> reliabilitydouble<br /> efficiency double<br />)<br /><br />I have tried to do the following, it works very well with around20,000 objects. (The query return in less than 10s) But when I have 100,000 objects it becomes very very slow. (I don'teven have patience to wait for it to return.... I kill it after 30 mins) <br /><br />select * from statusrecord s1 INNERJOIN ( SELECT objectId , MAX(datetime) AS msdt FROM statusrecord WHERE startDatetime <= 1233897527657 GROUP BY objectId) AS s2 ON ( s1.objectId = s2.objectId AND s1.datetime = s2.msdt ) where ( capacity < 10.0 ) order by s1.datetimeDESC, s1.objectId DESC;<br /><br />I did try to write a store procedure like below, for 100,000 objects and 1000status records / object, it returns in around 30 mins. <br /><br />CREATE OR REPLACE FUNCTION getStatus(pitvalue BIGINT)RETURNS SETOF statusrecord AS $BODY$<br />DECLARE<br /> id VARCHAR;<br /> status statusrecord%ROWTYPE;<br />BEGIN<br/> FOR object IN SELECT * FROM objectRecord <br /> LOOP<br /> EXECUTE 'SELECT * FROM statusrecord WHERE objectId= ' || quote_literal(object.objectId) || <br /> ' AND datetime <= ' || quote_literal(pitvalue) || ' ORDER BY datetimeDESC'<br /> INTO status;<br /> IF FOUND THEN <br /> RETURN NEXT status;<br /> END IF;<br /> END LOOP;<br /> RETURN;<br/>END<br />$BODY$ LANGUAGE plpgsql;<br /><br />Just wanna to know if anyone have a different approach to my senario.Thanks alot. <br /><br />John
I would be curious to know the performance curve for let's say 20K, 40K , 60K, 80K, 100K records. And what sort of indexing you have, whether or not it's clustered, re-built and so on. One could envision partitioning the status table such that recent records were grouped together (on the assumption that they will be most frequently "reported"). wkipjohn@gmail.com wrote: > I have the following senario. > > I have a tracking system. The system will record the status of an > object regularly, all the status records are stored in one table. And > it will keep a history of maximum 1000 status record for each object > it tracks. The maximum objects the system will track is 100,000. Which > means I will potentially have a table size of 100 million records. > > I have to generate a report on the latest status of all objects being > tracked at a particular point in time, and also I have to allow user > to sort and filter on different columes in the status record displayed > in the report. > > The following is a brief description in the status record (they are > not actual code) > > ObjectRecord( > objectId bigint PrimaryKey > desc varchar > ) > > StatusRecord ( > id bigint PrimaryKey > objectId bigint indexed > datetime bigint indexed > capacity double > reliability double > efficiency double > ) > > I have tried to do the following, it works very well with around > 20,000 objects. (The query return in less than 10s) But when I have > 100,000 objects it becomes very very slow. (I don't even have patience > to wait for it to return.... I kill it after 30 mins) > > select * from statusrecord s1 INNER JOIN ( SELECT objectId , > MAX(datetime) AS msdt FROM statusrecord WHERE startDatetime <= > 1233897527657 GROUP BY objectId ) AS s2 ON ( s1.objectId = s2.objectId > AND s1.datetime = s2.msdt ) where ( capacity < 10.0 ) order by > s1.datetime DESC, s1.objectId DESC; > > I did try to write a store procedure like below, for 100,000 objects > and 1000 status records / object, it returns in around 30 mins. > > CREATE OR REPLACE FUNCTION getStatus(pitvalue BIGINT) RETURNS SETOF > statusrecord AS $BODY$ > DECLARE > id VARCHAR; > status statusrecord%ROWTYPE; > BEGIN > FOR object IN SELECT * FROM objectRecord > LOOP > EXECUTE 'SELECT * FROM statusrecord WHERE objectId = ' || > quote_literal(object.objectId) || > ' AND datetime <= ' || quote_literal(pitvalue) || ' ORDER BY datetime > DESC' > INTO status; > IF FOUND THEN > RETURN NEXT status; > END IF; > END LOOP; > RETURN; > END > $BODY$ LANGUAGE plpgsql; > > Just wanna to know if anyone have a different approach to my senario. > Thanks alot. > > John
wkipjohn@gmail.com wrote: > I have the following senario. > > I have a tracking system. The system will record the status of an > object regularly, all the status records are stored in one table. And > it will keep a history of maximum 1000 status record for each object > it tracks. The maximum objects the system will track is 100,000. Which > means I will potentially have a table size of 100 million records. > > I have to generate a report on the latest status of all objects being > tracked at a particular point in time, and also I have to allow user > to sort and filter on different columes in the status record displayed > in the report. ... > > Just wanna to know if anyone have a different approach to my senario. > Thanks alot. > Not knowing all the details of your system, here are some things you could experiment with: 1. Add a "latest record id" field in your object table (automatically updated with a trigger) that would allow you to do a simple join with the tracking table. I suspect that such a join will be far faster than calculating "max" 100,000 times at the expense of a slightly larger main table. 2. Add a "current record flag" in the status table that simply flags the most recent record for each object (again, use triggers to keep the flag appropriately updated). This would also eliminate the need for the "max" subquery. You could even create a partial index filtering on the "current record flag" which could speed things up if the reporting query is written correctly. 3. Partition the table into a "current status table" and "historical status table" (each inheriting from the main table). Use a trigger so that anytime a new status record in added, the old "current" record is moved from the "current" to the "historical" table and the new one added to the "current" table. The latest status report will only need a simple join on the "current" table with a max size of 100,000 rather than a more complex query over a 100,000,000 record table. Cheers, Steve
Did you look at the query plans for the various record counts? That might show which index is missing or misinformed :). I wonder if clustering the status table on objectid would help? This does then require maintenance so you might only load it at 75%. wkipjohn@gmail.com wrote: > Hi Rob, > > I have default B-Tree indexes created for each of the indexed columes > and primary key columes. (No multiple columes indexe or NULL FIRST or > DESC/ASC). I am using PostgreSQL 8.3 with the auto vacuum daemon on. I > assume analyse will be automatically run to collect statistics for use > by the planner and there is no maintainance for B-tree indexes once it > is created. (Please point me out if I am wrong about this) > > I will probably try to partition the status table to group more recent > status records together to minimize the dataset I am querying. > > Thx > John > > > On Jul 31, 2009 1:16am, Rob Sargent <robjsargent@gmail.com> wrote: > > I would be curious to know the performance curve for let's say 20K, > 40K , 60K, 80K, 100K records. And what sort of indexing you have, > whether or not it's clustered, re-built and so on. > > > > > > > > One could envision partitioning the status table such that recent > records were grouped together (on the assumption that they will be > most frequently "reported"). > > > > > > > > wkipjohn@gmail.com wrote: > > > > > > I have the following senario. > > > > > > > > I have a tracking system. The system will record the status of an > object regularly, all the status records are stored in one table. And > it will keep a history of maximum 1000 status record for each object > it tracks. The maximum objects the system will track is 100,000. Which > means I will potentially have a table size of 100 million records. > > > > > > > > I have to generate a report on the latest status of all objects > being tracked at a particular point in time, and also I have to allow > user to sort and filter on different columes in the status record > displayed in the report. > > > > > > > > The following is a brief description in the status record (they are > not actual code) > > > > > > > > ObjectRecord( > > > > objectId bigint PrimaryKey > > > > desc varchar > > > > ) > > > > > > > > StatusRecord ( > > > > id bigint PrimaryKey > > > > objectId bigint indexed > > > > datetime bigint indexed > > > > capacity double > > > > reliability double > > > > efficiency double > > > > ) > > > > > > > > I have tried to do the following, it works very well with around > 20,000 objects. (The query return in less than 10s) But when I have > 100,000 objects it becomes very very slow. (I don't even have patience > to wait for it to return.... I kill it after 30 mins) > > > > > > > > select * from statusrecord s1 INNER JOIN ( SELECT objectId , > MAX(datetime) AS msdt FROM statusrecord WHERE startDatetime > > > > > > I did try to write a store procedure like below, for 100,000 objects > and 1000 status records / object, it returns in around 30 mins. > > > > > > > > CREATE OR REPLACE FUNCTION getStatus(pitvalue BIGINT) RETURNS SETOF > statusrecord AS $BODY$ > > > > DECLARE > > > > id VARCHAR; > > > > status statusrecord%ROWTYPE; > > > > BEGIN > > > > FOR object IN SELECT * FROM objectRecord > > > > LOOP > > > > EXECUTE 'SELECT * FROM statusrecord WHERE objectId = ' || > quote_literal(object.objectId) || > > > > ' AND datetime > > INTO status; > > > > IF FOUND THEN > > > > RETURN NEXT status; > > > > END IF; > > > > END LOOP; > > > > RETURN; > > > > END > > > > $BODY$ LANGUAGE plpgsql; > > > > > > > > Just wanna to know if anyone have a different approach to my > senario. Thanks alot. > > > > > > > > John > > > >
Hi Steve,<br /><br />Thanks for you suggestions. In my senario, what is current depends on users. Because if user wants astatus report at 00:00 1st Jan 2009, then 00:00 1st Jan 2009 is current. So it is not possible to flag any records as currentunless the user tells us what is current. <br /><br />cheers<br />John<br /><br />On Jul 31, 2009 2:41am, Steve Crawford<scrawford@pinpointresearch.com> wrote:<br />> wkipjohn@gmail.com wrote:<br />> <br />> <br />>I have the following senario.<br />> <br />> <br />> <br />> I have a tracking system. The system willrecord the status of an object regularly, all the status records are stored in one table. And it will keep a historyof maximum 1000 status record for each object it tracks. The maximum objects the system will track is 100,000. Whichmeans I will potentially have a table size of 100 million records.<br />> <br />> <br />> <br />> I haveto generate a report on the latest status of all objects being tracked at a particular point in time, and also I haveto allow user to sort and filter on different columes in the status record displayed in the report.<br />> <br />><br />> ...<br />> <br />> <br />> <br />> <br />> Just wanna to know if anyone have a differentapproach to my senario. Thanks alot.<br />> <br />> <br />> <br />> <br />> Not knowing all the detailsof your system, here are some things you could experiment with:<br />> <br />> <br />> <br />> 1. Adda "latest record id" field in your object table (automatically updated with a trigger) that would allow you to do a simplejoin with the tracking table. I suspect that such a join will be far faster than calculating "max" 100,000 times atthe expense of a slightly larger main table.<br />> <br />> <br />> <br />> 2. Add a "current record flag"in the status table that simply flags the most recent record for each object (again, use triggers to keep the flag appropriatelyupdated). This would also eliminate the need for the "max" subquery. You could even create a partial index filteringon the "current record flag" which could speed things up if the reporting query is written correctly.<br />><br />> <br />> <br />> 3. Partition the table into a "current status table" and "historical status table"(each inheriting from the main table). Use a trigger so that anytime a new status record in added, the old "current"record is moved from the "current" to the "historical" table and the new one added to the "current" table. The lateststatus report will only need a simple join on the "current" table with a max size of 100,000 rather than a more complexquery over a 100,000,000 record table.<br />> <br />> <br />> <br />> Cheers,<br />> <br />> Steve<br/>> <br />> <br />>
Hi Rob,<br /><br />I have default B-Tree indexes created for each of the indexed columes and primary key columes. (No multiplecolumes indexe or NULL FIRST or DESC/ASC). I am using PostgreSQL 8.3 with the auto vacuum daemon on. I assume analysewill be automatically run to collect statistics for use by the planner and there is no maintainance for B-tree indexesonce it is created. (Please point me out if I am wrong about this)<br /><br />I will probably try to partition thestatus table to group more recent status records together to minimize the dataset I am querying. <br /><br />Thx<br />John<br/><br /><br />On Jul 31, 2009 1:16am, Rob Sargent <robjsargent@gmail.com> wrote:<br />> I would be curiousto know the performance curve for let's say 20K, 40K , 60K, 80K, 100K records. And what sort of indexing you have,whether or not it's clustered, re-built and so on.<br />> <br />> <br />> <br />> One could envision partitioningthe status table such that recent records were grouped together (on the assumption that they will be most frequently"reported").<br />> <br />> <br />> <br />> wkipjohn@gmail.com wrote:<br />> <br />> <br />>I have the following senario.<br />> <br />> <br />> <br />> I have a tracking system. The system willrecord the status of an object regularly, all the status records are stored in one table. And it will keep a historyof maximum 1000 status record for each object it tracks. The maximum objects the system will track is 100,000. Whichmeans I will potentially have a table size of 100 million records.<br />> <br />> <br />> <br />> I haveto generate a report on the latest status of all objects being tracked at a particular point in time, and also I haveto allow user to sort and filter on different columes in the status record displayed in the report.<br />> <br />><br />> <br />> The following is a brief description in the status record (they are not actual code)<br />><br />> <br />> <br />> ObjectRecord(<br />> <br />> objectId bigint PrimaryKey<br />> <br />>desc varchar<br />> <br />> )<br />> <br />> <br />> <br />> StatusRecord (<br />> <br />>id bigint PrimaryKey<br />> <br />> objectId bigint indexed<br />> <br />> datetime bigint indexed<br />><br />> capacity double<br />> <br />> reliability double<br />> <br />> efficiency double<br />><br />> )<br />> <br />> <br />> <br />> I have tried to do the following, it works very well with around20,000 objects. (The query return in less than 10s) But when I have 100,000 objects it becomes very very slow. (I don'teven have patience to wait for it to return.... I kill it after 30 mins)<br />> <br />> <br />> <br />>select * from statusrecord s1 INNER JOIN ( SELECT objectId , MAX(datetime) AS msdt FROM statusrecord WHERE startDatetime<br />> <br />> <br />> I did try to write a store procedure like below, for 100,000 objects and 1000status records / object, it returns in around 30 mins.<br />> <br />> <br />> <br />> CREATE OR REPLACE FUNCTIONgetStatus(pitvalue BIGINT) RETURNS SETOF statusrecord AS $BODY$<br />> <br />> DECLARE<br />> <br />>id VARCHAR;<br />> <br />> status statusrecord%ROWTYPE;<br />> <br />> BEGIN<br />> <br />> FOR objectIN SELECT * FROM objectRecord<br />> <br />> LOOP<br />> <br />> EXECUTE 'SELECT * FROM statusrecord WHEREobjectId = ' || quote_literal(object.objectId) ||<br />> <br />> ' AND datetime <br />> INTO status;<br />><br />> IF FOUND THEN<br />> <br />> RETURN NEXT status;<br />> <br />> END IF;<br />> <br />>END LOOP;<br />> <br />> RETURN;<br />> <br />> END<br />> <br />> $BODY$ LANGUAGE plpgsql;<br />><br />> <br />> <br />> Just wanna to know if anyone have a different approach to my senario. Thanks alot.<br/>> <br />> <br />> <br />> John <br />> <br />>