Thread: SQL report

SQL report

From
wkipjohn@gmail.com
Date:
I have the following senario.<br /><br />I have a tracking system. The system will record the status of an object
regularly,all the status records are stored in one table. And it will keep a history of maximum 1000 status record for
eachobject it tracks. The maximum objects the system will track is 100,000. Which means I will potentially have a table
sizeof 100 million records.<br /><br />I have to generate a report on the latest status of all objects being tracked at
aparticular point in time, and also I have to allow user to sort and filter on different columes in the status record
displayedin the report.<br /><br />The following is a brief description in the status record (they are not actual
code)<br/><br />ObjectRecord(<br /> objectId bigint PrimaryKey<br /> desc varchar <br />)<br /><br />StatusRecord (<br
/>id bigint PrimaryKey<br /> objectId bigint indexed<br /> datetime bigint indexed<br /> capacity double <br />
reliabilitydouble<br /> efficiency double<br />)<br /><br />I have tried to do the following, it works very well with
around20,000 objects. (The query return in less than 10s) But when I have 100,000 objects it becomes very very slow. (I
don'teven have patience to wait for it to return.... I kill it after 30 mins) <br /><br />select * from statusrecord s1
INNERJOIN ( SELECT objectId , MAX(datetime) AS msdt FROM statusrecord WHERE startDatetime <= 1233897527657 GROUP BY
objectId) AS s2 ON ( s1.objectId = s2.objectId AND s1.datetime = s2.msdt ) where ( capacity < 10.0 ) order by
s1.datetimeDESC, s1.objectId DESC;<br /><br />I did try to write a store procedure like below, for 100,000 objects and
1000status records / object, it returns in around 30 mins. <br /><br />CREATE OR REPLACE FUNCTION getStatus(pitvalue
BIGINT)RETURNS SETOF statusrecord AS $BODY$<br />DECLARE<br /> id VARCHAR;<br /> status statusrecord%ROWTYPE;<br
/>BEGIN<br/> FOR object IN SELECT * FROM objectRecord <br /> LOOP<br /> EXECUTE 'SELECT * FROM statusrecord WHERE
objectId= ' || quote_literal(object.objectId) || <br /> ' AND datetime <= ' || quote_literal(pitvalue) || ' ORDER BY
datetimeDESC'<br /> INTO status;<br /> IF FOUND THEN <br /> RETURN NEXT status;<br /> END IF;<br /> END LOOP;<br />
RETURN;<br/>END<br />$BODY$ LANGUAGE plpgsql;<br /><br />Just wanna to know if anyone have a different approach to my
senario.Thanks alot. <br /><br />John 

Re: SQL report

From
Rob Sargent
Date:
I would be curious to know the performance curve for let's say 20K, 40K 
, 60K, 80K, 100K records.  And what sort of indexing you have, whether 
or not it's clustered, re-built and so on.

One could envision partitioning the status table such that recent 
records were grouped together (on the assumption that they will be most 
frequently "reported").

wkipjohn@gmail.com wrote:
> I have the following senario.
>
> I have a tracking system. The system will record the status of an 
> object regularly, all the status records are stored in one table. And 
> it will keep a history of maximum 1000 status record for each object 
> it tracks. The maximum objects the system will track is 100,000. Which 
> means I will potentially have a table size of 100 million records.
>
> I have to generate a report on the latest status of all objects being 
> tracked at a particular point in time, and also I have to allow user 
> to sort and filter on different columes in the status record displayed 
> in the report.
>
> The following is a brief description in the status record (they are 
> not actual code)
>
> ObjectRecord(
> objectId bigint PrimaryKey
> desc varchar
> )
>
> StatusRecord (
> id bigint PrimaryKey
> objectId bigint indexed
> datetime bigint indexed
> capacity double
> reliability double
> efficiency double
> )
>
> I have tried to do the following, it works very well with around 
> 20,000 objects. (The query return in less than 10s) But when I have 
> 100,000 objects it becomes very very slow. (I don't even have patience 
> to wait for it to return.... I kill it after 30 mins)
>
> select * from statusrecord s1 INNER JOIN ( SELECT objectId , 
> MAX(datetime) AS msdt FROM statusrecord WHERE startDatetime <= 
> 1233897527657 GROUP BY objectId ) AS s2 ON ( s1.objectId = s2.objectId 
> AND s1.datetime = s2.msdt ) where ( capacity < 10.0 ) order by 
> s1.datetime DESC, s1.objectId DESC;
>
> I did try to write a store procedure like below, for 100,000 objects 
> and 1000 status records / object, it returns in around 30 mins.
>
> CREATE OR REPLACE FUNCTION getStatus(pitvalue BIGINT) RETURNS SETOF 
> statusrecord AS $BODY$
> DECLARE
> id VARCHAR;
> status statusrecord%ROWTYPE;
> BEGIN
> FOR object IN SELECT * FROM objectRecord
> LOOP
> EXECUTE 'SELECT * FROM statusrecord WHERE objectId = ' || 
> quote_literal(object.objectId) ||
> ' AND datetime <= ' || quote_literal(pitvalue) || ' ORDER BY datetime 
> DESC'
> INTO status;
> IF FOUND THEN
> RETURN NEXT status;
> END IF;
> END LOOP;
> RETURN;
> END
> $BODY$ LANGUAGE plpgsql;
>
> Just wanna to know if anyone have a different approach to my senario. 
> Thanks alot.
>
> John 


Re: SQL report

From
Steve Crawford
Date:
wkipjohn@gmail.com wrote:
> I have the following senario.
>
> I have a tracking system. The system will record the status of an 
> object regularly, all the status records are stored in one table. And 
> it will keep a history of maximum 1000 status record for each object 
> it tracks. The maximum objects the system will track is 100,000. Which 
> means I will potentially have a table size of 100 million records.
>
> I have to generate a report on the latest status of all objects being 
> tracked at a particular point in time, and also I have to allow user 
> to sort and filter on different columes in the status record displayed 
> in the report.
...
>
> Just wanna to know if anyone have a different approach to my senario. 
> Thanks alot.
>
Not knowing all the details of your system, here are some things you 
could experiment with:

1. Add a "latest record id" field in your object table (automatically 
updated with a trigger) that would allow you to do a simple join with 
the tracking table. I suspect that such a join will be far faster than 
calculating "max" 100,000 times at the expense of a slightly larger main 
table.

2. Add a "current record flag" in the status table that simply flags the 
most recent record for each object (again, use triggers to keep the flag 
appropriately updated). This would also eliminate the need for the "max" 
subquery. You could even create a partial index filtering on the 
"current record flag" which could speed things up if the reporting query 
is written correctly.

3. Partition the table into a "current status table" and "historical 
status table" (each inheriting from the main table). Use a trigger so 
that anytime a new status record in added, the old "current" record is 
moved from the "current" to the "historical" table and the new one added 
to the "current" table. The latest status report will only need a simple 
join on the "current" table with a max size of 100,000 rather than a 
more complex query over a 100,000,000 record table.

Cheers,
Steve



Re: SQL report

From
Rob Sargent
Date:
Did you look at the query plans for the various record counts?  That 
might show which index is missing or misinformed :).  I wonder if 
clustering the status table on objectid would help?  This does then 
require maintenance so you might only load it at 75%. 

wkipjohn@gmail.com wrote:
> Hi Rob,
>
> I have default B-Tree indexes created for each of the indexed columes 
> and primary key columes. (No multiple columes indexe or NULL FIRST or 
> DESC/ASC). I am using PostgreSQL 8.3 with the auto vacuum daemon on. I 
> assume analyse will be automatically run to collect statistics for use 
> by the planner and there is no maintainance for B-tree indexes once it 
> is created. (Please point me out if I am wrong about this)
>
> I will probably try to partition the status table to group more recent 
> status records together to minimize the dataset I am querying.
>
> Thx
> John
>
>
> On Jul 31, 2009 1:16am, Rob Sargent <robjsargent@gmail.com> wrote:
> > I would be curious to know the performance curve for let's say 20K, 
> 40K , 60K, 80K, 100K records. And what sort of indexing you have, 
> whether or not it's clustered, re-built and so on.
> >
> >
> >
> > One could envision partitioning the status table such that recent 
> records were grouped together (on the assumption that they will be 
> most frequently "reported").
> >
> >
> >
> > wkipjohn@gmail.com wrote:
> >
> >
> > I have the following senario.
> >
> >
> >
> > I have a tracking system. The system will record the status of an 
> object regularly, all the status records are stored in one table. And 
> it will keep a history of maximum 1000 status record for each object 
> it tracks. The maximum objects the system will track is 100,000. Which 
> means I will potentially have a table size of 100 million records.
> >
> >
> >
> > I have to generate a report on the latest status of all objects 
> being tracked at a particular point in time, and also I have to allow 
> user to sort and filter on different columes in the status record 
> displayed in the report.
> >
> >
> >
> > The following is a brief description in the status record (they are 
> not actual code)
> >
> >
> >
> > ObjectRecord(
> >
> > objectId bigint PrimaryKey
> >
> > desc varchar
> >
> > )
> >
> >
> >
> > StatusRecord (
> >
> > id bigint PrimaryKey
> >
> > objectId bigint indexed
> >
> > datetime bigint indexed
> >
> > capacity double
> >
> > reliability double
> >
> > efficiency double
> >
> > )
> >
> >
> >
> > I have tried to do the following, it works very well with around 
> 20,000 objects. (The query return in less than 10s) But when I have 
> 100,000 objects it becomes very very slow. (I don't even have patience 
> to wait for it to return.... I kill it after 30 mins)
> >
> >
> >
> > select * from statusrecord s1 INNER JOIN ( SELECT objectId , 
> MAX(datetime) AS msdt FROM statusrecord WHERE startDatetime
> >
> >
> > I did try to write a store procedure like below, for 100,000 objects 
> and 1000 status records / object, it returns in around 30 mins.
> >
> >
> >
> > CREATE OR REPLACE FUNCTION getStatus(pitvalue BIGINT) RETURNS SETOF 
> statusrecord AS $BODY$
> >
> > DECLARE
> >
> > id VARCHAR;
> >
> > status statusrecord%ROWTYPE;
> >
> > BEGIN
> >
> > FOR object IN SELECT * FROM objectRecord
> >
> > LOOP
> >
> > EXECUTE 'SELECT * FROM statusrecord WHERE objectId = ' || 
> quote_literal(object.objectId) ||
> >
> > ' AND datetime
> > INTO status;
> >
> > IF FOUND THEN
> >
> > RETURN NEXT status;
> >
> > END IF;
> >
> > END LOOP;
> >
> > RETURN;
> >
> > END
> >
> > $BODY$ LANGUAGE plpgsql;
> >
> >
> >
> > Just wanna to know if anyone have a different approach to my 
> senario. Thanks alot.
> >
> >
> >
> > John
> >
> > 


Re: SQL report

From
wkipjohn@gmail.com
Date:
Hi Steve,<br /><br />Thanks for you suggestions. In my senario, what is current depends on users. Because if user wants
astatus report at 00:00 1st Jan 2009, then 00:00 1st Jan 2009 is current. So it is not possible to flag any records as
currentunless the user tells us what is current. <br /><br />cheers<br />John<br /><br />On Jul 31, 2009 2:41am, Steve
Crawford<scrawford@pinpointresearch.com> wrote:<br />> wkipjohn@gmail.com wrote:<br />> <br />> <br
/>>I have the following senario.<br />> <br />> <br />> <br />> I have a tracking system. The system
willrecord the status of an object regularly, all the status records are stored in one table. And it will keep a
historyof maximum 1000 status record for each object it tracks. The maximum objects the system will track is 100,000.
Whichmeans I will potentially have a table size of 100 million records.<br />> <br />> <br />> <br />> I
haveto generate a report on the latest status of all objects being tracked at a particular point in time, and also I
haveto allow user to sort and filter on different columes in the status record displayed in the report.<br />> <br
/>><br />> ...<br />> <br />> <br />> <br />> <br />> Just wanna to know if anyone have a
differentapproach to my senario. Thanks alot.<br />> <br />> <br />> <br />> <br />> Not knowing all the
detailsof your system, here are some things you could experiment with:<br />> <br />> <br />> <br />> 1.
Adda "latest record id" field in your object table (automatically updated with a trigger) that would allow you to do a
simplejoin with the tracking table. I suspect that such a join will be far faster than calculating "max" 100,000 times
atthe expense of a slightly larger main table.<br />> <br />> <br />> <br />> 2. Add a "current record
flag"in the status table that simply flags the most recent record for each object (again, use triggers to keep the flag
appropriatelyupdated). This would also eliminate the need for the "max" subquery. You could even create a partial index
filteringon the "current record flag" which could speed things up if the reporting query is written correctly.<br
/>><br />> <br />> <br />> 3. Partition the table into a "current status table" and "historical status
table"(each inheriting from the main table). Use a trigger so that anytime a new status record in added, the old
"current"record is moved from the "current" to the "historical" table and the new one added to the "current" table. The
lateststatus report will only need a simple join on the "current" table with a max size of 100,000 rather than a more
complexquery over a 100,000,000 record table.<br />> <br />> <br />> <br />> Cheers,<br />> <br />>
Steve<br/>> <br />> <br />> 

Re: SQL report

From
wkipjohn@gmail.com
Date:
Hi Rob,<br /><br />I have default B-Tree indexes created for each of the indexed columes and primary key columes. (No
multiplecolumes indexe or NULL FIRST or DESC/ASC). I am using PostgreSQL 8.3 with the auto vacuum daemon on. I assume
analysewill be automatically run to collect statistics for use by the planner and there is no maintainance for B-tree
indexesonce it is created. (Please point me out if I am wrong about this)<br /><br />I will probably try to partition
thestatus table to group more recent status records together to minimize the dataset I am querying. <br /><br />Thx<br
/>John<br/><br /><br />On Jul 31, 2009 1:16am, Rob Sargent <robjsargent@gmail.com> wrote:<br />> I would be
curiousto know the performance curve for let's say 20K, 40K , 60K, 80K, 100K records. And what sort of indexing you
have,whether or not it's clustered, re-built and so on.<br />> <br />> <br />> <br />> One could envision
partitioningthe status table such that recent records were grouped together (on the assumption that they will be most
frequently"reported").<br />> <br />> <br />> <br />> wkipjohn@gmail.com wrote:<br />> <br />> <br
/>>I have the following senario.<br />> <br />> <br />> <br />> I have a tracking system. The system
willrecord the status of an object regularly, all the status records are stored in one table. And it will keep a
historyof maximum 1000 status record for each object it tracks. The maximum objects the system will track is 100,000.
Whichmeans I will potentially have a table size of 100 million records.<br />> <br />> <br />> <br />> I
haveto generate a report on the latest status of all objects being tracked at a particular point in time, and also I
haveto allow user to sort and filter on different columes in the status record displayed in the report.<br />> <br
/>><br />> <br />> The following is a brief description in the status record (they are not actual code)<br
/>><br />> <br />> <br />> ObjectRecord(<br />> <br />> objectId bigint PrimaryKey<br />> <br
/>>desc varchar<br />> <br />> )<br />> <br />> <br />> <br />> StatusRecord (<br />> <br
/>>id bigint PrimaryKey<br />> <br />> objectId bigint indexed<br />> <br />> datetime bigint indexed<br
/>><br />> capacity double<br />> <br />> reliability double<br />> <br />> efficiency double<br
/>><br />> )<br />> <br />> <br />> <br />> I have tried to do the following, it works very well with
around20,000 objects. (The query return in less than 10s) But when I have 100,000 objects it becomes very very slow. (I
don'teven have patience to wait for it to return.... I kill it after 30 mins)<br />> <br />> <br />> <br
/>>select * from statusrecord s1 INNER JOIN ( SELECT objectId , MAX(datetime) AS msdt FROM statusrecord WHERE
startDatetime<br />> <br />> <br />> I did try to write a store procedure like below, for 100,000 objects and
1000status records / object, it returns in around 30 mins.<br />> <br />> <br />> <br />> CREATE OR REPLACE
FUNCTIONgetStatus(pitvalue BIGINT) RETURNS SETOF statusrecord AS $BODY$<br />> <br />> DECLARE<br />> <br
/>>id VARCHAR;<br />> <br />> status statusrecord%ROWTYPE;<br />> <br />> BEGIN<br />> <br />> FOR
objectIN SELECT * FROM objectRecord<br />> <br />> LOOP<br />> <br />> EXECUTE 'SELECT * FROM statusrecord
WHEREobjectId = ' || quote_literal(object.objectId) ||<br />> <br />> ' AND datetime <br />> INTO status;<br
/>><br />> IF FOUND THEN<br />> <br />> RETURN NEXT status;<br />> <br />> END IF;<br />> <br
/>>END LOOP;<br />> <br />> RETURN;<br />> <br />> END<br />> <br />> $BODY$ LANGUAGE plpgsql;<br
/>><br />> <br />> <br />> Just wanna to know if anyone have a different approach to my senario. Thanks
alot.<br/>> <br />> <br />> <br />> John <br />> <br />>