Thread: Read performance on Large Table
Hello Admin,
I am looking to build a database to log activities on my system
Currently I have the following table:
CREATE TABLE table
(
attr1 bigint,
attr2 text,
attr3 timestamp without time zone,
attr4 integer,
attr5 character varying(50),
attr6 character varying(50),
attr7 character varying(50),
attr8 character varying(50),
attr9 character varying(50),
attr10 character varying(50)
)
Over 3 months the table has grown to 600+ Millions Rows,
We mainly do inserts, we almost never do updates or delete.
However the table is not very useful as it stands because any select we run it takes a very long time ( in hours) to complete.
Question1: Is Slowness to be expected with such big table?
Question2: is there anything I am not doing that I should do to get a better perfomance?
I am thinking about 2 solution but I wanted to Ask the more experienced people before implementing them:
1: Break the table into multiple small table for each week.
2: Create an index on the Timestamp column sin it is the one we use the most in queries.
Thank you in advance for you Help,
--Kido K.
Note:
explain analyse select count(*) from table where attr5='value' and attr3 > '05/19/2015 00:00:00' and attr3 < '05/21/2015 00:00:00';
Has been running for 30 minutes and counting....
> Hello Admin, Hi Kido, Create indece in the fields that you use to compare the JOIN > I am looking to build a database to log activities on my system > > Currently I have the following table: > > CREATE TABLE table > ( > attr1 bigint, > attr2 text, > attr3 timestamp without time zone, > attr4 integer, > attr5 character varying(50), > attr6 character varying(50), > attr7 character varying(50), > attr8 character varying(50), > attr9 character varying(50), > attr10 character varying(50) > ) > > Over 3 months the table has grown to 600+ Millions Rows, > > We mainly do inserts, we almost never do updates or delete. > > However the table is not very useful as it stands because any select we > run it takes a very long time ( in hours) to complete. > > > Question1: Is Slowness to be expected with such big table? > Question2: is there anything I am not doing that I should do to get a > better perfomance? > > I am thinking about 2 solution but I wanted to Ask the more experienced > people before implementing them: > > 1: Break the table into multiple small table for each week. > 2: Create an index on the Timestamp column sin it is the one we use the > most in queries. > > Thank you in advance for you Help, > > --Kido K. > > > Note: > explain analyse select count(*) from table where attr5='value' and attr3 > > > '05/19/2015 00:00:00' and attr3 < '05/21/2015 00:00:00'; > > Has been running for 30 minutes and counting.... Saludos, Gilberto Castillo ETECSA, La Habana, Cuba --- This message was processed by Kaspersky Mail Gateway 5.6.28/RELEASE running at host imx3.etecsa.cu Visit our web-site: <http://www.kaspersky.com>, <http://www.viruslist.com>
Looks like currently the table has no index. It has to do table scan. Please post the typical queries and we will suggest what columns to index.
On Thu, May 21, 2015 at 10:54 AM, Kido Kouassi <jjkido@gmail.com> wrote:
Hello Admin,I am looking to build a database to log activities on my systemCurrently I have the following table:CREATE TABLE table(attr1 bigint,attr2 text,attr3 timestamp without time zone,attr4 integer,attr5 character varying(50),attr6 character varying(50),attr7 character varying(50),attr8 character varying(50),attr9 character varying(50),attr10 character varying(50))Over 3 months the table has grown to 600+ Millions Rows,We mainly do inserts, we almost never do updates or delete.However the table is not very useful as it stands because any select we run it takes a very long time ( in hours) to complete.Question1: Is Slowness to be expected with such big table?Question2: is there anything I am not doing that I should do to get a better perfomance?I am thinking about 2 solution but I wanted to Ask the more experienced people before implementing them:1: Break the table into multiple small table for each week.2: Create an index on the Timestamp column sin it is the one we use the most in queries.Thank you in advance for you Help,--Kido K.Note:explain analyse select count(*) from table where attr5='value' and attr3 > '05/19/2015 00:00:00' and attr3 < '05/21/2015 00:00:00';Has been running for 30 minutes and counting....
On Thu, May 21, 2015 at 8:54 AM, Kido Kouassi <jjkido@gmail.com> wrote: > Hello Admin, > > I am looking to build a database to log activities on my system > > Currently I have the following table: > > CREATE TABLE table > ( > attr1 bigint, > attr2 text, > attr3 timestamp without time zone, > attr4 integer, > attr5 character varying(50), > attr6 character varying(50), > attr7 character varying(50), > attr8 character varying(50), > attr9 character varying(50), > attr10 character varying(50) > ) > > Over 3 months the table has grown to 600+ Millions Rows, > > We mainly do inserts, we almost never do updates or delete. > > However the table is not very useful as it stands because any select we > run it takes a very long time ( in hours) to complete. > > > Question1: Is Slowness to be expected with such big table? > Question2: is there anything I am not doing that I should do to get a better > perfomance? > > I am thinking about 2 solution but I wanted to Ask the more experienced > people before implementing them: > > 1: Break the table into multiple small table for each week. > 2: Create an index on the Timestamp column sin it is the one we use the most > in queries. These are both good ideas. You can use partitioning to do Step1 and still have all your data in "one place" so to speak. Here's the standard page on partitioning: http://www.postgresql.org/docs/9.4/static/ddl-partitioning.html I've done a lot of partitioning of big data sets in postgresql and if there's some common field, like data, that makes sense to partition on, it can be a huge win.
On May 21, 2015, at 8:54 AM, Kido Kouassi <jjkido@gmail.com> wrote: > > 2: Create an index on the Timestamp column sin it is the one we use the most in queries. Absolutely. Go ahead right now. Note that you can use the CONCURRENTLY option to create index, and it will not require a lock on the table, so you can doit while the db is being used. It takes longer, sometimes a lot longer, than the regular create, but who cares? Start itand leave it. After you’ve done that and seen how much benefit you get just from indexing your most-commonly used column, post a rangeof common queries back here for further advice. -- Scott Ribe scott_ribe@elevated-dev.com http://www.elevated-dev.com/ https://www.linkedin.com/in/scottribe/ (303) 722-0567 voice
On May 21, 2015, at 9:05 AM, Scott Marlowe <scott.marlowe@gmail.com> wrote: > > I've done a lot of partitioning of big data sets in postgresql and if > there's some common field, like data, that makes sense to partition > on, it can be a huge win. Indeed. I recently did it on exactly this kind of thing, a log of activity. And the common queries weren’t slow at all. But if I wanted to upgrade via dump/restore with minimal downtime, rather than set up Slony or try my luck with pg_upgrade,I could dump the historical partitions, drop those tables, then dump/restore, then restore the historical partitionsat my convenience. (In this particular db, history is unusually huge compared to the live data.) -- Scott Ribe scott_ribe@elevated-dev.com http://www.elevated-dev.com/ https://www.linkedin.com/in/scottribe/ (303) 722-0567 voice
On Thu, May 21, 2015 at 9:18 AM, Scott Ribe <scott_ribe@elevated-dev.com> wrote: > On May 21, 2015, at 9:05 AM, Scott Marlowe <scott.marlowe@gmail.com> wrote: >> >> I've done a lot of partitioning of big data sets in postgresql and if >> there's some common field, like data, that makes sense to partition >> on, it can be a huge win. > > Indeed. I recently did it on exactly this kind of thing, a log of activity. And the common queries weren’t slow at all. > > But if I wanted to upgrade via dump/restore with minimal downtime, rather than set up Slony or try my luck with pg_upgrade,I could dump the historical partitions, drop those tables, then dump/restore, then restore the historical partitionsat my convenience. (In this particular db, history is unusually huge compared to the live data.) I use an interesting method to setup partitioning. I setup my triggers, then insert the data in chunks from the master table to itself. insert into master_table select * from only master_table limit 10000; and run that over and over. The data is all in the same "table" to the application. But it's slowly moving to the partitions without interruption. Note: ALWAYS use triggers for partitioning. Rules are way too slow.
On May 21, 2015, at 9:21 AM, Scott Marlowe <scott.marlowe@gmail.com> wrote: > > I use an interesting method to setup partitioning. I setup my > triggers, then insert the data in chunks from the master table to > itself. > > insert into master_table select * from only master_table limit 10000; > > and run that over and over. The data is all in the same "table" to the > application. But it's slowly moving to the partitions without > interruption. Ah, good idea—I’ll have to remember that. In my case the table is write-mostly, read-rarely, and update-never, so insert into partition select * from master where…with millions of rows at a time was not a problem. The one thing I did differently from the documentation, is write a pgSQL procedure to create a partition for a particulardate range and the function and trigger, where the trigger function checks the date against the current “head”partition, and if the date is newer, calls the function which creates the partition and replaces the trigger function.I hate having to remember those things which only have to be done once a year ;-) And yes, that particular implementation depends on the fact that the criteria on which I’m partitioning increases monotonically.I expect it would create a performance nightmare if I tried to insert old data into the master table whilethe application was actively inserting new data. The same basic technique can be adapted to other situations, but thetrigger function is slightly more complicated, and the function to create the trigger function would be kind of ugly andhard to follow. -- Scott Ribe scott_ribe@elevated-dev.com http://www.elevated-dev.com/ https://www.linkedin.com/in/scottribe/ (303) 722-0567 voice
If you're just going to do basic time series partitioning, I've written a tool that manages most of it for you. Also does retention management as well and can dump out the old partitions automatically.
https://github.com/keithf4/pg_partman
I'm in the process of getting v2.0.0 out that has a lot of new work done, but will only be compatible with Postgres 9.4 (since it uses background workers to have scheduling built in). So if you want to wait, I should have that out soon.https://github.com/keithf4/pg_partman
On Thu, May 21, 2015 at 11:41 AM, Scott Ribe <scott_ribe@elevated-dev.com> wrote:
On May 21, 2015, at 9:21 AM, Scott Marlowe <scott.marlowe@gmail.com> wrote:
>
> I use an interesting method to setup partitioning. I setup my
> triggers, then insert the data in chunks from the master table to
> itself.
>
> insert into master_table select * from only master_table limit 10000;
>
> and run that over and over. The data is all in the same "table" to the
> application. But it's slowly moving to the partitions without
> interruption.
Ah, good idea—I’ll have to remember that.
In my case the table is write-mostly, read-rarely, and update-never, so insert into partition select * from master where… with millions of rows at a time was not a problem.
The one thing I did differently from the documentation, is write a pgSQL procedure to create a partition for a particular date range and the function and trigger, where the trigger function checks the date against the current “head” partition, and if the date is newer, calls the function which creates the partition and replaces the trigger function. I hate having to remember those things which only have to be done once a year ;-)
And yes, that particular implementation depends on the fact that the criteria on which I’m partitioning increases monotonically. I expect it would create a performance nightmare if I tried to insert old data into the master table while the application was actively inserting new data. The same basic technique can be adapted to other situations, but the trigger function is slightly more complicated, and the function to create the trigger function would be kind of ugly and hard to follow.
--
Scott Ribe
scott_ribe@elevated-dev.com
http://www.elevated-dev.com/
https://www.linkedin.com/in/scottribe/
(303) 722-0567 voice--
Sent via pgsql-admin mailing list (pgsql-admin@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin
On May 21, 2015, at 9:45 AM, Keith <keith@keithf4.com> wrote: > > If you're just going to do basic time series partitioning, I've written a tool that manages most of it for you. Also doesretention management as well and can dump out the old partitions automatically. Yep, make some partitions ahead of time, plus cron to periodically run in order to stay ahead, is an alternative approach. > I'm in the process of getting v2.0.0 out that has a lot of new work done, but will only be compatible with Postgres 9.4(since it uses background workers to have scheduling built in). So if you want to wait, I should have that out soon. I wouldn’t mind that—I did it the way I did partly because I wanted it all self-contained in PG without external (cron orlaunchd or…) configuration dependencies. -- Scott Ribe scott_ribe@elevated-dev.com http://www.elevated-dev.com/ https://www.linkedin.com/in/scottribe/ (303) 722-0567 voice
On Thu, May 21, 2015 at 12:01 PM, Scott Ribe <scott_ribe@elevated-dev.com> wrote:
On May 21, 2015, at 9:45 AM, Keith <keith@keithf4.com> wrote:
>
> If you're just going to do basic time series partitioning, I've written a tool that manages most of it for you. Also does retention management as well and can dump out the old partitions automatically.
Yep, make some partitions ahead of time, plus cron to periodically run in order to stay ahead, is an alternative approach.
This is exactly what the extension does.
> I'm in the process of getting v2.0.0 out that has a lot of new work done, but will only be compatible with Postgres 9.4 (since it uses background workers to have scheduling built in). So if you want to wait, I should have that out soon.
I wouldn’t mind that—I did it the way I did partly because I wanted it all self-contained in PG without external (cron or launchd or…) configuration dependencies.
The problem with creating partitions on the fly via triggers, even ahead of when they're needed, is you run into BIG contention issues when you have high insert rates. One session will see it needs to create a new table, the next one comes in before the table is made but either gets put into a wait state or outright fails when it goes to try and make the new table that's already there. The serial partitioning method in pg_partman does do this by default (creates new future partitions when current is at 50%) because for serial, scheduling maintenance can be tricky because you may not be able to predict how often it needs to run. But several people brought up issues with the contention problems this caused so I made it optional to use the scheduled maintenance run instead.
It may not be an issue for you, but it's definitely something to keep in mind if you notice performance issues. :)
On May 21, 2015, at 10:25 AM, Keith <keith@keithf4.com> wrote: > > The problem with creating partitions on the fly via triggers, even ahead of when they're needed, is you run into BIG contentionissues when you have high insert rates. One session will see it needs to create a new table, the next one comesin before the table is made but either gets put into a wait state or outright fails when it goes to try and make thenew table that's already there… No failure, I handle the exception. > It may not be an issue for you, but it's definitely something to keep in mind if you notice performance issues. :) > Right. Around midnight this db has very low activity. -- Scott Ribe scott_ribe@elevated-dev.com http://www.elevated-dev.com/ https://www.linkedin.com/in/scottribe/ (303) 722-0567 voice