Thread: huge price database question..

huge price database question..

From

Jim Green

Date:

20 March 2012, 23:27:31

Greetings list!
I am pretty new to postgresql from mysql and did a fairly extensive
search of the list and came up with a few good ones but didn't find
the exact same situation as I have now. so I am venturing asking here.

I have daily minute stock price data from 2005 on and each day with
columns timestamp, open,high,low,close,volume and a few more. each
day's data is about 1.2million rows. I want import all the data to
postgresql and analyze using R with the help of Rpostgresql.

right now I am having about 7000 tables for individual stock and I use
perl to do inserts, it's very slow. I would like to use copy or other
bulk loading tool to load the daily raw gz data. but I need the split
the file to per stock files first before I do bulk loading. I consider
this a bit messy.

I would seek advise on the following idea:
store everything in a big table, partition by month(this gives a
reasonable number of partitions) and do bulk loading on the daily
file. my queries would consist mostly select on a particular symbol on
a particular day.

Also in the future, I will import daily data to the db every day.

my hardware is 16G Ram, 4x5400rpm raid10 with enough space.

Thanks!

Jim.

Re: huge price database question..

From

Michael Nolan

Date:

20 March 2012, 23:45:35

right now I am having about 7000 tables for individual stock and I use
perl to do inserts, it's very slow. I would like to use copy or other
bulk loading tool to load the daily raw gz data. but I need the split
the file to per stock files first before I do bulk loading. I consider
this a bit messy.

Are you committing each insert separately or doing them in batches using 'begin transaction' and 'commit'?

I have a database that I do inserts in from a text file. Doing a commit every 1000 transactions cut the time by over 90%.
--
Mike Nolan

Re: huge price database question..

From

Steve Crawford

Date:

21 March 2012, 00:20:00

On 03/20/2012 04:27 PM, Jim Green wrote:
> Greetings list!
> I am pretty new to postgresql from mysql
Welcome.
> I have daily minute stock price data from 2005 on and each day with
> columns timestamp, open,high,low,close,volume and a few more. each
> day's data is about 1.2million rows. I want import all the data to
> postgresql and analyze using R with the help of Rpostgresql.
>
> right now I am having about 7000 tables for individual stock and I use
> perl to do inserts, it's very slow.
I'm not sure I understand - please expand and clarify.

If you have 7000 stocks, 1.2 million rows/day is only 171
observations/stock/day or a little under 3-hours of data at 60
samples/hour. Are there holes in the data or am I completely missing the
setup?

For the NYSE you have 52-weeks/year, 5 trading-days/week (except for up
to 8 trading holidays), 6.5 trading-hours/day giving 100,000
rows/stock/year give or take depending on holiday schedule. Over your
time range, that will be less than 1-million rows per stock.

Without holes in the data, it seems you will be adding 2.7 million rows
per trading day.

Is perl doing individual record inserts? Short of reconnecting for each
insert, this is the slowest option. Each insert is its own transaction
and, unless you are using a raid card with writeback enabled (and
hopefully with battery-backed cache if it is) you will hit some
limitations imposed by your disk's rotational speed. If you can ensure
the data is sorted by stock, you can start a transaction, write all the
records for that stock, then commit the transaction which should speed
the operation.

Only because you are bulk adding historical data and can rebuild your
database if it is destroyed, you could consider turning off fsync while
you are importing historical data. Dropping indexes while you play
catch-up can help as well.

>   I would like to use copy or other
> bulk loading tool to load the daily raw gz data. but I need the split
> the file to per stock files first before I do bulk loading. I consider
> this a bit messy.
>
> I would seek advise on the following idea:
> store everything in a big table, partition by month(this gives a
> reasonable number of partitions) and do bulk loading on the daily
> file. my queries would consist mostly select on a particular symbol on
> a particular day.
If you already have daily files, you might want to work out the process
of importing a day and apply that to your backlog. There are lots of
options - you might try experimenting.

Given the type of queries you plan, simple partitioning by time period
as you described is a reasonable approach.

You could import a day's data as a bulk copy then select from that table
into the various individual stock tables. BTW, you have plenty of RAM
that that daily bulk file will be cached and queries on each individual
stock will probably be quite fast. But you will need to wrap things in a
transaction or otherwise ensure that you can recover if things fail
part-way through that distribution of data.

You could partition your data by stock symbol and use a trigger on the
parent to put the data into the correct table on insert. I am unsure how
fast this will run - try it and test. One benefit is that your daily
import will fully succeed or fully fail. But you will need to update
your partitioning and triggers to deal with new stocks. You can, of
course, have the trigger choose which table to use based on the table
name. This eliminates the need to alter the trigger code but it is still
recommended to make the child-table in advance.

> my hardware is 16G Ram, 4x5400rpm raid10 with enough space.
Have you started with basic tuning. It is unlikely that whatever stock
PostgreSQL you have installed is suboptimal. (What PG version and OS are
you using?)

Cheers,
Steve

Re: huge price database question..

From

Jim Green

Date:

21 March 2012, 01:04:06

On 20 March 2012 19:45, Michael Nolan <htfoot@gmail.com> wrote:
>
>>
>> right now I am having about 7000 tables for individual stock and I use
>> perl to do inserts, it's very slow. I would like to use copy or other
>> bulk loading tool to load the daily raw gz data. but I need the split
>> the file to per stock files first before I do bulk loading. I consider
>> this a bit messy.
>
>
> Are you committing each insert separately or doing them in batches using
> 'begin transaction' and 'commit'?
>
> I have a database that I do inserts in from a text file. Doing a commit
> every 1000 transactions cut the time by over 90%.

I use perl dbi and prepared statement. also I set
shared_buffers = 4GB
work_mem = 1GB
synchronous_commit = off
effective_cache_size = 8GB
fsync=off
full_page_writes = off

when I do the insert.

Thanks!


> --
> Mike Nolan

Re: huge price database question..

From

Jim Green

Date:

21 March 2012, 01:16:35

On 20 March 2012 20:19, Steve Crawford <scrawford@pinpointresearch.com> wrote:
> On 03/20/2012 04:27 PM, Jim Green wrote:
>>
>> Greetings list!
>> I am pretty new to postgresql from mysql
>
> Welcome.
>
>> I have daily minute stock price data from 2005 on and each day with
>> columns timestamp, open,high,low,close,volume and a few more. each
>> day's data is about 1.2million rows. I want import all the data to
>> postgresql and analyze using R with the help of Rpostgresql.
>>
>> right now I am having about 7000 tables for individual stock and I use
>> perl to do inserts, it's very slow.
>
> I'm not sure I understand - please expand and clarify.
>
> If you have 7000 stocks, 1.2 million rows/day is only 171
> observations/stock/day or a little under 3-hours of data at 60 samples/hour.
> Are there holes in the data or am I completely missing the setup?

Hi:
stocks are being delisted and added so not everyday I have 7000 stocks
and 7000 is just a rough number. also lots of them are not liquid so
holes are possible..

>
> For the NYSE you have 52-weeks/year, 5 trading-days/week (except for up to 8
> trading holidays), 6.5 trading-hours/day giving 100,000 rows/stock/year give
> or take depending on holiday schedule. Over your time range, that will be
> less than 1-million rows per stock.
>
> Without holes in the data, it seems you will be adding 2.7 million rows per
> trading day.

I spot checked 0302's data it has 1.2 million rows in it and would be
similar for other dates.

>
> Is perl doing individual record inserts? Short of reconnecting for each
> insert, this is the slowest option. Each insert is its own transaction and,
> unless you are using a raid card with writeback enabled (and hopefully with
> battery-backed cache if it is) you will hit some limitations imposed by your
> disk's rotational speed. If you can ensure the data is sorted by stock, you
> can start a transaction, write all the records for that stock, then commit
> the transaction which should speed the operation.

I use the per dbi and prepared statement to insert to table per symbol.

>
> Only because you are bulk adding historical data and can rebuild your
> database if it is destroyed, you could consider turning off fsync while you
> are importing historical data. Dropping indexes while you play catch-up can
> help as well.

I already turn fsync off.. but autovacuum uses lots of io and my 8
core cpu is really busy..

>
>
>>  I would like to use copy or other
>> bulk loading tool to load the daily raw gz data. but I need the split
>> the file to per stock files first before I do bulk loading. I consider
>> this a bit messy.
>>
>> I would seek advise on the following idea:
>> store everything in a big table, partition by month(this gives a
>> reasonable number of partitions) and do bulk loading on the daily
>> file. my queries would consist mostly select on a particular symbol on
>> a particular day.
>
> If you already have daily files, you might want to work out the process of
> importing a day and apply that to your backlog. There are lots of options -
> you might try experimenting.
>
> Given the type of queries you plan, simple partitioning by time period as
> you described is a reasonable approach.
>
> You could import a day's data as a bulk copy then select from that table
> into the various individual stock tables. BTW, you have plenty of RAM that
> that daily bulk file will be cached and queries on each individual stock
> will probably be quite fast. But you will need to wrap things in a
> transaction or otherwise ensure that you can recover if things fail part-way
> through that distribution of data.
>
> You could partition your data by stock symbol and use a trigger on the
> parent to put the data into the correct table on insert. I am unsure how

I've read thousands of partition might be too much..

> fast this will run - try it and test. One benefit is that your daily import
> will fully succeed or fully fail. But you will need to update your
> partitioning and triggers to deal with new stocks. You can, of course, have
> the trigger choose which table to use based on the table name. This
> eliminates the need to alter the trigger code but it is still recommended to
> make the child-table in advance.
>
>
>
>> my hardware is 16G Ram, 4x5400rpm raid10 with enough space.
>
> Have you started with basic tuning. It is unlikely that whatever stock
> PostgreSQL you have installed is suboptimal. (What PG version and OS are you
> using?)

Yes I do, I posted it in last reply. I run debian squeeze 64bit and
9.1.3 version PG..


It looks like alternatives are kind of complex to me, right now my
approach(perl dbi and prepared insert) would take about 8/9 mins to
insert  a day's data.  I think I'll probably just stick with it and
wait.

the autovacuum processes does a lot of io and make my pc unusable
while I do the data inserts.. and I tested autovacuum off with not
much success because of they are launched due to the transaction id
wrap around issue.


Thanks!

Jim.

>
> Cheers,
> Steve

Re: huge price database question..

From

David Kerr

Date:

21 March 2012, 01:40:56

On 03/20/2012 04:27 PM, Jim Green wrote:

Greetings list!
I am pretty new to postgresql from mysql and did a fairly extensive
search of the list and came up with a few good ones but didn't find
the exact same situation as I have now. so I am venturing asking here.

I have daily minute stock price data from 2005 on and each day with
columns timestamp, open,high,low,close,volume and a few more. each
day's data is about 1.2million rows. I want import all the data to
postgresql and analyze using R with the help of Rpostgresql.

right now I am having about 7000 tables for individual stock and I use
perl to do inserts, it's very slow. I would like to use copy or other
bulk loading tool to load the daily raw gz data. but I need the split
the file to per stock files first before I do bulk loading. I consider
this a bit messy.

I would seek advise on the following idea:
store everything in a big table, partition by month(this gives a
reasonable number of partitions) and do bulk loading on the daily
file. my queries would consist mostly select on a particular symbol on
a particular day.

Also in the future, I will import daily data to the db every day.

my hardware is 16G Ram, 4x5400rpm raid10 with enough space.

Thanks!

Jim.

Seems like you'd want to do this?
http://search.cpan.org/~turnstep/DBD-Pg-2.19.2/Pg.pm#pg_putcopydata
COPY support

DBD::Pg allows for quick (bulk) reading and storing of data by using the COPY command. The basic process is to use $dbh->do to issue a COPY command, and then to either add rows using "pg_putcopydata", or to read them by using "pg_getcopydata".

Re: huge price database question..

From

Jim Green

Date:

21 March 2012, 01:50:57

On 20 March 2012 21:40, David Kerr <dmk@mr-paradox.net> wrote:
> On 03/20/2012 04:27 PM, Jim Green wrote:
>
> Greetings list!
> I am pretty new to postgresql from mysql and did a fairly extensive
> search of the list and came up with a few good ones but didn't find
> the exact same situation as I have now. so I am venturing asking here.
>
> I have daily minute stock price data from 2005 on and each day with
> columns timestamp, open,high,low,close,volume and a few more. each
> day's data is about 1.2million rows. I want import all the data to
> postgresql and analyze using R with the help of Rpostgresql.
>
> right now I am having about 7000 tables for individual stock and I use
> perl to do inserts, it's very slow. I would like to use copy or other
> bulk loading tool to load the daily raw gz data. but I need the split
> the file to per stock files first before I do bulk loading. I consider
> this a bit messy.
>
> I would seek advise on the following idea:
> store everything in a big table, partition by month(this gives a
> reasonable number of partitions) and do bulk loading on the daily
> file. my queries would consist mostly select on a particular symbol on
> a particular day.
>
> Also in the future, I will import daily data to the db every day.
>
> my hardware is 16G Ram, 4x5400rpm raid10 with enough space.
>
> Thanks!
>
> Jim.
>
>
> Seems like you'd want to do this?
> http://search.cpan.org/~turnstep/DBD-Pg-2.19.2/Pg.pm#pg_putcopydata
> COPY support
>
> DBD::Pg allows for quick (bulk) reading and storing of data by using the
> COPY command. The basic process is to use $dbh->do to issue a COPY command,
> and then to either add rows using "pg_putcopydata", or to read them by using
> "pg_getcopydata".

Thanks! would you comment on the table setup as well?

Jim.
>
>
>

Re: huge price database question..

From

Jim Green

Date:

21 March 2012, 02:03:12

On 20 March 2012 21:54, Brent Wood <Brent.Wood@niwa.co.nz> wrote:
>
> Also look at a clustered index on timestamp

Thanks, this looks very helpful. what do you think about the thousands
table vs one table partitioned by month? I guess if I go with one
table, index would be too big to fit in ram?

Jim.

Re: huge price database question..

From

David Kerr

Date:

21 March 2012, 02:03:32

On 03/20/2012 06:50 PM, Jim Green wrote:
> On 20 March 2012 21:40, David Kerr<dmk@mr-paradox.net>  wrote:
>> On 03/20/2012 04:27 PM, Jim Green wrote:
>>
>> Greetings list!
>> I am pretty new to postgresql from mysql and did a fairly extensive
>> search of the list and came up with a few good ones but didn't find
>> the exact same situation as I have now. so I am venturing asking here.
>>
>> I have daily minute stock price data from 2005 on and each day with
>> columns timestamp, open,high,low,close,volume and a few more. each
>> day's data is about 1.2million rows. I want import all the data to
>> postgresql and analyze using R with the help of Rpostgresql.
>>
>> right now I am having about 7000 tables for individual stock and I use
>> perl to do inserts, it's very slow. I would like to use copy or other
>> bulk loading tool to load the daily raw gz data. but I need the split
>> the file to per stock files first before I do bulk loading. I consider
>> this a bit messy.
>>
>> I would seek advise on the following idea:
>> store everything in a big table, partition by month(this gives a
>> reasonable number of partitions) and do bulk loading on the daily
>> file. my queries would consist mostly select on a particular symbol on
>> a particular day.
>>
>> Also in the future, I will import daily data to the db every day.
>>
>> my hardware is 16G Ram, 4x5400rpm raid10 with enough space.
>>
>> Thanks!
>>
>> Jim.
>>
>>
>> Seems like you'd want to do this?
>> http://search.cpan.org/~turnstep/DBD-Pg-2.19.2/Pg.pm#pg_putcopydata
>> COPY support
>>
>> DBD::Pg allows for quick (bulk) reading and storing of data by using the
>> COPY command. The basic process is to use $dbh->do to issue a COPY command,
>> and then to either add rows using "pg_putcopydata", or to read them by using
>> "pg_getcopydata".
> Thanks! would you comment on the table setup as well?
>
> Jim.
>

\copy on 1.2million rows should only take a minute or two, you could
make that table "unlogged"
as well to speed it up more.  If you could truncate / drop / create /
load / then index the table each
time then you'll get the best throughput.

Dave

Re: huge price database question..

From

Jim Green

Date:

21 March 2012, 02:09:03

On 20 March 2012 22:03, David Kerr <dmk@mr-paradox.net> wrote:

> \copy on 1.2million rows should only take a minute or two, you could make
> that table "unlogged"
> as well to speed it up more.  If you could truncate / drop / create / load /
> then index the table each
> time then you'll get the best throughput.

Thanks, Could you explain on the "runcate / drop / create / load /
then index the table each time then you'll get the best throughput."
part.. or point me to some docs?..

Jim
>
> Dave
>
>
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general

Re: huge price database question..

From

Jim Green

Date:

21 March 2012, 02:12:20

On 20 March 2012 22:08, Jim Green <student.northwestern@gmail.com> wrote:
> On 20 March 2012 22:03, David Kerr <dmk@mr-paradox.net> wrote:
>
>> \copy on 1.2million rows should only take a minute or two, you could make
>> that table "unlogged"
>> as well to speed it up more.  If you could truncate / drop / create / load /
>> then index the table each
>> time then you'll get the best throughput.
>
> Thanks, Could you explain on the "runcate / drop / create / load /
> then index the table each time then you'll get the best throughput."
> part.. or point me to some docs?..

Also if I use copy, I would be tempted to go the one table route, or
else I need to parse my raw daily file, separate to individual symbol
file and copy to individual table for each symbol(this sounds like not
very efficient)..

>
> Jim
>>
>> Dave
>>
>>
>>
>>
>> --
>> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
>> To make changes to your subscription:
>> http://www.postgresql.org/mailpref/pgsql-general

Re: huge price database question..

From

David Kerr

Date:

21 March 2012, 02:21:29

On 03/20/2012 07:08 PM, Jim Green wrote:
> On 20 March 2012 22:03, David Kerr<dmk@mr-paradox.net>  wrote:
>
>> \copy on 1.2million rows should only take a minute or two, you could make
>> that table "unlogged"
>> as well to speed it up more.  If you could truncate / drop / create / load /
>> then index the table each
>> time then you'll get the best throughput.
> Thanks, Could you explain on the "runcate / drop / create / load /
> then index the table each time then you'll get the best throughput."
> part.. or point me to some docs?..
>
> Jim

I'm imagining that you're loading the raw file into a temporary table
that you're going to use to
process / slice new data data into your 7000+ actual tables per stock.

So that table doesn't probably need to be around once you've processed
your stocks through
that table. so you could just truncate/drop it after you're done.

When you create it, if you avoid indexes the inserts will be faster (it
doesn't have to rebuild the index every
insert) so then once the table is loaded, you create the indexes (So
it's actually useful) and then process the
data into the various stock tables.

Dave

Re: huge price database question..

From

John R Pierce

Date:

21 March 2012, 02:22:46

On 03/20/12 7:12 PM, Jim Green wrote:
> Also if I use copy, I would be tempted to go the one table route, or
> else I need to parse my raw daily file, separate to individual symbol
> file and copy to individual table for each symbol(this sounds like not
> very efficient)..

your 7000 tables all contain the exact same information, with the only
difference being the stock ticker symbol, right?    then really, the
single table, perhaps partitioned by month or whatever, is the right way
to go.  Any schema that makes you have to build SQL statements from
strings for each query is designed wrong.

--
john r pierce                            N 37, W 122
santa cruz ca                         mid-left coast

Re: huge price database question..

From

Andy Colson

Date:

21 March 2012, 02:25:32

On 03/20/2012 09:12 PM, Jim Green wrote:
> On 20 March 2012 22:08, Jim Green<student.northwestern@gmail.com>  wrote:
>> On 20 March 2012 22:03, David Kerr<dmk@mr-paradox.net>  wrote:
>>
>>> \copy on 1.2million rows should only take a minute or two, you could make
>>> that table "unlogged"
>>> as well to speed it up more.  If you could truncate / drop / create / load /
>>> then index the table each
>>> time then you'll get the best throughput.
>>
>> Thanks, Could you explain on the "runcate / drop / create / load /
>> then index the table each time then you'll get the best throughput."
>> part.. or point me to some docs?..
>
> Also if I use copy, I would be tempted to go the one table route, or
> else I need to parse my raw daily file, separate to individual symbol
> file and copy to individual table for each symbol(this sounds like not
> very efficient)..
>
>>
>> Jim
>>>
>>> Dave
>>>
>>>
>>>
>>>
>>> --
>>> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
>>> To make changes to your subscription:
>>> http://www.postgresql.org/mailpref/pgsql-general
>

I think the decisions:

1) one big table
2) one big partitioned table
3) many little tables

would probably depend on how you want to read the data.  Writing would be very similar.

I tried to read through the thread but didnt see how you're going to read.

I have apache logs in a database.  Single table, about 18 million rows.  I have an index on hittime (its a timestamp),
andI can pull a few hundred records based on a time, very fast.  On the other hand, a count(*) on the entire table
takesa while.  If you are going to hit lots and lots of records, I think the multi-table (which include partitioning)
wouldbe faster.  If you can pull out records based on index, and be very selective, then one big table works fine. 

On the perl side, use copy.  I have code in perl that uses it (and reads from .gz as well), and its very fast.  I can
postsome if you'd like. 

-Andy

Re: huge price database question..

From

Jim Green

Date:

21 March 2012, 02:27:08

On 20 March 2012 22:21, David Kerr <dmk@mr-paradox.net> wrote:

> I'm imagining that you're loading the raw file into a temporary table that
> you're going to use to
> process / slice new data data into your 7000+ actual tables per stock.

Thanks! would "slice new data data into your 7000+ actual tables per
stock." be a relatively quick operation?

>
> So that table doesn't probably need to be around once you've processed your
> stocks through
> that table. so you could just truncate/drop it after you're done.
>
> When you create it, if you avoid indexes the inserts will be faster (it
> doesn't have to rebuild the index every
> insert) so then once the table is loaded, you create the indexes (So it's
> actually useful) and then process the
> data into the various stock tables.
>
> Dave
>
>
>

Re: huge price database question..

From

Jim Green

Date:

21 March 2012, 02:31:52

On 20 March 2012 22:22, John R Pierce <pierce@hogranch.com> wrote:
> your 7000 tables all contain the exact same information, with the only
> difference being the stock ticker symbol, right?    then really, the single
> table, perhaps partitioned by month or whatever, is the right way to go.
>  Any schema that makes you have to build SQL statements from strings for
> each query is designed wrong.

each table contains 1 minute price data for that symbol, so each table
has the same schema, open, high,low,close and volume etc, but not the
same data..

Thanks.

>
>
>
> --
> john r pierce                            N 37, W 122
> santa cruz ca                         mid-left coast
>
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general

Re: huge price database question..

From

Jim Green

Date:

21 March 2012, 02:35:52

On 20 March 2012 22:25, Andy Colson <andy@squeakycode.net> wrote:
> I think the decisions:
>
> 1) one big table
> 2) one big partitioned table
> 3) many little tables
>
> would probably depend on how you want to read the data.  Writing would be
> very similar.
>
> I tried to read through the thread but didnt see how you're going to read.
>
> I have apache logs in a database.  Single table, about 18 million rows.  I
> have an index on hittime (its a timestamp), and I can pull a few hundred
> records based on a time, very fast.  On the other hand, a count(*) on the
> entire table takes a while.  If you are going to hit lots and lots of
> records, I think the multi-table (which include partitioning) would be
> faster.  If you can pull out records based on index, and be very selective,
> then one big table works fine.
> On the perl side, use copy.  I have code in perl that uses it (and reads
> from .gz as well), and its very fast.  I can post some if you'd like.

my queries would mostly consider select for one symbol for one
particular day or a few hours in a particular day, occasionally I
would do select on multiple symbols for some timestamp range. you code
sample would be appreciated, Thanks!

Jim.

>
> -Andy
>

Re: huge price database question..

From

Andy Colson

Date:

21 March 2012, 02:43:56

On 03/20/2012 09:35 PM, Jim Green wrote:
> On 20 March 2012 22:25, Andy Colson<andy@squeakycode.net>  wrote:
>> I think the decisions:
>>
>> 1) one big table
>> 2) one big partitioned table
>> 3) many little tables
>>
>> would probably depend on how you want to read the data.  Writing would be
>> very similar.
>>
>> I tried to read through the thread but didnt see how you're going to read.
>>
>> I have apache logs in a database.  Single table, about 18 million rows.  I
>> have an index on hittime (its a timestamp), and I can pull a few hundred
>> records based on a time, very fast.  On the other hand, a count(*) on the
>> entire table takes a while.  If you are going to hit lots and lots of
>> records, I think the multi-table (which include partitioning) would be
>> faster.  If you can pull out records based on index, and be very selective,
>> then one big table works fine.
>> On the perl side, use copy.  I have code in perl that uses it (and reads
>> from .gz as well), and its very fast.  I can post some if you'd like.
>
> my queries would mostly consider select for one symbol for one
> particular day or a few hours in a particular day, occasionally I
> would do select on multiple symbols for some timestamp range. you code
> sample would be appreciated, Thanks!
>
> Jim.
>
>>
>> -Andy
>>

Here is some copy/pasted parts:

my @list = glob('*.gz');
for my $fname (@list)
{
    $db->do('copy access from stdin');
    open my $fh, "-|", "/usr/bin/zcat $fname" or die "$fname: $!";
    while (<$fh>)
    {
        # bunch of stuff to format sniped here
        # if you have comma separated or something you might be able
        # to just feed it in
        $db->pg_putcopydata("$county\t$ip\t$time\t$status\t$size\t$url\t$ua\n");
    }
    $db->pg_endcopy;
    $db->commit;
}


Do you ever plan on batch deleted a BUNCH of records?

Do you ever want to do read all of one symbol (like, select avg(high) from stocks where symbol = 'bob')?

-Andy

Re: huge price database question..

From

Jim Green

Date:

21 March 2012, 02:49:24

On 20 March 2012 22:43, Andy Colson <andy@squeakycode.net> wrote:
> Here is some copy/pasted parts:
>
> my @list = glob('*.gz');
> for my $fname (@list)
> {
>        $db->do('copy access from stdin');
>        open my $fh, "-|", "/usr/bin/zcat $fname" or die "$fname: $!";
>        while (<$fh>)
>        {
>                # bunch of stuff to format sniped here
>                # if you have comma separated or something you might be able
>                # to just feed it in
>
>  $db->pg_putcopydata("$county\t$ip\t$time\t$status\t$size\t$url\t$ua\n");
>        }
>        $db->pg_endcopy;
>        $db->commit;
> }
>
>
> Do you ever plan on batch deleted a BUNCH of records?

no, after historical data is populated, I'll only add data daily. no delete..

>
> Do you ever want to do read all of one symbol (like, select avg(high) from
> stocks where symbol = 'bob')?

yes its possible but I would more likely grab the data to R and get
the avg in R..

Thanks,
Jim.

>
> -Andy

Re: huge price database question..

From

John R Pierce

Date:

21 March 2012, 02:57:56

On 03/20/12 7:49 PM, Jim Green wrote:
> yes its possible but I would more likely grab the data to R and get
> the avg in R..

avg() in the database is going to be a lot faster than copying the data
into memory for an application to process.

Also, you know there's a plR for postgres that lets you embed R
functions in the database server and invoke them in SQL statements??

--
john r pierce                            N 37, W 122
santa cruz ca                         mid-left coast

Re: huge price database question..

From

Andy Colson

Date:

21 March 2012, 03:01:40

On 03/20/2012 09:49 PM, Jim Green wrote:
> On 20 March 2012 22:43, Andy Colson<andy@squeakycode.net>  wrote:
>>
>> Do you ever plan on batch deleted a BUNCH of records?
>
> no, after historical data is populated, I'll only add data daily. no delete..
>
>>
>> Do you ever want to do read all of one symbol (like, select avg(high) from
>> stocks where symbol = 'bob')?
>
> yes its possible but I would more likely grab the data to R and get
> the avg in R..
>
> Thanks,
> Jim.
>
>>
>> -Andy

Based on your answers:

> my queries would mostly consider select for one symbol for one
> particular day or a few hours in a particular day, occasionally I
> would do select on multiple symbols for some timestamp range

one big table would probably be about the same speed as multiple smaller tables.  Either way you'll hit an index first
forthe above usage. 

> no, after historical data is populated, I'll only add data daily. no delete..

Truncating/dropping a table is much faster than a huge delete... but if you'll never delete then it really doenst
matter.

> yes its possible but I would more likely grab the data to R and get
> the avg in R..

but... to get the data to R you still have to step thru the entire table.

If you have a partition per symbol (which is the same as having a separate table per symbol) then I believe you can
stepthru it faster (just a table scan) than if you had one big table (index lookups on symbol).  So in this case,
partitionedwould be better (I think). 

So the score is:
   One big table = 1
   Doesn't matter = 1
   Partitioned = 1

Of course, there are probably other usage patters I'm not aware of.  And I also am assuming some things based on what
I'veheard -- not of actual experience. 

I'm not sure this was really helpful :-)

-Andy

Re: huge price database question..

From

Brent Wood

Date:

21 March 2012, 03:07:29

Also look at a clustered index on timestamp


Brent Wood

GIS/DBA consultant
NIWA
+64 (4) 4 386-0300
________________________________________
From: pgsql-general-owner@postgresql.org [pgsql-general-owner@postgresql.org] on behalf of Jim Green
[student.northwestern@gmail.com]
Sent: Wednesday, March 21, 2012 2:50 PM
To: David Kerr
Cc: pgsql-general@postgresql.org
Subject: Re: [GENERAL] huge price database question..

On 20 March 2012 21:40, David Kerr <dmk@mr-paradox.net> wrote:
> On 03/20/2012 04:27 PM, Jim Green wrote:
>
> Greetings list!
> I am pretty new to postgresql from mysql and did a fairly extensive
> search of the list and came up with a few good ones but didn't find
> the exact same situation as I have now. so I am venturing asking here.
>
> I have daily minute stock price data from 2005 on and each day with
> columns timestamp, open,high,low,close,volume and a few more. each
> day's data is about 1.2million rows. I want import all the data to
> postgresql and analyze using R with the help of Rpostgresql.
>
> right now I am having about 7000 tables for individual stock and I use
> perl to do inserts, it's very slow. I would like to use copy or other
> bulk loading tool to load the daily raw gz data. but I need the split
> the file to per stock files first before I do bulk loading. I consider
> this a bit messy.
>
> I would seek advise on the following idea:
> store everything in a big table, partition by month(this gives a
> reasonable number of partitions) and do bulk loading on the daily
> file. my queries would consist mostly select on a particular symbol on
> a particular day.
>
> Also in the future, I will import daily data to the db every day.
>
> my hardware is 16G Ram, 4x5400rpm raid10 with enough space.
>
> Thanks!
>
> Jim.
>
>
> Seems like you'd want to do this?
> http://search.cpan.org/~turnstep/DBD-Pg-2.19.2/Pg.pm#pg_putcopydata
> COPY support
>
> DBD::Pg allows for quick (bulk) reading and storing of data by using the
> COPY command. The basic process is to use $dbh->do to issue a COPY command,
> and then to either add rows using "pg_putcopydata", or to read them by using
> "pg_getcopydata".

Thanks! would you comment on the table setup as well?

Jim.
>
>
>

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
--
Please consider the environment before printing this email.
NIWA is the trading name of the National Institute of Water & Atmospheric Research Ltd.

Re: huge price database question..

From

Andy Colson

Date:

21 March 2012, 03:14:47

On 03/20/2012 08:54 PM, Brent Wood wrote:
>
> Also look at a clustered index on timestamp
>
>
> Brent Wood
>
> GIS/DBA consultant
> NIWA
> +64 (4) 4 386-0300

A clustered index is only "clustered" at the point in time you run the command.  It wont remain that way, and with a
reallybig table, you don't wanna spend the time re-clustering it every time you import more data. 

-Andy

Re: huge price database question..

From

David Kerr

Date:

21 March 2012, 03:24:41

On 03/20/2012 07:26 PM, Jim Green wrote:
> On 20 March 2012 22:21, David Kerr<dmk@mr-paradox.net>  wrote:
>
>> I'm imagining that you're loading the raw file into a temporary table that
>> you're going to use to
>> process / slice new data data into your 7000+ actual tables per stock.
>
> Thanks! would "slice new data data into your 7000+ actual tables per
> stock." be a relatively quick operation?

well, it solves the problem of having to split up the raw file by stock
symbol. From there you can run multiple jobs in parallel to load
individual stocks into their individual table which is probably faster
than what you've got going now.

It would probably be faster to load the individual stocks directly from
the file but then, as you said, you have to split it up first, so that
may take time.

Re: huge price database question..

From

Jim Green

Date:

21 March 2012, 03:28:38

On 20 March 2012 22:57, John R Pierce <pierce@hogranch.com> wrote:

> avg() in the database is going to be a lot faster than copying the data into
> memory for an application to process.

I see..

>
> Also, you know there's a plR for postgres that lets you embed R functions in
> the database server and invoke them in SQL statements??

Thanks for this tip!

Jim.

Re: huge price database question..

From

Jim Green

Date:

21 March 2012, 03:36:03

On 20 March 2012 23:01, Andy Colson <andy@squeakycode.net> wrote:
> Of course, there are probably other usage patters I'm not aware of.  And I
> also am assuming some things based on what I've heard -- not of actual
> experience.

I am not expert in sql, so what I get out of postgresql is probably
mostly select, but as you hinted there may be faster ways to do things
in sql natively rather than selecting stuff out and do stuff in R. I
need to learn more sql:)

>
> I'm not sure this was really helpful :-)

it's very helpful, Thanks!

Jim.

>
> -Andy

Re: huge price database question..

From

Steve Crawford

Date:

21 March 2012, 15:01:25

On 03/20/2012 06:16 PM, Jim Green wrote:
>
> It looks like alternatives are kind of complex to me, right now my
> approach(perl dbi and prepared insert) would take about 8/9 mins to
> insert  a day's data.  I think I'll probably just stick with it and
> wait.
>
> the autovacuum processes does a lot of io and make my pc unusable
> while I do the data inserts.. and I tested autovacuum off with not
> much success because of they are launched due to the transaction id
> wrap around issue.
>
Something sounds wrong, here. XID wraps around somewhere around
4-billion transactions which is a substantial multiple of the entire
number of records you are trying to insert. Do you have any unusual
vacuum settings?

As mentioned by others and myself, anything that batches lots of inserts
together into a single transaction instead of each record being its own
transaction will help whether this is COPY or BEGIN, INSERT......, COMMIT.

Cheers,
Steve

Re: huge price database question..

From

Jim Green

Date:

21 March 2012, 16:34:56

On 21 March 2012 11:01, Steve Crawford <scrawford@pinpointresearch.com> wrote:
> Something sounds wrong, here. XID wraps around somewhere around 4-billion
> transactions which is a substantial multiple of the entire number of records
> you are trying to insert. Do you have any unusual vacuum settings?

I have autvacumm=off in my config and I still get autovacuum process..
but yes, I don't have 4 billion inserts till now.. I have attached my
config used to do bulk loading at the end..

>
> As mentioned by others and myself, anything that batches lots of inserts
> together into a single transaction instead of each record being its own
> transaction will help whether this is COPY or BEGIN, INSERT......,

I thought I already turned synchronous commit off in config, but
probably it is not the same thing as you said.

Thanks!

Jim.
>
> Cheers,
> Steve


# -----------------------------
# PostgreSQL configuration file
# -----------------------------
#
# This file consists of lines of the form:
#
#   name = value
#
# (The "=" is optional.)  Whitespace may be used.  Comments are introduced with
# "#" anywhere on a line.  The complete list of parameter names and allowed
# values can be found in the PostgreSQL documentation.
#
# The commented-out settings shown in this file represent the default values.
# Re-commenting a setting is NOT sufficient to revert it to the default value;
# you need to reload the server.
#
# This file is read on server startup and when the server receives a SIGHUP
# signal.  If you edit the file on a running system, you have to SIGHUP the
# server for the changes to take effect, or use "pg_ctl reload".  Some
# parameters, which are marked below, require a server shutdown and restart to
# take effect.
#
# Any parameter can also be given as a command-line option to the server, e.g.,
# "postgres -c log_connections=on".  Some parameters can be changed at run time
# with the "SET" SQL command.
#
# Memory units:  kB = kilobytes        Time units:  ms  = milliseconds
#                MB = megabytes                     s   = seconds
#                GB = gigabytes                     min = minutes
#                                                   h   = hours
#                                                   d   = days


#------------------------------------------------------------------------------
# FILE LOCATIONS
#------------------------------------------------------------------------------

# The default values of these variables are driven from the -D command-line
# option or PGDATA environment variable, represented here as ConfigDir.

data_directory = '/var/lib/postgresql/9.1/main'        # use data in another directory
                    # (change requires restart)
hba_file = '/etc/postgresql/9.1/main/pg_hba.conf'    # host-based
authentication file
                    # (change requires restart)
ident_file = '/etc/postgresql/9.1/main/pg_ident.conf'    # ident configuration file
                    # (change requires restart)

# If external_pid_file is not explicitly set, no extra PID file is written.
external_pid_file = '/var/run/postgresql/9.1-main.pid'        # write an
extra PID file
                    # (change requires restart)


#------------------------------------------------------------------------------
# CONNECTIONS AND AUTHENTICATION
#------------------------------------------------------------------------------

# - Connection Settings -

#listen_addresses = 'localhost'        # what IP address(es) to listen on;
                    # comma-separated list of addresses;
                    # defaults to 'localhost', '*' = all
                    # (change requires restart)
port = 5433                # (change requires restart)
max_connections = 100            # (change requires restart)
# Note:  Increasing max_connections costs ~400 bytes of shared memory per
# connection slot, plus lock space (see max_locks_per_transaction).
#superuser_reserved_connections = 3    # (change requires restart)
unix_socket_directory = '/var/run/postgresql'        # (change requires restart)
#unix_socket_group = ''            # (change requires restart)
#unix_socket_permissions = 0777        # begin with 0 to use octal notation
                    # (change requires restart)
#bonjour = off                # advertise server via Bonjour
                    # (change requires restart)
#bonjour_name = ''            # defaults to the computer name
                    # (change requires restart)

# - Security and Authentication -

#authentication_timeout = 1min        # 1s-600s
ssl = true                # (change requires restart)
#ssl_ciphers = 'ALL:!ADH:!LOW:!EXP:!MD5:@STRENGTH'    # allowed SSL ciphers
                    # (change requires restart)
#ssl_renegotiation_limit = 512MB    # amount of data between renegotiations
#password_encryption = on
#db_user_namespace = off

# Kerberos and GSSAPI
#krb_server_keyfile = ''
#krb_srvname = 'postgres'        # (Kerberos only)
#krb_caseins_users = off

# - TCP Keepalives -
# see "man 7 tcp" for details

#tcp_keepalives_idle = 0        # TCP_KEEPIDLE, in seconds;
                    # 0 selects the system default
#tcp_keepalives_interval = 0        # TCP_KEEPINTVL, in seconds;
                    # 0 selects the system default
#tcp_keepalives_count = 0        # TCP_KEEPCNT;
                    # 0 selects the system default


#------------------------------------------------------------------------------
# RESOURCE USAGE (except WAL)
#------------------------------------------------------------------------------

# - Memory -

shared_buffers = 4GB            # min 128kB
                    # (change requires restart)
#temp_buffers = 8MB            # min 800kB
#max_prepared_transactions = 0        # zero disables the feature
                    # (change requires restart)
# Note:  Increasing max_prepared_transactions costs ~600 bytes of shared memory
# per transaction slot, plus lock space (see max_locks_per_transaction).
# It is not advisable to set max_prepared_transactions nonzero unless you
# actively intend to use prepared transactions.
work_mem = 1GB                # min 64kB
maintenance_work_mem = 256MB        # min 1MB
#max_stack_depth = 2MB            # min 100kB

# - Kernel Resource Usage -

#max_files_per_process = 1000        # min 25
                    # (change requires restart)
#shared_preload_libraries = ''        # (change requires restart)

# - Cost-Based Vacuum Delay -

vacuum_cost_delay = 10ms        # 0-100 milliseconds
#vacuum_cost_page_hit = 1        # 0-10000 credits
#vacuum_cost_page_miss = 10        # 0-10000 credits
#vacuum_cost_page_dirty = 20        # 0-10000 credits
#vacuum_cost_limit = 200        # 1-10000 credits

# - Background Writer -

#bgwriter_delay = 200ms            # 10-10000ms between rounds
#bgwriter_lru_maxpages = 100        # 0-1000 max buffers written/round
#bgwriter_lru_multiplier = 2.0        # 0-10.0 multipler on buffers scanned/round

# - Asynchronous Behavior -

effective_io_concurrency = 2        # 1-1000. 0 disables prefetching


#------------------------------------------------------------------------------
# WRITE AHEAD LOG
#------------------------------------------------------------------------------

# - Settings -

wal_level = minimal            # minimal, archive, or hot_standby
                    # (change requires restart)
fsync = off                # turns forced synchronization on or off
synchronous_commit = off        # synchronization level; on, off, or local
#wal_sync_method = fsync        # the default is the first option
                    # supported by the operating system:
                    #   open_datasync
                    #   fdatasync (default on Linux)
                    #   fsync
                    #   fsync_writethrough
                    #   open_sync
full_page_writes = off            # recover from partial page writes
#wal_buffers = -1            # min 32kB, -1 sets based on shared_buffers
                    # (change requires restart)
#wal_writer_delay = 200ms        # 1-10000 milliseconds

#commit_delay = 0            # range 0-100000, in microseconds
#commit_siblings = 5            # range 1-1000

# - Checkpoints -

checkpoint_segments = 32        # in logfile segments, min 1, 16MB each
#checkpoint_timeout = 5min        # range 30s-1h
checkpoint_completion_target = 0.9    # checkpoint target duration, 0.0 - 1.0
#checkpoint_warning = 30s        # 0 disables

# - Archiving -

#archive_mode = off        # allows archiving to be done
                # (change requires restart)
#archive_command = ''        # command to use to archive a logfile segment
#archive_timeout = 0        # force a logfile segment switch after this
                # number of seconds; 0 disables


#------------------------------------------------------------------------------
# REPLICATION
#------------------------------------------------------------------------------

# - Master Server -

# These settings are ignored on a standby server

#max_wal_senders = 0        # max number of walsender processes
                # (change requires restart)
#wal_sender_delay = 1s        # walsender cycle time, 1-10000 milliseconds
#wal_keep_segments = 0        # in logfile segments, 16MB each; 0 disables
#vacuum_defer_cleanup_age = 0    # number of xacts by which cleanup is delayed
#replication_timeout = 60s    # in milliseconds; 0 disables
#synchronous_standby_names = ''    # standby servers that provide sync rep
                # comma-separated list of application_name
                # from standby(s); '*' = all

# - Standby Servers -

# These settings are ignored on a master server

#hot_standby = off            # "on" allows queries during recovery
                    # (change requires restart)
#max_standby_archive_delay = 30s    # max delay before canceling queries
                    # when reading WAL from archive;
                    # -1 allows indefinite delay
#max_standby_streaming_delay = 30s    # max delay before canceling queries
                    # when reading streaming WAL;
                    # -1 allows indefinite delay
#wal_receiver_status_interval = 10s    # send replies at least this often
                    # 0 disables
#hot_standby_feedback = off        # send info from standby to prevent
                    # query conflicts


#------------------------------------------------------------------------------
# QUERY TUNING
#------------------------------------------------------------------------------

# - Planner Method Configuration -

#enable_bitmapscan = on
#enable_hashagg = on
#enable_hashjoin = on
#enable_indexscan = on
#enable_material = on
#enable_mergejoin = on
#enable_nestloop = on
#enable_seqscan = on
#enable_sort = on
#enable_tidscan = on

# - Planner Cost Constants -

#seq_page_cost = 1.0            # measured on an arbitrary scale
#random_page_cost = 4.0            # same scale as above
#cpu_tuple_cost = 0.01            # same scale as above
#cpu_index_tuple_cost = 0.005        # same scale as above
#cpu_operator_cost = 0.0025        # same scale as above
effective_cache_size = 8GB

# - Genetic Query Optimizer -

#geqo = on
#geqo_threshold = 12
#geqo_effort = 5            # range 1-10
#geqo_pool_size = 0            # selects default based on effort
#geqo_generations = 0            # selects default based on effort
#geqo_selection_bias = 2.0        # range 1.5-2.0
#geqo_seed = 0.0            # range 0.0-1.0

# - Other Planner Options -

#default_statistics_target = 100    # range 1-10000
#constraint_exclusion = partition    # on, off, or partition
#cursor_tuple_fraction = 0.1        # range 0.0-1.0
#from_collapse_limit = 8
#join_collapse_limit = 8        # 1 disables collapsing of explicit
                    # JOIN clauses


#------------------------------------------------------------------------------
# ERROR REPORTING AND LOGGING
#------------------------------------------------------------------------------

# - Where to Log -

#log_destination = 'stderr'        # Valid values are combinations of
                    # stderr, csvlog, syslog, and eventlog,
                    # depending on platform.  csvlog
                    # requires logging_collector to be on.

# This is used when logging to stderr:
logging_collector = off        # Enable capturing of stderr and csvlog
                    # into log files. Required to be on for
                    # csvlogs.
                    # (change requires restart)

# These are only used if logging_collector is on:
#log_directory = 'pg_log'        # directory where log files are written,
                    # can be absolute or relative to PGDATA
#log_filename = 'postgresql-%Y-%m-%d_%H%M%S.log'    # log file name pattern,
                    # can include strftime() escapes
#log_file_mode = 0600            # creation mode for log files,
                    # begin with 0 to use octal notation
#log_truncate_on_rotation = off        # If on, an existing log file with the
                    # same name as the new log file will be
                    # truncated rather than appended to.
                    # But such truncation only occurs on
                    # time-driven rotation, not on restarts
                    # or size-driven rotation.  Default is
                    # off, meaning append to existing files
                    # in all cases.
#log_rotation_age = 1d            # Automatic rotation of logfiles will
                    # happen after that time.  0 disables.
#log_rotation_size = 10MB        # Automatic rotation of logfiles will
                    # happen after that much log output.
                    # 0 disables.

# These are relevant when logging to syslog:
#syslog_facility = 'LOCAL0'
#syslog_ident = 'postgres'

#silent_mode = off            # Run server silently.
                    # DO NOT USE without syslog or
                    # logging_collector
                    # (change requires restart)


# - When to Log -

#client_min_messages = notice        # values in order of decreasing detail:
                    #   debug5
                    #   debug4
                    #   debug3
                    #   debug2
                    #   debug1
                    #   log
                    #   notice
                    #   warning
                    #   error

#log_min_messages = warning        # values in order of decreasing detail:
                    #   debug5
                    #   debug4
                    #   debug3
                    #   debug2
                    #   debug1
                    #   info
                    #   notice
                    #   warning
                    #   error
                    #   log
                    #   fatal
                    #   panic

#log_min_error_statement = error    # values in order of decreasing detail:
                     #   debug5
                    #   debug4
                    #   debug3
                    #   debug2
                    #   debug1
                     #   info
                    #   notice
                    #   warning
                    #   error
                    #   log
                    #   fatal
                    #   panic (effectively off)

#log_min_duration_statement = -1    # -1 is disabled, 0 logs all statements
                    # and their durations, > 0 logs only
                    # statements running at least this number
                    # of milliseconds


# - What to Log -

#debug_print_parse = off
#debug_print_rewritten = off
#debug_print_plan = off
#debug_pretty_print = on
#log_checkpoints = off
#log_connections = off
#log_disconnections = off
#log_duration = off
#log_error_verbosity = default        # terse, default, or verbose messages
#log_hostname = off
log_line_prefix='%t:%r:%u@%d:[%p]: '    # special values:
                    #   %a = application name
                    #   %u = user name
                    #   %d = database name
                    #   %r = remote host and port
                    #   %h = remote host
                    #   %p = process ID
                    #   %t = timestamp without milliseconds
                    #   %m = timestamp with milliseconds
                    #   %i = command tag
                    #   %e = SQL state
                    #   %c = session ID
                    #   %l = session line number
                    #   %s = session start timestamp
                    #   %v = virtual transaction ID
                    #   %x = transaction ID (0 if none)
                    #   %q = stop here in non-session
                    #        processes
                    #   %% = '%'
                    # e.g. '<%u%%%d> '
#log_lock_waits = off            # log lock waits >= deadlock_timeout
#log_statement = 'none'            # none, ddl, mod, all
#log_temp_files = -1            # log temporary files equal or larger
                    # than the specified size in kilobytes;
                    # -1 disables, 0 logs all temp files
#log_timezone = '(defaults to server environment setting)'


#------------------------------------------------------------------------------
# RUNTIME STATISTICS
#------------------------------------------------------------------------------

# - Query/Index Statistics Collector -

#track_activities = on
#track_counts = on
#track_functions = none            # none, pl, all
#track_activity_query_size = 1024     # (change requires restart)
#update_process_title = on
#stats_temp_directory = 'pg_stat_tmp'


# - Statistics Monitoring -

#log_parser_stats = off
#log_planner_stats = off
#log_executor_stats = off
#log_statement_stats = off


#------------------------------------------------------------------------------
# AUTOVACUUM PARAMETERS
#------------------------------------------------------------------------------

autovacuum = off            # Enable autovacuum subprocess?  'on'
                    # requires track_counts to also be on.
#log_autovacuum_min_duration = -1    # -1 disables, 0 logs all actions and
                    # their durations, > 0 logs only
                    # actions running at least this number
                    # of milliseconds.
#autovacuum_max_workers = 3        # max number of autovacuum subprocesses
                    # (change requires restart)
autovacuum_max_workers = 3        # max number of autovacuum subprocesses
#autovacuum_naptime = 1min        # time between autovacuum runs
#autovacuum_vacuum_threshold = 50    # min number of row updates before
                    # vacuum
#autovacuum_analyze_threshold = 50    # min number of row updates before
                    # analyze
#autovacuum_vacuum_scale_factor = 0.2    # fraction of table size before vacuum
#autovacuum_analyze_scale_factor = 0.1    # fraction of table size before analyze
#autovacuum_freeze_max_age = 200000000    # maximum XID age before forced vacuum
                    # (change requires restart)
#autovacuum_vacuum_cost_delay = 20ms    # default vacuum cost delay for
                    # autovacuum, in milliseconds;
                    # -1 means use vacuum_cost_delay
#autovacuum_vacuum_cost_limit = -1    # default vacuum cost limit for
                    # autovacuum, -1 means use
                    # vacuum_cost_limit


#------------------------------------------------------------------------------
# CLIENT CONNECTION DEFAULTS
#------------------------------------------------------------------------------

# - Statement Behavior -

#search_path = '"$user",public'        # schema names
#default_tablespace = ''        # a tablespace name, '' uses the default
#temp_tablespaces = ''            # a list of tablespace names, '' uses
                    # only default tablespace
#check_function_bodies = on
#default_transaction_isolation = 'read committed'
#default_transaction_read_only = off
#default_transaction_deferrable = off
#session_replication_role = 'origin'
#statement_timeout = 0            # in milliseconds, 0 is disabled
#vacuum_freeze_min_age = 50000000
#vacuum_freeze_table_age = 150000000
#bytea_output = 'hex'            # hex, escape
#xmlbinary = 'base64'
#xmloption = 'content'

# - Locale and Formatting -

datestyle = 'iso, mdy'
#intervalstyle = 'postgres'
#timezone = '(defaults to server environment setting)'
#timezone_abbreviations = 'Default'     # Select the set of available time zone
                    # abbreviations.  Currently, there are
                    #   Default
                    #   Australia
                    #   India
                    # You can create your own file in
                    # share/timezonesets/.
#extra_float_digits = 0            # min -15, max 3
#client_encoding = sql_ascii        # actually, defaults to database
                    # encoding

# These settings are initialized by initdb, but they can be changed.
lc_messages = 'en_US.UTF-8'            # locale for system error message
                    # strings
lc_monetary = 'en_US.UTF-8'            # locale for monetary formatting
lc_numeric = 'en_US.UTF-8'            # locale for number formatting
lc_time = 'en_US.UTF-8'                # locale for time formatting

# default configuration for text search
default_text_search_config = 'pg_catalog.english'

# - Other Defaults -

#dynamic_library_path = '$libdir'
#local_preload_libraries = ''


#------------------------------------------------------------------------------
# LOCK MANAGEMENT
#------------------------------------------------------------------------------

#deadlock_timeout = 1s
#max_locks_per_transaction = 64        # min 10
                    # (change requires restart)
# Note:  Each lock table slot uses ~270 bytes of shared memory, and there are
# max_locks_per_transaction * (max_connections + max_prepared_transactions)
# lock table slots.
#max_pred_locks_per_transaction = 64    # min 10
                    # (change requires restart)

#------------------------------------------------------------------------------
# VERSION/PLATFORM COMPATIBILITY
#------------------------------------------------------------------------------

# - Previous PostgreSQL Versions -

#array_nulls = on
#backslash_quote = safe_encoding    # on, off, or safe_encoding
#default_with_oids = off
#escape_string_warning = on
#lo_compat_privileges = off
#quote_all_identifiers = off
#sql_inheritance = on
#standard_conforming_strings = on
#synchronize_seqscans = on

# - Other Platforms and Clients -

#transform_null_equals = off


#------------------------------------------------------------------------------
# ERROR HANDLING
#------------------------------------------------------------------------------

#exit_on_error = off                # terminate session on any error?
#restart_after_crash = on            # reinitialize after backend crash?


#------------------------------------------------------------------------------
# CUSTOMIZED OPTIONS
#------------------------------------------------------------------------------

#custom_variable_classes = ''        # list of custom variable class names

Re: huge price database question..

From

Lee Hachadoorian

Date:

21 March 2012, 16:45:58

On Tue, Mar 20, 2012 at 11:28 PM, Jim Green <student.northwestern@gmail.com> wrote:

On 20 March 2012 22:57, John R Pierce <pierce@hogranch.com> wrote:

> avg() in the database is going to be a lot faster than copying the data into
> memory for an application to process.

I see..

As an example, I ran average on a 700,000 row table with 231 census variables reported by state. Running average on all 231 columns grouping by state inside Postgres beat running it by R by a factor of 130 NOT COUNTING an additional minute or so to pull the table from Postgres to R. To be fair, these numbers are not strictly comparable, because it's running on different hardware. But the setup is not atypical: Postgres is running on a heavy hitting server while R is running on my desktop.

SELECT state, avg(col1), avg(col2), [...] avg(col231)
FROM some_table
GROUP BY state;

5741 ms

aggregate(dfSomeTable, by = list(dfSomeTable$state), FUN = mean, na.rm = TRUE)

754746 ms

--Lee

Re: huge price database question..

From

Adrian Klaver

Date:

21 March 2012, 16:53:12

On 03/21/2012 09:34 AM, Jim Green wrote:
> On 21 March 2012 11:01, Steve Crawford<scrawford@pinpointresearch.com>  wrote:
>> Something sounds wrong, here. XID wraps around somewhere around 4-billion
>> transactions which is a substantial multiple of the entire number of records
>> you are trying to insert. Do you have any unusual vacuum settings?
>
> I have autvacumm=off in my config and I still get autovacuum process..
> but yes, I don't have 4 billion inserts till now.. I have attached my
> config used to do bulk loading at the end..


Basically it runs as a safety check. For more info see below:

http://www.postgresql.org/docs/9.1/interactive/runtime-config-autovacuum.html

autovacuum (boolean)

     Controls whether the server should run the autovacuum launcher
daemon. This is on by default; however, track_counts must also be
enabled for autovacuum to work. This parameter can only be set in the
postgresql.conf file or on the server command line.

     Note that even when this parameter is disabled, the system will
launch autovacuum processes if necessary to prevent transaction ID
wraparound. See Section 23.1.4 for more information.

Section 23.1.4 where you can all the details:

http://www.postgresql.org/docs/9.1/interactive/routine-vacuuming.html#VACUUM-FOR-WRAPAROUND


--
Adrian Klaver
adrian.klaver@gmail.com

Re: huge price database question..

From

Andy Colson

Date:

21 March 2012, 16:57:16

On 3/21/2012 11:45 AM, Lee Hachadoorian wrote:
>
>
> On Tue, Mar 20, 2012 at 11:28 PM, Jim Green
> <student.northwestern@gmail.com <mailto:student.northwestern@gmail.com>>
> wrote:
>
>     On 20 March 2012 22:57, John R Pierce <pierce@hogranch.com
>     <mailto:pierce@hogranch.com>> wrote:
>
>      > avg() in the database is going to be a lot faster than copying
>     the data into
>      > memory for an application to process.
>
>     I see..
>
>
> As an example, I ran average on a 700,000 row table with 231 census
> variables reported by state. Running average on all 231 columns grouping
> by state inside Postgres beat running it by R by a factor of 130 NOT
> COUNTING an additional minute or so to pull the table from Postgres to
> R. To be fair, these numbers are not strictly comparable, because it's
> running on different hardware. But the setup is not atypical: Postgres
> is running on a heavy hitting server while R is running on my desktop.
>
> SELECT state, avg(col1), avg(col2), [...] avg(col231)
> FROM some_table
> GROUP BY state;
>
> 5741 ms
>
> aggregate(dfSomeTable, by = list(dfSomeTable$state), FUN = mean, na.rm =
> TRUE)
>
> 754746 ms
>
> --Lee

avg() might not be a good example though.  If you just want average,
great, but what about when you want to do lots of different stats'y
functions that PG does not have?  I'll bet R (not pl-R) can read a text
file (a csv probably) pretty fast.

If you were doing a really complex analysis I bet dumping a dataset out
to csv first and then running R scripts over it would be fast and useful.

 > aggregate(dfSomeTable, by = list(dfSomeTable$state), FUN = mean, na.rm =
 > TRUE)

Is this a Postgres stored proc (pl-R)?  Or is that R itself?  If its plR
then I wonder if its stepping through the recordset twice.  Depending on
how the function is written, I can see the function firing off a query,
PG collects the entire recordset, and hands it back to the function,
which then iterates thru it again.  (vs. having the function called for
each row as PG steps thru the recordset only once).

-Andy

Re: huge price database question..

From

Steve Crawford

Date:

21 March 2012, 17:31:07

On 03/21/2012 09:34 AM, Jim Green wrote:
> On 21 March 2012 11:01, Steve Crawford<scrawford@pinpointresearch.com>  wrote:
>> Something sounds wrong, here. XID wraps around somewhere around 4-billion
>> transactions which is a substantial multiple of the entire number of records
>> you are trying to insert. Do you have any unusual vacuum settings?
> I have autvacumm=off in my config and I still get autovacuum process..
> but yes, I don't have 4 billion inserts till now.. I have attached my
> config used to do bulk loading at the end..
I should have noted, that autovacuum_freeze_max_age defaults to
200-million transactions which you are sure to hit in less than a year's
worth of data if you are doing individual inserts (each insert is a
transaction).
>> As mentioned by others and myself, anything that batches lots of inserts
>> together into a single transaction instead of each record being its own
>> transaction will help whether this is COPY or BEGIN, INSERT......,
> I thought I already turned synchronous commit off in config, but
> probably it is not the same thing as you said.
It is not.

Let's look at inserting one day's data into a single table as a COPY vs.
individual INSERTs. A COPY is *one* transaction. Individual INSERTs is
1,200,000 transactions which moves your transaction ID ahead an
unnecessary 1,199,999 times. There is plenty of other overhead consumed
in each transaction as well.

Turning synchronous_commit off will help mask some of the inefficiency
of doing individual transactions but it will not cure it. Note that even
when turned off, the max delay before writing WAL is three times
wal_writer_delay which is typically 200ms so you are only looking at 0.6
seconds before the WAL is written.

Cheers,
Steve

Re: huge price database question..

From

Lee Hachadoorian

Date:

21 March 2012, 18:15:14

On Wed, Mar 21, 2012 at 12:57 PM, Andy Colson <andy@squeakycode.net> wrote:

On 3/21/2012 11:45 AM, Lee Hachadoorian wrote:

On 20 March 2012 22:57, John R Pierce <pierce@hogranch.com
<mailto:pierce@hogranch.com>> wrote:

> avg() in the database is going to be a lot faster than copying
the data into
> memory for an application to process.

As an example, I ran average on a 700,000 row table with 231 census
variables reported by state. Running average on all 231 columns grouping
by state inside Postgres beat running it by R by a factor of 130 NOT
COUNTING an additional minute or so to pull the table from Postgres to
R. To be fair, these numbers are not strictly comparable, because it's
running on different hardware. But the setup is not atypical: Postgres
is running on a heavy hitting server while R is running on my desktop.

SELECT state, avg(col1), avg(col2), [...] avg(col231)
FROM some_table
GROUP BY state;

5741 ms

aggregate(dfSomeTable, by = list(dfSomeTable$state), FUN = mean, na.rm =
TRUE)

754746 ms

--Lee

avg() might not be a good example though. If you just want average, great, but what about when you want to do lots of different stats'y functions that PG does not have? I'll bet R (not pl-R) can read a text file (a csv probably) pretty fast.

Don't get me wrong. I DO use R for all kinds of stats and graphics stuff. I just tested avg() right now to test the claim made above.

I did some garbage collection and reran the aggregate() on half the data and this time it only took 80 seconds. Much faster, though still an order of magnitude slower than running it inside Postgres. The 700,000 row data frame in an an "unclean" R session swamped the 6 GB RAM on my desktop. Maybe the original test was unfair. On the other hand, it does directly speak to John's comment that "avg() in the database is going to be a lot faster than copying the data into memory for an application to process."

If you were doing a really complex analysis I bet dumping a dataset out to csv first and then running R scripts over it would be fast and useful.

Interesting:

COPY statement (same table, 700,000 rows): 7761 ms
SFTP file copy from Postgres server to local machine: 21 seconds (didn't time it and my FTP client didn't report it)
read.csv (R command): 62 seconds

Total ~ 91 seconds

R commands:
library(RPostgreSQL)
<declare connection object>
dfSomeTable = dbGetQuery(conn, "SELECT * FROM some_table")

~ 85 seconds

So whether you export to CSV first or use RPostgreSQL, it's roughly equivalent, though I would tend to favor RPostgreSQL just because it's one step instead of three.

On the other hand:

sql = "SELECT state, avg(col1), avg(col2), [...] avg(col231) FROM some_table GROUP BY state;"
dfGrouped = dbGetQuery(conn, sql)

~ 5 seconds

So aggregating inside Posgres, then the smaller data set via RPostgreSQL wins on two counts, speed of aggregation and speed of transfer.

> aggregate(dfSomeTable, by = list(dfSomeTable$state), FUN = mean, na.rm =
> TRUE)

Is this a Postgres stored proc (pl-R)? Or is that R itself? If its plR then I wonder if its stepping through the recordset twice. Depending on how the function is written, I can see the function firing off a query, PG collects the entire recordset, and hands it back to the function, which then iterates thru it again. (vs. having the function called for each row as PG steps thru the recordset only once).

This is run in R. Data frame (dfSomeTable in this code snippet) is already in R, so no trips to Postgres are taking up time in this process. The real reason for the inflated time is pointed out above.

--Lee

--
Lee Hachadoorian
PhD, Earth & Environmental Sciences (Geography)
Research Associate, CUNY Center for Urban Research
http://freecity.commons.gc.cuny.edu/