Thread: best db schema for time series data?

best db schema for time series data?

From

Louis-David Mitterrand

Date:

16 November 2010, 09:57:41

Hi,

I have to collect lots of prices from web sites and keep track of their
changes. What is the best option?

1) one 'price' row per price change:

    create table price (
        id_price primary key,
        id_product integer references product,
        price integer
    );

2) a single 'price' row containing all the changes:

    create table price (
        id_price primary key,
        id_product integer references product,
        price integer[] -- prices are 'pushed' on this array as they change
    );

Which is bound to give the best performance, knowing I will often need
to access the latest and next-to-latest prices?

Thanks,

Re: best db schema for time series data?

From

Pavel Stehule

Date:

16 November 2010, 10:04:17

Hello

my opinion:

@1 can be faster for access to last items with index
@2 can be more effective about data files length allocation

@1 or @2 - it depends on number of prices per product. For small
number (less 100) I am strong for @2 (if speed is important).
Personally prefer @2.

Pavel

2010/11/16 Louis-David Mitterrand <vindex+lists-pgsql-performance@apartia.org>:
> Hi,
>
> I have to collect lots of prices from web sites and keep track of their
> changes. What is the best option?
>
> 1) one 'price' row per price change:
>
>        create table price (
>                id_price primary key,
>                id_product integer references product,
>                price integer
>        );
>
> 2) a single 'price' row containing all the changes:
>
>        create table price (
>                id_price primary key,
>                id_product integer references product,
>                price integer[] -- prices are 'pushed' on this array as they change
>        );
>
> Which is bound to give the best performance, knowing I will often need
> to access the latest and next-to-latest prices?
>
> Thanks,
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
>

Re: best db schema for time series data?

From

Louis-David Mitterrand

Date:

16 November 2010, 10:07:38

On Tue, Nov 16, 2010 at 12:03:29PM +0100, Pavel Stehule wrote:
> Hello
>
> my opinion:
>
> @1 can be faster for access to last items with index
> @2 can be more effective about data files length allocation

Hi Pavel,

What is "data files length allocation" ?

Re: best db schema for time series data?

From

Pavel Stehule

Date:

16 November 2010, 10:12:31

2010/11/16 Louis-David Mitterrand <vindex+lists-pgsql-performance@apartia.org>:
> On Tue, Nov 16, 2010 at 12:03:29PM +0100, Pavel Stehule wrote:
>> Hello
>>
>> my opinion:
>>
>> @1 can be faster for access to last items with index
>> @2 can be more effective about data files length allocation
>
> Hi Pavel,
>
> What is "data files length allocation" ?

size of data files on disc :)

pg needs a some bytes for head on every row - so if you use a array,
then you share its. Next varlena types (like array) can be compressed.

Pavel


>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
>

Re: best db schema for time series data?

From

Arjen van der Meijden

Date:

16 November 2010, 10:24:30

On 16-11-2010 11:50, Louis-David Mitterrand wrote:
> I have to collect lots of prices from web sites and keep track of their
> changes. What is the best option?
>
> 1) one 'price' row per price change:
>
>     create table price (
>         id_price primary key,
>         id_product integer references product,
>         price integer
>     );
>
> 2) a single 'price' row containing all the changes:
>
>     create table price (
>         id_price primary key,
>         id_product integer references product,
>         price integer[] -- prices are 'pushed' on this array as they change
>     );
>
> Which is bound to give the best performance, knowing I will often need
> to access the latest and next-to-latest prices?

If you mostly need the last few prices, I'd definitaly go with the first
aproach, its much cleaner. Besides, you can store a date/time per price,
so you know when it changed. With the array-approach that's a bit harder
to do.

If you're concerned with performance, introduce some form of a
materialized view for the most recent price of a product. Or reverse the
entire process and make a "current price"-table and a "price history"-table.

Best regards,

Arjen

Re: best db schema for time series data?

From

Louis-David Mitterrand

Date:

16 November 2010, 10:28:24

On Tue, Nov 16, 2010 at 12:18:35PM +0100, Arjen van der Meijden wrote:
> On 16-11-2010 11:50, Louis-David Mitterrand wrote:
> >I have to collect lots of prices from web sites and keep track of their
> >changes. What is the best option?
> >
> >1) one 'price' row per price change:
> >
> >    create table price (
> >        id_price primary key,
> >        id_product integer references product,
> >        price integer
> >    );
> >
> >2) a single 'price' row containing all the changes:
> >
> >    create table price (
> >        id_price primary key,
> >        id_product integer references product,
> >        price integer[] -- prices are 'pushed' on this array as they change
> >    );
> >
> >Which is bound to give the best performance, knowing I will often need
> >to access the latest and next-to-latest prices?
>
> If you mostly need the last few prices, I'd definitaly go with the
> first aproach, its much cleaner. Besides, you can store a date/time
> per price, so you know when it changed. With the array-approach
> that's a bit harder to do.
>
> If you're concerned with performance, introduce some form of a
> materialized view for the most recent price of a product. Or reverse
> the entire process and make a "current price"-table and a "price
> history"-table.

That's exactly my current 'modus operandi'. So it's nice to have
confirmation that I'm not using the worst schema out there :)

Re: best db schema for time series data?

From

Jayadevan M

Date:

16 November 2010, 10:57:26

Hi,
> > If you mostly need the last few prices, I'd definitaly go with the
> > first aproach, its much cleaner. Besides, you can store a date/time
> > per price, so you know when it changed.
We too were using such an approach for 'soft deletes'. Soon we realized
that using a one char valid flag to mark the latest records was better. It
was easier to  filter on that. An index on the modified date column was
not being used consistently for some reason or the other.
The VALID records form a small portion of the big table  and an index on
the column help fetch the data pretty fast. Of course, you could partition
on the flag also (we did not have to). A slight processing overhead of
updating the valid FLAG column is the penalty.  This was an Oracle
database.
Regards,
Jayadevan






DISCLAIMER:

"The information in this e-mail and any attachment is intended only for
the person to whom it is addressed and may contain confidential and/or
privileged material. If you have received this e-mail in error, kindly
contact the sender and destroy all copies of the original communication.
IBS makes no warranty, express or implied, nor guarantees the accuracy,
adequacy or completeness of the information contained in this email or any
attachment and is not liable for any errors, defects, omissions, viruses
or for resultant loss or damage, if any, direct or indirect."

Re: best db schema for time series data?

From

Harald Fuchs

Date:

16 November 2010, 15:28:47

In article <4CE2688B.2050000@tweakers.net>,
Arjen van der Meijden <acmmailing@tweakers.net> writes:

> On 16-11-2010 11:50, Louis-David Mitterrand wrote:
>> I have to collect lots of prices from web sites and keep track of their
>> changes. What is the best option?
>>
>> 1) one 'price' row per price change:
>>
>> create table price (
>> id_price primary key,
>> id_product integer references product,
>> price integer
>> );
>>
>> 2) a single 'price' row containing all the changes:
>>
>> create table price (
>> id_price primary key,
>> id_product integer references product,
>> price integer[] -- prices are 'pushed' on this array as they change
>> );
>>
>> Which is bound to give the best performance, knowing I will often need
>> to access the latest and next-to-latest prices?

> If you mostly need the last few prices, I'd definitaly go with the
> first aproach, its much cleaner. Besides, you can store a date/time
> per price, so you know when it changed. With the array-approach that's
> a bit harder to do.

I'd probably use a variant of this:

  CREATE TABLE prices (
    pid int NOT NULL REFERENCES products,
    validTil timestamp(0) NULL,
    price int NOT NULL,
    UNIQUE (pid, validTil)
  );

The current price of a product is always the row with validTil IS NULL.
The lookup should be pretty fast because it can use the index of the
UNIQUE constraint.

Re: best db schema for time series data?

From

Chris Browne

Date:

16 November 2010, 15:54:58

vindex+lists-pgsql-performance@apartia.org (Louis-David Mitterrand)
writes:
> I have to collect lots of prices from web sites and keep track of their
> changes. What is the best option?
>
> 1) one 'price' row per price change:
>
>     create table price (
>         id_price primary key,
>         id_product integer references product,
>         price integer
>     );
>
> 2) a single 'price' row containing all the changes:
>
>     create table price (
>         id_price primary key,
>         id_product integer references product,
>         price integer[] -- prices are 'pushed' on this array as they change
>     );
>
> Which is bound to give the best performance, knowing I will often need
> to access the latest and next-to-latest prices?

I'd definitely bias towards #1, but with a bit of a change...

create table product (
  id_product serial primary key
);

create table price (
   id_product integer references product,
   as_at timestamptz default now(),
   primary key (id_product, as_at),
   price integer
);

The query to get the last 5 prices for a product should be
splendidly efficient:

   select price, as_at from price
    where id_product = 17
    order by as_at desc limit 5;

(That'll use the PK index perfectly nicely.)

If you needed higher performance, for "latest price," then I'd add a
secondary table, and use triggers to copy latest price into place:

  create table latest_prices (
     id_product integer primary key references product,
     price integer
  );

create or replace function capture_latest_price () returns trigger as $$
declare
begin
    delete from latest_prices where id_product = NEW.id_product;
    insert into latest_prices (id_product,price) values
       (NEW.id_product, NEW.price);
    return NEW;
end
$$ language plpgsql;

create trigger price_capture after insert on price execute procedure capture_latest_price();

This captures *just* the latest price for each product.  (There's a bit
of race condition - if there are two concurrent price updates, one will
fail, which wouldn't happen without this trigger in place.)
--
"... Turns   out that JPG  was in  fact using his  brain... and   I am
inclined to encourage him  to continue the practice  even if  it isn't
exactly what I  would have done myself."   -- Alan Bawden  (way out of
context)

Re: best db schema for time series data?

From

Louis-David Mitterrand

Date:

19 November 2010, 08:46:34

On Tue, Nov 16, 2010 at 11:35:24AM -0500, Chris Browne wrote:
> vindex+lists-pgsql-performance@apartia.org (Louis-David Mitterrand)
> writes:
> > I have to collect lots of prices from web sites and keep track of their
> > changes. What is the best option?
> >
> > 1) one 'price' row per price change:
> >
> >     create table price (
> >         id_price primary key,
> >         id_product integer references product,
> >         price integer
> >     );
> >
> > 2) a single 'price' row containing all the changes:
> >
> >     create table price (
> >         id_price primary key,
> >         id_product integer references product,
> >         price integer[] -- prices are 'pushed' on this array as they change
> >     );
> >
> > Which is bound to give the best performance, knowing I will often need
> > to access the latest and next-to-latest prices?
>
> I'd definitely bias towards #1, but with a bit of a change...
>
> create table product (
>   id_product serial primary key
> );
>
> create table price (
>    id_product integer references product,
>    as_at timestamptz default now(),
>    primary key (id_product, as_at),
>    price integer
> );

Hi Chris,

So an "id_price serial" on the price table is not necessary in your
opinion? I am using "order by id_price limit X" or "max(id_price)" to
get at the most recent prices.

> The query to get the last 5 prices for a product should be
> splendidly efficient:
>
>    select price, as_at from price
>     where id_product = 17
>     order by as_at desc limit 5;
>
> (That'll use the PK index perfectly nicely.)
>
> If you needed higher performance, for "latest price," then I'd add a
> secondary table, and use triggers to copy latest price into place:
>
>   create table latest_prices (
>      id_product integer primary key references product,
>      price integer
>   );

I did the same thing with a 'price_dispatch' trigger and partitioned
tables (inheritance). It's definitely needed when the price database
grow into the millions.

Thanks,

Re: best db schema for time series data?

From

Louis-David Mitterrand

Date:

19 November 2010, 08:50:30

On Tue, Nov 16, 2010 at 05:28:19PM +0100, Harald Fuchs wrote:
> In article <4CE2688B.2050000@tweakers.net>,
> Arjen van der Meijden <acmmailing@tweakers.net> writes:
>
> > On 16-11-2010 11:50, Louis-David Mitterrand wrote:
> >> I have to collect lots of prices from web sites and keep track of their
> >> changes. What is the best option?
> >>
> >> 1) one 'price' row per price change:
> >>
> >> create table price (
> >> id_price primary key,
> >> id_product integer references product,
> >> price integer
> >> );
> >>
> >> 2) a single 'price' row containing all the changes:
> >>
> >> create table price (
> >> id_price primary key,
> >> id_product integer references product,
> >> price integer[] -- prices are 'pushed' on this array as they change
> >> );
> >>
> >> Which is bound to give the best performance, knowing I will often need
> >> to access the latest and next-to-latest prices?
>
> > If you mostly need the last few prices, I'd definitaly go with the
> > first aproach, its much cleaner. Besides, you can store a date/time
> > per price, so you know when it changed. With the array-approach that's
> > a bit harder to do.
>
> I'd probably use a variant of this:
>
>   CREATE TABLE prices (
>     pid int NOT NULL REFERENCES products,
>     validTil timestamp(0) NULL,
>     price int NOT NULL,
>     UNIQUE (pid, validTil)
>   );
>
> The current price of a product is always the row with validTil IS NULL.
> The lookup should be pretty fast because it can use the index of the
> UNIQUE constraint.

Hi,

The validTil idea is nice, but you have to manage that field with a
trigger, right?

Re: best db schema for time series data?

From

Chris Browne

Date:

19 November 2010, 16:25:05

vindex+lists-pgsql-performance@apartia.org (Louis-David Mitterrand)
writes:
> On Tue, Nov 16, 2010 at 11:35:24AM -0500, Chris Browne wrote:
>> vindex+lists-pgsql-performance@apartia.org (Louis-David Mitterrand)
>> writes:
>> > I have to collect lots of prices from web sites and keep track of their
>> > changes. What is the best option?
>> >
>> > 1) one 'price' row per price change:
>> >
>> >     create table price (
>> >         id_price primary key,
>> >         id_product integer references product,
>> >         price integer
>> >     );
>> >
>> > 2) a single 'price' row containing all the changes:
>> >
>> >     create table price (
>> >         id_price primary key,
>> >         id_product integer references product,
>> >         price integer[] -- prices are 'pushed' on this array as they change
>> >     );
>> >
>> > Which is bound to give the best performance, knowing I will often need
>> > to access the latest and next-to-latest prices?
>>
>> I'd definitely bias towards #1, but with a bit of a change...
>>
>> create table product (
>>   id_product serial primary key
>> );
>>
>> create table price (
>>    id_product integer references product,
>>    as_at timestamptz default now(),
>>    primary key (id_product, as_at),
>>    price integer
>> );
>
> Hi Chris,
>
> So an "id_price serial" on the price table is not necessary in your
> opinion? I am using "order by id_price limit X" or "max(id_price)" to
> get at the most recent prices.

It (id_price) is an extra piece of information that doesn't reveal an
important fact, namely when the price was added.

I'm uncomfortable with adding data that doesn't provide much more
information, and it troubles me when people put a lot of interpretation
into the meanings of SERIAL columns.

I'd like to set up some schemas (for experiment, if not necessarily to
get deployed to production) where I'd use DCE UUID values rather than
sequences, so that people wouldn't make the error of imagining meanings
in the values that aren't really there.

And I suppose that there lies a way to think about it...  If you used
UUIDs rather than SERIAL, how would your application break?

And of the ways in which it would break, which of those are errors that
fall from:

 a) Ignorant usage, assuming order that isn't really there?  (e.g. - a
    SERIAL might capture some order information, but UUID won't!)

 b) Inadequate data capture, where you're using the implicit data
    collection from SERIAL to capture, poorly, information that should
    be expressly captured?

When I added the timestamp to the "price" table, that's intended to
address b), capturing the time that the price was added.

>> The query to get the last 5 prices for a product should be
>> splendidly efficient:
>>
>>    select price, as_at from price
>>     where id_product = 17
>>     order by as_at desc limit 5;
>>
>> (That'll use the PK index perfectly nicely.)
>>
>> If you needed higher performance, for "latest price," then I'd add a
>> secondary table, and use triggers to copy latest price into place:
>>
>>   create table latest_prices (
>>      id_product integer primary key references product,
>>      price integer
>>   );
>
> I did the same thing with a 'price_dispatch' trigger and partitioned
> tables (inheritance). It's definitely needed when the price database
> grow into the millions.
>
> Thanks,

The conversations are always interesting!  Cheers!
--
output = ("cbbrowne" "@" "gmail.com")
http://www3.sympatico.ca/cbbrowne/x.html
FLORIDA: If you think we can't vote, wait till you see us drive.

Re: best db schema for time series data?

From

Robert Klemme

Date:

19 November 2010, 23:16:31

On Fri, Nov 19, 2010 at 10:50 AM, Louis-David Mitterrand
<vindex+lists-pgsql-performance@apartia.org> wrote:
> On Tue, Nov 16, 2010 at 05:28:19PM +0100, Harald Fuchs wrote:
>> In article <4CE2688B.2050000@tweakers.net>,
>> Arjen van der Meijden <acmmailing@tweakers.net> writes:
>>
>> > On 16-11-2010 11:50, Louis-David Mitterrand wrote:
>> >> I have to collect lots of prices from web sites and keep track of their
>> >> changes. What is the best option?
>> >>
>> >> 1) one 'price' row per price change:
>> >>
>> >> create table price (
>> >> id_price primary key,
>> >> id_product integer references product,
>> >> price integer
>> >> );
>> >>
>> >> 2) a single 'price' row containing all the changes:
>> >>
>> >> create table price (
>> >> id_price primary key,
>> >> id_product integer references product,
>> >> price integer[] -- prices are 'pushed' on this array as they change
>> >> );
>> >>
>> >> Which is bound to give the best performance, knowing I will often need
>> >> to access the latest and next-to-latest prices?
>>
>> > If you mostly need the last few prices, I'd definitaly go with the
>> > first aproach, its much cleaner. Besides, you can store a date/time
>> > per price, so you know when it changed. With the array-approach that's
>> > a bit harder to do.
>>
>> I'd probably use a variant of this:
>>
>>   CREATE TABLE prices (
>>     pid int NOT NULL REFERENCES products,
>>     validTil timestamp(0) NULL,
>>     price int NOT NULL,
>>     UNIQUE (pid, validTil)
>>   );
>>
>> The current price of a product is always the row with validTil IS NULL.
>> The lookup should be pretty fast because it can use the index of the
>> UNIQUE constraint.

Even better: with a partial index lookup should be more efficient and
probably will stay that way even when the number of prices increases
(and the number of products stays the same).  With

CREATE UNIQUE INDEX current_prices
ON prices (
  pid
)
WHERE validTil IS NULL;

I get

robert=> explain select price from prices where pid = 12344 and
validTil is null;
                                 QUERY PLAN
-----------------------------------------------------------------------------
 Index Scan using current_prices on prices  (cost=0.00..8.28 rows=1 width=4)
   Index Cond: (pid = 12344)
(2 rows)

The index can actually be used here.

(see attachment)

> The validTil idea is nice, but you have to manage that field with a
> trigger, right?

Well, you don't need to.  You can always do

begin;
update prices set validTil = current_timestamp
  where pid = 123 and validTil is NULL;
insert into prices values ( 123, null, 94 );
commit;

But with a trigger it would be more convenient of course.

Kind regards

robert

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

Attachment

price.sql

Re: best db schema for time series data?

From

Bob Lunney

Date:

20 November 2010, 15:28:03

--- On Fri, 11/19/10, Robert Klemme <shortcutter@googlemail.com> wrote:

> From: Robert Klemme <shortcutter@googlemail.com>
> Subject: Re: [PERFORM] best db schema for time series data?
> To: pgsql-performance@postgresql.org
> Date: Friday, November 19, 2010, 7:16 PM
> On Fri, Nov 19, 2010 at 10:50 AM,
> Louis-David Mitterrand
> <vindex+lists-pgsql-performance@apartia.org>
> wrote:
> > On Tue, Nov 16, 2010 at 05:28:19PM +0100, Harald Fuchs
> wrote:
> >> In article <4CE2688B.2050000@tweakers.net>,
> >> Arjen van der Meijden <acmmailing@tweakers.net>
> writes:
> >>
> >> > On 16-11-2010 11:50, Louis-David Mitterrand
> wrote:
> >> >> I have to collect lots of prices from web
> sites and keep track of their
> >> >> changes. What is the best option?
> >> >>
> >> >> 1) one 'price' row per price change:
> >> >>
> >> >> create table price (
> >> >> id_price primary key,
> >> >> id_product integer references product,
> >> >> price integer
> >> >> );
> >> >>
> >> >> 2) a single 'price' row containing all
> the changes:
> >> >>
> >> >> create table price (
> >> >> id_price primary key,
> >> >> id_product integer references product,
> >> >> price integer[] -- prices are 'pushed' on
> this array as they change
> >> >> );
> >> >>
> >> >> Which is bound to give the best
> performance, knowing I will often need
> >> >> to access the latest and next-to-latest
> prices?
> >>
> >> > If you mostly need the last few prices, I'd
> definitaly go with the
> >> > first aproach, its much cleaner. Besides, you
> can store a date/time
> >> > per price, so you know when it changed. With
> the array-approach that's
> >> > a bit harder to do.
> >>
> >> I'd probably use a variant of this:
> >>
> >>   CREATE TABLE prices (
> >>     pid int NOT NULL REFERENCES products,
> >>     validTil timestamp(0) NULL,
> >>     price int NOT NULL,
> >>     UNIQUE (pid, validTil)
> >>   );
> >>
> >> The current price of a product is always the row
> with validTil IS NULL.
> >> The lookup should be pretty fast because it can
> use the index of the
> >> UNIQUE constraint.
>
> Even better: with a partial index lookup should be more
> efficient and
> probably will stay that way even when the number of prices
> increases
> (and the number of products stays the same).  With
>
> CREATE UNIQUE INDEX current_prices
> ON prices (
>   pid
> )
> WHERE validTil IS NULL;
>
> I get
>
> robert=> explain select price from prices where pid =
> 12344 and
> validTil is null;
>                
>              
>    QUERY PLAN
> -----------------------------------------------------------------------------
>  Index Scan using current_prices on prices 
> (cost=0.00..8.28 rows=1 width=4)
>    Index Cond: (pid = 12344)
> (2 rows)
>
> The index can actually be used here.
>
> (see attachment)
>
> > The validTil idea is nice, but you have to manage that
> field with a
> > trigger, right?
>
> Well, you don't need to.  You can always do
>
> begin;
> update prices set validTil = current_timestamp
>   where pid = 123 and validTil is NULL;
> insert into prices values ( 123, null, 94 );
> commit;
>
> But with a trigger it would be more convenient of course.
>
> Kind regards
>
> robert
>
> --
> remember.guy do |as, often| as.you_can - without end
> http://blog.rubybestpractices.com/
>
> -----Inline Attachment Follows-----
>
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
>

Louis,

Someday, as sure as Codd made little relational databases, someone will put an incorrect price in that table, and it
willhave to be changed, and that change will ripple throughout your system.  You have a unique chance here, at the
beginning,to foresee that inevitability and plan for it.   

Take a look at

  http://en.wikipedia.org/wiki/Temporal_database

and

  http://pgfoundry.org/projects/temporal/

and anything Snodgrass ever wrote about temporal databases.  Its a fascinating schema design subject, one that comes in
veryhandy in dealing with time-influenced data. 

Good luck!

Bob Lunney