Thread: optimizing daily data storage in Pg

optimizing daily data storage in Pg

From
P Kishor
Date:
I have been struggling with this for a while now, have even gone down
a few paths but struck out, so I turn now to the community for ideas.
First, the problem: Store six daily variables for ~ 25 years for cells
in a grid.

 * Number of vars = 6
 * Number of cells ~ 13 million
 * Number of days ~ 9125 (25 * 365)

Optimize the store for two different kinds of queries:

Query one: Retrieve the value of a single var for all or a portion of
the cells for a single day. This is analogous to an image where every
pixel is the value of a single var.

Query two: Retrieve values for all the days or a duration of days for
a single var for a single cell. This is like grabbing a column out of
a table in which each row holds all the vars for a single day.

So, I set about designing the db. The "grid" is in a table with 13 million rows

    CREATE TABLE cells (
        cell_id      INTEGER,
        other_data ..
    )
    WITH (
        OIDS=FALSE
    )


A single table *where every row is one day's values for one cell* looks like so

    CREATE TABLE d (
        yr      SMALLINT,
        yday    SMALLINT,
        a       SMALLINT,
        b       SMALLINT,
        d       SMALLINT,
        e       SMALLINT,
        f       SMALLINT,
        g       SMALLINT,
        cell_id INTEGER
    )
    WITH (
        OIDS=FALSE
    )

The data would look like so

    yr      yday    a   b   c   d   e   f   g   cell_id
    ----------------------------------------------------
    1980    1       x   x   x   x   x   x   x   1
    ..
    1980    365     x   x   x   x   x   x   x   1
    ...
    1981    1       x   x   x   x   x   x   x   1
    ..
    1981    365     x   x   x   x   x   x   x   1
      ...
      ...
    2005    1       x   x   x   x   x   x   x   1
    ..
    2005    365     x   x   x   x   x   x   x   1
    ......
    1980    1       x   x   x   x   x   x   x   2
    ..
    1980    365     x   x   x   x   x   x   x   2
      ...

I could now (theoretically) conduct my queries like so:

Query 1a: Retrieve the value of a single var for all the cells for a
single day. This is analogous to an image where every pixel is the
value of a single var.

    SELECT <var> FROM d WHERE yr = ? AND yday = ?;

I assuming I would need an index on yr and yday, or perhaps even a
compound index on (yr, yday).

Query 1b: Retrieve the value of a single var for a portion of the
cells for a single day. This is analogous to an image where every
pixel is the value of a single var.

    SELECT <var> FROM d WHERE yr = ? AND yday = ? AND cell_id IN (?,?,?...);

I assuming I would need an index on yr and yday, or perhaps even a
compound index on (yr, yday) AND an index on cell_id.

Query 2: Retrieve values for all the days or a duration of days for a
single var for a single cell. This is like grabbing a column out of a
table in which each row holds all the vars for a single day.

    SELECT <var> FROM d WHERE cell_id = ?;
    SELECT <var> FROM d WHERE cell_id IN (?,?,?...);

Once again, an index on cell_id would assist in the above.

The problem: The above table would have 13 M * 9125 rows ~ 118 billion
rows. Huge indexes, slow queries, etc. In fact, major issues loading
the data in the first place. Since I am loading data in batches, I
drop the indexes (takes time), COPY data into the table (takes time),
build the indexes (takes time), experiment with improving the
performance (takes time), fail, rinse, lather, repeat. I actually
tried the above with a subset of data (around 100 M rows) and
experienced all of the above. I don't remember the query times, but
they were awful.

So, I partitioned the table into years like so

    CREATE TABLE d_<yyyy> (
        CHECK ( yr = <yyyy> )
    ) INHERITS (d)

Hmmm... still no satisfaction. I ended up with 1 master table + 25
inherited tables. Each of the year tables now had ~ 4.75 billion rows
(13 M * 365), and the queries were still very slow.

So, I partitioned it all by years and days like so

    CREATE TABLE d_<yyyy>_<yday> (
        CHECK ( yr = <yyyy> AND yday = <yday> )
    ) INHERITS (d)

Each table now has 13 million rows, and is reasonably fast (although
still not satisfactorily fast), but now I have 9K tables. That has its
own problems. I can't query the master table anymore as Pg tries to
lock all the tables and runs out of memory. Additionally, I can't
anymore conduct query two above. I could do something like

    SELECT a FROM d_1980_1 WHERE cell_id = 1
    UNION
    SELECT a FROM d_1980_2 WHERE cell_id = 1
    UNION
    SELECT a FROM d_1980_3 WHERE cell_id = 1
    UNION
    SELECT a FROM d_1980_4 WHERE cell_id = 1
    UNION
    ...

But the above is hardly optimal.

Any suggestions, ideas, brainstorms would be appreciated. Perhaps Pg,
or even a RDBMS, is not the right tool for this problem, in which
case, suggestion for alternatives would be welcome as well.

Right now I am testing this on a dual Xeon dual core 3 GHz Xserve with
12 GB RAM. The PGDATA directory is located on an attached RAID that is
configured as RAID5. Reasonable time for a query would be under 500
ms, although ultimately I would love to have the query be done under
250 ms, perhaps with RAID10, and a machine with more RAM. I have
access to a machine with dual Xeon quad core 3 GHz Xserve with 32 GB
RAM, and an internal RAID, but before I try to move the data, I want
to actually conclusively prove that Pg is the best solution (or not).
Perhaps flat files are better, perhaps blobs are better, or perhaps
Pg's array column type. Any ideas/suggestions welcome.

Oh, one more thing. Once the data are loaded and everything is
working, the data are readonly. They are historical, so they don't
need to be changed.


--
Puneet Kishor

Re: optimizing daily data storage in Pg

From
Andy Colson
Date:
On 7/22/2010 9:41 AM, P Kishor wrote:
> I have been struggling with this for a while now, have even gone down
> a few paths but struck out, so I turn now to the community for ideas.
> First, the problem: Store six daily variables for ~ 25 years for cells
> in a grid.
>
>   * Number of vars = 6
>   * Number of cells ~ 13 million
>   * Number of days ~ 9125 (25 * 365)
>
> Optimize the store for two different kinds of queries:
>
> Query one: Retrieve the value of a single var for all or a portion of
> the cells for a single day. This is analogous to an image where every
> pixel is the value of a single var.

 >      SELECT<var>  FROM d WHERE yr = ? AND yday = ?;
 >      SELECT<var>  FROM d WHERE yr = ? AND yday = ? AND cell_id IN
(?,?,?...);


>
> Query two: Retrieve values for all the days or a duration of days for
> a single var for a single cell. This is like grabbing a column out of
> a table in which each row holds all the vars for a single day.
 >      SELECT<var>  FROM d WHERE cell_id = ?;
 >      SELECT<var>  FROM d WHERE cell_id IN (?,?,?...);



First, I must admit to not reading your entire email.

Second, Query 1 should be fast, regardless of how you layout the tables.

Third, Query 2 will return 13M rows?  I dont think it matters how you
layout the tables, returning 13M rows is always going to be slow.


-Andy

Re: optimizing daily data storage in Pg

From
P Kishor
Date:
On Thu, Jul 22, 2010 at 4:56 PM, Andy Colson <andy@squeakycode.net> wrote:
> On 7/22/2010 9:41 AM, P Kishor wrote:
>>
>> I have been struggling with this for a while now, have even gone down
>> a few paths but struck out, so I turn now to the community for ideas.
>> First, the problem: Store six daily variables for ~ 25 years for cells
>> in a grid.
>>
>>  * Number of vars = 6
>>  * Number of cells ~ 13 million
>>  * Number of days ~ 9125 (25 * 365)
>>
>> Optimize the store for two different kinds of queries:
>>
>> Query one: Retrieve the value of a single var for all or a portion of
>> the cells for a single day. This is analogous to an image where every
>> pixel is the value of a single var.
>
>>      SELECT<var>  FROM d WHERE yr = ? AND yday = ?;
>>      SELECT<var>  FROM d WHERE yr = ? AND yday = ? AND cell_id IN
>> (?,?,?...);
>
>
>>
>> Query two: Retrieve values for all the days or a duration of days for
>> a single var for a single cell. This is like grabbing a column out of
>> a table in which each row holds all the vars for a single day.
>
>>      SELECT<var>  FROM d WHERE cell_id = ?;
>>      SELECT<var>  FROM d WHERE cell_id IN (?,?,?...);
>
>
>
> First, I must admit to not reading your entire email.

I am not sure how to respond to your feedback give that you haven't
read the entire email. Nevertheless, thanks for writing...

>
> Second, Query 1 should be fast, regardless of how you layout the tables.

It is not fast. Right now I have data for about 250,000 cells loaded.
That comes to circa 92 million rows per year. Performance is pretty
sucky.


>
> Third, Query 2 will return 13M rows?  I dont think it matters how you layout
> the tables, returning 13M rows is always going to be slow.
>

Yes, I understand that. In reality I will never get 13 M rows. For
display purposes, I will probably get around 10,000 rows to 50,000
rows. When more rows are needed, it will be to feed a model, so that
can be offline (without an impatient human being waiting on the other
end).

Right now, my main problem is that I have either too many rows (~4 B
rows) in a manageable number of tables (25 tables) or manageable
number of rows (~13 M rows) in too many tables (~9000 tables).


>
> -Andy
>



--
Puneet Kishor http://www.punkish.org
Carbon Model http://carbonmodel.org
Charter Member, Open Source Geospatial Foundation http://www.osgeo.org
Science Commons Fellow, http://sciencecommons.org/about/whoweare/kishor
Nelson Institute, UW-Madison http://www.nelson.wisc.edu
-----------------------------------------------------------------------
Assertions are politics; backing up assertions with evidence is science
=======================================================================

Re: optimizing daily data storage in Pg

From
Scott Marlowe
Date:
Does ordering the data in the table by day, cell_id help at all?
How big ARE the indexes we're talking about?  If the total size of all
the indexes you need to do your queries run into the hundreds of
gigabytes, going from 12 to 32 Gigs of RAM may be like flapping your
arms our the window of your car in terms of effectiveness.  If the
indexes that you need at one time add up to something in the 32 to
256Gig range then adding enough memory to hold most or all of them at
once would help and is doable, but the price goes up fast when you get
to the > 64Gig range.

If everything is read only, then pg may or may not be the best fit.
It sounds more like what you're doing is batch processing, or at least
batch processing friendly.  If you could roll up your data, either in
the db or beforehand while preparing it, that might be a big win.

Is your machine IO bound when running these queries?

What does "iostat -xd 1" say about throughput and % utilization? I'm
assuming you're running an OS with sysstat available.  If you're on
something else, then you'll need to research how to see your IO
workload on that OS.

htop is a nice graphical way of seeing your wait states as well, with
the red bars representing IO wait on a machine.

If your machine is IO bound, and you've gotten enough ram to hold the
working set of your indexes, then you'll need more hard drives under
good controllers to make it faster.  The bad news is that RAID
controllers and lots of hard drives can be expensive, the good news is
that reporting servers (which is sounds like what this is) use a lot
of sequential access, and throwing more drives at the problem gives
big gains, usually.

As far as partitioning goes, I think you either need to use fewer
partitions, or just use individual tables without using the parent
table to access them.  It's a well known problem with partitioning
that past a few hundred or so child tables things get slow pretty
fast. Having 200 to 500 tables, maybe even 1,000 is workable, but past
that no, not really.

If you're IO bound, then you'll likely need more CPU horsepower.

Re: optimizing daily data storage in Pg

From
Andy Colson
Date:
On 7/23/2010 12:39 AM, P Kishor wrote:
> On Thu, Jul 22, 2010 at 4:56 PM, Andy Colson<andy@squeakycode.net>  wrote:
>> On 7/22/2010 9:41 AM, P Kishor wrote:
>>>
>>> I have been struggling with this for a while now, have even gone down
>>> a few paths but struck out, so I turn now to the community for ideas.
>>> First, the problem: Store six daily variables for ~ 25 years for cells
>>> in a grid.
>>>
>>>   * Number of vars = 6
>>>   * Number of cells ~ 13 million
>>>   * Number of days ~ 9125 (25 * 365)
>>>
>>> Optimize the store for two different kinds of queries:
>>>
>>> Query one: Retrieve the value of a single var for all or a portion of
>>> the cells for a single day. This is analogous to an image where every
>>> pixel is the value of a single var.
>>
>>>       SELECT<var>    FROM d WHERE yr = ? AND yday = ?;
>>>       SELECT<var>    FROM d WHERE yr = ? AND yday = ? AND cell_id IN
>>> (?,?,?...);
>>
>>
>>>
>>> Query two: Retrieve values for all the days or a duration of days for
>>> a single var for a single cell. This is like grabbing a column out of
>>> a table in which each row holds all the vars for a single day.
>>
>>>       SELECT<var>    FROM d WHERE cell_id = ?;
>>>       SELECT<var>    FROM d WHERE cell_id IN (?,?,?...);
>>
>>
>>
>> First, I must admit to not reading your entire email.
>
> I am not sure how to respond to your feedback give that you haven't
> read the entire email. Nevertheless, thanks for writing...
>

Heh, sorry, my point was, you put a lot of information into your email,
and I was going to only use one bit of it: row counts.

>>
>> Second, Query 1 should be fast, regardless of how you layout the tables.
>
> It is not fast. Right now I have data for about 250,000 cells loaded.
> That comes to circa 92 million rows per year. Performance is pretty
> sucky.
>
>

This query should return one record, correct?  This should be very fast,
PG should be able to find the record in the index within 5 seeks, and
then find the data in one seek.  Can you post 'explain analyze' for this
query.  (Or, it could be the case, I totally misunderstood your data)


>>
>> Third, Query 2 will return 13M rows?  I dont think it matters how you layout
>> the tables, returning 13M rows is always going to be slow.
>>
>
> Yes, I understand that. In reality I will never get 13 M rows. For
> display purposes, I will probably get around 10,000 rows to 50,000
> rows. When more rows are needed, it will be to feed a model, so that
> can be offline (without an impatient human being waiting on the other
> end).
>
> Right now, my main problem is that I have either too many rows (~4 B
> rows) in a manageable number of tables (25 tables) or manageable
> number of rows (~13 M rows) in too many tables (~9000 tables).
>

Searching by just cell_id is not going to be very selectable, and with
large result-sets I can see this one being slow.  As Scott talked about
in his response, this one will come down to hardware.  Have you dd
tested your hardware?

-Andy