Thread: Fast data, slow data

Fast data, slow data

From
Tim Uckun
Date:
I have a use case in which the most recent data experiences a lot of transactions (inserts and updates) and then the churn kind of calms down.  Eventually the data is relatively static and will only be updated in special and sporatic events.

I was thinking about keeping the high churn data in a different table so that the vacuums on that table can go faster and the vacuums on the rest of the data will rarely be needed.  

Ideally the high churn data would be in an in memory table but failing that I thought that an unlogged table would be good enough.  So now I need a way to flush the oldest data in this table to another logged table and do it in an efficient manner. I don't think partitioning is a good idea in this case because the partitions will be for small time periods (5 to 15 minutes). 

Anybody have a similar problem? If so how did you solve it?

Thanks.


Re: Fast data, slow data

From
Shaun Thomas
Date:
On 06/26/2014 04:29 AM, Tim Uckun wrote:

> I don't think partitioning is a good idea in this case because the
> partitions will be for small time periods (5 to 15 minutes).

Actually, partitioning might be exactly what you want, but not in the
way you might think. What you've run into is actually a pretty common
usage pattern. How we solve problems like this where I work is to use
table inheritance alone. Consider this:

CREATE TABLE my_table
(
   ... columns
);

CREATE TABLE my_table_stable (INHERITS my_table);

Then you create a job that runs however often you want, and all that job
does, is move old rows from my_table, to my_table_stable. Like so:

BEGIN;

INSERT INTO my_table_stable
SELECT * FROM ONLY my_table
  WHERE date_col >= now() - INTERVAL '15 minutes';

DELETE FROM ONLY my_table
  WHERE date_col >= now() - INTERVAL '15 minutes';

COMMIT;

Or whatever. But you get the idea.

This way, you still get all the data by selecting from my_table, but the
data is partitioned in such a way that you can put the high turnover
table in another tablespace, or otherwise modify it for performance reasons.

--
Shaun Thomas
OptionsHouse, LLC | 141 W. Jackson Blvd. | Suite 800 | Chicago IL, 60604
312-676-8870
sthomas@optionshouse.com

______________________________________________

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email


Re: Fast data, slow data

From
Steve Crawford
Date:
On 06/26/2014 02:29 AM, Tim Uckun wrote:
> I have a use case in which the most recent data experiences a lot of
> transactions (inserts and updates) and then the churn kind of calms
> down.  Eventually the data is relatively static and will only be
> updated in special and sporatic events.
>
> I was thinking about keeping the high churn data in a different table
> so that the vacuums on that table can go faster and the vacuums on the
> rest of the data will rarely be needed.
>
> Ideally the high churn data would be in an in memory table but failing
> that I thought that an unlogged table would be good enough.  So now I
> need a way to flush the oldest data in this table to another logged
> table and do it in an efficient manner. I don't think partitioning is
> a good idea in this case because the partitions will be for small time
> periods (5 to 15 minutes).
>
> Anybody have a similar problem? If so how did you solve it?

It's not at all unusual for a database to experience a high portion of
its activity on a small subset of the data so in that sense the
"problem" is already solved by appropriate sizing and tuning to make
appropriate adjustments to the caching and other properties already
provided by the OS and DB.

Having said that, there is far too little information here to provide
specific advice on tuning and other approaches that may be of use for
you. In particular:

Is this currently a hypothetical question or do you have a running
system that is experiencing performance problems?

In either case can you provide some parameters including rows and size
of your large table(s), what you mean by "high churn" (rate of inserts,
deletes, updates and how they are grouped into transactions), the size
of the table that is heavily updated, how the database is used
(transactional, data-mining, ...) and anything else you feel might help
the list understand your use-case.

If you have a live database, any metrics/observations you can supply
might be helpful. For example, are particular queries slow? What speed
is required? Do you have lots of simultaneous connections or is
everything through a single connection.

Finally,  confirm that the data is not important or is easily recreated
(implied by your in-memory/unlogged-table comments).

Cheers,
Steve




Re: Fast data, slow data

From
Marti Raudsepp
Date:
On Thu, Jun 26, 2014 at 5:49 PM, Shaun Thomas <sthomas@optionshouse.com> wrote:
> Then you create a job that runs however often you want, and all that job
> does, is move old rows from my_table, to my_table_stable. Like so:
>
> BEGIN;
> INSERT INTO my_table_stable
> SELECT * FROM ONLY my_table
>  WHERE date_col >= now() - INTERVAL '15 minutes';
> DELETE FROM ONLY my_table
>  WHERE date_col >= now() - INTERVAL '15 minutes';
> COMMIT;

This deserves a caveat, in the default "read committed" isolation
level, this example can delete more rows that it inserts; even if your
date_col never changes and you don't have transactions running for
that long, it's best not to use this pattern. You could change the
isolation using SET TRANSACTION, or much better, use wCTE to solve
this atomically:

WITH deleted AS (
  DELETE FROM ONLY my_table
  WHERE date_col >= now() - INTERVAL '15 minutes'
  RETURNING *
)
INSERT INTO my_table_stable
  SELECT * FROM deleted;

Regards,
Marti


Re: Fast data, slow data

From
Shaun Thomas
Date:
On 06/26/2014 10:47 AM, Marti Raudsepp wrote:

> This deserves a caveat, in the default "read committed" isolation
> level, this example can delete more rows that it inserts;

This is only true because I accidentally inverted the date resolutions.
It should have been:

BEGIN;
INSERT INTO my_table_stable
SELECT * FROM ONLY my_table
   WHERE date_col <= now() - INTERVAL '15 minutes';
DELETE FROM ONLY my_table
   WHERE date_col <= now() - INTERVAL '15 minutes';
COMMIT;

Since now() is tied to transaction start time, it's safe in this
context. But you're right, I actually debated including the CTE syntax,
but you did it for me. :)

Though you fell into the same trap I did. The correct CTE should be:

WITH deleted AS (
    DELETE FROM ONLY my_table
    WHERE date_col <= now() - INTERVAL '15 minutes'
    RETURNING *
)
INSERT INTO my_table_stable
SELECT * FROM deleted;

--
Shaun Thomas
OptionsHouse, LLC | 141 W. Jackson Blvd. | Suite 800 | Chicago IL, 60604
312-676-8870
sthomas@optionshouse.com

______________________________________________

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email


Re: Fast data, slow data

From
Tim Uckun
Date:
This is what I was thinking but I am worried about two things.

1. If there is a very large set of data in the table that needs to be moved this will be slow and might throw locks which would impact the performance of the inserts and the updates.
2. Constantly deleting large chunks of data might cause vacuum problems.


On Fri, Jun 27, 2014 at 2:49 AM, Shaun Thomas <sthomas@optionshouse.com> wrote:
On 06/26/2014 04:29 AM, Tim Uckun wrote:

I don't think partitioning is a good idea in this case because the
partitions will be for small time periods (5 to 15 minutes).

Actually, partitioning might be exactly what you want, but not in the way you might think. What you've run into is actually a pretty common usage pattern. How we solve problems like this where I work is to use table inheritance alone. Consider this:

CREATE TABLE my_table
(
  ... columns
);

CREATE TABLE my_table_stable (INHERITS my_table);

Then you create a job that runs however often you want, and all that job does, is move old rows from my_table, to my_table_stable. Like so:

BEGIN;

INSERT INTO my_table_stable
SELECT * FROM ONLY my_table
 WHERE date_col >= now() - INTERVAL '15 minutes';

DELETE FROM ONLY my_table
 WHERE date_col >= now() - INTERVAL '15 minutes';

COMMIT;

Or whatever. But you get the idea.

This way, you still get all the data by selecting from my_table, but the data is partitioned in such a way that you can put the high turnover table in another tablespace, or otherwise modify it for performance reasons.

--
Shaun Thomas
OptionsHouse, LLC | 141 W. Jackson Blvd. | Suite 800 | Chicago IL, 60604
312-676-8870
sthomas@optionshouse.com

______________________________________________

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email

Re: Fast data, slow data

From
Tim Uckun
Date:
The database is functioning fine now but I am anticipating a much higher workload in the future.  The table in question is probably going to have a few million rows per day inserted into it when it gets busy, if it gets very busy it might be in the tens of millions per day but that's speculation at this point.

I don't want to say that the data is not important but if I drop one or two sensor readings it's not going to be the end of the world.



On Fri, Jun 27, 2014 at 3:45 AM, Steve Crawford <scrawford@pinpointresearch.com> wrote:
On 06/26/2014 02:29 AM, Tim Uckun wrote:
I have a use case in which the most recent data experiences a lot of transactions (inserts and updates) and then the churn kind of calms down.  Eventually the data is relatively static and will only be updated in special and sporatic events.

I was thinking about keeping the high churn data in a different table so that the vacuums on that table can go faster and the vacuums on the rest of the data will rarely be needed.

Ideally the high churn data would be in an in memory table but failing that I thought that an unlogged table would be good enough.  So now I need a way to flush the oldest data in this table to another logged table and do it in an efficient manner. I don't think partitioning is a good idea in this case because the partitions will be for small time periods (5 to 15 minutes).

Anybody have a similar problem? If so how did you solve it?

It's not at all unusual for a database to experience a high portion of its activity on a small subset of the data so in that sense the "problem" is already solved by appropriate sizing and tuning to make appropriate adjustments to the caching and other properties already provided by the OS and DB.

Having said that, there is far too little information here to provide specific advice on tuning and other approaches that may be of use for you. In particular:

Is this currently a hypothetical question or do you have a running system that is experiencing performance problems?

In either case can you provide some parameters including rows and size of your large table(s), what you mean by "high churn" (rate of inserts, deletes, updates and how they are grouped into transactions), the size of the table that is heavily updated, how the database is used (transactional, data-mining, ...) and anything else you feel might help the list understand your use-case.

If you have a live database, any metrics/observations you can supply might be helpful. For example, are particular queries slow? What speed is required? Do you have lots of simultaneous connections or is everything through a single connection.

Finally,  confirm that the data is not important or is easily recreated (implied by your in-memory/unlogged-table comments).

Cheers,
Steve



Re: Fast data, slow data

From
Shaun Thomas
Date:
On 06/26/2014 05:31 PM, Tim Uckun wrote:

> 1. If there is a very large set of data in the table that needs to be
>  moved this will be slow and might throw locks which would impact the
>  performance of the inserts and the updates.

Well, the locks would only affect the rows being moved. If this is
primarily insert-based, the only performance hit would be from the extra
IO. If you move records frequently enough, these small batches should
not be a problem.

Though I strongly suggest you build a tier library that creates a
partition for every day/week/month so the target movement tables
themselves don't get overly large.

> 2. Constantly deleting large chunks of data might cause vacuum problems.

This goes back to PostgreSQL and MVCC. If your interval moves roughly
the same amount of data each time, the table will no longer grow past
that point. In the movement function, just have it run a manual
vacuum/analyze which is not restricted by autovacuum rules. It'll use
more IO, but will finish faster and check growth more consistently.

I wrote a tier library I'm working on getting permission to open source
that does exactly this. It supports tier granularity up to 1 day, and
will age out old data into the partitions beyond that threshold. You
could probably tool something similar to be more aggressive if you want
to keep that 15-minute (or shorter) expiration interval.

I haven't used it myself, but pg_partman might have what you need:

http://pgxn.org/dist/pg_partman/doc/pg_partman.html

--
Shaun Thomas
OptionsHouse, LLC | 141 W. Jackson Blvd. | Suite 800 | Chicago IL, 60604
312-676-8870
sthomas@optionshouse.com

______________________________________________

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email


Re: Fast data, slow data

From
David G Johnston
Date:
Tim Uckun wrote
> The database is functioning fine now but I am anticipating a much higher
> workload in the future.  The table in question is probably going to have a
> few million rows per day inserted into it when it gets busy, if it gets
> very busy it might be in the tens of millions per day but that's
> speculation at this point.
>
> I don't want to say that the data is not important but if I drop one or
> two
> sensor readings it's not going to be the end of the world.

If the server crashes immediately before, or even during, the movement
process you are going to lose every record - apparently in the order of
tens-of-thousands - on the unlogged table.


I am not certain that your stated fear/goal is relevant:

"I was thinking about keeping the high churn data in a different table so
that the vacuums on that table can go faster and the vacuums on the rest of
the data will rarely be needed."

http://www.postgresql.org/docs/9.2/interactive/routine-vacuuming.html#VACUUM-BASICS

Section 23.1.4 Updating the Visibility Map

In terms of caching, both indexes and data are placed into memory on a page
basis; anything that is not in use will remain dormant.  There will be some
impact but whether it is enough to add in the complexity of partitions
and/or archive tables is use-specific.  The ability to move those tables to
other tablespaces is solid as is the ability to simply drop the partition to
remove said data from the database - but neither ability is useful in all
situations.






--
View this message in context: http://postgresql.1045698.n5.nabble.com/Fast-data-slow-data-tp5809324p5809551.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.