Re: using a postgres table as a multi-writer multi-updater queue - Mailing list pgsql-general

From Steve Petrie, P.Eng.
Subject Re: using a postgres table as a multi-writer multi-updater queue
Date
Msg-id E31490052E604F84AFEEE805EB9FB4C1@Dell
Whole thread Raw
In response to using a postgres table as a multi-writer multi-updater queue  (Chris Withers <chris@simplistix.co.uk>)
Responses Re: using a postgres table as a multi-writer multi-updater queue  (Alban Hertroys <haramrae@gmail.com>)
List pgsql-general
Thanks to George and Adrian for their latest responses regarding a DELETE /
AUTOVACUUM / VACUUM approach, to recycling disk space allocated for rows in
a
postgres table, that is used to manage sessions (e.g. HTTP sessions with
visitor web browsers).

My comments are below. In them, I propose an alternative strategy, using a
partial index, instead of a DELETE /
AUTOVACUUM / VACUUM approach.

The comments are a bit lengthy, but I hope forum members will consider them,
and maybe help lead to a breakthrough in using a postgres table for
high-performance session management.

"George Neuner" <gneuner2@comcast.net> wrote in message
news:ukdm5b1ni7lv393coa71vf8d0i1mi0phuh@4ax.com...
> On Sun, 29 Nov 2015 05:02:58 -0500, "Steve Petrie, P.Eng."
> <apetrie@aspetrie.net> wrote:
>
>>I should have clarified, that I use the the term "fuzzy" to refer to the
>>probability mechanism, that hooks a small burst of session row deletion
>>activity, to each one of a randomly-selected portion (presently 1/5) of
>>the
>>HTTP requests that cause a new session row to be INSERTed into the session
>>table.
>>
>>This means that on average, only every 5th HTTP request that creates a new
>>session row, will also incur the session deletion workload. When the
>>session
>>row deletion process occurs, its (aggressive) limit for deletion workload
>>is
>>2X as many expired rows as needed on average, to keep up with the rate of
>>session row creation (so the 2X DELETE limit presently == 10 rows).
>>
>>The idea is to make the process of DELETing expired session rows,
>>automatically scale its activity, to closely and aggressively match the
>>rate
>>of session row creation.
>
> There's nothing really wrong with that, but I wouldn't do it that way
> ... I would bulk delete old records from a separate scheduled task.
>
> Another way to do it would be to have each new session delete exactly
> one old session.  1:1 scales perfectly and spreads the delete load
> evenly across all users.
>
> Not that deleting a handful of records is a lengthy process, but it
> seems unfair to burden some users with it but not others.  I would
> burden (or not) everyone equally.
>
>
>>A heavy burst of new sessions being created will
>>work proportionately more aggressively to DELETE expired session rows.
>>This
>>way, the (HTTP request-driven) PHP app will be self-tuning its own session
>>table space recycling. And there is no process (e.g. cron-driven),
>>external to the PHP app itself, that is doing session row deletion.
>>
>>Based on what I've learned from this forum (but before I have studied
>>AUTOVACUUM in any detail) my thinking is to include an AUTOVACUUM command
>>(outside of any SQL transaction block) in the HTTP request-driven PHP app,
>>immediately following any time the PHP app completes a session row DELETE
>>command.
>>
>>Or maybe the AUTOVACUUM request should occur less frequently?
>
> Reducing the frequency will result in a larger table space on disk.
> Insertions are made at the end of the table so the table keeps growing
> in size regardless of deletions until (some kind of) vacuum is run.
>
> Autovacuum doesn't shrink the table space on disk, it merely compacts
> the table's live data so that any free space is at the end.
>
> If you want to tightly control the growth of the table space, you need
> to run autovacuum _more_ often, not less.
>

Before I start in to implement a DELETE / AUTOVACUUM / VACUUM approach, to
recycling disk space used for a session management table, I would like to
propose, for consideration by this forum, an idea for a different approach.

A row in a session management table, represents a significant "sunk cost" in
both computing time used to create the row, and in disk space allocated.
Postgres has to use a lot of resources to create that row in the first
place.

When the session that originally caused that row to be allocated, eventually
expires -- why delete the associated session managent row ??

Instead of using a DELETE command to destroy the row (and a
resource-intensive AUTOVACUUM / VACUUM process to reclaim the storage
space), why not instead, simply mark that session management row as "free"
(with an UPDATE command) ??

In brief -- Why not just leave the expired session management row allocated
in place, already "hot" and ready to be reused for a new session ??

But -- when the app needs to initiate a new session, it must be able quickly
to: 1. know if there are any "free" rows available for re-use, and if so 2.
select and allocate a specific "free" row for re-use, (3. and if no "free"
row is available, then the app will INSERT a new session row)

* * *
* * *

Would the postgres Partial Index facility be useful here?

Here are quotes from the excellent postgres documentation (9.3.5):

"11.8. Partial Indexes

A partial index is an index built over a subset of a table; the subset is
defined by a conditional expression (called the predicate of the partial
index). The index contains entries only for those table rows that satisfy
the predicate. ... This reduces the size of the index, which will speed up
those queries that do use the index. It will also speed up many table update
operations because the index does not need to be updated in all cases."

"11.2. Index Types

PostgreSQL provides several index types: B-tree ... By default, the CREATE
INDEX command creates B-tree indexes, which fit the most common situations.

B-trees can handle equality and range queries on data that can be sorted
into some ordering. In particular, the PostgreSQL query planner will
consider using a B-tree index whenever an indexed column is involved in a
comparison using one of these operators:

<
<=
=
>=
>

... Also, an IS NULL or IS NOT NULL condition on an index column can be used
with a B-tree index."

* * *
* * *

In the case of my particular app, I propose to add to the session management
table eto_sql_tb_session_www:

1. a new int column session_www_free, and
2. a partial (B-tree) index -- session_www_free_ndx -- on the new column.

CREATE DATABASE eto_sql_db WITH ENCODING 'UTF8';

CREATE SCHEMA its_eto;

CREATE TABLE its_eto.eto_sql_tb_session_www
(
  session_www_code       char(32)      NOT NULL UNIQUE PRIMARY KEY,

  session_www_type       int           NOT NULL,
  session_www_state      int           NOT NULL,
  session_verify_code    char(7)       NOT NULL,

  session_www_serno      SERIAL        NOT NULL UNIQUE,
  session_www_free      int         NULL,

  session_target_serno   int           NULL,
  session_target_data    varchar(1000) NULL,

  session_www_init_utc   timestamp     NOT NULL,
  session_www_last_utc   timestamp     NOT NULL,
  session_www_expiry_utc timestamp     NOT NULL,
  session_www_delete_utc timestamp     NOT NULL,
  session_www_hit_count  int           NOT NULL,
  session_www_act_seqno  int           NULL

);

CREATE INDEX session_www_serno_ndx ON its_eto.eto_sql_tb_session_www
(session_www_serno);

CREATE INDEX session_www_free_ndx ON its_eto.eto_sql_tb_session_www
(session_www_free)
  WHERE session_www_free IS NOT NULL;

* * *
* * *

The app starts with an empty table eto_sql_tb_session_www that has empty
indexes.

When the app goes to initiate a new session, it checks for availability of a
"free" session row, using a command like:

SELECT session_www_code, session_www_serno, ... FROM eto_sql_tb_session_www
  WHERE ((session_www_free IS NOT NULL)) LIMIT 1 FOR UPDATE.

(I am presuming that the postgres query planner will decide to use the
session_www_free_ndx partial index, to very quickly satisfy this query.)

If a "free" session row is found (e.g. with session_www_code ==
'F032...40D9'), the app will allocate that row to the new session, with a
command like:

UPDATE eto_sql_tb_session_www
  SET (..., session_www_free = NULL, ...)
  WHERE (session_www_code = 'F032...40D9')

When performing the UPDATE command, postgres will (MUST !!) remove the entry
for that row, from the session_www_free_ndx partial index, because that
partial index excludes rows WHERE column session_www_free IS NULL.

However, when the app goes to initiate a new session, if there are no
entries in the session_www_free_ndx partial index, the app then creates a
new session row with a unique (MD5) session_www_code value (e.g.
'26AD...50C9'):

INSERT INTO eto_sql_tb_session_www
  (session_www_code, ..., session_www_free, ...)
  VALUES ('26AD...50C9', ..., NULL, ...))

In that INSERT command, the value of the session_www_free column is set to
NULL, so postgres will not (MUST NOT!!)create an entry for that row in the
session_www_free_ndx partial index.

When a session (e.g. with session_www_code == '26AD...50C9') expires, the
app UPDATEs the corresponding session row:

UPDATE eto_sql_tb_session_www
  SET (..., session_www_free = session_www_serno, ...)
  WHERE (session_www_code = '26AD...50C9')

Since the UPDATEd value of column session_www_free IS NOT NULL, postgres
will (MUST!!) create an entry for that row in the session_www_free_ndx
partial index.

Every row in table eto_sql_tb_session_www always has a UNIQUE value for
column session_www_serno, so there can never an attempt to create a
duplicate key value in the session_www_free_ndx partial index.

* * *
* * *

The app uses the 32-character (MD5) key value in column session_www_code, as
the session ID parameter in dynamically-generated HTML pages. So an HTTP
request, generated for that session by a web browser from that page, can be
associated by the app with the correct unique row in table
eto_sql_tb_session_www.

To protect itself from processing an HTTP request coming from an outdated
HTML page, the app uses an additional 6-character hash value, generated from
two columns in table eto_sql_tb_session_www:

session_www_init_utc timestamp
session_www_act_seqno  int

The session_www_init_utc column gets a new timestamp value, every time the
session row is allocated to a new session. Then the timestamp value stays
the same in that column for the duration of that session.

The session_www_act_seqno column int value starts == 0, and simply
increments by 1 with every UPDATE to that session row. A session row gets
updated every time the app receives an HTTP request for that row. The app
will wrap the value in the row back to zero, if it ever reaches its maximum
possible value.

This 6-character hash value, is a second parameter, provided in every
dynamically-generated HTML page.

So there should be no logical problem, with reusing the same session row,
with the same 32-character session_www_code value, for more than one session
(but only ever for one session at a time).

* * *
* * *

Given the above proposed approach, the number of rows in table
eto_sql_tb_session_www will grow as large as necessary to service the
maximum number of concurrent sessions required.

But the app can reuse rows of expired sessions, again and again, instead of
DELETing every row after it is used used for only one session (and also
periodically AUTOVACUUMing or VACUUMing table eto_sql_tb_session_www to
reclaim the storage space).

Suppose there is a large burst of session activity, and table
eto_sql_tb_session_www grows in size to e.g. 100,000 "active" session rows.
Eventually, when system activity slows down (or ceases) again, most (all) of
the "active" rows will become "free" and available for reuse during the next
burst of activity.

And as the system gets busier again, the session_www_free_ndx partial index
will get smaller (because NULL values are being excluded), so postgres
should need to use less resources to search this partial index to find a
"free" session row for reuse. Use less resources, at the most advantageous
time, when the system is busy.

* * *
* * *

In the case of my app, I do still plan to take the postgres server offline
once per day, for a maintenance window of a few minutes. But instead of just
forcing the postgres server offline at an arbitrary time every day, the app
could analyze counts of entries in the two indexes:

session_www_serno_ndx
session_www_free_ndx

to detect when there are not too many active sessions, so postgres can be
taken offline with minimum inconvenience to website visitors.

During the brief offline maintenance window, the app can allocate a fresh
empty eto_sql_tb_session_www table, and copy over to it from the old
eto_sql_tb_session_www table, only rows of active sessions.

As a result of this table copy operation, column session_www_serno SERIAL
numbers will start again at 1, but this is no problem, because this SERIAL
number is only ever used as a key by the app, for the duration of a single
HTTP request, once the (MD5) key value in the HTTP request has been used to
find the row (by the session_www_code  PRIMARY KEY). The value of the
session_www_code column won't get changed, when a row is copied across to
the fresh eto_sql_tb_session_www table.

* * *
* * *

For this approach to succeed, it will be important to ensure that, once
storage space for a row in table eto_sql_tb_session_www has been allocated,
postgres is in fact going to be able to reuse the same storage space for
that row, even as values in that row get changed.

I would expect this proviso to be more likely correct for columns containing
fixed-length values (e.g. char(32), int, char(7), SERIAL, timestamp).

But what about the session_target_data varchar(1000) NULL column in table
eto_sql_tb_session_www ??

Here are some snippets from the postgres doc (9.3.5):

3. Character Types

"... Values of type character are physically padded with spaces to the
specified width n, and are stored and displayed that way."

"... The storage requirement for a short string (up to 126 bytes) is 1 byte
plus the actual string, which includes the space padding in the case of
character. Longer strings have 4 bytes of overhead instead of 1. Long
strings are compressed by the system automatically, so the physical
requirement on disk might be less."

"... There is no performance difference among these three types, apart from
increased storage space when using the blank-padded type, and a few extra
CPU cycles to check the length when storing into a length-constrained
column. While character(n) has performance advantages in some other database
systems, there is no such advantage in PostgreSQL; in fact character(n) is
usually the slowest of the three because of its additional storage costs. In
most situations text or character varying should be used instead."

Perhaps someone with postgres internals knolwledge could advise, about how
much storage space reuse, the proposed partial index approach might actually
accomplish, with table eto_sql_tb_session_www ... ??

* * *
* * *

If forum members think this approach, of using a partial index might work,
to
implement an efficient session row reuse strategy, I will proceed to try it.

> George
>
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general

Steve Petrie, P.Eng.

ITS-ETO Consortium
Oakville, Ontario, Canada
apetrie@aspetrie.net



pgsql-general by date:

Previous
From: Torsten Förtsch
Date:
Subject: Re: Pgbouncer
Next
From: Alban Hertroys
Date:
Subject: Re: using a postgres table as a multi-writer multi-updater queue