Re: using a postgres table as a multi-writer multi-updater queue - Mailing list pgsql-general
From | Steve Petrie, P.Eng. |
---|---|
Subject | Re: using a postgres table as a multi-writer multi-updater queue |
Date | |
Msg-id | E31490052E604F84AFEEE805EB9FB4C1@Dell Whole thread Raw |
In response to | using a postgres table as a multi-writer multi-updater queue (Chris Withers <chris@simplistix.co.uk>) |
Responses |
Re: using a postgres table as a multi-writer multi-updater queue
|
List | pgsql-general |
Thanks to George and Adrian for their latest responses regarding a DELETE / AUTOVACUUM / VACUUM approach, to recycling disk space allocated for rows in a postgres table, that is used to manage sessions (e.g. HTTP sessions with visitor web browsers). My comments are below. In them, I propose an alternative strategy, using a partial index, instead of a DELETE / AUTOVACUUM / VACUUM approach. The comments are a bit lengthy, but I hope forum members will consider them, and maybe help lead to a breakthrough in using a postgres table for high-performance session management. "George Neuner" <gneuner2@comcast.net> wrote in message news:ukdm5b1ni7lv393coa71vf8d0i1mi0phuh@4ax.com... > On Sun, 29 Nov 2015 05:02:58 -0500, "Steve Petrie, P.Eng." > <apetrie@aspetrie.net> wrote: > >>I should have clarified, that I use the the term "fuzzy" to refer to the >>probability mechanism, that hooks a small burst of session row deletion >>activity, to each one of a randomly-selected portion (presently 1/5) of >>the >>HTTP requests that cause a new session row to be INSERTed into the session >>table. >> >>This means that on average, only every 5th HTTP request that creates a new >>session row, will also incur the session deletion workload. When the >>session >>row deletion process occurs, its (aggressive) limit for deletion workload >>is >>2X as many expired rows as needed on average, to keep up with the rate of >>session row creation (so the 2X DELETE limit presently == 10 rows). >> >>The idea is to make the process of DELETing expired session rows, >>automatically scale its activity, to closely and aggressively match the >>rate >>of session row creation. > > There's nothing really wrong with that, but I wouldn't do it that way > ... I would bulk delete old records from a separate scheduled task. > > Another way to do it would be to have each new session delete exactly > one old session. 1:1 scales perfectly and spreads the delete load > evenly across all users. > > Not that deleting a handful of records is a lengthy process, but it > seems unfair to burden some users with it but not others. I would > burden (or not) everyone equally. > > >>A heavy burst of new sessions being created will >>work proportionately more aggressively to DELETE expired session rows. >>This >>way, the (HTTP request-driven) PHP app will be self-tuning its own session >>table space recycling. And there is no process (e.g. cron-driven), >>external to the PHP app itself, that is doing session row deletion. >> >>Based on what I've learned from this forum (but before I have studied >>AUTOVACUUM in any detail) my thinking is to include an AUTOVACUUM command >>(outside of any SQL transaction block) in the HTTP request-driven PHP app, >>immediately following any time the PHP app completes a session row DELETE >>command. >> >>Or maybe the AUTOVACUUM request should occur less frequently? > > Reducing the frequency will result in a larger table space on disk. > Insertions are made at the end of the table so the table keeps growing > in size regardless of deletions until (some kind of) vacuum is run. > > Autovacuum doesn't shrink the table space on disk, it merely compacts > the table's live data so that any free space is at the end. > > If you want to tightly control the growth of the table space, you need > to run autovacuum _more_ often, not less. > Before I start in to implement a DELETE / AUTOVACUUM / VACUUM approach, to recycling disk space used for a session management table, I would like to propose, for consideration by this forum, an idea for a different approach. A row in a session management table, represents a significant "sunk cost" in both computing time used to create the row, and in disk space allocated. Postgres has to use a lot of resources to create that row in the first place. When the session that originally caused that row to be allocated, eventually expires -- why delete the associated session managent row ?? Instead of using a DELETE command to destroy the row (and a resource-intensive AUTOVACUUM / VACUUM process to reclaim the storage space), why not instead, simply mark that session management row as "free" (with an UPDATE command) ?? In brief -- Why not just leave the expired session management row allocated in place, already "hot" and ready to be reused for a new session ?? But -- when the app needs to initiate a new session, it must be able quickly to: 1. know if there are any "free" rows available for re-use, and if so 2. select and allocate a specific "free" row for re-use, (3. and if no "free" row is available, then the app will INSERT a new session row) * * * * * * Would the postgres Partial Index facility be useful here? Here are quotes from the excellent postgres documentation (9.3.5): "11.8. Partial Indexes A partial index is an index built over a subset of a table; the subset is defined by a conditional expression (called the predicate of the partial index). The index contains entries only for those table rows that satisfy the predicate. ... This reduces the size of the index, which will speed up those queries that do use the index. It will also speed up many table update operations because the index does not need to be updated in all cases." "11.2. Index Types PostgreSQL provides several index types: B-tree ... By default, the CREATE INDEX command creates B-tree indexes, which fit the most common situations. B-trees can handle equality and range queries on data that can be sorted into some ordering. In particular, the PostgreSQL query planner will consider using a B-tree index whenever an indexed column is involved in a comparison using one of these operators: < <= = >= > ... Also, an IS NULL or IS NOT NULL condition on an index column can be used with a B-tree index." * * * * * * In the case of my particular app, I propose to add to the session management table eto_sql_tb_session_www: 1. a new int column session_www_free, and 2. a partial (B-tree) index -- session_www_free_ndx -- on the new column. CREATE DATABASE eto_sql_db WITH ENCODING 'UTF8'; CREATE SCHEMA its_eto; CREATE TABLE its_eto.eto_sql_tb_session_www ( session_www_code char(32) NOT NULL UNIQUE PRIMARY KEY, session_www_type int NOT NULL, session_www_state int NOT NULL, session_verify_code char(7) NOT NULL, session_www_serno SERIAL NOT NULL UNIQUE, session_www_free int NULL, session_target_serno int NULL, session_target_data varchar(1000) NULL, session_www_init_utc timestamp NOT NULL, session_www_last_utc timestamp NOT NULL, session_www_expiry_utc timestamp NOT NULL, session_www_delete_utc timestamp NOT NULL, session_www_hit_count int NOT NULL, session_www_act_seqno int NULL ); CREATE INDEX session_www_serno_ndx ON its_eto.eto_sql_tb_session_www (session_www_serno); CREATE INDEX session_www_free_ndx ON its_eto.eto_sql_tb_session_www (session_www_free) WHERE session_www_free IS NOT NULL; * * * * * * The app starts with an empty table eto_sql_tb_session_www that has empty indexes. When the app goes to initiate a new session, it checks for availability of a "free" session row, using a command like: SELECT session_www_code, session_www_serno, ... FROM eto_sql_tb_session_www WHERE ((session_www_free IS NOT NULL)) LIMIT 1 FOR UPDATE. (I am presuming that the postgres query planner will decide to use the session_www_free_ndx partial index, to very quickly satisfy this query.) If a "free" session row is found (e.g. with session_www_code == 'F032...40D9'), the app will allocate that row to the new session, with a command like: UPDATE eto_sql_tb_session_www SET (..., session_www_free = NULL, ...) WHERE (session_www_code = 'F032...40D9') When performing the UPDATE command, postgres will (MUST !!) remove the entry for that row, from the session_www_free_ndx partial index, because that partial index excludes rows WHERE column session_www_free IS NULL. However, when the app goes to initiate a new session, if there are no entries in the session_www_free_ndx partial index, the app then creates a new session row with a unique (MD5) session_www_code value (e.g. '26AD...50C9'): INSERT INTO eto_sql_tb_session_www (session_www_code, ..., session_www_free, ...) VALUES ('26AD...50C9', ..., NULL, ...)) In that INSERT command, the value of the session_www_free column is set to NULL, so postgres will not (MUST NOT!!)create an entry for that row in the session_www_free_ndx partial index. When a session (e.g. with session_www_code == '26AD...50C9') expires, the app UPDATEs the corresponding session row: UPDATE eto_sql_tb_session_www SET (..., session_www_free = session_www_serno, ...) WHERE (session_www_code = '26AD...50C9') Since the UPDATEd value of column session_www_free IS NOT NULL, postgres will (MUST!!) create an entry for that row in the session_www_free_ndx partial index. Every row in table eto_sql_tb_session_www always has a UNIQUE value for column session_www_serno, so there can never an attempt to create a duplicate key value in the session_www_free_ndx partial index. * * * * * * The app uses the 32-character (MD5) key value in column session_www_code, as the session ID parameter in dynamically-generated HTML pages. So an HTTP request, generated for that session by a web browser from that page, can be associated by the app with the correct unique row in table eto_sql_tb_session_www. To protect itself from processing an HTTP request coming from an outdated HTML page, the app uses an additional 6-character hash value, generated from two columns in table eto_sql_tb_session_www: session_www_init_utc timestamp session_www_act_seqno int The session_www_init_utc column gets a new timestamp value, every time the session row is allocated to a new session. Then the timestamp value stays the same in that column for the duration of that session. The session_www_act_seqno column int value starts == 0, and simply increments by 1 with every UPDATE to that session row. A session row gets updated every time the app receives an HTTP request for that row. The app will wrap the value in the row back to zero, if it ever reaches its maximum possible value. This 6-character hash value, is a second parameter, provided in every dynamically-generated HTML page. So there should be no logical problem, with reusing the same session row, with the same 32-character session_www_code value, for more than one session (but only ever for one session at a time). * * * * * * Given the above proposed approach, the number of rows in table eto_sql_tb_session_www will grow as large as necessary to service the maximum number of concurrent sessions required. But the app can reuse rows of expired sessions, again and again, instead of DELETing every row after it is used used for only one session (and also periodically AUTOVACUUMing or VACUUMing table eto_sql_tb_session_www to reclaim the storage space). Suppose there is a large burst of session activity, and table eto_sql_tb_session_www grows in size to e.g. 100,000 "active" session rows. Eventually, when system activity slows down (or ceases) again, most (all) of the "active" rows will become "free" and available for reuse during the next burst of activity. And as the system gets busier again, the session_www_free_ndx partial index will get smaller (because NULL values are being excluded), so postgres should need to use less resources to search this partial index to find a "free" session row for reuse. Use less resources, at the most advantageous time, when the system is busy. * * * * * * In the case of my app, I do still plan to take the postgres server offline once per day, for a maintenance window of a few minutes. But instead of just forcing the postgres server offline at an arbitrary time every day, the app could analyze counts of entries in the two indexes: session_www_serno_ndx session_www_free_ndx to detect when there are not too many active sessions, so postgres can be taken offline with minimum inconvenience to website visitors. During the brief offline maintenance window, the app can allocate a fresh empty eto_sql_tb_session_www table, and copy over to it from the old eto_sql_tb_session_www table, only rows of active sessions. As a result of this table copy operation, column session_www_serno SERIAL numbers will start again at 1, but this is no problem, because this SERIAL number is only ever used as a key by the app, for the duration of a single HTTP request, once the (MD5) key value in the HTTP request has been used to find the row (by the session_www_code PRIMARY KEY). The value of the session_www_code column won't get changed, when a row is copied across to the fresh eto_sql_tb_session_www table. * * * * * * For this approach to succeed, it will be important to ensure that, once storage space for a row in table eto_sql_tb_session_www has been allocated, postgres is in fact going to be able to reuse the same storage space for that row, even as values in that row get changed. I would expect this proviso to be more likely correct for columns containing fixed-length values (e.g. char(32), int, char(7), SERIAL, timestamp). But what about the session_target_data varchar(1000) NULL column in table eto_sql_tb_session_www ?? Here are some snippets from the postgres doc (9.3.5): 3. Character Types "... Values of type character are physically padded with spaces to the specified width n, and are stored and displayed that way." "... The storage requirement for a short string (up to 126 bytes) is 1 byte plus the actual string, which includes the space padding in the case of character. Longer strings have 4 bytes of overhead instead of 1. Long strings are compressed by the system automatically, so the physical requirement on disk might be less." "... There is no performance difference among these three types, apart from increased storage space when using the blank-padded type, and a few extra CPU cycles to check the length when storing into a length-constrained column. While character(n) has performance advantages in some other database systems, there is no such advantage in PostgreSQL; in fact character(n) is usually the slowest of the three because of its additional storage costs. In most situations text or character varying should be used instead." Perhaps someone with postgres internals knolwledge could advise, about how much storage space reuse, the proposed partial index approach might actually accomplish, with table eto_sql_tb_session_www ... ?? * * * * * * If forum members think this approach, of using a partial index might work, to implement an efficient session row reuse strategy, I will proceed to try it. > George > > > > -- > Sent via pgsql-general mailing list (pgsql-general@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-general Steve Petrie, P.Eng. ITS-ETO Consortium Oakville, Ontario, Canada apetrie@aspetrie.net
pgsql-general by date: