Re: background triggers? - Mailing list pgsql-general
From | Rafal Pietrak |
---|---|
Subject | Re: background triggers? |
Date | |
Msg-id | 1148453120.20217.102.camel@model.home.waw.pl Whole thread Raw |
In response to | Re: background triggers? (Kenneth Downs <ken@secdat.com>) |
Responses |
Re: background triggers?
Re: background triggers? |
List | pgsql-general |
On Tue, 2006-05-23 at 13:01 -0400, Kenneth Downs wrote: > Rafal Pietrak wrote: > >some other INSERT, return imediately if so, but turn into background for > >a long-lasting job if not". > > > > > > > Rafal, I'm wondering why you want to do this. You may be "fighting the > framework". Yes, most probably. I'm afraid of that :( Still... > If you are trying to do something that is totally unsupported, it is > probably for a pretty good reason, usually dealing with security or data > loss. You can probably get what you want by supported methods, but it > may require looking at the problem in a different way. > > What is it you are trying to accomplish? Is it just performance? OK. here is a 'real life' example. It works more like a post office. Now and then, there arrive a 'delivery man' with a bunch of post to deliver. Post office takes the batch, checks and stamps each and every item, and hands over a receipt. But the actual bin-ing (into delivery channels) and routing (dispatch) is handled without the delivery man standing and waiting for the process to end. In my *real*life* case I have a file with hundreds of tousends of 'transactions' uploaded by operator (several times times a day, and in fact, by 10-30 operators) - those are checked at INSERT time. But after that check and INSERT, I'd like to say to the operator: "OK, jour job is done, don't warry about the rest". But there is more work to do with the batch. I have to 'route the transactions' to their relevant accounts, and see how those change the 'status' of those accounts, consequently, store the updated status within the account itself. This is tedious and time consuming. But it have to be done, because 'customers' query account status for those 'agregate status information' and it would be *very* haevy for the database if those queries required browsing of the entire 'transaction log'. Number of 'Transactions' to number of accounts is like milions to thousends. A plain INSERT of batch takes 5-10minutes on desktop postgresql (800MHz machine, ATA disks). When I attach trigger (*Very* simple funciton) to update the accounts, the INSERT take hours (2-4). But when I make just one single update of all accounts at the end of the batch insert, it takes 20-30min. So my solution was 1) to have an additional table "TABLE dirty_log(tiem timestamp, who text)", which gets inserted a row *after* a batch of INSERTS. 2) and a 'server side process', run every 5-10 minutes, which makes accounts update, and which: A) does NOT launche when another such process is currently running; B) purges DIRTY_LOG table after it's done. This is quite obvoisly a 'user space implementation' of the 'background triggers'. I needed that. Natuaraly, having this 'bacground trigger' I loose acuracy of the account information. But I gain on system efficiency - this is engineering decision. When we have to take those (like the implementation above), it's good to have 'system tools' (like 'background triggers') that support us. But of cource I may be wrong all togather. I'd really like to know the techniq, which is 'along the line' of RDBM systems design, which serves that same purpose. Is there a better solution? NB: the 'batch INSERT' I mentioned above is done by www server. It's quite vital to have the the server process terminated (meaning: not keep it waiting for LISTEN/NOPTIFY event - meaning not use LISTEN/NOTIFY), as apache will keep the connection opened until the process ends. In 'real life', this scenario is applicable also to 'service network': 1. Say, you have 100_000 - 1000_000 vending machines (VM) network. 2. each is is loaded with c.a. 100 item types (300 types in the entire network). 3. each VM dispatches an item every 2-3 minutes. which make overall 'transaction traffic' at the level of over hundreds per second. 4. assume, that for 'customer management', you need to store quite a bit of data with each item-dispense 'transaction'. Meaning: transaction are not very light, and their details have to be kept for long time. 5. obviously, you need to manage your stock (each of the 300 item-types): you keep VM loaded and keep some stock at central store. (ATMs are a good example of such netowrk) So: 1. 'transaction traffic' is so signifficant, that you really have to 'optimise for that' 2. you don't really have to know *exactly* when you run out of stock, because each VM has signifficant local item store, so if you get notified, that a particular VM gets close to the bottom with particular item, you may dispatch a reload in 10min, but it's also OK to dispatch that in 2hours - meaning, the 'acocunt information' does not have to be 'immediately acurate'. Far more important is 'dispatch transaction' performance. 3. normally, you 'keep an eye' on you VM network - meaning, you issue a 'statistics' query quite frequently. If that was a 'haevy query' it would degrade your database performance quite signifficantly - we really need the 'agregate information' stored within 'item-accounts'. Is there a clean, 'along the framework' design that serves this reality? -R
pgsql-general by date: