I LIKE BEING ABLE TO POINT THE BAZOOKA AT MY FOOT! I'M WORKING ON A DOUBLE-BARRELLED BAZOOKA SO I CAN POINT IT AT *BOTH* FEET!! ;)
:)
I've spent years, on and off, worrying about system monitoring. Years ago I made available a giant Python system called Mom (v3) that no one but me could use. The publish-subscribe mechanism required learning a tiny matching language to use.
I had three successes with Mom.v3 which I really think deserve consideration in any future monitoring systems. My first minor success of sorts was that, due to my desire to produce statistical models of a bunch of system measures, I have years and years of collected data to test new algorithms on. RRD or similar graphs are a loss for anything except visualization. My second unalloyed success was producing an algorithm that told me when my users were filling up a disk partition *before* thresholds were crossed. [1]
The final huge win of Mom.v3 was that difficult publish-subscribe engine. The landscape is full of single-purpose monitoring tools that don't interact very well outside of their own little world. If, instead, you center a monitoring system around a communications protocol (a not too difficult one, hopefully) then you can plug in whatever you want. Full-contact, improvisational bazooka juggling can then be indulged in without necessarily endangering the stability of the main system. An example might be useful.
The Life Cycle of a Disk Sample in Mom.v3
Some dumb data collecting agent running locally runs 'df' and yanks out the relevant numbers. The data is packaged up into a network Message format (a collection of property-value pairs, including the agent type, the data points, a timestamp, etc.) and is passed off to a forwarder agent running on the same host. That agent is the only program running locally that knows enough to speak to the central publish-subscribe system, called the Kiosk. If the forwarder cannot talk to the Kiosk it caches messages. Otherwise it opens an authenticated and encrypted socket to the Kiosk.
Once the message is accepted by the Kiosk and a receipt goes back to the forwarder the new message is shoved into a processing queue. Now the entire set of subscription rules is run against the new message. A simple subscription rule might look like this:
agent == 'disk'
a trickier one:
DEFINED class AND class == 'security' AND DEFINED message
These property-checking subscriptions are paired with data sinks I called 'transports' in Mom.v3. In the case of this disk agent, we have two transports attached to the subscription. The first shunts the disk use sample into a log file for latter grovelling over. The second sends the sample into the diskwatcher transport, which keeps enough disk samples around to run the impending disk doom algorithm. That transport then *adds another message to the queue* with the analysis. You might attach this subscription to a transport that sends out email or a page:
agent == 'diskwatcher' AND class == 'notification' AND degree in 3 4
So, if the disk appears to be filling up fast, you'll hear about it.
Now, in Mom.v3 rather too much of this message cascade happened in the same process. If an analysis transport went bad it could muck up the entire Kiosk. Fortunately Python has enough introspective abilities that I could deactivate really badly behaving transports, but this isn't ideal. This publish-subscribe message routing really was just incredibly powerful - I had correlation engines, time series models, logs, database sinks, etc. A single data sample message could result in a half-dozen message being reinjected into the system. But because so much ran in the same process there were certain things I couldn't try out live. So my current focus - when I have time to code on this - is to generalize a monitoring protocol that'll let me plug in some experimental analysis engine without endangering the other parts that are working correctly. It would also permit different ways of accessing the data so we're not forced to pretend system monitoring maps well to web pages.
I'm thinking aloud here. A few weeks before Ingvar announced his common-lisp.net project and this mailing list I was thinking of contacting him and Chun Tian (the author of cl-net-snmp) to see if they thought we should create a "Common Lisp and Monitoring" mailing list to discuss our separate projects, to share what works and what doesn't.
-- wm
[1] http://www.biostat.wisc.edu/~annis/granny/notes/impending-doom.html - lisp code available if anyone is curious