Hi,
While working on turning back on all the cron jobs that we are running on cl-net, I found there's a cron job to check available disk space.
Shouldn't we be using other software, like nagios, to monitor the services? Maybe we can write plugins/task scripts to do any of the cl-net specific tasks, but having a framework like that sounds better.
Bye,
Erik.
Hi, Erik
I can take this task by running up a Lisp-based daemon to check up everything else on the server include available disk space, but I need more time because I'm a bit busy in this and next week on my local job. Considering we're all Lisp programmers and we're maintaining a Lisp site, using a Lisp-based solution may be better than nagios or other similar solutions.
For details, I want to run "snmpd" up, and use cl-net-snmp to check local server status (include disk spaces) and send alarm mails when necessary. This program could also provide a simple web interface.
--binghe
在 2011-5-22,01:52, Erik Huelsmann 写道:
Hi,
While working on turning back on all the cron jobs that we are running on cl-net, I found there's a cron job to check available disk space.
Shouldn't we be using other software, like nagios, to monitor the services? Maybe we can write plugins/task scripts to do any of the cl-net specific tasks, but having a framework like that sounds better.
Bye,
Erik.
Hi Chun,
I can take this task by running up a Lisp-based daemon to check up everything else on the server include available disk space, but I need more time because I'm a bit busy in this and next week on my local job. Considering we're all Lisp programmers and we're maintaining a Lisp site, using a Lisp-based solution may be better than nagios or other similar solutions.
For details, I want to run "snmpd" up, and use cl-net-snmp to check local server status (include disk spaces) and send alarm mails when necessary. This program could also provide a simple web interface.
Sounds great!
It'd be even better if we could run the snmp client externally, so that we're not dependent on the host being up and running well enough to be able to send out notifications...
Bye,
Erik.
Hi, Erik
I thought about this, but do we have an secondary host which can be used to monitor current common-lisp.net?
--binghe
在 2011-5-24,04:46, Erik Huelsmann 写道:
Hi Chun,
I can take this task by running up a Lisp-based daemon to check up everything else on the server include available disk space, but I need more time because I'm a bit busy in this and next week on my local job. Considering we're all Lisp programmers and we're maintaining a Lisp site, using a Lisp-based solution may be better than nagios or other similar solutions.
For details, I want to run "snmpd" up, and use cl-net-snmp to check local server status (include disk spaces) and send alarm mails when necessary. This program could also provide a simple web interface.
Sounds great!
It'd be even better if we could run the snmp client externally, so that we're not dependent on the host being up and running well enough to be able to send out notifications...
Bye,
Erik.
On Sat, 2011-05-21 at 19:52 +0200, Erik Huelsmann wrote:
Hi,
While working on turning back on all the cron jobs that we are running on cl-net, I found there's a cron job to check available disk space.
Shouldn't we be using other software, like nagios, to monitor the services? Maybe we can write plugins/task scripts to do any of the cl-net specific tasks, but having a framework like that sounds better.
I think we should use nagios. It makes little sense to write new software instead of using an already written and fairly good monitor such as nagios
I think it does not make any sense to debate make vs. download. If Binghe wants to do it, he can chose whatever tool he thinks is best. I think he has volunteered already, so please stop discussing the means unless you insist to do the work yourself. There is plenty of work, so if you feel like doing something (as opposed to whining on the mailing list) and don't know what, just ask.
Thanks. -Hans
On Thu, Jun 9, 2011 at 6:26 PM, Stelian Ionescu sionescu@cddr.org wrote:
On Sat, 2011-05-21 at 19:52 +0200, Erik Huelsmann wrote:
Hi,
While working on turning back on all the cron jobs that we are running on cl-net, I found there's a cron job to check available disk space.
Shouldn't we be using other software, like nagios, to monitor the services? Maybe we can write plugins/task scripts to do any of the cl-net specific tasks, but having a framework like that sounds better.
I think we should use nagios. It makes little sense to write new software instead of using an already written and fairly good monitor such as nagios
-- Stelian Ionescu a.k.a. fe[nl]ix Quidquid latine dictum sit, altum videtur. http://common-lisp.net/project/iolib
Hey guys,
I just installed "snmpd" on common-lisp.net and made it listen on localhost with full access. (RT #23)
I started up a Clozure CL image (hiding in GNU screen) several days ago, and now by loading some new code into it, I can see server disk spaces by following commands:
? (snmp:snmp-walk "localhost" "hrStorageTable") ((#<SIMPLE-OID HOST-RESOURCES-MIB::hrStorageIndex.1> 1) (#<SIMPLE-OID HOST-RESOURCES-MIB::hrStorageIndex.3> 3) (#<SIMPLE-OID HOST-RESOURCES-MIB::hrStorageIndex.6> 6) (#<SIMPLE-OID HOST-RESOURCES-MIB::hrStorageIndex.7> 7) (#<SIMPLE-OID HOST-RESOURCES-MIB::hrStorageIndex.10> 10) (#<SIMPLE-OID HOST-RESOURCES-MIB::hrStorageIndex.31> 31) (#<SIMPLE-OID HOST-RESOURCES-MIB::hrStorageIndex.32> 32) (#<SIMPLE-OID HOST-RESOURCES-MIB::hrStorageIndex.33> 33) (#<SIMPLE-OID HOST-RESOURCES-MIB::hrStorageIndex.34> 34) (#<SIMPLE-OID HOST-RESOURCES-MIB::hrStorageIndex.35> 35) (#<SIMPLE-OID HOST-RESOURCES-MIB::hrStorageIndex.36> 36) (#<SIMPLE-OID HOST-RESOURCES-MIB::hrStorageType.1> #<OBJECT-ID HOST-RESOURCES-TYPES::hrStorageRam (2) [0]>) (#<SIMPLE-OID HOST-RESOURCES-MIB::hrStorageType.3> #<OBJECT-ID HOST-RESOURCES-TYPES::hrStorageVirtualMemory (3) [0]>) (#<SIMPLE-OID HOST-RESOURCES-MIB::hrStorageType.6> #<OBJECT-ID HOST-RESOURCES-TYPES::hrStorageOther (1) [0]>) (#<SIMPLE-OID HOST-RESOURCES-MIB::hrStorageType.7> #<OBJECT-ID HOST-RESOURCES-TYPES::hrStorageOther (1) [0]>) (#<SIMPLE-OID HOST-RESOURCES-MIB::hrStorageType.10> #<OBJECT-ID HOST-RESOURCES-TYPES::hrStorageVirtualMemory (3) [0]>) (#<SIMPLE-OID HOST-RESOURCES-MIB::hrStorageType.31> #<OBJECT-ID HOST-RESOURCES-TYPES::hrStorageFixedDisk (4) [0]>) (#<SIMPLE-OID HOST-RESOURCES-MIB::hrStorageType.32> #<OBJECT-ID HOST-RESOURCES-TYPES::hrStorageFixedDisk (4) [0]>) (#<SIMPLE-OID HOST-RESOURCES-MIB::hrStorageType.33> #<OBJECT-ID HOST-RESOURCES-TYPES::hrStorageFixedDisk (4) [0]>) (#<SIMPLE-OID HOST-RESOURCES-MIB::hrStorageType.34> #<OBJECT-ID HOST-RESOURCES-TYPES::hrStorageFixedDisk (4) [0]>) (#<SIMPLE-OID HOST-RESOURCES-MIB::hrStorageType.35> #<OBJECT-ID HOST-RESOURCES-TYPES::hrStorageFixedDisk (4) [0]>) (#<SIMPLE-OID HOST-RESOURCES-MIB::hrStorageType.36> #<OBJECT-ID HOST-RESOURCES-TYPES::hrStorageFixedDisk (4) [0]>) (#<SIMPLE-OID HOST-RESOURCES-MIB::hrStorageDescr.1> "Physical memory") (#<SIMPLE-OID HOST-RESOURCES-MIB::hrStorageDescr.3> "Virtual memory") (#<SIMPLE-OID HOST-RESOURCES-MIB::hrStorageDescr.6> "Memory buffers") (#<SIMPLE-OID HOST-RESOURCES-MIB::hrStorageDescr.7> "Cached memory") (#<SIMPLE-OID HOST-RESOURCES-MIB::hrStorageDescr.10> "Swap space") (#<SIMPLE-OID HOST-RESOURCES-MIB::hrStorageDescr.31> "/") (#<SIMPLE-OID HOST-RESOURCES-MIB::hrStorageDescr.32> "/var") (#<SIMPLE-OID HOST-RESOURCES-MIB::hrStorageDescr.33> "/tmp") (#<SIMPLE-OID HOST-RESOURCES-MIB::hrStorageDescr.34> "/home") (#<SIMPLE-OID HOST-RESOURCES-MIB::hrStorageDescr.35> "/project") (#<SIMPLE-OID HOST-RESOURCES-MIB::hrStorageDescr.36> "/custom") (#<SIMPLE-OID HOST-RESOURCES-MIB::hrStorageAllocationUnits.1> 1024) (#<SIMPLE-OID HOST-RESOURCES-MIB::hrStorageAllocationUnits.3> 1024) (#<SIMPLE-OID HOST-RESOURCES-MIB::hrStorageAllocationUnits.6> 1024) (#<SIMPLE-OID HOST-RESOURCES-MIB::hrStorageAllocationUnits.7> 1024) (#<SIMPLE-OID HOST-RESOURCES-MIB::hrStorageAllocationUnits.10> 1024) (#<SIMPLE-OID HOST-RESOURCES-MIB::hrStorageAllocationUnits.31> 4096) (#<SIMPLE-OID HOST-RESOURCES-MIB::hrStorageAllocationUnits.32> 4096) (#<SIMPLE-OID HOST-RESOURCES-MIB::hrStorageAllocationUnits.33> 4096) (#<SIMPLE-OID HOST-RESOURCES-MIB::hrStorageAllocationUnits.34> 4096) (#<SIMPLE-OID HOST-RESOURCES-MIB::hrStorageAllocationUnits.35> 4096) (#<SIMPLE-OID HOST-RESOURCES-MIB::hrStorageAllocationUnits.36> 4096) (#<SIMPLE-OID HOST-RESOURCES-MIB::hrStorageSize.1> 2097348) (#<SIMPLE-OID HOST-RESOURCES-MIB::hrStorageSize.3> 4194492) (#<SIMPLE-OID HOST-RESOURCES-MIB::hrStorageSize.6> 2097348) (#<SIMPLE-OID HOST-RESOURCES-MIB::hrStorageSize.7> 1140664) (#<SIMPLE-OID HOST-RESOURCES-MIB::hrStorageSize.10> 2097144) (#<SIMPLE-OID HOST-RESOURCES-MIB::hrStorageSize.31> 2580302) (#<SIMPLE-OID HOST-RESOURCES-MIB::hrStorageSize.32> 3870460) (#<SIMPLE-OID HOST-RESOURCES-MIB::hrStorageSize.33> 258022) (#<SIMPLE-OID HOST-RESOURCES-MIB::hrStorageSize.34> 3870460) (#<SIMPLE-OID HOST-RESOURCES-MIB::hrStorageSize.35> 5160607) (#<SIMPLE-OID HOST-RESOURCES-MIB::hrStorageSize.36> 3870460) (#<SIMPLE-OID HOST-RESOURCES-MIB::hrStorageUsed.1> 1858212) (#<SIMPLE-OID HOST-RESOURCES-MIB::hrStorageUsed.3> 1958300) (#<SIMPLE-OID HOST-RESOURCES-MIB::hrStorageUsed.6> 92940) (#<SIMPLE-OID HOST-RESOURCES-MIB::hrStorageUsed.7> 1140664) (#<SIMPLE-OID HOST-RESOURCES-MIB::hrStorageUsed.10> 100088) (#<SIMPLE-OID HOST-RESOURCES-MIB::hrStorageUsed.31> 273168) (#<SIMPLE-OID HOST-RESOURCES-MIB::hrStorageUsed.32> 1285679) (#<SIMPLE-OID HOST-RESOURCES-MIB::hrStorageUsed.33> 1277) (#<SIMPLE-OID HOST-RESOURCES-MIB::hrStorageUsed.34> 1512249) (#<SIMPLE-OID HOST-RESOURCES-MIB::hrStorageUsed.35> 4107480) (#<SIMPLE-OID HOST-RESOURCES-MIB::hrStorageUsed.36> 2238010))
(sorry for put such big a list in this mail)
To compute the available spaces (as a percent) of "/tmp", I can use following form:
(apply #'/ (snmp:snmp-get "localhost" (list "hrStorageUsed.33" "hrStorageSize.33")))
711/129011
(note: the number 33 can be find by searching "/tmp" in hrStorageDescr.*, but this need only once on program start, it won't change during the running of "snmpd")
So, all I need to do is following three basic tasks:
* write some trivial functions to compute the available spaces from above interface * run a thread to monitor the available spaces, and do something when it's too slow (I know there're some tricks to prevent duplicate actions) * send a mail to notify the events (there're so many mail client packages on Cliki.net)
optional tasks:
* save historical data using Elephant * draw historical graph (Zach have some pure lisp graphics libraries, I remember) * show the whole system on Web (I don't know how to use Hunchentoot, but I know CL-HTTP well)
Let me use this weekend to finish at least the three basic tasks, and we can nurture it to let it do more things. Nagios is too complex, and hard to extend by Lispers, but it could be a reasonable choice if I failed my plan.
--binghe
在 2011-6-10,00:48, Hans Hübner 写道:
I think it does not make any sense to debate make vs. download. If Binghe wants to do it, he can chose whatever tool he thinks is best. I think he has volunteered already, so please stop discussing the means unless you insist to do the work yourself. There is plenty of work, so if you feel like doing something (as opposed to whining on the mailing list) and don't know what, just ask.
Thanks. -Hans
On Thu, Jun 9, 2011 at 6:26 PM, Stelian Ionescu sionescu@cddr.org wrote:
On Sat, 2011-05-21 at 19:52 +0200, Erik Huelsmann wrote:
Hi,
While working on turning back on all the cron jobs that we are running on cl-net, I found there's a cron job to check available disk space.
Shouldn't we be using other software, like nagios, to monitor the services? Maybe we can write plugins/task scripts to do any of the cl-net specific tasks, but having a framework like that sounds better.
I think we should use nagios. It makes little sense to write new software instead of using an already written and fairly good monitor such as nagios
-- Stelian Ionescu a.k.a. fe[nl]ix Quidquid latine dictum sit, altum videtur. http://common-lisp.net/project/iolib
To compute the available spaces (as a percent) of "/tmp", I can use following form:
(apply #'/ (snmp:snmp-get "localhost" (list "hrStorageUsed.33" "hrStorageSize.33")))
711/129011
Opps, this is not "available" space, this is used space, should (- 1 *) the result, or I should check if it's going too high.
--binghe
Hi, Erik (and others)
I've done a basic monitor program (RT ticket #24), checking all disk spaces every 5 minutes and send alert mails if
* there's a disk which has over 90% full, and * no alert mail (for this disk) was sent during last check.
It's a pure lisp solution based on CCL, cl-net-snmp and cl-smtp. Let's test it for several days and see if it works on next disk full event.
--binghe
在 2011-5-22,01:52, Erik Huelsmann 写道:
Hi,
While working on turning back on all the cron jobs that we are running on cl-net, I found there's a cron job to check available disk space.
Shouldn't we be using other software, like nagios, to monitor the services? Maybe we can write plugins/task scripts to do any of the cl-net specific tasks, but having a framework like that sounds better.
Bye,
Erik.
Binghe wrote:
I've done a basic monitor program (RT ticket #24), checking all disk spaces every 5 minutes and send alert mails if
- there's a disk which has over 90% full, and
- no alert mail (for this disk) was sent during last check.
Sounds pretty good. May I suggest tweaking the hysteresis?
With the logic as described, I would expect to get an alert email every 10 minutes. It might be better to suppress alerts until a check sees that the condition has cleared. An extension might add a "low tide" reset mark some distance from the trigger level (e.g. <= 80%) to avoid small oscillations around the 90% mark.
A low-frequency (i.e. daily or weekly) nag also helps keep persistent issues from being forgotten.
Just trying to offer my experience if not my time.
- Daniel
Hi, Daniel
Thanks very much for your suggestion, I've adjusted the alert policy according to your given rules and the system will send daily summary mail to major SAs. Now I'm considering collect historical data and report more aspects (load average, network, ...).
Regards,
Chun Tian (binghe)
在 2011-6-11,03:39, dherring@tentpost.com 写道:
Binghe wrote:
I've done a basic monitor program (RT ticket #24), checking all disk spaces every 5 minutes and send alert mails if
- there's a disk which has over 90% full, and
- no alert mail (for this disk) was sent during last check.
Sounds pretty good. May I suggest tweaking the hysteresis?
With the logic as described, I would expect to get an alert email every 10 minutes. It might be better to suppress alerts until a check sees that the condition has cleared. An extension might add a "low tide" reset mark some distance from the trigger level (e.g. <= 80%) to avoid small oscillations around the 90% mark.
A low-frequency (i.e. daily or weekly) nag also helps keep persistent issues from being forgotten.
Just trying to offer my experience if not my time.
- Daniel
clo-devel mailing list clo-devel@common-lisp.net http://lists.common-lisp.net/cgi-bin/mailman/listinfo/clo-devel
Hi Binghe,
please do not send out daily reports. I will not look at them.
Thanks, Hans
2011/6/11 Chun Tian (binghe) binghe.lisp@gmail.com:
Hi, Daniel
Thanks very much for your suggestion, I've adjusted the alert policy according to your given rules and the system will send daily summary mail to major SAs. Now I'm considering collect historical data and report more aspects (load average, network, ...).
Regards,
Chun Tian (binghe)
在 2011-6-11,03:39, dherring@tentpost.com 写道:
Binghe wrote:
I've done a basic monitor program (RT ticket #24), checking all disk spaces every 5 minutes and send alert mails if
- there's a disk which has over 90% full, and
- no alert mail (for this disk) was sent during last check.
Sounds pretty good. May I suggest tweaking the hysteresis?
With the logic as described, I would expect to get an alert email every 10 minutes. It might be better to suppress alerts until a check sees that the condition has cleared. An extension might add a "low tide" reset mark some distance from the trigger level (e.g. <= 80%) to avoid small oscillations around the 90% mark.
A low-frequency (i.e. daily or weekly) nag also helps keep persistent issues from being forgotten.
Just trying to offer my experience if not my time.
- Daniel
clo-devel mailing list clo-devel@common-lisp.net http://lists.common-lisp.net/cgi-bin/mailman/listinfo/clo-devel
clo-devel mailing list clo-devel@common-lisp.net http://lists.common-lisp.net/cgi-bin/mailman/listinfo/clo-devel
OK, done.
在 2011-6-11,16:45, Hans Hübner 写道:
Hi Binghe,
please do not send out daily reports. I will not look at them.
Thanks, Hans
2011/6/11 Chun Tian (binghe) binghe.lisp@gmail.com:
Hi, Daniel
Thanks very much for your suggestion, I've adjusted the alert policy according to your given rules and the system will send daily summary mail to major SAs. Now I'm considering collect historical data and report more aspects (load average, network, ...).
Regards,
Chun Tian (binghe)
在 2011-6-11,03:39, dherring@tentpost.com 写道:
Binghe wrote:
I've done a basic monitor program (RT ticket #24), checking all disk spaces every 5 minutes and send alert mails if
- there's a disk which has over 90% full, and
- no alert mail (for this disk) was sent during last check.
Sounds pretty good. May I suggest tweaking the hysteresis?
With the logic as described, I would expect to get an alert email every 10 minutes. It might be better to suppress alerts until a check sees that the condition has cleared. An extension might add a "low tide" reset mark some distance from the trigger level (e.g. <= 80%) to avoid small oscillations around the 90% mark.
A low-frequency (i.e. daily or weekly) nag also helps keep persistent issues from being forgotten.
Just trying to offer my experience if not my time.
- Daniel
clo-devel mailing list clo-devel@common-lisp.net http://lists.common-lisp.net/cgi-bin/mailman/listinfo/clo-devel
clo-devel mailing list clo-devel@common-lisp.net http://lists.common-lisp.net/cgi-bin/mailman/listinfo/clo-devel
clo-devel mailing list clo-devel@common-lisp.net http://lists.common-lisp.net/cgi-bin/mailman/listinfo/clo-devel
On Sat, 11 Jun 2011, Hans Hübner wrote:
please do not send out daily reports. I will not look at them.
Just to clarify, the idea was to only send out daily reports when there was a problem. Nominally, no report would be sent.
There should be no reason to ignore them, other than "that can wait". When such a decision is made, it may help to change the threshold or schedule a reminder for some time in the future. The daily or weekly nag was meant to be that reminder.
- Daniel
P.S. It is easy to create and test a full disk condition. $ dd if=/dev/zero of=/tmp/test-file bs=1048576 count=1024 will write a file of 1024*1MB=1GB in size. Tweak the parameters or copy the file as needed.
binghe, Daniel,
I am certainly not opposed to problem reports, but I do not need daily reports when there is no problem. Also, the report should say what the problem is rather than just show a status and let me figure out the problem again.
Thanks for working on this anyway, Hans
On Sat, Jun 11, 2011 at 4:54 PM, Daniel Herring dherring@tentpost.com wrote:
On Sat, 11 Jun 2011, Hans Hübner wrote:
please do not send out daily reports. I will not look at them.
Just to clarify, the idea was to only send out daily reports when there was a problem. Nominally, no report would be sent.
There should be no reason to ignore them, other than "that can wait". When such a decision is made, it may help to change the threshold or schedule a reminder for some time in the future. The daily or weekly nag was meant to be that reminder.
- Daniel
P.S. It is easy to create and test a full disk condition. $ dd if=/dev/zero of=/tmp/test-file bs=1048576 count=1024 will write a file of 1024*1MB=1GB in size. Tweak the parameters or copy the file as needed. _______________________________________________ clo-devel mailing list clo-devel@common-lisp.net http://lists.common-lisp.net/cgi-bin/mailman/listinfo/clo-devel
Hi, Daniel
Sorry, obviously I misunderstood your idea about daily reports. I think a "reminder" is reasonable, will add this feature.
Thank you for your suggestion.
--binghe
在 2011-6-11,22:54, Daniel Herring 写道:
On Sat, 11 Jun 2011, Hans Hübner wrote:
please do not send out daily reports. I will not look at them.
Just to clarify, the idea was to only send out daily reports when there was a problem. Nominally, no report would be sent.
There should be no reason to ignore them, other than "that can wait". When such a decision is made, it may help to change the threshold or schedule a reminder for some time in the future. The daily or weekly nag was meant to be that reminder.
- Daniel
P.S. It is easy to create and test a full disk condition. $ dd if=/dev/zero of=/tmp/test-file bs=1048576 count=1024 will write a file of 1024*1MB=1GB in size. Tweak the parameters or copy the file as needed._______________________________________________ clo-devel mailing list clo-devel@common-lisp.net http://lists.common-lisp.net/cgi-bin/mailman/listinfo/clo-devel