Monitoring – return to Nagios Core

Without measurement, you will never understand the performance of your IT infrastructure. Monitoring is the key; not only will it provide for the baseline of your network, it will also enable you react pro-actively. Informing the users there is a problem is better than acknowledging a problem reported by a user.

What is the load on the infrastructure, when do these peaks occur and how can you mitigate poor response for the users, if applicable? Making a baseline of the infrastructure also allows for easy identification of differences – thus faster alerting and troubleshooting.

In bigger companies, the responsibility for the infrastructure is split into several departments. By having these silos of responsibility, most likely there will be an equal amount of monitoring systems; each presenting their information in a different format.

On top, each of these monitoring systems will provide for technical data, i.e. host x has a CPU load of 80&, or interface y is dropping packets. That’s fine for technicians, however users will be lost if error messages like this are reported – even worse if all these departments report differently.

Users are better helped, if their business is monitored equal to a traffic light. Green=OK, Yellow=some problems and Red=Problems. Handing them such an interface is brilliant: it shows them how their systems are performing. If money is needed to improve, they will understand far better.

That was the biggest reason for me – around 3 years ago – to be an advocate of Opsview. It is based on Nagios – the master of all monitoring software – but without the administrative hassle which comes with Nagios.

By having a click-able interface, configuration turned into joy; by having addons like simple slave deployment (a mini monitoring machine taking local load) it was even better. Every new release, problems were solved and new features added. SNMP trap management, a device yelling to the monitoring machine it had a problem, instead of waiting for the monitoring machine to detect a problem occurred.

In those days, it was community edition (free, with all the features, support from the community) and enterprise (not free, with paid support).

Having the community edition was fine. It worked, I provided a perfect service to all my customers. As a small company, I do not have the room in the budget to go enterprise. And hey, these monitored companies – all non-profit – where happy: the availability of their infrastructure increased since problems were detected faster and better. Instead of ‘it does not work’ it became ‘internal email is fine, but external email is not’.

Then the announcement came. The business model Opsview uses was “outdated”. Community turned into Core, Pro, Enterprise and MSP. So far, no problem. However in the same announcement, it turned out Community was deprived of its ever so important functions. Slave handling removed. Slave clustering removed. SNMP Trap processing removed. The other removals, dashboard and data warehouse are less important to my customers, or myself.

Then what? Never touching the Opsview installation would be an option, however all new goodies will not be there also. On top, this monitoring machine is shared with other functions – freezing the installation can be done – in the future it will stop working as other components part of the OS will be upgraded and become dysfunctional.

Alternatives? Yes, many. A forced migration can be done to a new application – if time was sufficiently available. And that’s what I don’t have.

So, return to the mother of all monitoring, Nagios. It has been a long time, however the logic still is in the mind. They become big, large installed base – and the required functionality is in the Core edition.

I already migrated, with a few bumps and glitches. It was not easy – I copied the Nagios configuration from the Opsview directory – due to the following “points of attention”

Install the nagios3 full package on the master
Install the core, not web on the slave
path names do need changing. I used the default install for nagios3 from my OS, thus path names did change
Opsview leaves a few small files, which change the operation of a plain Nagios installation significantly, especially with slaves. These files need to be removed.
Tunneling is great. By enabling the tunnel, the slave talks to localhost, which is transported to the master over ssh. I moved the tunnel to autossh, which monitors the tunnel and restarts, if needed.

So, monitoring works. It does have a simpler layout, does not require MYSQL – so the load on that machine has gone down. My Android app finally works – the Opsview mobile client does not support multidomain SSL certificates. I don’t have the datawarehouse, accepted this. And, the size of my backup has been reduced with about 1G (compressed) due to the removal of opsview, odw and runtime databases from MYSQL.

On my list:

Install naglite – the simplest console for problem reporting
Install nagiosQL – easy maintenance of Nagios configuration/li>
Add servicegroups – for business process monitoring
Replace NSCA with NRD – to be ready for more 🙂
See what ndoutils can do for me
Bits and pieces on enabling graphing for special servicechecks.

At least, the users are satisfied again – and that’s what counts..