At a high level, you need three things to run a trouble-free datacenter (even if your datacenter consists of two mini-tower systems stuffed in a closet): intelligent architecture, monitoring, and trend analysis. Intelligent architecture consists of making good purchase decisions and designing virtual machines that can appropriately handle their load. Monitoring allows you to prevent or respond quickly to emergent situations. Trend analysis helps you to determine how well your reality matches your projections and greatly assists in future architectural decisions. In this article, we’re going to focus on trend analysis. We will set up a data collection and graphing system called “PNP4Nagios” that will allow you to track anything that you can measure. It will hold that data for four years. You can display it in graphs on demand.

What You Get

I know that intro was a little heavy. So, to put it more simply, I’m giving you graphs. Want to know how much CPU that VM has been using? Trying to figure out how quickly your VMs are filling up your Cluster Shared Volumes? Curious about a VM’s memory usage? We have all of that.

Where I find it most useful: Getting rid of vendor excuses. We all have at least one of those vendors that claim that we’re not providing enough CPU or memory or disk or a combination. Now, you can visually determine the reasonableness of their demands.

First, the host and service screens in Nagios will get a new graph icon next to every host and service that track performance data. Also, hovering over one of those graph icons will show a preview of the most recent chart:

p4n_mainscreen

Second, clicking any of those icons will open a new tab with the performance data graph for the selected item.

p4n_chartpage

Just as the Nagios pages periodically refresh, the PNP4Nagios page will update itself.

Additionally, you can do the following:

  • Click-dragging a section on a graph will cause it to zoom. If you’ve ever used the zoom feature in Performance Monitor, this is similar.
  • In the Actions bar, you can:
    • Set a custom time/date range to graph
    • Generate a PDF of the visible charts
    • Generate XML summary data
  • Create a “basket” of the graphs that you view most. The basket persists between sessions, so you can build a dashboard of your favorite charts

What You Need

Fortunately, you don’t need much to get going with PNP4Nagios.

Fiscal Cost

Let’s answer the most important question: what does it cost? PNP4Nagios does not require you to purchase anything. Their site does include a Donate button. If your organization finds PNP4Nagios useful, it would be good to throw a few dollars their way.

You’ll need an infrastructure to install PNP4Nagios on, of course. We’ll wrap that up into the later segments.

Nagios

As its name implies, PNP4Nagios needs Nagios. PNP4Nagios installs alongside Nagios on the same system. We have a couple of walkthroughs for installing Nagios as a Hyper-V guest, divided by distribution.

The installation really doesn’t change much between distributions. The differences lie in how you install the prerequisites and in how you configure Apache. If you know those things about your distribution, then you should be able to use either of the two linked walkthroughs to great effect. If you’d rather see something on your exact distribution, the official Nagios project has stepped up its game on documentation. If we haven’t got instructions for your distribution, maybe they do. There are still things that I do differently, but nothing of critical importance. Also, being a Hyper-V blog, I have included special items just for monitoring Hyper-V, so definitely look at the post-installation steps of my articles.

Also, if you want to use SSL and Active Directory to secure your Nagios installation, we’ve got an article for that.

Disk Space

According to the PNP4Nagios documentation, each item that you monitor will require about 400 kilobytes once it has reached maximum data retention. That assumes that you will leave the default historical interval and retention lengths. More information can be found on the PNP4Nagios site. So, 20 systems with 12 monitors apiece will use about 96 megabytes.

PNP4Nagios itself appears to use around 7 megabytes once installed and extracted.

Downloading PNP4Nagios

PNP4Nagios is distributed on Sourceforge: https://sourceforge.net/projects/pnp4nagios/files/latest/download.

As always, I recommend that you download to a standard workstation and then transfer the files to the Nagios server. Since I operate using a Windows PC and run Nagios on a Linux system, WinSCP is my choice of transfer tool.

On my Linux systems, I create a “Download” directory in my home folder and place everything there. The install portion of my instructions will be written using the file’s location as a starting point. So, for me, I begin with cd ~/Downloads.

Installing PNP4Nagios

PNP4Nagios installs quite easily.

PNP4Nagios Prerequisites

Most of the prerequisites for PNP4Nagios automatically exist in most Linux distributions. Most of the remainder will have been satisfied when you installed Nagios. The documentation lists them: http://docs.pnp4nagios.org/pnp-0.6/about#required_software.

  • Perl, at least version 5. To check your installed Perl version: perl -v
  • RRDTool: This one will not be installed automatically or during a regular Nagios build. Most distributions include it in their mainstream repositories. Install with your distribution’s package manager.
    • CentOS and most other RedHat-based distributions: sudo yum install perl-rrdtool
    • SUSE-based systems: sudo zypper install rrdtool
    • Ubuntu and most other Debian-based distributions: sudo apt install rrdtool librrds-perl
  • PHP, at least version 5. This would have been installed with Nagios. Check with: php -v
  • GD extension for PHP. You might have installed this with Nagios. Easiest way to check is to just install it; it will tell you if you’ve already got it.
    • CentOS and most other RedHat-based distributions: sudo yum install php-gd
    • SUSE-based systems: sudo zypper install php-gd
    • Ubuntu and most other Debian-based distributions: sudo apt install php-gd
  • mod_rewrite extension for Apache. This should have been installed along with Nagios. How you check depends on whether your distribution uses “apache2” or “httpd” as the name of the Apache executable:
    • CentOS and most other RedHat-based distributions: sudo httpd -M | grep rewrite
    • Ubuntu, openSUSE, and most Debian and SUSE distributions: sudo apache2ctl -M | grep rewrite
  • There will be a bit more on this in the troubleshooting section near the end of the article, but if you’re running a more current version of PHP (like 7), then you may not have the XML extension built-in. I only ran into this problem on my Ubuntu installation. I solved it with this: sudo apt install php-xml
  • openSUSE was missing a couple of PHP modules on my system: sudo zypper install php-sockets php-zlib

If you are missing anything that I did not include instructions for, you can visit one of my articles on installing Nagios. If I haven’t got one for your distribution, then you’ll need to search for instructions elsewhere.

Unpacking and Installing PNP4Nagios

As I mentioned in the download section, I place my downloaded files in ~/Downloads. I start from there (with cd ~/Downloads). Start these directions in the folder where you placed your downloaded PNP4Nagios tarball.

  1. Unpack the tarball. I wrote these directions with version 0.6.26. Modify your command as necessary (don’t forget about tab completion!): tar xzf pnp4nagios-0.6.26.tar.gz
  2. Move to the unpacked folder: cd ./pnp4nagios-0.6.26/
  3. Next, you will need to configure the installer. Most of us can just use it as-is. Some of us will need to override some things, such as the Nagios user groups. To determine if that applies to you, open /usr/local/nagios/etc/nagios.cfg. Look for the following section:

    If both nagios_user and nagios_group are “nagios”, then you don’t need to do anything special.
    Regular configuration: ./configure
    Configuration with overrides: ./configure --with-nagios-user=naguser --with-nagios-group=nagcmd .
    Other overrides are available. You can view them all with ./configure --help. One useful override would be to change the location of the emitted perfdata files to an auxiliary volume to control space usage. On my Ubuntu system, I needed to override the location of the Apache conf files: ./configure --with-httpd-conf=/etc/apache2/sites-available
  4. When configure completes, check its output. Verify that everything looks OK. Especially pay attention to “Apache Config File” — note the value because you will access it later. If anything looks off, install any missing prerequisites and/or use the appropriate configure options. You can continue running ./configure until everything suits your needs.
  5. Compile the program: make all. If you have an “oh no!” moment in which you realize that you missed something, you can still re-run ./configure and then compile again.
  6. Because we’re doing a new installation, we will have it install everything: sudo make fullinstall. Be aware that we are now using sudo. That’s because it will need to copy files into locations that your regular account won’t have access to. For an upgrade, you’d likely only want sudo make install. Please check the documentation for additional notes about upgrading. If you didn’t pay attention to the output file locations during configure, they’ll be displayed to you again.
  7. We’re going to be adding a bit of flair to our Nagios links. Enable the pop-up extension with: sudo cp ./contrib/ssi/status-header.ssi /usr/local/nagios/share/ssi/

Installation is complete. We haven’t wired it into Nagios yet, so don’t expect any fireworks.

Configure Apache Security for PNP4Nagios

If you just use the default Apache security for Nagios, then you can skip this whole section. As outlined in my previous article, I use Active Directory authentication. Really, all that you need to do is duplicate your existing security configuration to the new site. Remember how I told you to pay attention to the output of configure, specifically “Apache Config File”? That’s the file to look in.

My “fixed” file looks like this:

Only a single line needed to be changed to match my Nagios virtual directories.

Initial Verification of PNP4Nagios Installation

Before we go any further, let’s ensure that our work to this point has done what we expected.

  1. If you are using a distribution whose Apache enables and disables sites by symlinking into sites-available and you instructed PNP4Nagios to place its files there (ex: Ubuntu), enable the site: sudo a2ensite pnp4nagios.conf
  2. Restart Apache.
    1. CentOS and most other RedHat-based distributions: sudo service httpd restart
    2. Almost everyone else: sudo service apache2 restart
  3. If necessary, address any issues with Apache starting. For instance, Apache on my openSUSE box really did not like the “Order” and “Allow” directives.
  4. Once Apache starts correctly, access http://yournagiosserveraddress/pnp4nagios. For instance, my internal URL is http://nagios.siron.int/pnp4nagios. Remember that you copied over your Nagios security configuration, so you will log in using the same credentials that you use on a normal Nagios site.
  5. Fix any problems indicated by the web page. Continue reloading the Apache server and the page as necessary until you get the green light:
    p4n_greenlight
  6. Remove the file that validates the installation: sudo rm /usr/local/pnp4nagios/share/install.php

Installation was painless on my CentOS and Ubuntu systems. openSUSE gave me more drama. In particular, it complained about “PHP zlib extension not available” and “PHP socket extension not available”. Very easy to fix: sudo zypper install php-sockets php-zlib. Don’t forget to restart Apache after making these changes.

Initial Configuration of Nagios for PNP4Nagios

At this point, you have PNP4Nagios mostly prepared to do its job. However, if you try to access the URL, you’ll get a message that says that it doesn’t have any data: “perfdata directory “/usr/local/pnp4nagios/var/perfdata/” is empty. Please check your Nagios config.” Nagios needs to start feeding it data.

We start by making several global changes. If you are comparing my walkthrough to the official PNP4Nagios documentation, be aware that I am guiding you to a Bulk + NPCD configuration. I’ll talk about why after the how-to.

Global Nagios Configuration File Changes

In the text editor of your choice, open /usr/local/nagios/etc/nagios.cfg. Find each of the entries that I show in the following block and change them accordingly. Some don’t need anything other than to be uncommented:

 

Next, open /usr/local/nagios/etc/objects/templates.cfg. At the end, you’ll find some existing commands that mention “perfdata”. After those, add the commands from the following block. If you don’t use the initial Nagios sample files, then just place these commands in any active cfg file that makes sense to you.

Configuring NPCD

The performance collection method that we’re employing involves the Nagios Perfdata C Daemon (NPCD). The default configuration will work perfectly for this walkthrough. If you need something more from it, you can edit /usr/local/pnp4nagios/etc/npcd.cfg. We just want it to run as a daemon:

Enable it to run automatically at startup.

  • Most Red Hat and SUSE based distributions: sudo chkconfig --add npcd
  • Ubuntu and most other Debian-based distributions: sudo update-rc.d npcd defaults

Configuring Hosts in Nagios for PNP4Nagios Graphing

If you made it here, you’ve successfully completed all the hard work! Now you just need to tell Nagios to start collecting performance data so that PNP4Nagios can graph it.

Note: I deviate substantially from the PNP4Nagios official documentation. If you follow those directions, you will quickly and easily set up every single host and every single service to gather data. I didn’t want that because I don’t find such a heavy hand to be particularly useful. You’ll need to do more work to exert finer control. In my opinion, that extra bit of work is worth it. I’ll explain why after the how-to.

If you followed the path of least resistance, every single host in your Nagios environment inherits from a single root source. Open /usr/local/nagios/etc/objects/templates.cfg. Find the define host object with a name of generic-host. Most likely, this is your master host object. Look at its configuration:

Now that you’ve enabled performance data processing in nagios.cfg, this means that Nagios and PNP4Nagios will now start graphing for every single host in your Nagios configuration. Sound good? Well, wait a second. What it really means is that it will graph the output of the check_command for every single host in your Nagios configuration. What is check_command in this case? Probably check_ping or check_icmp. The performance data that those output are the round-trip average and packets lost during pings from the Nagios server to the host in question. Is that really useful information? To track for four years?

I don’t really need that information. Certainly not for every host. So, I modified mine to look this:

What we have:

  • Our existing hosts are untouched. They’ll continue not recording performance data just as they always have.
  • A new, small host definition called “perf-host”. It also does not set up the recording of host performance data. However, its “action_url” setting will cause it to display a link to any graphs that belong to this host. You can use this with hosts that have graphed services but you don’t want the ping statistics tracked. To use it, you would set up/modify hosts and host templates to inherit from this template in addition to whatever host templates they already inherit from. For example: use perf-host,generic-host.
  • A new, small host definition called “perf-host-pingdata”. It works exactly like “perf-host” except that it will capture the ping data as well. The extra bit on the end of the “action_url” will cause it to draw a little preview when you mouseover the link. To use it, you will set up/modify hosts and host templates to inherit from this template in addition to whatever host templates they already inherit from. For example: use perf-host-pingdata,generic-host.

Note: When setting the inheritance:

  • perf-host or perf-host-pingdata must come before any other host templates in a use line.
  • In some instances, including a space after the comma in a use line causes Nagios to panic if the name of the host does not also have a space (ex: you are using tabs instead of spaces on the name generic_host line. Make sure that all of your use directives have no spaces after any commas and you will never have a problem. Ex: use perf-host,generic-host.

Remember to check the configuration and restart Nagios after any changes to the .cfg files:

Couldn’t You Just Set a Single Root Host for Inheritance?

An alternative to the above would be:

In this configuration, perf-host inherits directly from generic-host. You could then have all of your other systems inherit from perf-host instead of generic-host. The problem is that even in a fairly new Nagios installation, a fair number of hosts already inherit from generic-host. You’d need to determine which of those you wanted to edit and carefully consider how inheritance works. If you’re going to all of that trouble, it seems to me that maybe you should just directly edit the generic-host template and be done with it.

Truthfully, I’m only telling you what I do. Do whatever makes sense to you.

Configuring Services in Nagios for PNP4Nagios Graphing

You’ll get much more use of out service graphing than host graphing. Just as with hosts, the default configuration enables performance graphing for all services. Not all services emit performance data, and you may not want data from all services that do produce data. So, let’s fine-tune that configuration as well.

Still in /usr/local/nagios/etc/objects/templates.cfg, find the define service object with a name of generic-service. Disable performance data collection on it and add a stub service that enables performance graphing:

When you want to capture performance data from a service, prepend the new stub service to its use line. Ex: use perf-service,generic-service. The warnings from the host section about the order of items and the lack of a space after the comma in the use line transfer to the service definition.

Remember to check the configuration and restart Nagios after any changes to the .cfg files:

Example Configurations

In case the above doesn’t make sense, I’ll show you what I’m doing.

Most of the check_nt services emit performance data. I’m especially interested in CPU, disk, and memory. The uptime service also emits data, but for some reason, it doesn’t use the defined “counter” mode. Instead, it’s just a graph that steadily increases at each interval until you reboot, then it starts over again at zero. I don’t find that terribly useful, especially since Nagios has its own perfectly capable host uptime graphs. So, I first configure the “windows-server” host to show the performance action_url. Then I configure the desired default Windows services to capture performance data.

My /usr/local/nagios/etc/objects/windows.cfg:

Now, my hosts that inherit from the default Windows template have the extra action icon, but my other hosts do not:

p4n_hostswithiconsThe same story on the services page; services that track performance data have an icon, but the others do not:

p4n_serviceswithicons

Troubleshooting your PNP4Nagios Deployment

Not getting any data? First of all, be patient, especially when you’re just getting started. I have shown you how to set up the bulk mode with NPCD which means that data captures and graphing are delayed. I’ll explain why later, but for now, just be aware that it will take some time before you get anything at all.

If it’s been some time, say, 15 minutes, and you’re still not getting any data. Go to verify.pnp4nagios.org/ and download the verify_pnp_config file. Transfer it to your Nagios host. I just plop it into my Downloads folder as usual. Navigate to the folder where you placed yours, then run:

That should give you the clues that you need to fix most any problems.

I did have one leftover problem, but only my Ubuntu system where I had updated to PHP 7. The verify script passed everything, but trying to load any PNP4Nagios page gave me this error: “Call to undefined function simplexml_load_file()”. I only needed to install the PHP XML package to fix that: sudo apt install php-xml. I didn’t look up the equivalent on the other distributions.

Plugin Output for Performance Graphing

To determine if a plugin can be graphed, you could just look at its documentation. Otherwise, you’ll need to manually execute it from /usr/local/nagios/libexec. For instance, we’ll just use the first one that shows up on an Ubuntu system, check_apt:

p4n_testcheckoutput

See the pipe character (|) there after the available updates report? Then the jumble of characters after that? That’s all in the standard format for Nagios performance charting. That format is:

  1. A pipe character after the standard Nagios service monitoring result.
  2. A human-readable label. If the label includes any special characters, the entire label should be enclosed in single quotes.
  3. An equal sign (=)
  4. The reported value.
  5. Optionally, a unit of measure.
  6. A semi-colon, optionally followed by a value for the warning level. If the warning level is visible on the produced chart, it will be indicated by a horizontal yellow line.
  7. A semi-colon, optionally followed by a value for the critical level. If the warning level is visible on the produced chart, it will be indicated by a horizontal red line.
  8. A semicolon, optionally followed by the minimum value for the chart’s y-axis. Must be the same unit of measure as the value in #4. If not specified, PNP4Nagios will automatically set the minimum value. If this value would make the current value invisible, PNP4Nagios will set its own minimum.
  9. A semicolon, optionally followed by the maximum value for the chart’s y-axis. Must be the same unit of measure as the value in #4. If not specified, PNP4Nagios will automatically set the maximum value. If this value would make the current value invisible, PNP4Nagios will set its own maximum.

This format is defined by Nagios and PNP4Nagios conforms to it. You can read more about the format at: verify.pnp4nagios.org/

My plugins did not originally emit any performance data. I have been working on that and should hopefully have all of that work completed before you read this article.

My PNP4Nagios Configuration Philosophy

I had several decision points when setting up my system. You may choose to diverge as it meets your needs. I’ll use this section to explain why I made the choices that I did.

Why “Bulk with NPCD” Mode?

Initially, I tried to set up PNP4Nagios in “synchronous” mode. That would cause Nagios to instantly call on PNP4Nagios to generate performance data immediately after every check’s results were returned. I chose that initially because it seemed like the path of least resistance.

It didn’t work for me. I’m betting that I did something wrong. But, I didn’t get my problem sorted out. I found a lot more information on the NPCD mode. So, I switched. Then I researched the differences. I feel like I made the correct choice.

You can read up on the available modes yourself: http://docs.pnp4nagios.org/pnp-0.6/modes.

In synchronous mode, Nagios can’t do anything while PNP4Nagios processes the return information. That’s because it all occurs in the same thread; we call that behavior “blocking”. According to the PNP4Nagios documentation, that method “will work very good up to about 1,000 services in a 5-minute interval”. I assume that’s CPU-driven, but I don’t know. I also don’t know how to quantify or qualify “will work very good”. I also don’t know what sort of environments any of my readers are using.

Bulk mode moves the processing of data from per-return-of-results to gathering results for a while and then processing them all at once. The documentation says that testing showed that 2,000 services were processed in .06 seconds. That’s easier to translate to real-world systems, although I still don’t know the overall conditions that generated that benchmark.

When we add NPCD onto bulk mode, then we don’t block Nagios at all. Nagios still does the bulk gathering, but NPCD processes the data, not Nagios. I chose this method as it means that as long as your Nagios system is multi-core and not already overloaded, you should not encounter any meaningful interruption to your Nagios service by adding PNP4Nagios. It should also work well with most installation sizes. For really big Nagios/PNP4Nagios installations (also not qualified or quantified), you can follow their instructions on configuring “Gearman Mode”.

One drawback to this method: Your “4 Hour” charts will frequently show an empty space at the right of their charts. That’s because they will be drawn in-between collection/processing periods. All of the data will be filled in after a few minutes. You just may not have instant gratification.

Why Not Just Allow Every Host and Service to be Monitored?

The default configuration of PNP4Nagios results in every single host and every single service being enabled for monitoring. From an “ease-of-configuration” standpoint, that’s tempting. Once you’ve set the globals, you literally don’t have to do anything else.

However, we are also integrating directly with Nagios’ generated HTML pages. Whereas PNP4Nagios can determine that a service doesn’t have performance data because Nagios won’t have generated anything, the front-end just has an instruction to add a linked icon to every single service. So, if you just globally enable it, then you’ll get a lot of links that don’t work.

If you’re the only person using your environment, maybe that’s OK. But, if you share the environment, then you’ll start getting calls wanting to you to “fix” all those broken links. It won’t take long before you’re spending more time explaining (and re-explaining) that not all of the links have anything to show.

Why Not Just Change the Inheritance Tree?

If you want, you could have your performance-enabled hosts and services inherit from the generic-host/generic-service templates, then have later templates, hosts, and services inherit from those. If that works for you, then take that approach.

I chose to employ multiple inheritance as a way of overriding the default templates because it seemed like less effort to me. When I went to modify the services, I simply copied “perf-service,” to the clipboard and then selectively pasted it into the use line of every service that I wanted. It worked easier for me than a selective find-replace operation or manual replacement. It also seems to me that it would be easier to revert that decision if I make a mistake somewhere.

I can envision very solid arguments for handling this differently. I won’t argue. I just think that this approach was best for my situation.