Monitor bacula backup jobs with zabbix

Subject

There is an email notification mechanism built in into bacula out of the box. It really sends out email notification after each job. This works, but in the case a job is stuck and who reads emails after all?

So I decided to make my zabbix monitoring solution to handle this.

Symptoms

Here is what I want my solution to accomplish. Discover active backup jobs, create items and triggers and also some nice graphs. This discovery part is not so important as my installation is pretty stable, new hosts or new backup jobs don’t come every day, still I don’t like manual typing.

Plattform/Tools

My server is Ubuntu 14.04.3 LTS, bacula 5.2.6, zabbix 2.2.6.

I’ll be using my lovely perl language, currently v5.18.2 is installed.

Solution

Discovery script

Create new Template

Template App Bacula

Create new Discovery rule

Discovery rule name: Bacula Jobs Discovery, discovery key: bacula.jobs.discovery.

Please also pay attention at the macro {#JOB_NAME}, it will be used to create discovery items.

Discovery script on the target server

To make our discovery work we need to add a UserParameter item to zabbix configuration script on the target machine and also create a discovery script, which will be used by this zabbix item.

Create file /etc/zabbix/zabbix_agentd.d/userparameter_bacula.conf:

#
UserParameter=bacula.jobs.discovery, sudo /usr/lib/zabbix/externalscripts/zabbix_bacula.pl -D
#

Pay attention at the bacula.jobs.discovery key, it has to match the key, defined in the discovery rule. Also, make sure the Include directive is in your main zabbix agent configuration file.

Now to the discovery script zabbix_bacula.pl. When used with the command line option -D, it will deliver a JSON object with a list of enabled backup jobs. Probably we will extend it in the future with other options to do something else.

#!/usr/bin/perl
use strict;
use warnings;
use Getopt::Std;
use JSON;
use Data::Dumper;

my $JOB_TYPE_BACKUP = 66;

# declare the perl command line flags/options we want to allow
my %options=();
getopts("Ds:", \%options);

if ($options{D}) {
        my $arrays_found = undef;
        open(my $fh, '-|', 'echo "show jobs" | bconsole') or die $!;
        while (my $line = <$fh>) {
                if ($line =~ /^Job:(.*)/) {
                        my @tmp = split(/\s+/,$1);
                        my %job;
                        foreach my $t (@tmp) {
                                if (my ($k,$v) = split(/=/,$t)) {
                                        $job{$k} = $v;
                                }
                        }
                        if ($job{JobType} eq $JOB_TYPE_BACKUP) {
                            if ($arrays_found) {
                                push(@{$arrays_found->{'data'}},{'{#JOB_NAME}' => ($job{name})});
                            } else {
                                $arrays_found->{'data'}->[0] = {'{#JOB_NAME}' => ($job{name})};
                            }
                        }

                }
        }
        print encode_json($arrays_found) if ($arrays_found);
}

Here comes our nicely formatted JSON output:

{
    "data":
        [
            {
                "{#JOB_NAME}":"BackupClient1"
            },
            {
                "{#JOB_NAME}":"BackupCatalog"
            }
        ]
}

Finally create (or modify) /etc/sudoers.d/zabbix to enable sudo for zabbix user:


zabbix ALL=NOPASSWD: /usr/lib/zabbix/externalscripts/zabbix_bacula.pl -D

That’s not all though, additionally we need to add user zabbix to the group bacula to be able to run the bconsole command. Check /etc/group:


bacula:x:116:zabbix

And the very final caveat was the timeout problem with bconsole. On my development system bconsole took about 15 seconds to run and I was seeing mysterious ZBX_NOTSUPPORTED, which I thought was due to some incomplete sudoers configuration. Also error messages like:

zbx_waitpid() killed by signal 15

in the zabbix_agent log file. Turned out it was really about the timeout for external script in the zabbix_agent configuration file /etc/zabbix/zabbix_agentd.conf:

#Option: Timeout
# Spend no more than Timeout seconds on processing
#
# Mandatory: no
# Range: 1-30
# Default:
# Timeout=3
Timeout=30

Setting timeout to 30 solved the problem. The timeout value has to be adjusted on the server side as well, otherwise mysterious Interrupted system call error messages will appear in the zabbix server log file.

Item prototypes

For the start, we will have one reporting item per job name. Item prototypes will be created using the {#JOB_NAME} macro from the discovery rule. So we will expect items in the form of bacula.job.exit_code[JOB_NAME]=JOB_EXIT_CODE, where job exit code is one of OK, Error, etc.

ScreenShot112

The picture above is self-explanatory, just don’t forget to use Zabbix trapper as an item type.

Trigger prototypes

We will have it simple, just one trigger to raise an alarm if backup job status is not OK. The below screenshot explains it:

ScreenShot113

Please be careful to create this prototype on the Discovery rules and not directly in the Template.

Making bacula report job status to zabbix

Wrapper script for Message resource

The idea is to modify a Message resource in the bacula director configuration file to send messages to zabbix. Unfortunately, there is no option to run an external script or to specify an external script as a destination. So I decided to create a wrapper script, which will be used instead of the mail command in Message resource and do both – send an email and also send information to zabbix.

Let’s create /etc/bacula/scripts/bacula_message.pl:

#!/usr/bin/perl
use strict;
use warnings;
use Getopt::Std;
use Data::Dumper;
use Sys::Syslog qw(:standard :macros);

my %options=();
getopts('c:d:e:i:j:l:n:r:s:t:MDO',\%options);
...

It will take all substitution variables as specified in the bacula documentation for mail command plus -s option for the mail Subject and one of -M,-O or -D options to correspond to MailCommand, OpertorCommand and Daemon message resource in the bacula director configuration file.

Here are the substitution variables as specified in the bacula documentation:

  • %c = Client’s name
  • %d = Director’s name
  • %e = Job Exit code (OK, Error, …)
  • %i = Job Id
  • %j = Unique Job name
  • %l = Job level
  • %n = Job name
  • %r = Recipients
  • %t = Job type (e.g. Backup, …)

 

Modifying bacula config

Now we will use this wrapper script in the Message resource configuration in /etc/bacula/bacula-dir.conf. Here are the excerpts alongside with the original mail command configurations commented out:

...
# mailcommand = "mail -s \"Bacula: %t job %n %e of %c %l\" %r"
mailcommand = "/etc/bacula/scripts/bacula_message.pl -M -c '%c' -d '%d' -e '%e' -i '%i' -j '%j' -l '%l' -n '%n' -r '%r' -s 'Bacula: %t job %n %e of %c %l' -t '%t'"
...
# operatorcommand = "mail -s \"Bacula: Intervention needed for %j\" %r"
operatorcommand = "/etc/bacula/scripts/bacula_message.pl -D -c '%c' -d '%d' -e '%e' -i '%i' -j '%j' -l '%l' -n '%n' -r '%r' -s 'Bacula: Intervention needed for %j' -t '%t'"
...
# mailcommand = "mail -s \"Bacula daemon message:\" %r"
mailcommand = "/etc/bacula/scripts/bacula_message.pl -M -c '%c' -d '%d' -e '%e' -i '%i' -j '%j' -l '%l' -n '%n' -r '%r' -s 'Bacula daemon message' -t '%t'"
 
...

To be on the safe side, we use all substitution variables that are provided by bacula to the mail command  and decide how to use them later inside our wrapper script.

Send mail

This is the simplest part of our script, we assume that if sending mail worked directly from bacula, it will work from our script as well:

...
#Will send out an email only if one of -M,-D or -O defined and also %r for recipients provided
if ($options{'r'} &amp;amp;amp;&amp;amp;amp; ($options{'M'} || $options{'O'} || $options{'D'})) {
	system("mail -s \"$options{'s'}\" $options{'r'}");
}
...

Send job status

Next step would be to send job status to zabbix. We will use job name as a key and job exit code as its value:

my $ZABBIX_SERVER = '127.0.0.1';
my $ZABBIX_HOST = 'Zabbix server';

my $JOB_EXIT_CODE_KEY = 'bacula.job.exit_code[%s]';

my $zabbix_sender = `which zabbix_sender`;
chomp($zabbix_sender);
my $zabbix_sender_cmd_line = "$zabbix_sender -z $ZABBIX_SERVER -s \"$ZABBIX_HOST\" -k %s -o %s";

system(sprintf($zabbix_sender_cmd_line,sprintf($JOB_EXIT_CODE_KEY,$options{'n'}),$options{'e'}) . " &amp;amp;gt;/dev/null");

Don’t forget to change $ZABBIX_SERVER and $ZABBIX_HOST variables to the real values!

Send extended job information

This is left for the next post.

Discussion

What has been done so far? We created a zabbix Template with a Discovery rule to identify bacula backup jobs configured on the machine, where bacula-director is running. When these jobs are discovered, an item and a trigger are created to monitor the exit status of these jobs and also to raise an alarm if this exit status indicates an error.

These are the artifacts created:

  • zabbix Template
  • zabbix configuration add-on script, which will add custom keys to zabbix agent configuration. This needs to be done on our target machine, where zabbix director is running
  • Discovery support script to be installed on the target machine
  • Wrapper script to be used by bacula director in place of traditional mail command, which will send information to zabbix server after each backup job completed. This script will be also installed on the target machine
  • And also we need to modify bacula director configuration file to use our wrapper script instead of mail command

In my next article, I’ll add more monitoring items to also monitor other backup job parameters and also the status for storage Pools.

Caveats

Just one thing bothered me few times, at which level to create item and trigger prototypes. Although I used zabbix Template to create discovery rule, few times I mistakenly created item and trigger prototypes on the Template level and not in the Discovery rule.

Also the issue with zabbix  agent and server timeouts was a bit tricky, took some time to figure it out. This may require some experimenting to find the proper values in the particular environment. On my test machine, it was about 20 sec for bconsole command, which is being run in the discovery script to finish while it was about 3 sec in my production environment.

Comments

comments