Table of Contents

Name

knoerre - fast check tool and http server for nagios remote checks

Synopsis

knoerre [ key ]

Description

knoerre is a tool for checking very different parameters of a server. The intended primary purpose is to serve check values to a (remote) requesting instance like nagios by using simplified HTTP.
It was developed as a substitution to the oversized, sometimes very buggy, sometimes difficult to configure and often also slow net-snmp package.
knoerre uses (should use) tcpserver of DJB’s software suite ucspi-tcp. Only the brave among yourselves will have the heart to do the daring deed of using (x)inetd.
The usage of DJB’s daemontools and ucspi-tcp (for tcpserver) is strongly recommended.

knoerre can be easily set up with knoerre-conf(1) .
Access restrictions by IP# can be done with knoerre-update-tcprules(1) .

A key is a specific request to knoerre like i.e. "load1". All "keys" can be used local or by http request i.e. knoerre load1, knoerre diskusage/home or GET /load1 HTTP/1.1 . A key given on command line takes precedence over reading a http request from stdin (by tcpserver). A http request is internally limited to 512 bytes.

Like using keys on the command line you can use knoerre in more ways of nagios remote checks: called by ssh, NRPE and the slow snmpd. Nevertheless the usage of tcpserver is strongly recommended. Using tcpserver and a request like load1 you’ll receive a approx. 25% faster response like a local "/bin/cat /proc/loadavg". Using a local "knoerre load1" it is 4 times faster than "/bin/cat".
Here’s a short speed comparison, 5000 times remote request "load1":

net-snmp default, default nagios check_snmp: 8 mins 50 secs
NRPE: 43 secs
tcpserver/knoerre: 3 secs

Process control

With the recommended usage of daemontools and ucspi-tcp you don’t have to care about starting, stopping or restarting knoerre. Started on demand by tcpserver(1) there is no continuously running knoerre process like other daemons. The controlling tcpserver-process can be managed with svc(8) .

Built-In checks

Some basic checks are built into knoerre. These built-in checks don’t need to call an external program.

cachedvalue
Return cached value from a file
Format: cachedvalue/XXXXX/absolute/path/to/file
where XXXXX is the max age in minutes the file may have.
Return the contents of the given file. The file should contain one line beginning with "OK ", "WARNING " or "CRITICAL " causing knoerre to exit with the matching exit code.
These conditions will also cause a critical exit code:
lstat error, file’s modtime is older than XXXXX minutes, not a regular file, empty file, file too large, open error, read error

cat
Cat content of a file.
Format: cat/absolute/path/to/file
"Cat" the content of a given file after "cat/" (up to PATHNAME_SIZE length of data). The first line contains the filename and also the date of the file (if no error occured). The last line of the file should contain an integer value to check by nagios. You can also use this check to test if an NFS-mounted FS is actually working by "cat"ting a file which should contain just "1" in a line. But to prevent blocking knoerre-processes you should better use the nfs check. If an error or timeout happens then 9999 or a bigger value is returned.

catwts
Cat content of a file with timestamp check.
Format: cat/maxage/maxagecode/absolute/path/to/file
Like cat it writes out data from a file but it does additionally a "freshness check": If the file’s modified timestamp is older than "maxageminutes" minutes then "CRITICAL" is printed and maxagecode in last line.

cmdline
Return the number of instances of a process by cmdline match.
Format: cmdline/XXXX
where XXX is a string which should be part of the cmdline.
Like process but use /proc/.../cmdline to detect also script-processes like i.e. python loadlogger.py which process name is only "python".

cmp
Compare a string to the content of a file.
Format: cmp/string/absolute/path/to/file
Compare a string to the content of a file. If the string is equal to the content (LF is ignored) then 0 is returned otherwise 1. If an error or timeout happens then 9999 or a bigger value is returned.

cpu
Show CPU usage in percent values.
Format: cpuXY/SECONDS
where X is one of (u|n|s|i|w|I) and Y one of (t|c) and optional SECONDS the measuring interval.
The times of CPU usage can be shown ’t’otal since kernel start or ’c’urrent values of a measuring interval of 10 seconds default. The CPU times are ’u’ser, ’n’ice, ’s’ystem, ’i’dle or I/O ’w’ait. The ’I’ values are "inverted" against 100 percent, e.g. print 99 for idle of 1%. If you need an immediate response of an up-to-date measuring value you should use knoerred which has a special measuring thread.

ctxtswitch
Return context switches per second.
Format: ctxtswitch
Format: ctxtswitch/SECONDS
Count the context switches per second. If no seconds are given a default of 10 is used.

direntries
Return the number of entries recursively in a directory.
Format: direntries/absolute/path/to/dir
It counts entries in a dir - not inodes. This check is equal to "direntries" in recursive mode. See direntries(1) .

dirlevels
Return the maximum recursion level
Format: dirlevels/absolute/path/to/dir
Step recursively into dir, count recursion level and print the max count. One "@" can be used as wildcard (like asterisk in a shell).

diskinodes
Return used disk inodes percentage.
Format: diskinodes/absolute/path/to/fs
Like diskusage but for inodes and not diskspace.

diskusage
Return used disk space percentage.
Format: diskusage/absolute/path/to/fs
Return the amount of used space on a filesystem given after "diskusage/". NOTE: Because just one simple stat() call is used, you can use this check also for testing existance of files like e.g. "/var/lib/mysql/mysql.sock". See nagios-check-diskfree(1) .

diskusagelocal
Return highest used local disk space percentage.
Format: diskusagelocal
Get stats of mounted local (e.g. ext3) filesystems, print the two most full fs and in the last line the highest fill rate in percent.

dmesg
kernel errors
Format: dmesg
Search kernel message ring buffer for "bad" lines, kernel errors and warnings. Like kernellog the search results are weighted i.e. "Hardware Error" gets 2000 points while a "segmentation fault" gets 1 point. See also kernellog.

fileexists
Return whether file exists.
Format: fileexists/X/absolute/path/to/file
with X one of [fdcbplsaFDCBPLSA] for f(ile), d(ir), c(har dev), b(lock dev), p(ipe), l(ink), s(ocket) or a(ny type).
If file exists and matches the type then 0 is returned otherwise 1. Upper case letter for file type makes logical inversion of the test. If file is a small regular file then also its content is printed before last line.

filesizes
Return (max) filesize(s) in KB.
Format: filesizes/absolute/path/to/file
Get the filesize in KB of a single file or the maximum filesize of a group of files. You can use one dot or ’@’ as one wildcard (like asterisk in a shell). See Examples.

filesizesbypattern
Return (max) filesize(s) by given filename pattern.
Format: filesizesbypattern/XXXXX/Y/absolute/path/to/file-or-dir
Format: filesizesbypatternmaxage/XXXXX/Y/ZZ/absolute/path/to/file-or-dir
where XXXXX is a filename pattern like i.e. log, cipher Y is the recursive search depth and number ZZ is the max age (modtime) in days from 1 to 99.
Get the filesize in KB of a single file or the maximum filesize of a group of files by a given filename pattern and a maximum depth to search in. You can use one dot or ’@’ as one wildcard (like asterisk in a shell).

filesizesbysuffix
Return (max) filesize(s) by given filename suffix.
Format: filesizesbysuffix/XXXXX/Y/absolute/path/to/file-or-dir
where XXXXX is a filename suffix like i.e. .gif and cipher Y is the recursive search depth.
Get the filesize in KB of a single file or the maximum filesize of a group of files by a given filename suffix and a maximum depth to search in. You can use one dot or ’@’ as one wildcard (like asterisk in a shell). See Examples.

filetimestamp
Return age of file in minutes.
Format: filetimestamp/X/absolute/path/to/file
with X one of [acmoACMO] using access, change or modification time or the oldest of these.
Upper case means return no error but just 0 if file does not exist. If file is a small regular file then also print its content before last line.

kernellog
Count "bad lines" in kernellog.
Format: kernellog/XX/absolute/path/to/kernellog
where XX is a two-digit number.
Like tslogentries you can specify as first parm the number of chars from the beginning of a log line which must be equal to the beginning of the last line of kernellog. If you use i.e. kernellog/07/var/log/kernel on Aug 29, then all lines starting with "Aug 29 " are scanned but not lines with "Aug 28".
"Bad entries" are hardcoded in source and are strings like "access beyond end of device", "ector repair", "kernel BUG" and more.
Up to 10 "bad lines" of kernellog are returned in lines above the count return value for nagios. On very big files only the last part (default 1MB) is searched. See also dmesg.

load1 load5 load15
Return load average per 1/5/15 minutes.
Just return the load average value requested in the last line and all of /proc/loadavg in the line above.
If knoerre was compiled with gcc-option -DOPENVZDEFAULT then the load value will be divided by the number of cpu cores online as listed in /proc/stat. Additionally the number of cores will be appended to the line with loadavg data.

#.TP #.B loadmulti #Return load and many other values as one multicheck. #.br #.B Format: loadmulti/XXX/YYY #.br #where XXX is the time on the requesting host in seconds since epoch #and YYY the hostname the local host should have. #.br

loaduser
Return most processes per one account
Format: loaduser/XXX/YYY
where XXX and YYY are the min/max uid of the processes to be checked.
Return most running processes per one account. For every uid in the given range all processes are counted. 32-bit-UIDs are also supported. Up to 3 top users and the process counts are printed and the value in the last line is the max proc count.

logcheckerr
Count lines with errors in a logfile
Format: logcheckerr/absolute/path/to/logfile
Lines with "error" or "fail" are counted with a weight of 100 and "warning" lines with a weight of 10. Up to 10 "faulty" lines of logfile are returned in lines above the count return value for nagios. On very big files only the last part (default 1MB) is searched.

longprocp
Return minutes of the longest running user process.
Format: longprocp/XXX/YYY[/A[/B[/C]]]
where XXX and YYY are the min/max uid of the processes to be checked and the optional A, B, ... are names of processes to be excluded from check (up to 15).
Check for long running processes. This check returns the time in minutes of the longest running user process. Its goal is to detect suspicious processes like PHP-shells of hacked user accounts. The only difference to longprocs is that min/max uid and process excludes are given by HTTP request and are not configured in /etc/knoerrerc. It’s useful in cases when you want to build a monolithic version of knoerre which does not read knoerrerc.

longprocs
Return minutes of the longest running user process.
Format: longprocs
Check for long running processes. This check returns the time in minutes of the longest running user process. Its goal is to detect suspicious processes like PHP-shells of hacked user accounts. The values for min/max uid and optional exclude process names must be specified in /etc/knoerrerc. See nagios-check-longuserprocesses(1) .

mailqsize
Return postfix mailqueue size.
Format: mailqsize
Format: mailqsize/XXXXX
Return the size of the mailqueue (active and deferred subdirs) on a postfix server. See postfix-mailqsize(1) . With the second format you can specify up to 4 subdirs to check and an optional mode character. Just use any combination of single chars like a(ctive), d(eferred), m(aildrop) or i(ncoming) . Using ’M’ as mode char for maximum count you won’t get the sum of all emails but the maximum count of one of the specified dirs.

maxdirentries
Return the maximum number of entries recursively in directories.
Format: maxdirentries/X/absolute/path/to/dir
where cipher X is the recursive search depth.
This check is equal to "direntries" in max mode. See direntries(1) .

maxfilesizes
Return biggest file size recursively.
Format: maxfilesizes/X/absolute/path/to/dir
Format: maxfilesizessum/X/absolute/path/to/dir
where cipher X is the recursive search depth.
Find the biggest files and print paths and sizes in MB. The return value is the size of the biggest file in MB or the sum of the sizes of the scanned files.

mountopts

Check mountpoint and options
Format: mountopts/XXXXX/absolute/path/to/mountpoint
where XXXXX is an option string which should match the beginning of the mount options
Use /proc/mounts for actual mount options and mountpoint. If the given option string matches the actual mount options then 0 will be returned otherwise 1. If an error like i.e. not existing mountpoint or timeout happens then 9999 or a bigger value is returned.

mounts
Check mounts of fstab
Format: mounts
mounts compares all entries of /etc/fstab if all are actually mounted and do a statfs() to check if a (nfs) mount is lost. Return 1 if a mount is missing and return 2 if a mount is listed in /proc/mounts but is actually lost. Use fork() like key nfs to avoid blocking on lost mounts. See also procmounts.

mysqlerr
Count errors in mysqld errlog
Format: mysqlerr/absolute/path/to/mysqld.err
Like kernellog you must specify the absolute path to MySQL daemon error logfile. Only lines with ts of the current day are examined. Every "Note" counts, "Warnings" count ten times and every "ERROR" has a weight of 100.

netlinksdown
Count net interfaces without link
Format: netlinksdown
Check all network interfaces for missing link (cable).

nettraf
Count network traffic
Format: nettraf/XXXX/SECONDS
where XXXX is the device name and optional SECONDS the measuring interval.
Traffic data is read from /proc/net/dev. Units are KiB and KiB/s. In the line before last the total count of traffic while the measuring interval and the measuring interval are shown. If you need an immediate response of an up-to-date measuring value you should use knoerred which has a special measuring thread.

nfs
Check availability of a nfs-mounted fs.
Format: nfs/absolute/path/to/file
Check the availability of a nfs-mounted fs. It does this by "cat"ting the content of a given file after "nfs/", which should contain "1". If this file does not exist or NFS is not available and a timeout of 2 seconds did happen then a bigger value than 1 is returned. For NFS this check should be preferred over cat because it forks a child which may be blocked and killed then afterwards. See nagios-check-nfs(1) .

proccount
Number of all processes
Format: proccount
Format: proccounttg
Format: proccountovz
"proccount" shows the count of all processes as shown by /proc/loadavg (including "threads"). "proccounttg" counts processes by stepping through /proc and count every PID-dir (no "threads", just processes with pid==tgid). The alternative "proccountovz" is disabled by default. It additionally shows the three "top" instances of OpenVZ in the line before last line.

process
Count instances of a process.
Format: process/XXXXX
Format: process0/XXXXX
Format: processd/XXXXX
Format: process/OpenVZ-CTID_YYYY/XXXXX
Format: processd/OpenVZ-CTID_YYYY/XXXXX *** CURRENTLY NOT IMPLEMENTED ***
where XXXXX is the name of a process as in /proc/.../stat and YYYY is the CTID to match on an OpenVZ host.
If the key is "processd" then count only "real" daemons running as session/process leader with PPID 1.
On "process" a return value of 999999999999999999 will be returned if no such process runs. To return just 0 you must use "process0".
See nagios-check-process(1) .

procmounts
Check mounts of /proc/mounts
Format: procmounts
See also mounts. procmounts checks all mounts of /proc/mounts for being alive. It returns 2 if a mount is lost.

rsbackup
Return the minutes since the last backup.
Format: rsbackup
The last backup time in format YYYYMMDD is taken from "/var/log/backup.timestamp" and the difference to the current time is returned. See nagios-check-backup(1) .

time
measure execution time of a command
Format: time/XXXXX
where XXXXX must be the executable /usr/bin/XXXXX which will be measured.
The return value is the execution time of the command from fork()/execve() until SIGCHILD. The execution time is measured in microseconds.

timediff
System clock difference between local and remote.
Format: timediff/XXXXX
where XXXXX must be the unix timestamp from the requesting server in seconds since epoch.
The difference between remote and local system time is returned as a (positive) value in seconds.
A sample check in a shell:
lynx -dump http://172.16.1.1:8888/timediff/$(date +%s)

tslogentries
Count last lines in a logfile with the same beginning of line.
Format: tslogentries/XY/absolute/path/to/file
where cipher X is the recursive search depth and the optional Y is a separator char.
If you have logfiles with a timestamp at the beginning of every logline then you can count i.e. how many mails were sent or files were transferred today. The first argument must be a cipher as field count and an optional char taken as field separator to create a matching pattern. The pattern is created from the last line and the field count and separator. If no separator char is specified then ’ ’ (space) will be used as default. The second argument is the path. You can use one dot or ’@’ as one wildcard like asterisk in a shell. See Examples.

sockets
Count sockets / sockets per port
Format: sockets/PROTO/XXXXXX/YYYY
Format: sockets/PROTO/XXXXXX/YYYY/ZZZZZZZZ
Format: sockets/set-WWWW[/XXXXXX[/YYYY]]
where WWWW is a set of protocols, XXXXXX is local, remote, wlocal, wremote, all or wall. YYYY is the port as 4-digit hexstring and ZZZZZZZZ is an optional IP address to be excluded from counting.
PROTO is one of tcp, udp, tcp6, udp6 or set-WWWW. It is also the name of the proc-file in /proc/net/ which is read to get socket data. If you specify a set of protocols then "t" stands for tcp, "T" for tcp6, "u" for udp and "U" for udp6. Using the set syntax the specification of remote/local and port number is optional counting all sockets i.e. sockets/set-tTuU gives you all sockets. If you wanna know e.g. the number of sockets of a local running apache then you should use the key sockets/tcp/local/0050 and if you wanna count outgoing ssh-connections excluding connections to 172.16.0.1 then you should use sockets/tcp/remote/0016/010010AC . Sockets in state "06" (TIME_WAIT) are ignored unless you prefix local/remote/all with ’w’.

swap
Used swap space in MB
Format: swap
Used swap space in MB is calculated with values of /proc/meminfo. MemTotal and SwapTotal in MB are printed in line before last. If you don’t need this data you should use swaps because /proc/swaps holds just swap information and nothing else. The "swap" key is disabled by default.

swaps
Used swap(s) space in MB
Format: swaps
This is an alternative version to swap. The amount of used swap space is calculated by adding the "Used" fields in /proc/swaps. The number of active swaps is printed in line before last. This should be preferred over swap unless you need the MemTotal output.

tcp
Check for open TCP port
Format: tcp/XXXXXXXX/YYYY
where XXXXXXXX is the ip address and YYYY the port to connect to.
Check for open port and return an error code if connect() fails. If connect() succeeds return the time needed in microseconds. This is useful to check e.g. a local (127.0.0.1) running tomcat server on port 8080 with tcp/127.0.0.1/8080.

uptime
Return uptime
Format: uptime
Format: uptimeI
Format: uptimeI/INVERSIONLEVEL
Return uptime or an "inverted" uptime in seconds. The inverted value is (INVERSIONLEVEL - uptime) or 0 if the value would be negative. The inversionlevel may be specified by the key string, i.e. uptimeI/3600. If no inversionlevel was specified then a default of 86400 will be used.

wc-l
Count lines of a file.
Format: wc-l/absolute/path/to/file
Just like shell cmd "wc -l" it counts lines of a file. You can use it for checking i.e. apache running out of semaphores with wc-l/proc/sysvipc/sem.

knoerrerc

The (optional) resource config file is "/etc/knoerrerc". You can just specify some basic settings like external commands or parameters for "longprocs".
To specify an external program which is called by knoerre please use "CMD programurl command arg1 arg2 .. arg15", like i.e.

CMD loadavg cat /proc/loadavg

NOTE1: The number of args is limited to 15.
NOTE2: knoerre doesn’t use insecure and oversized popen(). You don’t get a shell to execute the external program.
NOTE3: You can’t specify a path to your external program. For security reasons knoerre uses an internal path list to search for the program.

Parameters for the longprocs function can be specified like this:

LONGPROC_UID_MIN 630
LONGPROC_UID_MAX 65533
LONGPROC_EXCLUDES vsftpd bash sftp-server

Files

knoerre uses one configuration file and one access restrictions file for its tcpserver daemon:

/etc/knoerrerc
rc-file for non-monolithic knoerre

/etc/knoerre.tcprules.cdb
tcprules for use with tcpserver

See Also

tcpserver(1) , knoerre-conf(1) , knoerre-update-tcprules(1) , svc(8) , check_remote_by_http(1) , check_remote_by_http_time(1)

http://cr.yp.to/ucspi-tcp.html

http://cr.yp.to/daemontools.html

Examples

Here’s a simple example of a client and server communication:
server$ tcpserver -v -RHl localhost 0 8888 knoerre
client$ lynx -dump -mime-header http://server:8888/load1
HTTP/1.0 200 OK
Server: knoerre/0.8.5m
Content-Type: text/plain


1.51


You can also use something like

echo "GET /loadavg HTTP/1.1" | knoerre
or
knoerre loadavg

This example shows the usage of a @ as wildcard:

$ knoerre filesizes/home/www/@/log/access_log
/home/www/user_hans/log/access_log
52222

A very "complex" example with three arguments (suffix, depth and path) and wildcard usage is this:

$ knoerre filesizesbysuffix/.gif/2/home/@/html/typo3temp
/home/www/user_hans/html/typo3temp/pics/30363cbb32.gif
201

Also filesizesbysuffix:

$ knoerre filesizesbysuffix/cache_pages.ibd/1/var/lib/mysql
/var/lib/mysql/user-database-1/cache_pages.ibd=3022848
3022848
$ knoerre filesizesbysuffix/.ibd/1/var/lib/mysql
/var/lib/mysql/user-database-2/index_rel.ibd=3248128
3248128

Which user sent the most emails today?

$ knoerre tslogentries/1/home/www/@/log/mail.log
/home/www/user_hans/log/mail.log
858

Which user runs the most processes?

$ knoerre loaduser/1/60000
hans=32 jack=3 john=1
32

Is /home rw-mounted and nosuid?

$ grep home /proc/mounts
/dev/sda7 /home ext3 rw,nosuid,nodev,data=ordered 0 0
$ knoerre/knoerre mountopts/rw,nosuid/home
/home==rw,nosuid?
/dev/sda7 /home ext3 rw,nosuid,nodev,data=ordered 0 0
0

Security

knoerre does not support dropping of rights. Used as remote check tool with tcpserver you can drop rights with tcpserver. knoerre actually does not need to be run as root but for different checks and different dirs and files you’ll maybe need different rights. Don’t use setuid-bits, uid/euid checks are not made.

Too long keys are truncated or answered with http-redirection. HTTP requests are limited to 512 bytes.

Keys containing ".." are answered with http-redirection.

All stat-calls are lstat()-calls.

No writes are made to filesystem(s), all open()-calls are read-only. Data is only written to stdout/stderr.

No external libs are used. Only standard C-lib is used. No stdio-functions are used. "External" input data is used with bound checks. Arrays are "oversized" to avoid off-by-one errors.

An internal timeout prevents "dead" knoerre processes with blocking read() and waiting for data which will never come.

The amount of syscalls and the amount of different syscalls is low. The source code and also the executable file is small.

Using external commands with "CMD" in /etc/knoerrerc can be a security risk because the external program is forked/exec’ed by knoerre.

knoerre doesn’t use insecure and oversized popen() to execute external commands. You don’t get a shell to execute an external program. You can’t put strings in quotes. Space does always separate. You can’t specify a path to your external program. knoerre uses an internal path list to search for the program.

It’s strongly recommended that you only allow access for your nagios server by tcp. One entry "knoerre: ALL" in /etc/hosts.deny and one entry with the nagios server IP# in /etc/hosts.allow. After changing it you must use knoerre-update-tcprules(1) to update tcpserver’s cdb file. Keep always in mind that host based authentication is actually not a authentication.

To encrypt network traffic please use e.g. ipsec or vpn.

Caveats

Due to "leaf optimization" in direntries recursive mode it can produce wrong results on non-unix-like filesystems.

The maximum internal absolute pathname length is 16384 chars.

Author

Frank Bergmann, http://www.tuxad.com


Table of Contents