Server telemetry: the context

Are you also using the UDP User Data Protocol to transfer telemetry metrics from your cluster of cloud servers, or your swarm of IoT devices, to your central metrics server?
=> So how do you know that all UDP traffic is received and processed correctly? I did not have a clue . . . and therefore, as a good server telemetry citizen, I have put together an automated solution to monitor the quality of the UDP traffic. Today I will focus on the receiving side (the metrics server). We assume that one UDP Listener (your app) is active in your environment; the solution can be modified very easily to handle more than one.

The central metrics server is typically under heavy load because all nodes are transmitting their metrics data on a regularly basis (every second, minute, . . .). The traffic volumes add up very quickly. Therefore you always want to choose a lightweight network protocol to transmit the data.

UDP is a typical simple, lightweight, stateless protocol that imposes minimal strain on your metrics server and your nodes (opposed to the TCP Transmission Control Protocol). Some important downsides are that delivery, and the sequence of the packets, is not guaranteed. Note that the integrity of each packet is guaranteed with a packet checksum.

The solution is developed for this Linux environment:

  • Ubuntu 14.04.2 –LTS
  • Collectd 5.7
  • Grafana 4.0.2
  • InfluxDB 1.1.1

A sneak preview of the result (focus on the last 3 panels):

UDP Stats preview screencast via rolf.huijbrechts.be

The Plan of attack

The plan consists of 3 parts.

The first part configures the monitoring solution.

The second part explains how to test the solution by gradually increasing the volume of UDP packets on a specific port of the central metrics server, and assess how that data is represented in the graphs of the Grafana Dashboard.

The third part is a screencast so you can see the changes in the Grafana Dashboard over a period of time.

We want to identify in a simple way if critical errors have occurred. We also want to monitor the system and use automated alerts so we can prevent those critical errors from happening.

The main idea is to extract various network-related information from the Linux network stack on each node using the Collectd server telemetry collector daemon. And then forward and store that data in the InfluxDB time-series database. And then visualize that data in a comprehensive Grafana dashboard for monitoring (passive) and alerting (reactive) purposes.

 

Part 1: Setup

Linux Network Interface Errors

The first step is to figure out how to extract the relevant data from the Linux network stack.We prefer to find a way to handle this using Bash shell scripts (opposed to writing a C program which I could have done instead but that would not be very useful for other people).

I want to focus on a few specific areas that are relevant for this use case.

First you need to collect diagnostics data from the network interface (typically eth0). This data is exposed through the Linux command ifconfig.


    # Bash
    $ ifconfig

    # Output
    eth0 Link encap:Ethernet HWaddr 04:01:09:2d:c9:01
    inet addr:146.185.155.172 Bcast:146.185.155.255 Mask:255.255.255.0
    inet6 addr: fe80::601:9ff:fe2d:c901/64 Scope:Link
    UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
    RX packets:1094621 errors:0 dropped:0 overruns:0 frame:0
    TX packets:21712 errors:0 dropped:0 overruns:0 carrier:0
    collisions:0 txqueuelen:1000
    RX bytes:80396343 (80.3 MB) TX bytes:4480641 (4.4 MB)

The metrics about the total number of errors when receiving (RX) or transmitting (TX) data is interesting. This is a cumulative metric. These errors typically indicate a physical electrical error (e.g. a faulty cable). Note that the output does not contain statistics at the UDP protocol level (we handle that in the next section).

Linux UDP errors

Linux provides various statistics about the UDP Sockets that are active on your central system. Each socket is typically linked to an UDP Listener that is part of your app.

How does that work, when “nodes are communicating over UDP”? Your app establishes a socket (a UDP Listener socket on a specific IP port). The Linux network stack pushes incoming UDP packets in a dedicated RX Queue for that UDP socket. Your app will pop the incoming UDP packets in the right sequence from the Linux UDP RX Queue using a socket and will process the data (this takes time). When the inflow of UDP packets in the RX Queue becomes too high, and the UDP listener cannot keep up with that pace, then the size of the UDP RX Queue will increase gradually; this continues until the UDP Listener will pop more packets from the RX Queue and so the Queue size will decrease; this stops when the RX Queue gets full (memory wise) and then the new incoming UDP packets will be dropped until the UDP Listener can keep up again processing the data of the queue.

This workflow exposes a number of configuration options, and some interesting metrics that we can use to measure the quality of the UDP traffic.

The size of the UDP RX Queue for each UDP Socket is configurable. The default on Ubuntu 14.x is typically too low (e.g. 128K) for a busy metrics server. Therefore we increase that to 25MB permanently as follows:


    # Bash
    $ sysctl -w net.core.rmem_max=26214400
    $ sysctl -w net.core.rmem_default=26214400
    $ echo "net.core.rmem_max=26214400" >> /etc/sysctl.conf
    $ echo "net.core.rmem_default=26214400" >> /etc/sysctl.conf

 

The diagnostics data is retrieved using the Linux commands /proc/net/udp for IPV4 and /proc/net/udp6 for IPV6. These are live metrics so when a socket is closed then the data for that socket is no longer displayed. This is not a problem for our solution because we snapshot the data in regular intervals (seconds) in a time-series database :).


    # Bash
    cat /proc/net/udp; cat /proc/net/udp6;

    # Output
    sl local_address rem_address st tx_queue rx_queue tr tm->when retrnsmt uid timeout inode ref pointer drops
    37: AC9BB992:007B 00000000:0000 07 00000000:00000000 00:00000000 00000000 0 0 17903 2 ffff88001c257500 0
    37: 0100007F:007B 00000000:0000 07 00000000:00000000 00:00000000 00000000 0 0 17902 2 ffff88001c257880 0
    37: 00000000:007B 00000000:0000 07 00000000:00000000 00:00000000 00000000 0 0 17895 2 ffff88001c256e00 0
    58: 0100007F:C790 0100007F:300E 01 00000000:00000000 00:00000000 00000000 0 0 516227 2 ffff88000afd4000 0
    75: 0100007F:00A1 00000000:0000 07 00000000:00000000 00:00000000 00000000 0 0 15365 2 ffff88001c256380 0
    214: 00000000:A42C 00000000:0000 07 00000000:00000000 00:00000000 00000000 0 0 15363 2 ffff88001c257180 0

    sl local_address remote_address st tx_queue rx_queue tr tm->when retrnsmt uid timeout inode ref pointer drops
    37: 000080FE00000000FF09010601C92DFE:007B 00000000000000000000000000000000:0000 07 00000000:00000000 00:00000000 00000000 0 0 17905 2
    ffff880000a1ccc0 0
    37: 00000000000000000000000001000000:007B 00000000000000000000000000000000:0000 07 00000000:00000000 00:00000000 00000000 0 0 17904 2
    ffff880000a1c000 0
    37: 00000000000000000000000000000000:007B 00000000000000000000000000000000:0000 07 00000000:00000000 00:00000000 00000000 0 0 17896 2
    ffff880000a1c880 0
    184: 00000000000000000000000000000000:300E 00000000000000000000000000000000:0000 07 00000000:00000000 00:00000000 00000000 999 0 16699 2
    ffff880000a1c440 0

The column #13 “drops” lists the number of dropped packets for each UDP Listener in a row.

The column #5 “tx_queue” lists the memory used (bytes) by the transmission queue for each UDP Listener in a row.

The column #6 “rx_queue” lists the memory used (bytes) by the receiving queue for each UDP Listener in a row.

Collecting The Data Using CollectD

Now we need to configure Collectd to collect that data on a regularly basis.

The network interface diagnostics data will be collected using the Collectd standard plugin “interface”.

 

Collecting the UDP diagnostics data requires more manual work. There is no specific Collectd plugin available for that. However we can use the generic plugin “curl_json” for this and inject our custom solution. You need to specify an HTTP URL for that Collectd plugin which is accessed by the Collectd daemon in regular intervals. The plugin expects a JSON response.

You need to configure an internal web server (something you probably already have) for this on the system where you collect the data, and hook up a custom CGI script. This script will execute the Linux command(s) that we described earlier. It will also extract the relevant data and transform it into a JSON object using the cmdline tools awk and jq. The JSON is returned as the HTTP response.

The CollectD configuration file collectd.conf => plugin curl_json:


    # UDP STATS metrics:
    # Grafana queries:
    # select mean(value) from "collectd.[[myhostname]].curl_json-udpstats.gauge-total_nbr_of_drops" WHERE $timeFilter GROUP BY time($interval)
    fill(null)
    # select mean(value) from "collectd.[[myhostname]].curl_json-udpstats.gauge-total_queues_memory_used" WHERE $timeFilter GROUP BY time($interval)
    fill(null)
    <URL "http://localhost:8080/cgi-bin/cgi-collectd-udpstats.sh">
        Instance "udpstats"
        User "myuser"
        Password "mypassword"
        <Key "total_nbr_of_drops">
            Type "gauge"
        </Key>
        <Key "total_queues_memory_used">
            Type "gauge"
        </Key>
    </URL>

 

The Bash script and the CGI Script which are invoked by CollectD:


    # Bash
    cat <<- 'CONTENT' > /etc/scripts/cgi-collectd-udpstats.sh
    #!/bin/bash
    echo "Content-Type:application/json"
    echo "Cache-Control: no-cache, no-store, must-revalidate"
    echo "Pragma: no-cache"
    echo "Expires: 0"
    echo ""

    # collect the raw source data in a tmp file
    OUTPUTFILE_UDP=`mktemp /tmp/udpanalysis_proc_net_udp_udp6.XXXXXXXXXXXXXXXXXXXX` || exit 1
    OUTPUTFILE_SUM_DROPS=`mktemp /tmp/udpanalysis_sum_drops.XXXXXXXXXXXXXXXXXXXX` || exit 1
    OUTPUTFILE_QUEUES=`mktemp /tmp/udpanalysis_queues.XXXXXXXXXXXXXXXXXXXX` || exit 1
    OUTPUTFILE_SUM_QUEUES=`mktemp /tmp/udpanalysis_sum_queues.XXXXXXXXXXXXXXXXXXXX` || exit 1
    OUTPUTFILE_ALL_SUMS=`mktemp /tmp/udpanalysis_all_sums.XXXXXXXXXXXXXXXXXXXX` || exit 1
    JSONFILE_FINAL=`mktemp /tmp/udpanalysis_final.XXXXXXXXXXXXXXXXXXXX` || exit 1

    # get proc net udp/udp6 output
    tail --lines=+2 /proc/net/udp >> ${OUTPUTFILE_UDP}
    tail --lines=+2 /proc/net/udp6 >> ${OUTPUTFILE_UDP}
    #####cat ${OUTPUTFILE_UDP}

    #sum(drops $13)
    cat ${OUTPUTFILE_UDP} | awk '{print $13}' | awk '{mysum +=strtonum($1)} END {print mysum}' > ${OUTPUTFILE_SUM_DROPS}
    #####cat ${OUTPUTFILE_SUM_DROPS}

    #sum(tx_queue $5,rx_queue $6)
    cat ${OUTPUTFILE_UDP} | awk '{gsub("^0*:","",$5); gsub("^","0x",$5); print strtonum($5)}' >> ${OUTPUTFILE_QUEUES}
    cat ${OUTPUTFILE_UDP} | awk '{gsub("^0*:","",$6); gsub("^","0x",$6); print strtonum($6)}' >> ${OUTPUTFILE_QUEUES}
    #####cat ${OUTPUTFILE_QUEUES}
    cat ${OUTPUTFILE_QUEUES} | awk '{mysum +=strtonum($1)} END {print mysum}' > ${OUTPUTFILE_SUM_QUEUES}
    #####cat ${OUTPUTFILE_SUM_QUEUES}

    #sum-concat
    cat ${OUTPUTFILE_SUM_DROPS} >> ${OUTPUTFILE_ALL_SUMS}
    cat ${OUTPUTFILE_SUM_QUEUES} >> ${OUTPUTFILE_ALL_SUMS}
    #####cat ${OUTPUTFILE_ALL_SUMS}

    # jq: wrap json
    cat ${OUTPUTFILE_ALL_SUMS} | jq --raw-input --null-input '[inputs] | map(tonumber) | {"total_nbr_of_drops": .[0], "total_queues_memory_used":
    .[1]}' > ${JSONFILE_FINAL}
    #####cat ${JSONFILE_FINAL}

    if [[ -s ${JSONFILE_FINAL} ]]
    then
        cat ${JSONFILE_FINAL}
    else
        read -r -d '' JSONCONTENT <<'MYINPUT'
        {
        "mjd_error_occurred": true,
        "total_nbr_of_drops": 1111111111,
        "total_queues_memory_used": 2222222222
        }
        MYINPUT
        echo "$JSONCONTENT"
    fi
    CONTENT

    # Continue
    cat /etc/scripts/cgi-collectd-udpstats.sh

    chmod ugo=+r-w+x /etc/scripts/cgi-collectd-udpstats.sh
    ln --symbolic --force /etc/scripts/cgi-collectd-udpstats.sh /usr/lib/cgi-bin/cgi-collectd-udpstats.sh
    ll /usr/lib/cgi-bin/

The Bash script uses the standard tool awk for processing the text output and the tool jq to transform the result into JSON. Note also that the CGI Script must have specific file permissions.

This is an example of the output of the script http://localhost:8080/cgi-bin/cgi-collectd-udpstats.sh


    {
    "total_nbr_of_drops": 1024000,
    "total_queues_memory_used": 12800
    }

Passing The Data From CollectD To InfluxDB

This tunnel is already in place because we are already sending metrics from all our nodes to the central metrics server (note that the central metrics server is also considered a node from which server telemetry data is collected).

The Collectd daemon on each node will transform the metrics data in the Graphite format and send it to a Graphite compatible service (in our case this is the InfluxDB database service). The InfluxDB input plugin for Graphite will process the incoming data and store it in a time-series database.

The CollectD configuration for Graphite traffic outbound in collectd.conf :


    LoadPlugin write_graphite
    <Plugin write_graphite>
        <Node "ikke">
            # @important Use the DNS name for your node servers, Use localhost if this is the node config for the central metrics server
            Host "localhost"
            #####Host "node1.foo.com"
            Port "12399"
            Protocol "udp"
            LogSendErrors true
            Prefix "collectd."
            #Postfix "-suffix"
            ###RHMOD @important StoreRates: if true (default) then convert COUNTER values to RATES. ***preferred***
            ### if false the COUNTER values are stored as is, i. e. as an increasing integer number.
            StoreRates true
            AlwaysAppendDS false
            EscapeCharacter "_"
        </Node>
    </Plugin>

The InfluxDB configuration for Graphite traffic inbound in influxdb.conf :


    [[graphite]]
    enabled = true
    database = "mydata"
    bind-address = ":12399"
    protocol = "udp"

Visualizing The Time-series Data In InfluxDB Using Grafana

We want to visualize the data so we can easily detect critical errors, and also monitor the overall state of the system so we can prevent further critical errors from happening.

We will add three Grafana panels in the existing Network Section of the dashboard:

  • Network interface errors.
  • UDP Statistics (packet drops).
  • UDP Statistics (queues total memory used).

These are the InfluxDB Queries which are specified in these panels:

1. Network interface errors:


    #InfluxQL
    select mean(value) from "collectd.[[myhostname]].interface-eth0.if_errors.tx" WHERE $timeFilter GROUP BY time($interval) fill(null)
    select mean(value) from "collectd.[[myhostname]].interface-eth0.if_errors.rx" WHERE $timeFilter GROUP BY time($interval) fill(null)

2. UDP Statistics (packet drops):


    #InfluxQL
    select mean(value) from "collectd.[[myhostname]].curl_json-udpstats.gauge-total_nbr_of_drops" WHERE $timeFilter GROUP BY time($interval)
    fill(null)

3. UDP Statistics (queues total memory used):


    #InfluxQL
    select mean(value) from "collectd.[[myhostname]].curl_json-udpstats.gauge-total_queues_memory_used" WHERE $timeFilter GROUP BY time($interval)
    fill(null)

Next we define the visual properties and the thresholds in each panel and then we are ready. The panels of the dashboard are configured until they look like this:

Grafana Dashboard: UDP Stats preview screencast via rolf.huijbrechts.be

 

Part 2: Test It

Test Overview

We now have a monitoring solution in place and now it is a good time to validate that work. We could use the production system for that but it safer to setup a separate test environment.

Install The Nmap Tools

Install the required tools ncat and nping (from the makers of nmap) on both servers #1 and #2.


    # Bash
    mkdir --parents ~/udpstats
    cd ~/udpstats
    apt-get --yes install alien

    MYRPM=nmap-7.40-1.x86_64.rpm
    MYDEB=nmap_7.40-2_amd64.deb
    rm ${MYRPM}*
    rm ${MYDEB}*
    wget https://nmap.org/dist/${MYRPM}
    alien ${MYRPM}
    ll *.deb
    dpkg --install ${MYDEB}

    MYRPM=nping-0.7.40-1.x86_64.rpm
    MYDEB=nping_0.7.40-2_amd64.deb
    rm ${MYRPM}*
    rm ${MYDEB}*
    wget https://nmap.org/dist/${MYRPM}
    alien ${MYRPM}
    ll *.deb
    dpkg --install ${MYDEB}

    MYRPM=ncat-7.40-1.x86_64.rpm
    MYDEB=ncat_7.40-2_amd64.deb
    rm ${MYRPM}*
    rm ${MYDEB}*
    wget https://nmap.org/dist/${MYRPM}
    alien ${MYRPM}
    ll *.deb
    dpkg --install ${MYDEB}

    nmap --version
    nping --version
    ncat --version

Testing Network Interface Errors

I have omitted the test at the network interface level (eth0). These errors are typically caused by a physical electrical error on the Ethernet controller or the cabling. This data is monitored in the dashboard panel "Network Errors".

Grafana Dashboard: UDP Stats preview screencast via rolf.huijbrechts.be

We proceed with testing the quality of the UDP traffic.

Testing The Quality Of The UDP Traffic

We need a server #1 on the transmitting side with a process that can send UDP packets. It is essential that we can define the rate at which the packets are sent in order to monitor on the receiving side how fast the packets are being processed. We also want to be able to declare the (text) content of the UDP packet so we can follow it flowing through the system. The cmdline tool nping from www.nmap.org is a good fit.

We need a server #2 on the receiving side with a process that can act as an UDP Listener. It is essential that we can influence the speed of the response to the client so we can emulate a server process that responds fast enough, or too slow. This will certainly influence the size of the RX Queue of the UDP socket; and that is what we want to monitor after all. The cmdline tool ncat from www.nmap.org is a good fit.

The servers are contained in a private network. You might have to open port 12399 in the firewall on the server #1 for packets coming from your server#2. Replace "1.2.3.4" with the IP address of your server#2.


    # Bash Server#1
    ufw allow from 1.2.3.4 to any port 12399

Setup Server#1 As The Receiver

We start the UDP Listener and some other watchers in separate terminal windows so we can see what is going on.

Note that the UDP Listener responds with a delay of several seconds. This will simulate our slow process on the receiving side.

1. Terminal Window #1 @ SERVER#1:


    # Bash
    ncat --udp --keep-open --crlf --listen 12399 --sh-exec 'MYDELAY=13; sleep ${MYDELAY}; echo "Hi, this is the UDP server on port 12399. I delay each
    response with ${MYDELAY} seconds..."'

2. Terminal Window #2 @ SERVER#1:


    # Bash
    watch --interval 1 --differences "ifconfig eth0"

    # Output
    eth0 Link encap:Ethernet HWaddr 04:01:09:2d:c9:01
    inet addr:146.185.155.172 Bcast:146.185.155.255 Mask:255.255.255.0
    inet6 addr: fe80::601:9ff:fe2d:c901/64 Scope:Link
    UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
    RX packets:1061 errors:0 dropped:0 overruns:0 frame:0
    TX packets:926 errors:0 dropped:0 overruns:0 carrier:0
    collisions:0 txqueuelen:1000
    RX bytes:462555 (462.5 KB) TX bytes:123351 (123.3 KB)

3. Terminal Window #3 @ SERVER#1:


    # Bash
    watch --interval 1 --differences "netstat --statistics --udp"

    # Output
    IcmpMsg:
    InType0: 59
    InType3: 11
    OutType3: 23
    OutType8: 59
    Udp:
    750 packets received
    24 packets to unknown port received.
    0 packet receive errors
    615 packets sent
    UdpLite:
    IpExt:
    InOctets: 1153067
    OutOctets: 805239
    InNoECTPkts: 4257

4. Terminal Window #4 @ SERVER#1:


    # Bash
    watch --interval 1 --differences "cat /proc/net/udp; cat /proc/net/udp6;"

    # Output
    Every 1.0s: cat /proc/net/udp; cat /proc/net/udp6; Thu Jan 5 23:20:03 2017

    sl local_address rem_address st tx_queue rx_queue tr tm->when retrnsmt uid timeout inode ref pointer drops
    25: 00000000:306F 00000000:0000 07 00000000:00000000 00:00000000 00000000 0 0 26511 2 ffff88000f18a700 0
    37: AC9BB992:007B 00000000:0000 07 00000000:00000000 00:00000000 00000000 0 0 17930 2 ffff88000f18a000 0
    37: 0100007F:007B 00000000:0000 07 00000000:00000000 00:00000000 00000000 0 0 17929 2 ffff88000f18a380 0
    37: 00000000:007B 00000000:0000 07 00000000:00000000 00:00000000 00000000 0 0 17922 2 ffff88000f18aa80 0
    57: 00000000:958F 00000000:0000 07 00000000:00000000 00:00000000 00000000 0 0 15092 2 ffff88001c21c700 0
    75: 0100007F:00A1 00000000:0000 07 00000000:00000000 00:00000000 00000000 0 0 15094 2 ffff88001c21d880 0
    121: 0100007F:BBCF 0100007F:300E 01 00000000:00000000 00:00000000 00000000 0 0 16569 2 ffff88000f18ae00 0
    sl local_address remote_address st tx_queue rx_queue tr tm->when retrnsmt uid timeout inode ref pointer drops
    25: 00000000000000000000000000000000:306F 00000000000000000000000000000000:0000 07 00000000:00000000 00:00000000 00000000 0 0 26510 2
    ffff880013ef5100 0
    37: 000080FE00000000FF09010601C92DFE:007B 00000000000000000000000000000000:0000 07 00000000:00000000 00:00000000 00000000 0 0 17932 2
    ffff880013ef4cc0 0
    37: 00000000000000000000000001000000:007B 00000000000000000000000000000000:0000 07 00000000:00000000 00:00000000 00000000 0 0 17931 2
    ffff880013ef4000 0
    37: 00000000000000000000000000000000:007B 00000000000000000000000000000000:0000 07 00000000:00000000 00:00000000 00000000 0 0 17923 2
    ffff880013ef4880 0
    184: 00000000000000000000000000000000:300E 00000000000000000000000000000000:0000 07 00000000:00000000 00:00000000 00000000 999 0 17034 2
    ffff880013ef4440 0

Setup Server#2 As The Transmitter

We have 3 scenario's that we want to test:

  • Scenario#1: Send low volume UDP traffic.
  • Scenario#2: Send medium volume UDP traffic.
  • Scenario#3: Send high volume UDP traffic.

Each command is constrained to execute long enough so we can gather enough relevant metrics data.

The actual parameters (process timeout, total nbr of packets, packet rate/sec) have been determined after some testing and they are specific for this environment. This means that the parameter values will be different for your environment.

Scenario#1: Send low volume UDP traffic.

This should just work and we won't see a significant memory size of the TX Queue because the UDP Listener processes the incoming packets in less than a second.

The Grafana dashboard will show everything is OK; no packets are dropped and the queue size will remain zero or almost zero.

Nping will send 100 packets at a rate of 1/second for a maximum duration of 60 seconds.

Terminal Window @ Server#2:


    # Bash
    timeout 60s nping -v4 --count 100 --rate 1 --udp --dest-port 12399 metrics.foo.com --data-string "ROLF-PACKET-SHOULD-ARRIVE"

Scenario#2: Send medium volume UDP traffic.

This scenario should increase the memory size of the TX Queue but it should not overflow the UDP RX Queue (25MB).

We should see in the Grafana panels that it uses gradually more memory for the UDP RX Queue but no packets are dropped. The queue's memory size should fluctuate because the client keeps sending UDP Packets and the UDP Listener on the server processes them at its own pace.

Nping will send 1 million packets at a rate of 1000/second for a maximum duration of 120 seconds.

Terminal Window @ Server#2:


    # Bash
    timeout 120s nping -v4 --count 1000000 --rate 1000 --udp --dest-port 12399 server1.foo.com --data-string "ROLF-PACKET-SHOULD-ARRIVE"

Scenario#3: Send high volume UDP traffic.

This should overflow the maximum size of the UDP RX Queue (we capped it earlier at 25MB). The panel "UDP Stats (packet drops)" should show that UDP packets are being dropped. The panel "UDP STATS (queues total memory size)" should show that the maximum amount of memory used for the queue (25MB) has been reached very quickly and that is not increasing anymore.

Nping will send 5 million packets at a rate of 1250 per second for a maximum duration of 120 seconds.

Note that the rate is somewhat higher. Some tests have shown that this will indeed overflow the RX Queue in this environment. We have also increased the total number of packets being sent because we want to have a big enough time window to monitor this situation (when the RX Queue is full then the packets are dropped at a very fast pace so the nping command would end too soon before we can grab enough metrics data).

Terminal Window @ Server#2:


    # Bash
    timeout 120s nping -v4 --count 5000000 --rate 1250 --udp --dest-port 12399 server1.foo.com --data-string "ROLF-PACKET-SHOULD-ARRIVE"

 

Part 3: The Screencast

I have made a screencast for you so you can see the changes over time in the Grafana Dashboard.

The volume of the incoming traffic will increase gradually and you can see clearly the impact on the size of the UDP RX Queue and if any UDP packets are being dropped.

You can also see when the last panel generates a warning alert (orange area) and when it generates an error alert (red area). This will generate several notifications on the backend.

UDP Stats screencast via rolf.huijbrechts.be

 

Epilogue

Is the environment Grafana + CollectD + InfluxDB a stable & scalable solution for server telemetry? Yes, I believe it is. We are using it since the year 2014 (a single node InfluxDB Server). And if a scalability issue would come up then the InfluxDB server will probably be the bottleneck and we can still scale up that one, or move to InfluxEnterprise which is, among other things, a cluster of InfluxDb time-series data stores.

You could setup a similar project to monitor the quality of your TCP traffic; the metrics will probably be different.

 

References

 

Related Projects

Tagged with →