Universal Agent: Redesigning our Agents ecosystem from the ground up for better operability - kentik

In this post, we'll be covering a feature that was delivered a while back but had the gem of a long-term project hidden in it – and now is the time to talk about it. I'm talking about our (now not so) recently released Kentik NMS product – let's get back to this in a short moment.

Over the years, Kentik has built a number of Agent binaries – each one to carry out a specific function as a Telemetry Agent for its own type of telemetry.

kproxy lets you proxy flows from inside of your network to our public flow ingest cluster
kprobe is used as a DNS tap to provide the magic mapping between DNS and Flow records to unlock OTT observability
kbgp is a local BGP hub, which prolongs your BGP sessions towards our BGP ingest enrichment cluster
ksynth is the Synthetic Monitoring agent you run (privately) or we run (publicly), which performs Synthetic Tests

You'll notice the one missing here is our SNMP poller: you now see it as what we call a "Capability" of the Universal Agent we released when we unveiled Kentik NMS.

In a nutshell, you install Universal Agent, enable the NMS capability on it and you're off to the races. Hang in there, this is what this post is all about!

Operability challenges of Telemetry Agents

Managing large fleets of telemetry agents always comes with operational complexities – let's lay out a few observations we've made over the years in that field. In everything that follows, "operability" is a key term.

Observable agents

As your Telemetry comes to rely on these agents, they quickly become a critical part of your infrastructure, and therefore now require to be observable – some examples here:

If a Flow Proxy (currently named kproxy) becomes faulty, users need to be alerted. If they don't, they will assume the trough in their traffic charts is due to a network outage and waste valuable time troubleshooting the situation.
The team in charge of running your telemetry systems is often a different team than the one building and running the network – while they may not be daily Kentik users, they need to monitor them in a scalable way and reduce the amount of integration work needed to operationalize them.
Agents running on a host (virtual or physical) can go wrong for multiple reasons: maybe the host itself is not doing well (i.e. it's not the Agent's fault), maybe the function the Agent performs is not doing well, but the host is doing just fine. In other words, users want self-serviceability when it comes to determining why agents are not doing their job.

Frictionless upgrade path

When running large infrastructure, the last thing engineers want to do is have to upgrade a large fleet of Agents: "if it ain't broke, don't fix it" is usually the governing principle. Operational realities require the upgrade path to be the most frictionless possible:

Bug fixes can require upgrading a large fleet of agents – the task of upgrading a large fleet of agents therefore needs to be as frictionless as possible to maintain constant state of operation.
Availability of new features requiring Telemetry Agent upgrades tend to be delayed in favor of the aforementioned conservative approach.
Security updates to large Telemetry Agent fleets can get delayed because of upgrades deployment complexity – these are always critical, should always be seamless enough to not incur delays.

Agent proliferation vs. One-size fits all

With the rise of observability, agent proliferation in your infrastructure has been skyrocketing. Each new agent comes with its own upgrade track, bugs, security context... in other words, the operational complexity of one's telemetry setup increases exponentially with the number of agents required to operate one's infrastructure. All telemetry agents share common goals, requirements, and functions: they need to be deployed, monitored, and updated.

The first way that comes to mind to deliver these common functions is to collapse all agents into a single swiss army knife agent: the operational ease of this solution is appealing, but comes with a few significant drawbacks:

All functions carried by the agent require eventual updates, and having many functions served by a single agent usually results in increasing the frequency at which these need to be updated – depending on the number of functions collapsed together, this often results in a significant increase of update pace, therefore operational tax.
Each function performed by the agent comes with its own bugs and security weaknesses – collapsing multiple agents in one often result in increasing the bug and security risk per agent.

For the reasons above, the ideal setup is one where we can reap the benefits of both a single agent, while keeping multiple ones at the same time. Let's discuss our new approach to agentry in the next section!

Introducing Universal Agent

What is Universal Agent ?

"One ~~Ring~~ Agent to rule them all, One ~~Ring~~ Agent to find them, One ~~Ring~~ Agent to bring them all, and in the ~~darkness~~ Kentik Platform bind them"

With the aforementioned challenges in mind, our engineering team produced a modular design centered around a new deployable binary, named Universal Agent.

Universal Agent acts as a host governor module (literary pun intended), tasked with offering a common foundation to "capabilities" running under it: it acts as the sole controller towards our SaaS platform, handles the download and enablement of other agents (now named "capabilities"), handles under-the-hood update cadences for both itself and its governed capabilities, and collects/ships not only host-level metrics, but also specific metrics for each capabilities to the Kentik SaaS platform.

What benefits does Universal Agent offer ?

Operational peace of mind
Universal Agent is now the central piece of Kentik's Telemetry Agent strategy. Its setup process is trivial and its enrollment entirely driven by the Kentik Portal UI our users all know and love.

Furthermore, Universal Agent updates are transparently and gracefully managed "under the hood", and the same goes for any Capability run by the agent – little if no operator intervention is now needed to keep an Agent and its Capabilities up to date.

Central management & monitoring
The Settings > Universal Agents now becomes the central place where you will in turn manage your complete Kentik telemetry agent ecosystem. This interface lets you identify any agent or capability deployed on your network and its current running state.

Agent Observability
Each deployed Universal Agent reports host-level metrics, accessible directly from the Settings > Universal Agent screen

As a bonus, all agent host-level metrics are also available in Metrics Explorer under the /kentik/agent measurement tree without any extra work needed. Universal Agents have now become observable, with their vitals now available for dashboarding like any other NMS device.

One single binary to access all of our telemetry collection capabilities
Once it is deployed, Universal Agent gives instant access to all the telemetry functions we've ported over as "capabilities". These get installed and enabled upon simple click. While NMS was the initial capability we shipped Universal Agent with, our entire ecosystem of telemetry agents will follow over time and be integrated as a Capability.

Observability for each Agent Capability
Each enabled Capability comes with its own set of metrics, designed to describe its function. These metrics also get shipped for free to our NMS subsystem and displayed at Agent > Capability level in the Universal Agent Management UI. Again, as these metrics are being stored in our Metrics subsystem, they can be accessed via Metrics Explorer, but also alerted upon.

In the example above, an Universal Agent's NMS Capability will show how many Metrics Per Second it is currently handling, as well as the Network Devices it is polling.

What's next ?

With this foundation built, we have already started producing new Capabilities leveraging this new model:

Our newly released Syslog Server is one of these new capabilities
As part of the same release, we also released a Trap Receiver capability

We've already started porting over our existing Agents to this new "Capability" model – watch this space for more announcements in that field real soon!

Lastly, we will be leveraging our brand new NMS Alerting platform in the very near future to provide automated alerts on Agents and Capabilities Health.