Riemann is an event stream processor.

Every time something important happens in your system, submit an event to Riemann. Just handled an HTTP request? Send an event with the time it took. Caught an exception? Send an event with the stacktrace. Small daemons can watch your servers and send events about disk capacity, CPU use, and memory consumption. Every second. It's like top for hundreds of machines at once.

Riemann filters, combines, and acts on flows of events to understand your systems.

Riemann archictecture diagram

Events

Events are just structs. They're sent over Protocol Buffers, and in Riemann are treated as immutable maps. Each event has these (optional) fields:

host A hostname, e.g. "api1", "foo.com"
service e.g. "API port 8000 reqs/sec"
state Any string less than 255 bytes, e.g. "ok", "warning", "critical"
time The time of the event, in unix epoch seconds
description Freeform text
tags Freeform list of strings, e.g. ["rate", "fooproduct", "transient"]
metric A number associated with this event, e.g. the number of reqs/sec.
ttl A floating-point time, in seconds, that this event is considered valid for. Expired states may be removed from the index.

In addition to these mandatory fields, events can contain any number of *custom* fields with string values. See Howto: Custom event attributes for more details.

The index

The index is a table of the current state of all services tracked by Riemann. Each event is uniquely indexed by its host and service. The index just keeps track of the most recent event for a given (host, service) pair.

Streams run quickly and retain as little state as possible. The index is Riemann's state table; its picture of the world. The dashboard, network clients, and streams can all query the index to see what the system looks like now. The Dashboard is just an HTML view of the index.

Events entered into the index have a :ttl field which indicate how long that event is valid for. Events that sit in the index for longer than their TTL are removed from the index and reinserted into the event streams with state "expired". Instead of polling for failure, just let events expire. Services which fail to check in regularly can be handled by the same state-transition streams you use for error handling.

Streams

Streams live in the (streams ...) section of your config file. You can think of (streams) as the top level stream, the source from which which all events flow. In this documentation, we'll often leave it implied.

Here's a simple stream which sends critical events from any service beginning with "riak" to delacroix@vonbraun.com.

(streams
  (where (and (service #"^riak")
              (state "critical"))
         (email "delacroix@vonbraun.com")))

The first argument to where is the predicate expression, which where uses to test each event. Events which match the predicate are passed to each of where's children, and if they don't match, nothing happens. In this case, there's one child the email stream.

You can think of streams like rivers in the real world. Events flow through tributaries and streams, pool in lakes and dams, and are filtered by grates and boulders. Riemann streams aren't really queries; they're more like pipelines which events flow through.

Stream composition

Let's consider a more complex problem. We want to send an email every time a service on a given host starts to report a new state; e.g., goes from reporting "ok", "ok", "ok", ... to "error", "error", "error". But in the event of a cascade failure we don't want to receive hundreds of emails--let's limit it to five per hour, and if more transitions occur, we'll roll them all up into the last email for that hour.

Since streams are just functions which take events as arguments, they're composable. We'll chain together four streams to solve the problem.

We need to detect state transitions for each service independently, so we'll start by splitting the event stream with (by).

(by [:host :service])

The (by) stream is like a river delta, or a fuel manifold. It splits the event stream into n distinct streams, each one independent. We get one fork for each unique host and service, so all the events from host 1 and service 1 flow together.

Since we haven't given (by) any child streams yet, it'll just drop every event on the floor. Most streams can take any number of children, so it's common to see something like (by [:host :service] (some-stream ...) (some-other-stream ...) ...).

For each one of those independent streams, we'll detect state transitions with (changed).

(by [:host :service]
    (changed :state))

Whenever an event arrives with a different state than the previous event, (changed :state) passes on the new event to its children. Otherwise, nothing happens.

With those state transitions, we'll roll up more than 5 events per 3600 seconds. In a given hour, rollup will allow four events to pass through immediately. Then it aggregates all successive events in a buffer, which is passed on at the end of the hour.

(by [:host :service]
    (changed :state
             (rollup 5 3600
                     (email "delacroix@vonbraun.com"))))

Rollup's child is an (email) stream, which emails events it receives to Delacroix. I think there's a problem in the sim units today!

To learn more, see the howto articles on streams, and check out the Streams API for details.

Queries

Clients can query the index for particular events using a simple query language. Dashboard views are each powered by a single query. Queries can be applied on the index of past events, but can also tap into the stream of events going to the index in real-time.

# Simple equality
state = "ok"

# Wildcards
(service =~ "disk%") or 
(state != "critical" and host =~ "%.trioptimum.com")

# Standard operator precedence applies
metric_f > 2.0 and not host = nil

# Anything with a tag "product"
tagged "product"

# All states
true

Query messages return a list of matching events. See the query tests for tons of examples, or read the full grammar.

Servers

"Riemann" usually refers to the JVM process that handles events, i.e. bin/riemann. Internally, that process has several distinct servers which are specified in the configuration file.

The TCP and UDP servers listen on port 5555 for TCP connections and UDP datagrams. They accept a stream of protocol buffer messages containing events (or queries, control messages, etc). Those events are then applied to a tree of streams. The TCP server also supports querying the index for the current state of the system. The UDP protocol is much faster but lossy. The TCP protocol is slower, but provides reliable delivery and acknowledgement of receipt.

The websockets server (ws-server) accepts HTTP websockets connections. This server streams events which match a given query to clients.