Architecture

Overview

Our infrastructure is designed to be highly scalable, asynchronous, secure, and put data privacy first. We built our infrastructure from the ground up and out of many different components to create a best-in-class experience around these objectives.

Client Architecture

Before we get into our architecture it may be helpful to go over an example client application. For this example, we will cover a small event-driven application.

Typically, event-driven systems consist of:

  • a message bus

  • event producers

  • event consumers

The message bus layer can be anything from Kafka, Rabbit, SQS, and many more. The order, payment, and shipping applications represent event producers that generate information for consumers which are typically backend applications that execute tasks based on the generated messages.

Connecting to Batch

In order to integrate with Batch, clients can run plumber, use the HTTP or gRPC API's or run our kafka sink connector.

The plumber client is then configured to consume a copy of the message queue and deliver them to Batch.

Collections:

Once the plumber client is connected to the application message bus. The event collection process begins.

​Plumber pulls events off the message bus and then sends them into Batch's infrastructure where they are processed by the writer and eventually collected into an S3 bucket.

Search:

All collected events are searchable via our search API and dashboard.

Replay:

The replay process allows clients to use the search API to filter and replay specific events or all events they have stored in a collection.

Client destinations can range from simple HTTP endpoints to Kafka, RabbitMQ, or SQS.

Batch Architecture Overview

Below is the full overview of our architecture. From the client-side plumber deployment all the way down to our replay service.

  • The collectors are highly scalable endpoints. Designed to collect events/messages over gRPC or HTTP.

  • The writers do a lot of heavy lifting:

    • It pulls messages off our cache (Kafka)

    • Discovers the schema of inbound messages/events

    • Generates optimally formatted parquet data

    • Write the messages to search

    • Writes (partitioned) parquet data to S3

    • Updates our internal metrics service on the statistics about the collected messages/events

  • Our S3 storage are highly optimized data lakes storing a copy of all messages and formatted as parquet files.

  • SearchCache is a large cluster of servers designed to return quick results of your most recent data.

  • Replayer pulls the original formatted message off of S3 and replays to your desired endpoint.

Stack

This is some of the tech that Batch uses:

  • ​Docker for container runtime

  • ​Kubernetes for container orchestration

  • ​RabbitMQ for internal service messaging

  • ​Kafka for event collection buffer

  • ​ElasticSearch for short-term storage & search

  • ​PostgreSQL for traditional, relational data storage

  • ​Timescale for metrics

  • ​Etcd for long-term cache

  • AWS S3 & Athena for long-term storage & replay

We ❤️ Golang and are a proud Go shop - Go is perfectly suited for building reliable and performant distributed systems. Our backend is 100% Go.

We utilize event driven systems architecture (with a pinch of event sourcing) to further increase service reliability.

Finally, we use Batch for our internal message system which allows us to rebuild state if things ever go wrong.

The excellent folks over at Community do something very similar (except using Elixir) and they've had awesome success. <TODO: Link to tweets or blog post>

NOTE: If this sounds interesting to you and something you'd like to work on - shoot us a message [email protected].