Event replay is just what it sounds like - the ability to replay an arbitrary number of events that occurred in the past to some destination.
The events are messages that were collected from your messaging system(s) and usually represent a "state" that your system or component was in at that particular time.
There are many reasons why you'd want to replay events:
To gain insight into what happened during an outage
To debug issues as you're developing your software
To test your software with real event data
To feed data for machine learning
and many, many more
Event replay is a core component of event driven system architectures such as event sourcing in which the event is the source of truth. Meaning, rather than relying on a "status" in a database column, the event (and the surrounding events) specifies what the "status" is at that specific time.
If you are doing event sourcing, you will 100% need a replay mechanism.
Event driven architecture is the umbrella term for system and software architectures that operate by emitting and reacting to events. Event driven architectures utilize producers to emit events that indicate state changes and consumers who consume the events and react to the state change.
Core concepts of event driven architectures:
Loose coupling
Asynchronous communication
Idempotency
Some examples of event driven architectures:
Billing service emits "payment processed" messages which are then consumed by the account service to "unlock" the associated account
State management in Redux
Using Python Celery to execute tasks
You can read more about event driven architectures here:
Event sourcing is an event-driven architecture pattern used for building distributed systems. The pattern focuses on:
Events being your source of truth (and not a column in a DB)
Storing ALL events somewhere, forever
Have an API for reading current state (API keeps track of latest events)
Current state can be reconstructed by replaying events
Event sourcing is not easy and is likely an unfamiliar way to build software but the you gain significant advantages:
Increased reliability - because of loose coupling, one system not consuming events means that only that part of the system is inaccessible/inoperable
Provide all engineering teams with a complete data source by replaying only the events they are interested in
As opposed to giving the team access to a shared database
Serves as a built-in audit-log
Enables a path for migrating away from a monolithic architecture
You can read more about the event sourcing pattern here:
Event sourcing goes hand-in-hand with another pattern called CQRS. You can read more about CQRS here.
Event replay sounds deceptively simple - read events from somewhere, send them to a destination. What's the big deal?
Building a super basic replayer is pretty straightforward - write your events to a message system, write a service that archives the events somewhere and write another service to "read" and "replay" the events.
The difficulties start arising once you want something beyond just a "replay events from/to date".
Namely:
How do you replay events that contain specific data?
You now need to index your events
You now have to store your events in a queryable format
How do you scale past 1B events?
How do you make replay self-serve?
Who maintains all of these components?
While Kafka has replays, if you want to do anything outside of "replay from this offset until this offset" - you will want something more sophisticated.
In our experience, Kafka replays work best when used as a recovery mechanism rather than an "on-demand replay mechanism" to facilitate event driven architectures.
Data lakes are a collection of structured, semi-structured or unstructured data that you do not know the exact purpose of at the time of collection. Where a data warehouse is a carefully constructed SQL database that serves a specific data-analytics purpose, a data lake is the step before a data warehouse - you collect data that you may want to use in the future.
One huge difference between data lakes and data warehouses is that data lakes may contain terabytes or petabytes of data while a data warehouse will be much smaller in size.
For example, your data lake may contain ALL of your traffic logs while a data warehouse may contain only order fulfillment data.
An important characteristics of a data lake is that it must be highly scalable - as in, you should be able to store a near-infinite amount of data in it and be able to retrieve it within a reasonable amount of time (seconds or minutes, rather than days or weeks).
This usually boils down to the following:
Data should be stored in a search friendly format such as Parquet
Data lake hydration is the act of populating your data lake with data.
There are many ways to get data into a data lake:
You can setup pipelines to push data from your system components (such as Kafka) into your data lake of choice
You can write scripts that fetch and push data to your data lake
You can setup various integrations from other vendors (such as Segment) to push analytics data into your data lake
Batch's data lake hydration removes the necessity for doing #1 and #2.
Upon receiving an event, Batch will automatically:
Generate a parquet and Athena schema for your events
Generate optimally partitioned and sized parquet files
Batch write the generated files to your S3 bucket
You should give us a try if:
You want to gain visibility into your message queues
You are working on something that requires data replay
You are exploring event driven architectures
Want to turn your message queue data into a useable data lake
Really... there are a lot of reasons. Chances are that if it has something to do with message queues - we are probably a good fit for you.
Take a look at the Use Cases section to get a better idea as to what's possible with the Batch platform.
There are a lot of ways to get data into Batch, you have the following options:
plumber running in relay
mode
HTTP API
gRPC API
Kafka connector (BETA)
Getting data out is simple. Either:
Replay the data using our replay feature
Download the contents of the S3 bucket we provide for you ("team" plan and above)
Export the data as a JSON dump
We are actively working on this and should have it delivered by 05.2021
In short - nothing but we will ask you very nicely to increase your storage limits by either purchasing a storage add-on or upgrading your plan 😇
Each collection is allocated a certain amount of reserved storage on our hot storage nodes. Whether you use the data or not - the resources are allocated and thus used.
Because we do automatic schema inference and have the ability to accept virtually any semi-structured data, we believe that most folks will not need more than 1 to 3 collections.
If you have a very specific need for a lot of collections - reach out to us and let's chat - we'll do our best to accomodate you.
No - because we store your events forever, there is no need to think about retention.
As of 02.2021, we support for the following tech:
Ingress / Collection:
Kafka
RabbitMQ
Amazon SQS
Google GCP
Azure Service Bus
MQTT
ActiveMQ
Egress / Replay
Kafka
RabbitMQ
HTTP
If you have a need to replay data to a destination that we do not support or are interfacing with a message bus that is not listed above - let us know and we'll make it happen.
As of 02.2021, Batch's collectors are hosted in AWS, us-west-2 (Oregon, US).
If you are pushing a large volume of events (1,000+ per sec), you may need for our collectors to be located closer to you geographically. If so, contact us and we'll see how we can accomodate you.
35.165.253.176
52.25.81.114
54.190.100.78
No. Exactly-once delivery has been a hot-topic issue for many years and while in most cases, Batch will indeed deliver only one copy of a message, there is no way to guarantee that.
Instead, build your systems to be idempotent. If your systems are able to handle dupe events, you will not have to worry about the delivery semantics of upstream (or downstream) services and will ultimately have a much more robust and reliable distributed system.
Batch can guarantee at-least-once delivery
The short answer is: no. In most cases, 99.99999% of messages should arrive in the order they were collected, however there is a chance that something arrives out of order (because of networking, outages, etc.)
The longer answer:
While it is technically possible to create a system that has order guarantees, a well-designed distributed system would ultimately be better off by being able to deal with out-of-order delivery.
Choosing to guarantee order has a lot of ramifications in systems design - every component of your distributed system has to be designed, implemented and used in very specific manner. It also means that your system will be highly susceptible to problems if an out-of-order event arrives as a result of a misbehaving component.
Similar to the advice given for exactly-once delivery, you should build your systems to be capable of dealing with out of order events. In most cases, this means that you should timestamp your events and your services should be aware of whether the event it has received is the newest event.
Of course, there are exceptions - some systems need order guarantees. If you suspect your system/platform is one of them - talk to us - we would love to hear about and see how we could help.
Storage is an add-on that expands the amount of space your account has.
You can read more about it in pricing/add-ons.
Replay is an add-on that expands the amount of data you can replay.
You can read more about it in pricing/add-ons.
Collections are an add-on that extends the number of collections you can create.
You can read more about the add-on in pricing/add-ons.
Every event that Batch collects is stored in two places:
Search cache (hot storage)
Used for powering observability via console
S3 (cold storage)
Used for event replays
Search cache contains about 10% of your most recent data while S3 contains 100% of your data.
When purchasing additional storage, you are really purchasing .
Generally, you should not need to know when hot or cold storage is used - it should be transparent and is an internal implementation detail.
Please see the throughput section in the what are/collections doc.
Please see the throughput section in the what are/replays doc.
Everything on the Batch platform is "team-oriented" - even if you do not have a team (yet), your account will be ready to be used by a "team" whenever you are ready.
Specifically, when you are creating a schema, destination or collections - they are owned by your team, not your account.
This way, when you invite additional team members into your team - they will all have access to everything your account has created.
You can delete the collection via the console.
If you need to delete specific data, please submit a ticket and include the search query that matches the data you wish to delete.
Pro and Team accounts do not have a trial period.
Due to this, you will only be able to ingest data, create collections, create replays once you've attached a billing method to the account.
We would love to hear it - shoot us a message here.