You should give us a try if:
- You want to gain visibility into your message queues
- You are using an encoded message/event format (such as Protobuf, Avro, Flatbuffer)
- You are working on something that requires data replay
- You are exploring event driven architectures
- Want to turn your message queue data into a useable data lake
Really... there are a lot of reasons. Chances are that if it has something to do with message queues - we are probably a good fit for you.
There are a lot of ways to get data into Batch, you have the following options:
Getting data out is simple. Either:
- Replay the data using our replay feature
- Download the contents of the S3 bucket we provide for you
- Or if you bring your own bucket, you already have the data👍
- Export the data as a JSON dump
- We are actively working on this and should have it delivered by 05.2021
Event replay is just what it sounds like - the ability to replay an arbitrary number of events that occurred in the past to an arbitrary destination.
The events are messages that were collected from your messaging system(s) and usually represent a "state" that your system or component was in at that particular time.
There are many reasons why you'd want to replay events:
Event replay is a core component of event driven system architectures such as event sourcing in which the event is the source of truth. Meaning, rather than relying on a "status" in a database column, the event (and the surrounding events) specifies what the "status" is at that specific time.
If you are doing event sourcing, you will 100% need a replay mechanism.
Event driven architecture is the umbrella term for system and software architectures that operate by emitting and reacting to events. Event driven architectures utilize producers to emit events that indicate state changes and consumers who consume the events and react to the state change.
Core concepts of event driven architectures:
- Loose coupling
- Asynchronous communication
Some examples of event driven architectures:
- Billing service emits "payment processed" messages which are then consumed by the account service to "unlock" the associated account
- State management in Redux
- Using Python Celery to execute tasks
You can read more about event driven architectures here:
Event sourcing is an event-driven architecture pattern used for building distributed systems. The pattern focuses on:
- Events being your source of truth (and not a column in a DB)
- Storing ALL events somewhere, forever
- Have an API for reading current state (API keeps track of latest events)
- Current state can be reconstructed by replaying events
Event sourcing is not easy and is likely an unfamiliar way to build software but the you gain significant advantages:
- Increased reliability - because of loose coupling, one system not consuming events means that only that part of the system is inaccessible/inoperable
- Provide all engineering teams with a complete data source by replaying only the events they are interested in
- As opposed to giving the team access to a shared database
- Serves as a built-in audit-log
- Enables a path for migrating away from a monolithic architecture
You can read more about the event sourcing pattern here:
Event replay sounds deceptively simple - read events from somewhere, send them to a destination. What's the big deal?
Building a super basic replayer is pretty straightforward - write your events to a message system, write a service that archives the events somewhere and write another service to "read" and "replay" the events.
The difficulties start arising once you want something beyond just a "replay events from/to date".
- How do you replay events that contain specific data?
- You now need to index your events
- You now have to store your events in a queryable format
- How do you scale past 1B events?
- How do you make replay self-serve?
- Who maintains all of these components?
While Kafka has replays, if you want to do anything outside of "replay from this offset until this offset" - you will want something more sophisticated.
In our experience, Kafka replays work best when used as a recovery mechanism rather than an "on-demand replay mechanism" to facilitate event driven architectures.
In short - nothing but we will ask you very nicely to increase your storage limits by either purchasing a storage add-on or upgrading your plan
We support nearly all popular messaging technologies and several non-messaging technologies such as MongoDB, PostgreSQL and MySQL.
As of 02.2022, Batch's collectors are hosted in AWS, us-west-2 (Oregon, US).
If you are pushing a large volume of events (1,000+ per sec), you may need for our collectors to be located closer to you geographically. If so, contact us and we can spin up collectors closer to you.
Instead, build your systems to be idempotent. If your systems are able to handle dupe events, you will not have to worry about the delivery semantics of upstream (or downstream) services and will ultimately have a much more robust and reliable distributed system.
The short answer is: no. In most cases, 99.99999% of messages should arrive in the order they were collected, however there is a chance that something arrives out of order (because of networking, outages, etc.)
The longer answer:
While it is technically possible to create a system that has order guarantees, a well-designed distributed system would ultimately be better off by being able to deal with out-of-order delivery.
Choosing to guarantee order has a lot of ramifications in systems design - every component of your distributed system has to be designed, implemented and used in very specific manner. It also means that your system will be highly susceptible to problems if an out-of-order event arrives as a result of a misbehaving component.
Similar to the advice given for exactly-once delivery, you should build your systems to be capable of dealing with out of order events. In most cases, this means that you should timestamp your events and your services should be aware of whether the event it has received is the newest event.
Because we write events to two different locations (Search and S3), support dynamic schemas (schema inference) and make use of heavy batching, the TTV can fluctuate. Refer to the below table for estimates:
You can delete the collection via the console.