There are a lot of different ways to use the Batch platform.
The goal of this doc is to illustrate some of the different ways that you can use Batch to solve real problems you are facing.
Observability is the #1, simplest and most obvious use case.
It takes very little to get started but provides a massive benefit: being able to see what is flowing through your message queue(s).
- 1.You are a backend developer tasked with implementing a new feature in a
billingservice that you have never touched before.
- 2.You know that
billingthat consumes and produces messages to Kafka but you don't know what the messages look like or what they're supposed to contain.
- 3.Prior to starting development, it would be nice to "see" actual messages (without having to look through other codebases to see how a message is constructed or how it is read).
This is an extremely common use case and one of the primary reasons we built Batch. It should be trivial for anyone to see what's inside of your message queue - without having to ask devops or your data team to give you a dump of messages.
plumber relay kafka \
--address "your-kafka-address.com:9092" \
--token YOUR-COLLECTION-TOKEN-HERE \
--topics orders \
--sasl-username your-username \
(3) Wait for events to be pushed to Kafka (or trigger functionality that will do it)
(4) Inspect only the billing-related messages that you care about (such as account creation)
That is probably pretty obvious but the point is that data science usually involves a really tedious process of wrangling the data before it is "usable" - it has to be cleaned up (such removing bad characters, garbage text), parsed and mutated into a useable form such as JSON or parquet.
On top of that, there are many different data sources that data science teams can use - traffic logs, application logs, 3rd party data (ie. Salesforce, Segment), databases and many more, but often, the most useful is the data on your message bus that your backend systems rely on to make decisions.
For example, your order processing, account setup or email campaign process all might use a message bus such as RabbitMQ or Kafka to asynchronously achieve the tasks at hand.
Gaining access to such a rich data source would be huge but it will likely take quite an effort because:
- 1.The data science team would need to talk to the folks responsible for the message bus instance (say, RabbitMQ in this scenario) to gain access.
- 2.Need to come to a consensus on what data the data science team is interested in.
- 3.Someone (usually the devs or devops) would need to figure out how to pull the associated data and where to store it for data science consumption
- 1.Best case scenario, someone implements a robust pipeline with an off-the-shelf solution (that won't break in 3 months) but most often, this will consist of having to write a script or two that is ran on an interval and periodically dumps data to some (to be determined) location.
- 4.And finally, the data science team has access to the data!
With Batch's replay functionality, you can avoid having to write brittle recovery procedures.
All you have to do is:
- 2.Have your data science team log into the Batch dashboard and sift through the collected events.
- 3.When good candidate data is found, the data science team can initiate a replay that includes ONLY the events they are interested in and point the destination to their dedicated Kafka instance (or whatever tech they're using).
- 4.The replayed data will contain the original event that was sent to the event bus.
In the above example, we have saved a ton of time - the data science team can view AND pull the data without having to involve another party and without having to write a line of code.
Implementing a disaster recovery mechanism is not easy .. or fun. It is usually (and unfortunately) an afterthought brought on by a new requirement in order to get a certification or land a client.
And while there is no argument that having a disaster recovery strategy is a good idea - it does require a massive time investment that usually involves nearly every part of engineering.But it doesn't have to be that way.
Full disclosure: This use-case assumes that your company is already utilizing event sourcing - your events are your source of truth and thus your databases and whatever other data storage can be rebuilt from events.If that is not the case, Batch will only be a part of your disaster recovery strategy and not the complete solution.
Most/common disaster recovery mechanisms:
- Represent the CURRENT state of the system - they do not evolve
- Are brittle - if any part of the stack changes, the recovery process breaks
- Are built by engineers already on tight deadlines - corners will be cut
- Are not regularly tested
The end result is that you've built something that won't stand the test of time and will likely fail when you need it the most.
Assuming that your message bus contains ALL of your events, you can hook up Batch to your messaging tech and have us automatically archive ALL of your events and store them in parquet, in S3.
If/when disaster strikes (or performing periodic disaster recovery test runs), all you have to do to "restore state" is to replay ALL of your events back into your system.
If a 3rd party vendor (such as Batch) is unavailable - you still have access to all of your data in parquet. You can implement a "panic" replay mechanism that just reads all of your events from S3, bypassing the need to have Batch perform the replay.
We have participated in multiple disaster recovery initiatives throughout our careers. In almost all cases, the implementation process was chaotic or short staffed or underfunded or all of the above.
While building Batch, we realized that this is exactly the type of solution we wished we had when we worked on implementing strategies. It would've saved everyone involved a lot of time, money and sanity.
Setting the stage:
- 1.Your software consumes events from your message queue and consumes data within the message to perform XYZ actions
- 2.Your tests are have "fake" or synthetic event data - maybe your local tests seed your message queue or maybe you just inject the fake data for tests
Injecting fake data is OK, but it is most definitely not foolproof. If your event schema has changed OR if your data contents have changed - maybe "timestamp" used to contain a unix timestamp in seconds but now contains unix timestamps in milliseconds.
What do you do?
Normally, you'd find out about this at the most inopportune time possible - once the code is deployed to dev .. or prod.
This way, rather than using synthetic data in your tests, you would use real event data that is used in dev and production. The result is more confidence in your software and in your tests.
Load testing is a hot topic - there are MANY ways to do load testing and there are equally as many tools to make it happen. Most load testing strategies involve more than one tool and more than one logical process. You may have a load testing suite that hammers one particular endpoint using hey and other parts of the site using locust.
Typically, load testing tooling falls into two categories:
- Performance oriented
- Verify that you're able to handle N/req per second (>10,000 req/s)
- Scenario oriented
- The ability to chain a bunch of requests together in a specific order at "relatively high" speeds (<10,000 req/s)
While most load testing tooling is geared towards HTTP testing, Batch is unique that you are able to load test your service message consumption functionality.
In other words - as part of your CI process (or another external process), you can test how well your service is able to consume a specific amount of messages that arrive on your message bus.
The process for implementing this would be fairly straight forward:
- 1.Connect plumber with your message queue
- 4.Start the replay
- 5.Instrument and monitor how well and fast your service is able to consume the replayed events
Before we dive into this - let's establish what is a "data lake":
A data lake is a centralized repository that allows you to store all your structured and unstructured data at any scale...A data warehouse is a database optimized to analyze relational data coming from transactional systems and line of business applications...A data lake is different, because it stores relational data from line of business applications, and non-relational data from mobile apps, IoT devices, and social media.
To simplify it further - your data lake may contain huge amounts of "unorganized" data without any advance knowledge as to how you'll be querying the data. Whereas a data warehouse will likely be smaller and already be "organized" into a usable data set.
All of this roughly translates to - if you make business decisions from your data, you likely need both.
And if you don't already have a data lake - you may be wondering "how do you create a data lake?"
This is where Batch comes in beautifully.
When Batch collects a message from your message bus, the following things happen behind the scenes:
- 1.We decode and index the event
- 2.We generate (or update) an internal "schema" for the event
- 3.We generate (or update) a parquet schema
- 4.We generate (or update) an Athena schema and create/update the table
- 5.We store the event in our hot storage
- 6.We batch and store the event in parquet format in S3 (using the inferred schema)
Because we are storing your messages/events in S3 in parquet (with optimal partitioning) - we are essentially creating and hydrating a functional data lake for you.
You own this data lake and can use the events however you like - whether it be by using Batch replays, Athena searches or piping the data to another destination.
More of often than not, message queues contain important information that are critical to operating your business. While some tech like Kafka retains messages, most messaging tech has no such functionality.
The common approach to this is to write some scripts that periodically backup your queues and hopefully, if disaster strikes, your engineers can cobble the backups together.
With Batch, there is no need to perform periodic backups - messages are "backed up" the moment they are relayed to Batch and are immediately visible and queryable.
Batch replays enable you to restore your message queue data without having to write a single line of code. Just search for
*and replay the data to your message queue.