Collections

A collection is a logical grouping of certain events/messages.

Whenever you push data to Batch, you will be pushing them to a specific collection you have created.

Each collection you create gets its own unique token that is then used to identify the type of data the collection will receive. A collection is directly tied to a schema.

Event collection

Event collection refers to the process of Batch "accepting" event data that runs on our platform.

An event can be sent to us in many ways:

Via plumber relay
Via our HTTP API
Via our gRPC API
Via our connectors

Once an event is collected by our collector, we inspect the event, followed by a write to our internal Kafka cluster. From there, the event will go through our pipeline of schema identification (or generation) and get written permanently to our hot and cold storage.

It usually takes <5s for a collected event to become visible in our dashboard.

Collection grouping

Generally, you will want to group collections by the message envelope that the event has.

In some cases, you may want to also group them by the source of the events.

Scenario 1 - Without common message envelope

Event 1:

{
    "id": "c3347b39-5e6c-4bf4-ab5e-4efc03367a00",
    "name": "Dade Murphy",
    "age": 17,
    "address_1": "123 JusLivin Ave",
    "address_2": "Oklahoma City, OK, 73008"
}

Event 2:

{
    "id": "0ae30a29-2abe-491d-92ef-277d26515b43",
    "order_id": "23940ce9-64f7-4919-98c8-6089f94b911a",
    "price": 50.99,
    "quantity": 1,
    "name": "Dermot McValuables",
    "address_1": "123 Shipping Ln",
    "address_2": "Beverly Hills, CA, 90210"
}

While the two events are similar, they represent an entirely different data set - one is identifying a person and their personal attributes while the other is something billing related.

Both events share some fields and while you could send both events to the same collection (and Batch's automatic schema inference would do its magic) - you probably should NOT combine these events into the same collection as it'll be hard to determine which event represents what.

In this scenario, your best bet would be to create two collections: persons and orders and send each event to a different collection.

NOTE: If both of the events are located on the same messaging system, you will have no choice but to send the events to the same collection. At that point, it would be best to come up with a unified message schema or include a "type" designator in the event to make it easier to search for.

Scenario 2 - With common message envelope

Event 1:

{
    "event_type": "person",
    "person": {
        "id": "c3347b39-5e6c-4bf4-ab5e-4efc03367a00",
        "name": "Dade Murphy",
        "age": 17,
        "address_1": "123 JusLivin Ave",
        "address_2": "Oklahoma City, OK, 73008"
    }
}

Event 2:

{
    "event_type": "order",
    "order": {
        "id": "0ae30a29-2abe-491d-92ef-277d26515b43",
        "order_id": "23940ce9-64f7-4919-98c8-6089f94b911a",
        "price": 50.99,
        "quantity": 1,
        "name": "Dermot McValuables",
        "address_1": "123 Shipping Ln",
        "address_2": "Beverly Hills, CA, 90210"
    }
}

In this scenario, because both of the events are using a common message envelope (that indicates the event type), we are able to use a single collection for both of the event types.

Throughput

It is possible for Batch to collect at upwards of 100,000 events per second, however this will vary dramatically on:

Where you are sending the events from
What you are using to send us events

Where

The lower the latency between your event relay mechanism and our servers - the higher the throughput you will be able to achieve.

Specifically, you want to be as close as possible to AWS's us-west-2 region.

What

In order to achieve 10,000+ event per second collection, you will have to use our gRPC API and make use of batching.

The good news is that you don't have to write any code for this - you can use plumber relay mode for this, which uses our gRPC API under the hood.

NOTE: You will likely need to tweak the batching settings to find an optimal rate.

As with all things, there are pros and cons to utilizing heavy batching (5,000+ events):

PRO: Will use less bandwidth
PRO: Will increase your overall total throughput
CON: Will increase the time before your event(s) become visible/available in Batch

Similarly, using low batching has it's pros and cons:

PRO: Your events will become immediately visible/available in Batch
CON: Lower overall throughput
CON: Higher bandwidth usage

PreviousGoogle Cloud PubSub NextSchemas

Last updated 3 years ago