Data Lake

What is a data lake?

Data lakes are a collection of structured, semi-structured or unstructured data that you do not know the exact purpose of at the time of collection. Where a data warehouse is a carefully constructed SQL database that serves a specific data-analytics purpose, a data lake is the step before a data warehouse - you collect data that you may want to use in the future.

One huge difference between data lakes and data warehouses is that data lakes may contain terabytes or petabytes of data while a data warehouse will be much smaller in size.

For example, your data lake may contain ALL of your traffic logs while a data warehouse may contain only order fulfillment data.

An important characteristics of a data lake is that it must be highly scalable - as in, you should be able to store a near-infinite amount of data in it and be able to retrieve it within a reasonable amount of time (seconds or minutes, rather than days or weeks).

This usually boils down to the following:

  • Data lakes should be powered by "storage and search" oriented tech

  • Data should be stored in a search friendly format such as Parquet

Batch satisfies both of these requirements by writing all ingested data as optimized parquet files into an S3 bucket.

Data lake hydration

Data lake hydration is the act of populating your data lake with data.

There are many ways to get data into a data lake:

  1. You can setup pipelines to push data from your system components (such as Kafka) into your data lake of choice

  2. You can write scripts that fetch and push data to your data lake

  3. You can setup various integrations from other vendors (such as Segment) to push analytics data into your data lake

Batch's data lake hydration removes the necessity for doing #1 and #2.

Upon receiving an event, Batch will automatically:

  • Generate a parquet and Athena schema for your events

  • Generate optimally partitioned and sized parquet files

  • Batch write the generated files to your S3 bucket

Using a custom data lake

By default, Batch will write all ingested events into its own, internal S3 bucket.

This might be OK if you do not require access to the parquet files or simply do not want to manage additional AWS resources.

However, if you have requirements for the data to reside in your S3 bucket and/or you would like your teams to have access to the parquet data, you can do so as follows:

  1. Create an IAM policy that allows our AWS account to read, write and list files in your S3 bucket:

https://docs.aws.amazon.com/athena/latest/ug/cross-account-permissions.html

{
   "Version": "2012-10-17",
   "Id": "BatchS3AccessPolicyID",
   "Statement": [
      {
          "Sid": "MyStatementSid",
          "Effect": "Allow",
          "Principal": {
             "AWS": "arn:aws:iam::589147263245:root"
          },
          "Action": [
             "s3:GetBucketLocation",
             "s3:GetObject",
             "s3:ListBucket",
             "s3:ListBucketMultipartUploads",
             "s3:ListMultipartUploadParts",
             "s3:AbortMultipartUpload",
             "s3:PutObject",
             "s3:DeleteObject"
          ],
          "Resource": [
             "arn:aws:s3:::my-athena-data-bucket",
             "arn:aws:s3:::my-athena-data-bucket/*"
          ]
       }
    ]
 }

2. Create a collection that uses your S3 bucket:

3. You're done!

Once all of the above is done, Batch will store all parquet data in your S3 bucket, using the following "directory" structure:

s3://your-bucket/batch-collections/{{collection_id}}/year=XXXX/month=XX/day=XX/*.parquet

Best practices

  1. Do not modify the bucket contents

    1. While you have complete access and control over all of the parquet data, modifying the structure (or contents) of the bucket can affect our ability to replay data.

  2. Modifying the IAM S3 policy may affect our ability to write data

    1. If you modify the IAM policy, make sure that new collected events continue to be written to S3. If not, contact us.

Last updated