Data Science

Overview

All data that is ingested by the Batch collectors is written as parquet files into an AWS S3 bucket and exposed via AWS Athena.

Customers who bring their own S3 bucket are able to get direct access to the parquet data which can then be used for powering out-of-band data-science tasks.

Athena

Each collection in Batch is exposed in AWS Athena as a table. The table is fully managed and kept up to date by Batch as the schema in your collection evolves.

To get access to Athena - please send us an email that includes your AWS account ID and our support representatives will enable access.

Snowflake

Snowflake (and most other data warehousing platforms) support AWS S3 and the parquet format out of the box which makes integration simple and quick.

Your data team has the following

  • Use Batch's parquet data as an external table

  • Perform a traditional, periodic load (COPY INTO) of the parquet data into a Snowflake table

In order to gain access to the parquet data in S3, you will need to provide your AWS key + secret via Snowflake and the full S3 path to the parquet data. By including a trailing / in the S3 ARN, Snowflake will recursively walk all of the "directories" and search for parquet data.

Example of a traditional load of a single parquet file via S3

1. Create a schema

2. Create a parquet format

3. Create a stage

create or replace stage onefilestage 
  s3://batchsh-datalakes/$team_id/$account_id/$datalake_id/$collection_id/year=2021/month=11/day=23/1637629076391797918.parquet' CREDENTIALS=(aws_key_id='***' aws_secret_key='***') 
  file_format = parquet;

4. Create a table (that points to the stage & format)

create table mytable
  using template (
  select array_agg(object_construct(*))
    from table(
      infer_schema(
        location=>'@onefilestage',
        file_format=>'my_parquet_format'
      )
    )
  );

5. Load the data into the table

copy into MYTABLE
  from (select
    $1:raw_json::varchar, $1:batch::variant, $1:client::variant
    from @onefilestage
  );

Last updated