Data Science
Overview
All data that is ingested by the Batch collectors is written as parquet files into an AWS S3 bucket and exposed via AWS Athena.
Customers who bring their own S3 bucket are able to get direct access to the parquet data which can then be used for powering out-of-band data-science tasks.
Athena
Each collection in Batch is exposed in AWS Athena as a table. The table is fully managed and kept up to date by Batch as the schema in your collection evolves.
To get access to Athena - please send us an email that includes your AWS account ID and our support representatives will enable access.
Snowflake
Snowflake (and most other data warehousing platforms) support AWS S3 and the parquet format out of the box which makes integration simple and quick.
Your data team has the following
Use Batch's parquet data as an external table
Perform a traditional, periodic load (
COPY INTO
) of the parquet data into a Snowflake table
In order to gain access to the parquet data in S3, you will need to provide your AWS key + secret via Snowflake and the full S3 path to the parquet data. By including a trailing /
in the S3 ARN, Snowflake will recursively walk all of the "directories" and search for parquet data.
Example of a traditional load of a single parquet file via S3
1. Create a schema
2. Create a parquet format
3. Create a stage
4. Create a table (that points to the stage & format)
5. Load the data into the table
Last updated