Hugging Face
Crunchy Data Warehouse supports accessing a wide variety of public S3 buckets.
Hugging Face is a widely used platform for sharing
machine learning models and training data. You can query files directly using a
hf://
prefix instead of s3
. Hugging Face file URLs will look something like
this:
https://huggingface.co/datasets/microsoft/orca-math-word-problems-200k/blob/main/data/train-00000-of-00001.parquet
The URL for the system will remove the extra /blob/main/
and a final creation
to create a foreign table with Hugging Face data will look like this:
CREATE FOREIGN TABLE word_problems ()
SERVER crunchy_lake_analytics OPTIONS
(path 'hf://datasets/microsoft/orca-math-word-problems-200k/data/train-00000-of-00001.parquet');
You can also use the wildcard path with the user and project name to create a foreign table for a batch of parquet files:
CREATE FOREIGN TABLE word_problems ()
SERVER crunchy_lake_analytics OPTIONS
(path 'hf://datasets/microsoft/orca-math-word-problems-200k@~parquet/**/*.parquet');
The Hugging Face URLs currently do not use caching. If you access a data set frequently, we recommend moving the data to S3 or loading it into a Postgres table.