Hugging Face

Crunchy Bridge for Analytics supports accessing a wide variety of public S3 buckets. Hugging Face is a widely used platform for sharing machine learning models and training data. You can query files directly using a hf:// prefix instead of s3. Hugging Face file URLs will look something like this:

https://huggingface.co/datasets/microsoft/orca-math-word-problems-200k/blob/main/data/train-00000-of-00001.parquet

The URL for the system will remove the extra /blob/main/ and a final creation to create a foreign table with Hugging Face data will look like this:

CREATE FOREIGN TABLE word_problems ()
SERVER crunchy_lake_analytics OPTIONS
(path 'hf://datasets/microsoft/orca-math-word-problems-200k/data/train-00000-of-00001.parquet');

The Hugging Face URLs currently do not use caching. If you access a data set frequently, we recommend moving the data to S3 or loading it into a Postgres table.