Tutorial
This tutorial walks you through building a curated image dataset from scratch. By the end you will have searched for images, downloaded them alongside video footage, clustered by similarity, refined quality, and produced final resized outputs — all within a single working directory.
The example uses "crowd" as the subject. Along the way the tutorial shows two paths through the pipeline: one that extracts and clusters faces by identity (using ArcFace), and one that clusters images directly by visual similarity (using CLIP). Pick whichever fits your use case, or combine both.
What you will build
A curated image dataset starting from the search term "crowd". The pipeline uses 14 tools:
- Search for images across multiple engines
- Fetch images and videos from the collected URLs
- Extract frames from downloaded videos
- Extract faces from all collected images (optional — for face datasets)
- Cluster by similarity (ArcFace for faces, CLIP for general images)
- Copy selected clusters into a curation folder
- Analyze images for metadata (hashes, blur scores)
- Filter out low-quality images
- Review manually with the web UI
- Dedup to remove near-duplicates
- Augment with transformations to increase diversity
- Upscale images to higher resolution
- Frame (resize) for the final dataset
Steps 1–3 are covered in Collecting images, steps 4–5 in Extracting features, steps 6–11 in Selecting and refining, and steps 12–14 in Final preparation.
Every dtst command reads from and writes to buckets and tracks metadata in sidecars. See Concepts for details on both.
The directory structure
As you work through the steps, your working directory grows into this layout:
crowd.yaml
scratch/
crowd/
results.jsonl
images/
search1/
search2/
frames/
videos/
faces/
cluster/
000/
001/
noise/
select/
filtered/
rejected/
duplicated/
final/
1024/
512/
256/
Prerequisites
Install dtst:
Copy .env.example to .env and fill in the API keys for whichever search engines you want to use:
Create a working directory
Every dtst pipeline lives in a working directory. Create one for this tutorial:
All the commands in the following pages use -d scratch/crowd to point at this folder. You can also set working_dir in a config file to avoid repeating it (see Configuration).