Tutorial

This tutorial walks you through building a curated image dataset from scratch. By the end you will have searched for images, downloaded them alongside video footage, clustered by similarity, refined quality, and produced final resized outputs — all within a single working directory.

The example uses "crowd" as the subject. Along the way the tutorial shows two paths through the pipeline: one that extracts and clusters faces by identity (using ArcFace), and one that clusters images directly by visual similarity (using CLIP). Pick whichever fits your use case, or combine both.

What you will build

A curated image dataset starting from the search term "crowd". The pipeline uses 14 tools:

Search for images across multiple engines
Fetch images and videos from the collected URLs
Extract frames from downloaded videos
Extract faces from all collected images (optional — for face datasets)
Cluster by similarity (ArcFace for faces, CLIP for general images)
Copy selected clusters into a curation folder
Analyze images for metadata (hashes, blur scores)
Filter out low-quality images
Review manually with the web UI
Dedup to remove near-duplicates
Augment with transformations to increase diversity
Upscale images to higher resolution
Frame (resize) for the final dataset

Steps 1–3 are covered in Collecting images, steps 4–5 in Extracting features, steps 6–11 in Selecting and refining, and steps 12–14 in Final preparation.

Every dtst command reads from and writes to buckets and tracks metadata in sidecars. See Concepts for details on both.

The directory structure

As you work through the steps, your working directory grows into this layout:

crowd.yaml
scratch/
  crowd/
    results.jsonl
    images/
      search1/
      search2/
      frames/
    videos/
    faces/
    cluster/
      000/
      001/
      noise/
    select/
      filtered/
      rejected/
      duplicated/
    final/
      1024/
      512/
      256/

Prerequisites

Install dtst:

uv tool install git+https://github.com/Ucodia/dtst.git

Copy .env.example to .env and fill in the API keys for whichever search engines you want to use:

cp .env.example .env

Create a working directory

Every dtst pipeline lives in a working directory. Create one for this tutorial:

mkdir -p scratch/crowd

All the commands in the following pages use -d scratch/crowd to point at this folder. You can also set working_dir in a config file to avoid repeating it (see Configuration).