Skip to content

Tutorial

This tutorial walks you through building a curated image dataset from scratch. By the end you will have searched for images, downloaded them alongside video footage, clustered by similarity, refined quality, and produced final resized outputs — all within a single working directory.

The example uses "crowd" as the subject. Along the way the tutorial shows three paths through the pipeline: one that extracts and clusters faces by identity (using ArcFace), one that clusters images directly by visual similarity (using CLIP), and one that detects and crops specific object classes (using OWL-ViT). Pick whichever fits your use case, or combine them.

What you will build

A curated image dataset starting from the search term "crowd". The pipeline uses 17 tools:

  1. Search for images across multiple engines
  2. Fetch images and videos from the collected URLs
  3. Extract frames from downloaded videos
  4. Extract faces from all collected images (optional — for face datasets) or extract classes from object detections (optional — for object datasets)
  5. Cluster by similarity (ArcFace for faces, CLIP for general images)
  6. Copy selected clusters into a curation folder
  7. Analyze images for metadata (hashes, blur scores)
  8. Filter out low-quality images
  9. Review manually with the web UI
  10. Dedup to remove near-duplicates
  11. Augment with transformations to increase diversity
  12. Upscale images to higher resolution
  13. Rename images with sequential, prefixed filenames
  14. Format (normalize) image formats, channels, and metadata
  15. Frame (resize) for the final dataset
  16. Validate that the final dataset is consistent

Steps 1–3 are covered in Collecting images, steps 4–5 in Extracting features, steps 6–11 in Selecting and refining, and steps 12–17 in Final preparation.

Every dtst command reads from and writes to buckets and tracks metadata in sidecars. See Concepts for details on both.

The directory structure

As you work through the steps, your working directory grows into this layout:

crowd.yaml
scratch/
  crowd/
    results.jsonl
    images/
      search1/
      search2/
      frames/
    videos/
    faces/
    cluster/
      000/
      001/
      noise/
    select/
      filtered/
      rejected/
      duplicated/
    final/
      formatted/
      1024/
      512/
      256/

Prerequisites

Install dtst:

uv tool install git+https://github.com/Ucodia/dtst.git

Copy .env.example to .env and fill in the API keys for whichever search engines you want to use:

cp .env.example .env

Create a working directory

Every dtst pipeline lives in a working directory. Create one for this tutorial:

mkdir -p scratch/crowd

All the commands in the following pages use -d scratch/crowd to point at this folder. You can also set working_dir in a config file to avoid repeating it (see Configuration).