The challenge: predict per-pixel foreground/background masks given an image and sparse user "scribbles" (small markings of bg=0 / fg=1, with most pixels unlabeled). Below is each method we tried, in chronological order. Train mIoU is the average over 228 training images, computed honestly out-of-fold for the deep models.
Big-picture progression: the project's reported baseline (KNN k=11) achieved 0.499 mIoU. Switching to a globally-trained U-Net with full ground-truth supervision instead of per-image scribble-only learning was the single biggest jump (+0.29). CutMix augmentation, model ensembling, and finally pseudo-labeling each added a few more points — total: 0.499 → 0.843.