Skip to content

The segauge medical segmentation leaderboard

Every score with a confidence interval and a ranking-stability test. Sliced by failure mode. On public data. Reproducible with one command.

Most segmentation leaderboards report a single Dice number with no interval, so you cannot tell a real lead from sampling noise. The medical-imaging metrics community has shown this repeatedly: removing one test case flips most teams' ranks in a majority of challenges, and the reported "winner" often sits inside the runner-up's confidence interval. This leaderboard is that critique turned into a running tool.

Preview (GPU via Modal) - whole-kidney on KiTS23, N=20

This is an early, small-N preview that demonstrates the methodology end to end on real, public data. The wide intervals are the point: they show how little a bare leaderboard number tells you at this sample size. Scaling to more models and organs is the next step.

Dataset: kits23 (20 cases, ground-truth labels CC-BY-NC-SA-4.0). Scored with segauge v0.2.0 (95% bootstrap CI, 2000 resamples, seed 0). See Methodology and Reproduce.

A leaderboard you can run yourself: pip install segauge and one segbench run.

Kidney

Model Fair Dice HD95 (mm) NSD ASSD (mm)
TotalSegmentator yes 0.904 [0.881, 0.922] 6.13 [4.37, 8.24] 0.824 [0.778, 0.863] 1.5 [1.14, 1.95]
CT-FM yes 0.891 [0.848, 0.92] 7.24 [4.88, 10.4] 0.805 [0.752, 0.85] 1.79 [1.25, 2.47]
MOOSE yes 0.905 [0.879, 0.928] 6.58 [4.24, 9.36] 0.834 [0.786, 0.874] 1.54 [1.07, 2.1]

Ranking stability (by Dice, paired bootstrap over n=20 cases):

  • MOOSE: 0.905 (P(rank 1) = 64%, mean rank 1.4 [1, 3])
  • TotalSegmentator: 0.904 (P(rank 1) = 36%, mean rank 1.7 [1, 3])
  • CT-FM: 0.891 (P(rank 1) = 0%, mean rank 2.9 [2, 3])

Statistical separability:

  • MOOSE and TotalSegmentator are not statistically separable (Δ=0.00137, CI includes 0)
  • MOOSE and CT-FM are not statistically separable (Δ=0.0148, CI includes 0)
  • TotalSegmentator and CT-FM are not statistically separable (Δ=0.0135, CI includes 0)

Pairwise intervals use a Bonferroni family-wise correction over 3 comparisons (each computed at 98.3% so the family holds 95%).


Contamination policy: a model is only ranked on a dataset it was not trained on. Cells marked "no" are shown for context but excluded from ranking. Dataset citation: Heller et al., The KiTS23 Challenge (2023). github.com/neheller/kits23 Source: https://github.com/neheller/kits23.