The segauge medical segmentation leaderboard¶
Every score with a confidence interval and a ranking-stability test. Sliced by failure mode. On public data. Reproducible with one command.
Most segmentation leaderboards report a single Dice number with no interval, so you cannot tell a real lead from sampling noise. The medical-imaging metrics community has shown this repeatedly: removing one test case flips most teams' ranks in a majority of challenges, and the reported "winner" often sits inside the runner-up's confidence interval. This leaderboard is that critique turned into a running tool.
Preview (GPU via Modal) - whole-kidney on KiTS23, N=20
This is an early, small-N preview that demonstrates the methodology end to end on real, public data. The wide intervals are the point: they show how little a bare leaderboard number tells you at this sample size. Scaling to more models and organs is the next step.
Dataset: kits23 (20 cases, ground-truth labels CC-BY-NC-SA-4.0). Scored with segauge v0.2.0 (95% bootstrap CI, 2000 resamples, seed 0). See Methodology and Reproduce.
A leaderboard you can run yourself: pip install segauge and one segbench run.
Kidney¶
| Model | Fair | Dice | HD95 (mm) | NSD | ASSD (mm) |
|---|---|---|---|---|---|
| TotalSegmentator | yes | 0.904 [0.881, 0.922] | 6.13 [4.37, 8.24] | 0.824 [0.778, 0.863] | 1.5 [1.14, 1.95] |
| CT-FM | yes | 0.891 [0.848, 0.92] | 7.24 [4.88, 10.4] | 0.805 [0.752, 0.85] | 1.79 [1.25, 2.47] |
| MOOSE | yes | 0.905 [0.879, 0.928] | 6.58 [4.24, 9.36] | 0.834 [0.786, 0.874] | 1.54 [1.07, 2.1] |
Ranking stability (by Dice, paired bootstrap over n=20 cases):
- MOOSE: 0.905 (P(rank 1) = 64%, mean rank 1.4 [1, 3])
- TotalSegmentator: 0.904 (P(rank 1) = 36%, mean rank 1.7 [1, 3])
- CT-FM: 0.891 (P(rank 1) = 0%, mean rank 2.9 [2, 3])
Statistical separability:
- MOOSE and TotalSegmentator are not statistically separable (Δ=0.00137, CI includes 0)
- MOOSE and CT-FM are not statistically separable (Δ=0.0148, CI includes 0)
- TotalSegmentator and CT-FM are not statistically separable (Δ=0.0135, CI includes 0)
Pairwise intervals use a Bonferroni family-wise correction over 3 comparisons (each computed at 98.3% so the family holds 95%).
Contamination policy: a model is only ranked on a dataset it was not trained on. Cells marked "no" are shown for context but excluded from ranking. Dataset citation: Heller et al., The KiTS23 Challenge (2023). github.com/neheller/kits23 Source: https://github.com/neheller/kits23.