Ben Van Calster, Bavo De Cock, Bart De Moor, Angelo Votino, Alessia di Legge, Thierry Van den Bosch, Arnaud Installé, Letizia Zannoni, Ligita Jokubkiene, Lil Valentin, Beryl Benacerraf, Povilas Sladkevicius, Dirk Timmerman
To estimate intra- and inter-rater agreement and reliability with regard to describing ultrasound images of the endometrium using the International Endometrial Tumor Analysis (IETA) terminology.
Four expert and four non-expert raters assessed video clips of transvaginal ultrasound examinations of the endometrium from 99 women with postmenopausal bleeding and sonographic endometrial thickness ≥4.5 mm and no fluid in the uterine cavity. The following features were rated: endometrial echogenicity (nine categories), endometrial midline (four categories), bright edge (yes, no), endometrial-myometrial junction (four categories), color score (1 to 4), vascular pattern (seven categories), irregularly branching vessels (yes, no), color splashes (yes, no). The color content of the endometrial scan was estimated using a visual analogue scale (VAS) graded from 0 to 100. The clips were assessed twice > 2 months apart. The raters were blinded to their own results and to those of the others.
Inter-rater differences in the prevalence of most IETA variables were substantial, and some variable categories were rare. Specific agreement was poor for variables with many categories. For binary variables, specific agreement was better for absence than presence of a category. For variables with more than two outcome categories specific agreement was best for undefined endometrial midline (93% and 96% for expert and non-expert raters), regular endometrial-myometrial junction (72% and 70%), and three-layer endometrial pattern (67% and 56%). The most reliable gray scale ultrasound variable was uniform versus non-uniform echogenicity (multirater Kappa, κ, 0.55 and 0.52 for expert and non-expert raters), the least reliable were appearance of the endometrial-myometrial junction (κ 0.25 and 0.16) and the nine-category endometrial echogenicity variable (κ 0.29 and 0.28). The most reliable color Doppler variable was color score (mean weighted κ 0.77 and 0.69). Intra- and inter-rater agreement and reliability were similar for experts and non-experts.
Agreement and reliability when using IETA terminology was limited. This may have implications when assessing the association between a particular ultrasound feature and a specific histological diagnosis, because lack of reproducibility reduces the relationship of a feature with the outcome. Future studies should investigate if using fewer variable categories or offering practical training could improve agreement and reliability.