Differential diagnosis and mutation stratification of desmoid-type fibromatosis on MRI using radiomics.

PURPOSE
Diagnosing desmoid-type fibromatosis (DTF) requires an invasive tissue biopsy with β-catenin staining and CTNNB1 mutational analysis, and is challenging due to its rarity. The aim of this study was to evaluate radiomics for distinguishing DTF from soft tissue sarcomas (STS), and in DTF, for predicting the CTNNB1 mutation types.


METHODS
Patients with histologically confirmed extremity STS (non-DTF) or DTF and at least a pretreatment T1-weighted (T1w) MRI scan were retrospectively included. Tumors were semi-automatically annotated on the T1w scans, from which 411 features were extracted. Prediction models were created using a combination of various machine learning approaches. Evaluation was performed through a 100x random-split cross-validation. The model for DTF vs. non-DTF was compared to classification by two radiologists on a location matched subset.


RESULTS
The data included 203 patients (72 DTF, 131 STS). The T1w radiomics model showed a mean AUC of 0.79 on the full dataset. Addition of T2w or T1w post-contrast scans did not improve the performance. On the location matched cohort, the T1w model had a mean AUC of 0.88 while the radiologists had an AUC of 0.80 and 0.88, respectively. For the prediction of the CTNNB1 mutation types (S45 F, T41A and wild-type), the T1w model showed an AUC of 0.61, 0.56, and 0.74.


CONCLUSIONS
Our radiomics model was able to distinguish DTF from STS with high accuracy similar to two radiologists, but was not able to predict the CTNNB1 mutation status.


Introduction
Sporadic desmoid-type fibromatosis (DTF) is a rare borderline, soft tissue tumor arising in musculoaponeurotic structures [1]. Worldwide epidemiological data is lacking, but population studies in Scandinavia and the Netherlands show a low incidence of 2.4-5.4 cases per million per year [2,3]. Early recognition and diagnosis of DTF is therefore challenging.
On MRI, DTF can display a wide variety of enhancement patterns [4]. DTF has imaging characteristics that are often associated with soft tissue sarcomas (STS), such as crossing fascial boundaries, an invasive growth pattern, little central necrosis, mildly hyperintense on T1-weighted (T1w) MRI, and hyperintense and heterogeneous on T2-weighted (T2w) MRI with hypointense bands [5]. Hence, the distinction between DTF and STS, i.e. non-DTF, can be difficult. An invasive tissue biopsy, with additional immunohistochemical staining for β-catenin and mutation analysis of the CTNNB1 (β-catenin) gene, is therefore currently required to differentiate DTF from non-DTF [6].
As DTF is a borderline tumor who is unable to metastasize, and requires a different treatment regimen than malignant STS, this distinction is highly relevant. Differentiation between DTF and STS based on imaging would be beneficial because of the rarity of DTF, making clinical and pathological recognition challenging. Furthermore, DTF exhibits an aggressive growth pattern and growth might be stimulated after (surgical) trauma, including biopsies [7]. Avoiding (multiple) harmful biopsies which potentially cause tumor growth is therefore of great importance.
Several studies have addressed the prognostic role of the CTNNB1 mutation in DTF [8][9][10], as serine 45 (S45 F) tumors appear to have a higher risk of recurrence after surgery compared to threonine 41 (T41A) and wild type (WT) (i.e. no CTNNB1 mutation [11]) tumors [12]. Obtaining the CTNNB1 mutation status is for diagnostic purposes and to guide the clinical work-up, but, for now, the CTNNB1 mutation status has no therapeutic consequences [13]. The majority of DTF harbors a CTNNB1 mutation at either T41A or S45 F [8]. Assessment of the mutation status is currently done by Sanger Sequencing or Next Generation Sequencing, which are time consuming and expensive.
In radiomics, large amounts of quantitative imaging features are related to clinical outcome [14]. Radiomics may serve as a non-invasive surrogate to contribute to diagnosis, prognosis and treatment planning [15,16]. Based on the results of previous studies in cancer [17], we hypothesized that radiomics may also be useful in DTF.
This study investigated whether a radiomics model based on MRI is able to 1) distinguish DTF from non-DTF in the extremities, and 2) to predict the CTNNB1 mutation status of DTF. Additionally, in the DTF vs. non-DTF distinction, we evaluated which of the included MRI sequences has the highest predictive value.

Data collection
Approval by the Erasmus Medical Center (MC) institutional review board (MEC-2016-339) was obtained. Patients diagnosed or referred to the Erasmus MC between 1990-2018 with a histologically proven primary or recurrent DTF were included. This resulted in a multicenter imaging dataset as patients referred to our sarcoma expert institute often received imaging at their referring hospital. The most frequently used imaging modality prior to treatment was T1w-MRI, and its availability was used as an inclusion criterion [18]. When available, other sequences such as T2w, T1w post-contrast, dynamical contrast enhanced (DCE), proton density (PD) and diffusion weighted imaging (DWI) MRI were collected.
For the differential diagnosis (DTF vs. non-DTF), histologically confirmed malignant extremity STS were included. Benign STS were excluded, because this distinction is clinically less relevant. Nonextremity STS were excluded because of the infrequent use of MRI. Although DTF tumors commonly occur in the abdominal wall, their differential diagnosis is broad and includes pseudo-tumors such as myositis, nodular fasciitis and hematomas, and tumors such as lipomas, STS, endometriosis, carcinomas, lymphomas and metastasis [19]. Hence, we decided to focus on the distinction between DTF and STS, and included patients with a histologically proven primary fibromyxosarcoma, myxoid liposarcoma or leiomyosarcoma of the extremities. Similar to the DTF, patients with at least a pre-treatment T1w-MRI were retrospectively included.
Sex, age at diagnosis, and tumor location were collected. For the DTF, in case of a missing CTNNB1 mutation status, Sanger Sequencing was performed after review of formalin-fixed paraffin-embedded tumor sections by a pathologist. Cases with a known CTNNB1 mutation did not undergo additional review by a pathologist. Poor scan quality (e.g. artifacts), poor DTF DNA quality with failure of sequencing, and CTNNB1 mutation other than S45F, T41A or WT led to exclusion.

Radiomics feature extraction
The tumors were all manually segmented once on the T1w-MRI by one of two clinicians under supervision of a musculoskeletal radiologist (4 years of experience). A subset of 30 DTF was segmented by both clinicians, in which intra-observer variability was evaluated through the pairwise Dice Similarity Coefficient (DSC), with DSC > 0.70 indicating good agreement [20]. To transfer the segmentations to the other sequences, all sequences were automatically aligned to the T1w-MRI using image registration with the Elastix software [21]. For each lesion, per MRI sequence, 411 features quantifying intensity, shape and texture were extracted. Details can be found in Appendix A and Table A.2.

Decision model creation
To create a decision model from the features, the WORC toolbox was used, see Fig. 1 [22][23][24]. In WORC, the decision model creation consists of several steps, e.g. feature selection, resampling, and machine learning. WORC performs an automated search amongst a variety of algorithms for each step and determines which combination of algorithms maximizes the prediction performance on the training set. More details can be found in Appendix B.
For the differential diagnosis cohort, a binary classification model was created using a variety of machine learning models. For the DTF cohort (predicting the CTNNB1 mutation), a multiclass classification model was created using random forests.

Evaluation
Evaluation of all models was done through a 100x random-split cross-validation. In each iteration, the data was randomly split in 80 % for training and 20 % for testing in a stratified manner, to make sure the distribution of the classes in all sets was similar to the original (Fig. A.1). Within the training set, model optimization was performed using an internal cross-validation (5x). Hence, all optimization was done on the training set to eliminate any risk of overfitting on the test set.
Performance was evaluated using the Area Under the Curve (AUC) of the Receiver Operating Characteristic (ROC) curve, balanced classification accuracy (BCA), sensitivity, specificity, negative predictive value (NPV), and positive predictive value (PPV). For the multiclass models, we reported the multiclass AUC [25] and overall BCA [26]. The positive classes included: DTF in the differential diagnosis, and the presence of the mutation in the mutation analysis. The 95 % confidence intervals were constructed using the corrected resampled t-test, thereby taking into account that the samples in the cross-validation splits are not statistically independent [27]. Both the mean and the confidence intervals are reported. ROC confidence bands were constructed using fixed-width bands [28].
To assess the predictive value of the various features, models were trained based on: 1) volume; 2) age and sex; 3) T1w-MRI imaging; 4) T1w-MRI imaging, age and sex. Model 1 was created to verify that the imaging models were not solely based on volume. Model 2 was created to evaluate potential age and gender biases. In model 4, the imaging and clinical characteristics are combined by using both the imaging features and age and sex as features for a total of 413 features. This allows WORC to combine the imaging and clinical characteristics in the most optimal way. Additionally, a model was made for each combination of T1w-MRI and one of the other included MRI sequences (e.g. based on T1w-MRI and T2w-MRI) to evaluate the added value of these other sequences. When a sequence was missing for a patient, feature imputation was used to estimate the missing values.
The code for the feature extraction, model creation and evaluation has been published open-source [29].

Model insight
To explore the predictive value of individual features, the Mann-Whitney U univariate statistical test was used. P-values were corrected for multiple testing using the Bonferroni correction, and were considered statistically significant at a p-value <0.05. Feature robustness to variations in the segmentations was assessed on the subset of 30 DTF segmented by two observers using the intra-class correlation coefficient (ICC), were an ICC > 0.75 indicated good reliability [30]. To evaluate model reliability, a separate model was trained using only these features with a good reliability. To gain insight into the models, the patients were ranked based on the consistency of the model predictions. Typical examples for each class consisted of the patients that were correctly classified in all cross-validation iterations; atypical vice versa.

Classification by radiologists
To compare the models with clinical practice, the tumors were classified by two musculoskeletal radiologists (5 and 4 years of experience), which had access to all available MRI sequences, age, and sex. They were specifically instructed to distinguish between STS and DTF. Classification was made on a ten-point scale to indicate the radiologists' certainty. As only extremity STS were selected for the non-DTF group, a location-matched database was used. This included all extremity DTF and the same number of non-DTF. Agreement between the radiologists was evaluated using Cohen's kappa. The radiomics models were evaluated as well in this cohort. In each cross-validation iteration, these models were trained on 80 % of the full dataset, but tested only on patients from the location-matched cohort in the other 20 % of the dataset. The DeLong test was used to compare the AUCs [31].

Study selection and population
The dataset included 203 patients; see Table 1 for the clinical characteristics. The differential diagnosis cohort consisted of 64 fibromyxosarcomas, 31 leiomyosarcomas, 36 myxoid liposarcomas, and 72 DTFs (65 primary, 7 recurrent), of which 61 were suitable for the mutation analysis.
On the subset of 30 DTF that was segmented by both observers, the mean DSC was 0.77 (standard deviation of 0.20), indicating good agreement. An example of the image registration results is depicted in Fig. 2.

Differential diagnosis
The performance of models 1-6 for the differential diagnosis is shown in Table 3. Model 1, based on volume, showed little predictive value (mean AUC of 0.69). Model 2, based on age and sex, performed better (mean AUC of 0.86). Model 3, based on T1w-MRI, had a mean AUC of 0.79, thus performing worse than age and sex, but better than volume alone. Model 4, combining the T1w-MRI, age, and sex, showed little improvement in terms of mean AUC (0.88) over model 2. Addition of a T2w-MRI, i.e. model 5, or T1 post-contrast MRI, i.e. model 6, both with or without FatSat, both yielded a minor overall improvement over  Fig. 3. The models using either only non-FatSat or FatSat scans, both for the T2w and T1w post-contrast MRI, faired similar, see Table A.1.

Comparison with radiologists
As described in the methods, for the comparison with radiologists, a location-matched cohort consisting of all extremity DTFs and an equal amount of extremity non-DTF was used. To this end, all 20 extremity DTFs and 20 randomly selected extremity non-DTFs were included in the location-matched cohort. The performance of radiomics and the radiologists in this cohort is shown in Table 4: model 1 and 5-6 were omitted from the results for brevity. The AUCs of the radiomics models (model 2: 0.93; model 3: 0.88; model 4: 0.98) were generally higher than both radiologists 1 (0.80) and 2 (0.88). This is confirmed by the ROC curves in Fig. 4. Cohen's kappa between the two radiologists was 0.     Table 5 depicts the performance of the radiomics models for the CTNNB1 mutation stratification. Model 4, using T1w-MRI, age, and sex, had a high specificity (S45 F: 0.83, T41A: 0.59 and WT: 0.72), but a sensitivity similar to guessing (S45 F: 0.15, T41A: 0.49 and WT: 0.56).

CTNNB1 mutation status stratification
This indicates a strong bias in the models towards the negative classes, i. e. not-S45 F, not-T41A and not-WT. As model 4 did not perform well, models 1, 2, and 3 were omitted from the results, as these contain a subset of these features. Adding the T2w or T1w post-contrast imaging, i. e. models 5 and 6, did not improve the performance. Hence, the models using either only non-FatSat or FatSat scans were omitted, as these Table 3 Performance of the radiomics models for the DTF differential diagnosis based on: model 1: volume only; model 2: age and sex only; model 3: T1w imaging features, including volume; model 4: the combination of T1w imaging features and age and sex; model 5: the combination of T1w and T2w imaging features; and model 6: the combination of T1w and T1w post-contrast imaging features. Outcomes are presented with the 95% confidence interval. Model Fig. 3. Receiver operating characteristic curves of the radiomics models based on volume (1); age and sex (2); T1-weighted (T1w) features (3); T1w features, age, and sex (4); T1w + T2weighted imaging features (5); and T1w + T1w post-contrast imaging features (6). The grey crosses identify the 95 % confidence intervals of the 100x random-split cross-validation; the orange curve depicts the mean.

Table 4
Performance of the two radiologists and the radiomics models in differentiating between DTF (n = 20) and non-DTF (n = 20) in the location-matched cohort. Outcomes are presented with the 95% confidence interval. contain subsets of the scans from models 5 and 6.

Model insight
As the CTNNB1 mutation status stratification models did not perform well, the model insight analysis was only conducted for the differential diagnosis. The p-values from the Mann-Whitney U test between the DTF and non-DTF patients of all features are shown in Table A.3. In the feature importance analysis, 76 T1w-MRI features had significant pvalues (5.4 × 10 − 8 to 4.8 × 10 -2 ). These included two intensity features (entropy and peak), two shape features (radial distance and volume), and 72 texture features. The p-value of age (1 × 10 -11 ) was lower than that of all imaging features. The ICC values of all T1w-MRI features are shown in Table A.4. Of the 411 features, 270 (66 %) had an ICC > 0.75 and thus good reliability. Only using these features with a good reliability in model 3 did not alter the performance.
As we are mostly interested in which imaging features define typical DTF, and not age and sex, the patient ranking was conducted for model 3. Of the 203 patients, 104 tumors (24 DTFs, 80 non-DTFs) were always classified correctly by model 3, i.e. in all 100 cross-validation iterations. Nineteen tumors (17 DTFs, 2 non-DTFs) were always classified incorrectly. In Fig. 5, MRI slices of such typical and atypical examples of DTFs are shown.

Discussion
This study showed that radiomics based on T1w-MRI can distinguish DTF from STS. Adding T2w or T1w post-contrast MRI did not substantially improve the model. The DTF CTNNB1 mutation status could not be predicted through radiomics. To our knowledge, this is the first study to evaluate the DTF differential diagnosis and mutation status through an automated radiomics approach.
Age and sex appeared to be strong predictors for the diagnosis of DTF, performing better than T1w-MRI. The combination of imaging, age and sex did not improve the model. This implies that age and sex are sufficient for distinguishing DTF from STS. In line with previous nationwide DTF cohort studies, females represented the majority of our cohort, with a lower median age compared to the median age of the patients from the non-DTF group [2,32]. The relation in our database may however be too strong, and thereby not representative of clinical practice. For example, above 63 years of age, our database included 60 non-DTF and only a single DTF. While the peak incidence of DTF is between 20-40 years, DTF can affect patients of all ages with reported ranges from 2 to 90 years 32]. Simply classifying all tumors in patients above 63 years as non-DTF, regardless of any tumor (imaging) information, is unfeasible. Such a model cannot be applied in the general population, while the model purely based on T1w-MRI imaging, as it does not use any population-based information. Our cohort might be biased due to the focus on MRI and the extremity as a location, while other modalities (e.g. CT or ultrasound) may be used for certain locations or for certain types of patients. Further research should include the expansion of our dataset to make especially the age distribution more representative.
To estimate the clinical value of our model, we compared the performance with the assessment of two radiologists. The model based on imaging performed similar to the radiologists. The model combining age, sex and imaging features, using the same dataset as the radiologist, had a higher AUC than the musculoskeletal radiologists. However this model may suffer from the selection bias as mentioned in the previous section. The agreement between the radiologists was intermediate, indicating observer dependence in the prediction. The radiomics model is observer independent, assuming the segmentation is reproducible as indicated by the high DSC and ICC, and will always give the same prediction on the same image.
The DTF differential diagnosis is highly important for treatment decisions, but difficult on imaging due to its rarity, while using invasive biopsies brings risks such as tumor growth. The use of our T1w-MRI radiomics model may therefore aid early recognition and diagnosis of DTF, thus shortening the diagnostic delay by enabling direct referral to an STS expertise center. Since all routine MRI protocols include a T1w-MRI, our radiomics method is generalizable, feasible and applicable for use in daily clinical practice. After further model optimization, it may serve as a quick, non-invasive, and low-cost alternative for a biopsy, currently limited to extremities due to the used dataset.
Additionally, we investigated the predictive value of sequences other than T1w-MRI. The number of available sequences was however limited Table 5 Performance of the random forest multilabel radiomics models for the DTF CTNNB1 mutation stratification based on; model 4: T1w imaging features, age and sex; model 5: T1w + T2w imaging features; and model 6: T1w + T1w post-contrast imaging features. Model 4 was evaluated for a single class (S45 F, T41A, and WT) or the overall performance (All). Outcomes are presented with the 95% confidence interval. Model   due to the multicenter imaging dataset. Although T2w-MRI is often used to correlate DTF signal intensity with prognosis or response to therapy [33][34][35][36], in the current study T2w-MRI added little predictive value to the T1w-MRI, similar to the T1w post-contrast MRI. This may however be attributed to the fact that these sequences were only available for a subset of the patients. Our cohort contained too few patients with PD, DCE, or DWI sequences to be analyzed. However, there is little to no indication of the added value of these sequences in DTF [37][38][39].
The second aim of this study was to predict the DTF CTNNB1 mutation status. Our radiomics model was not able to stratify the CTNNB1 mutation type, which is in line with the absence of literature linking DTF MRI appearance to the CTNNB1 mutation.
The current study enclosed several limitations. First, due to the rarity of DTF, the DTF sample size was limited and possibly too small for the mutation stratification model to learn from. This also resulted in little statistical power for the mutation analysis, as shown by the large width of our confidence intervals, and for the comparison with the radiologists in the differential diagnosis. Besides primary tumors, the DTF cohort contained also recurrent tumors. As this number was low, and to our knowledge, there are no indications that recurrent DTF appear different on MRI than primary DTF, the expected influence is small. Within the DTF cohort, the WT group was relatively large and might have been subjected to incorrect allocation, as Sanger Sequencing is not always sensitive enough to detect all mutations [11]. The results of the CTNBB1 mutation status stratification showed a strong bias towards the majority classes, which may be attributed to the class imbalance. Although we exploited commonly used imbalanced learning strategies such as resampling and ensembling. other strategies may improve the performance. Second, only extremity DTFs were included for comparison with STS. This was due to the limited availability of MRI in non-extremity soft tissue tumors. However, this is not representative for the entire DTF population, which also occurs frequently in the abdominal wall and trunk [3]. Third, the current radiomics approach requires manual annotations. While accurate, this process is also time consuming and subject to some observer variability as indicated by our DSC, and thus limits the transition to clinical practice. Automatic segmentation methods, for example deep learning, may help to overcome these limitations [40]. Lastly, the dataset originated from 68 different scanners, which resulted in substantial heterogeneity in the acquisition protocols. The lack of standard imaging parameters can be problematic as these can affect the appearance of the tumor and thus the radiomics performance. However, our method was successfully able to create diagnostic models despite these differences. As these models were trained on a variety of imaging protocols, there is an increased chance that the reported performance can be reproduced in a routine clinical setting when using other MRI scanners. Using a single-scanner with dedicated tumor protocols may improve the model performance, but will limit the generalizability.
Future work should firstly focus on the prospective validation of our findings. Although we did use a multicenter imaging dataset and performed a rigorous cross-validation experiment strictly separating training from testing data, we did not validate our model on an independent, external dataset. Afterwards, the radiomics model could be used to predict clinical outcomes of DTF receiving active surveillance or

Conclusions
Our radiomics approach is capable of distinguishing DTF from non-DTF tumors on T1w-MRI, and can potentially aid diagnosis and shorten diagnostic delay. The performance of the model was similar to that of two experienced musculoskeletal radiologists. The model was not able to predict CTNNB1 mutation status of DTF tumors. Further optimization and external validation of the model is needed to incorporate radiomics in clinical practice.