Machine Learning–Based Prognostic Gene Signature for Early Triple-Negative Breast Cancer

Article information

J Korean Cancer Assoc. 2024;.crt.2024.937
Publication date (electronic) : 2024 November 19
doi : https://doi.org/10.4143/crt.2024.937
1Division of Hemato-oncology, Department of Internal Medicine, Korea University Anam Hospital, Korea University College of Medicine, Seoul, Korea
2Department of Pathology, Korea University Anam Hospital, Korea University College of Medicine, Seoul, Korea
3Department of Hospital Pathology, Seoul St. Mary’s Hospital, College of Medicine, The Catholic University of Korea, Seoul, Korea
Correspondence: Kyong Hwa Park, Division of Hemato-oncology, Department of Internal Medicine, Korea University Anam Hospital, Korea University College of Medicine, 73 Goryeodae-ro, Seongbuk-gu, Seoul 02841, Korea Tel: 82-2-920-6841 E-mail: khpark@korea.ac.kr
Co-correspondence: Sangjeong Ahn, Department of Pathology, Korea University Anam Hospital, Korea University College of Medicine, 73 Goryeodae-ro, Seongbuk-gu, Seoul 02841, Korea Tel: 82-2-920-5621 E-mail: vanitas80@korea.ac.kr
*Ju Won Kim and Jonghyun Lee contributed equally to this work.
Received 2024 September 26; Accepted 2024 November 18.

Abstract

Purpose

This study aimed to develop a machine learning–based approach to identify prognostic gene signatures for early-stage triple-negative breast cancer (TNBC) using next-generation sequencing data from Asian populations.

Materials and Methods

We utilized next-generation sequencing data to analyze gene expression profiles and identify potential biomarkers. Our methodology involved integrating various machine learning techniques, including feature selection and model optimization. We employed logistic regression, Kaplan-Meier survival analysis, and receiver operating characteristic (ROC) curves to validate the identified gene signatures.

Results

We identified a gene signature significantly associated with relapse in TNBC patients. The predictive model demonstrated robustness and accuracy, with an area under the ROC curve of 0.9087, sensitivity of 0.8750, and specificity of 0.9231. The Kaplan-Meier survival analysis revealed a strong association between the gene signature and patient relapse, further validated by logistic regression analysis.

Conclusion

This study presents a novel machine learning-based prognostic tool for TNBC, offering significant implications for early detection and personalized treatment. The identified gene signature provides a promising approach for improving the management of TNBC, contributing to the advancement of precision oncology.

Introduction

Breast cancer is the most commonly diagnosed cancer among women on a global scale, with the annual estimate for new breast cancer cases worldwide stands at approximately 2.3 million [1]. It is a diverse disease, exhibiting variations in biological behavior, responses to treatment, and prognosis. Triple-negative breast cancer (TNBC) represents a subtype that accounts for 10%-15% of all breast cancer cases and is characterized by the absence of estrogen receptors, progesterone receptors, and human epidermal growth factor receptor 2 expression [2]. Compared to other breast cancer subtypes, TNBC is associated with a poorer prognosis because it lacks the benefit of specific therapies that targets these proteins [3].

Patients with operable stage I-III TNBC undergo a variety of treatment combinations, including neoadjuvant chemotherapy, surgery, adjuvant chemotherapy, immunotherapy, and radiation therapy. Currently, there are no available surrogate biomarkers that can accurately predict prognosis at the time of diagnosis, allowing for personalized treatment. Although the achievement of pathologic complete response (pCR) is known to provide assistance in predicting future prognosis for patients who have received neoadjuvant chemotherapy [4], the prognostic value of pCR is still limited [5].

The integration of machine-learning and artificial intelligence (AI) into healthcare has led to ongoing research in predicting the prognosis of cancer [6,7]. It is important to utilize bioinformatics techniques for predicting the prognosis of TNBC, given its relatively rare prevalence and the need for long-term follow-up. Previous attempts to predict the prognosis of TNBC using machine-learning algorithms have been made [8-10], but most of them were retrospective studies based on genomic datasets from Western populations, which may not fully capture the clinical characteristics of the Asian patients. Here, we present important genomic features that can predict the prognosis of TNBC by integrating national sequencing data and clinical information.

Materials and Methods

The model’s conceptual diagram is shown in Fig. 1. First, mutations underwent preprocessing and were combined at the gene level. In real-world scenarios, it is challenging to obtain the entire gene set, and not all genes are assumed to be significant. With this in mind, we employed a two-step feature selection process to identify the gene set with the highest predictive power. In the first phase, a random forest model was used to identify the top 10 genes contributing most significantly to prediction. To enhance interpretability and facilitate clinical application, we then applied logistic regression in the second phase to calculate the final risk scores. During this phase, we used p-value-based thresholding, ultimately narrowing down to a final set of five genes (S1 Table).

Fig. 1.

Gene signature identification for relapse prediction in triple-negative breast cancer (TNBC) patients. An analytical workflow, beginning with variant call format (VCF) files, aggregates gene-level mutations ClinVar and exonic function filtering. A decision tree classifier is employed to identify 10 gene candidates, which are further refined through logistic regression with backward feature selection to ascertain f ive critical gene signatures. These signatures are determined to predict relapse in TNBC patients with heightened accuracy. Their significance validated by Kaplan-Meier and area under the receiver operating characteristic curves.

1. Patients and study design

Among patients who had received treatment for early or metastatic triple-negative breast cancer at Korea University Anam Hospital, those with available next-generation sequencing (NGS) data and clinical response to treatment were chosen for this study. Clinical characteristics such as age, sex, date of diagnosis, molecular subtype, treatment history, and survival data of each patient were collected from their medical records.

2. NGS: KM-00 protocol

The targeted sequencing using tumor tissues was performed using CancerSCAN (Samsung Genome Institute). In patients whose tumor tissues were not available, 10 mL of whole blood was collected with Cell-Free DNA BCT for ctDNA preparation and analyzed by the Axen Cancer Panel (Macrogen). The sequential blood sample was collected within 24 hours before treatment, before the start of the 2nd cycle, time of response evaluation, and at the end of treatment.

Genomic DNA from formalin-fixed paraffin-embedded (FFPE) samples or plasma was extracted using the QIAamp FFPE tissue kit (Qiagen) or QIAamp circulating nucleic acid kit (Qiagen), respectively. Cell-free DNA purity was measured using an Agilent High Sensitivity DNA Kit and a 2100 Bioanalyzer instrument (Agilent Technologies). When required, additional purification was performed using Agencourt AMPure XP (Beckman Coulter) to further remove contaminating nucleic acid. Centrally isolated genomic DNA samples that underwent quality control were sent to the K-MASTER genomic analysis laboratories. K-MASTER used two previously established tissue-based NGS panels (FIRST and CancerSCAN) to detect major genomic aberrations, including mutations, CNAs, and small insertions and deletions in cancer-related genes. CancerSCAN has been further upgraded to K-MASTER v1.0 and v1.1. Detailed laboratory and bioinformatics protocols are available in the Supplementary Methods.

3. Preprocessing of profile data

The initial genomic datasets underwent preprocessing under the protocol previously described 1 [11]. Following annotation of mutation variants, we applied various criteria to filter potential gene candidates. These criteria consisted of:

- The gene should exhibit an exonic function, falling into categories such as frameshift insertion, frameshift deletion, nonframeshift insertion, nonframeshift deletion, and nonsynonymous single nucleotide variant.

- The ClinVar (https://www.ncbi.nlm.nih.gov/clinvar/) functional description must align with pathologic, likely pathogenic, drug response, risk factor, uncertain significance, or an unknown classification.

Subsequently, the mutations that met these filter criteria were grouped on a gene-level basis. This gene-level categorization was then transformed into a binary classification, where genes with mutations were designated as 1 (positive), while those without mutations were labeled as 0 (negative). This binary representation created a patient-gene mutation variant table. The target label for our analysis pertained to the presence or absence of relapse. Relapse, in this context, was defined as either a recurrence of breast cancer diagnosis or the initiation of chemotherapy subsequent to the initial breast cancer diagnosis. The endpoint of our follow-up period was set at a date in the December 2022. Positive labels were assigned to cases with breast cancer relapse, while negative labels were assigned to patients without any relapse. To facilitate model development and validation, we divided the entire patient-gene mutation variant dataset into a train-test split, with a ratio of 7:3. The split was conducted while taking into consideration the distribution of relapse cases, ensuring a stratified train-test split.

4. Selection of gene candidates and signatures

To begin, we initiated the elimination of highly correlated gene features, employing a threshold of 0.9 for this purpose. Subsequently, for the selection of gene candidates, we undertook the training of a random forest model. Tuning of the hyperparameters for the random forest model was carried out via a 5-fold cross-validation approach. The outcome of this process led to the identification of 10 gene candidates based on the random forest model’s analysis.

To compute a risk score, we employed logistic regression using the previously identified set of 10 gene candidates. In this endeavor, we utilized backward elimination steps, implementing a p-value threshold of 0.1 to filter the gene signatures. To facilitate a comparison of the performance between the full model, which utilizes the entire set of gene signatures, and other models that predict relapse by employing individual genes, we fit various logistic regression models with variations corresponding to the utilization of single genes in prediction.

5. Statistical analysis

We evaluated the model’s performance by looking at receiver operating characteristics (ROC) curves and the associated area under the ROC curve (AUROC). To understand how well the model predicts survival, we used the Kaplan-Meier method. We divided patients into two groups based on the median predicted values and then looked at how their survival differed over time. To check if these differences were significant, we used a statistical test called the log-rank test. All of these analyses were carried out using Python 3.9 and R software ver. 4.3.1.

Results

1. Patients’ characteristics

A total of 102 TNBC patients were included as either a training for machine-leaning or test set for validation of algorithm. The characteristics of the included patients are summarized in Table 1. The median age of the patients was 50 years old. Patients clinical stage III accounted for the highest proportion at 31.4%, while seven patients were initial stage 4. 42.2% underwent surgery after neoadjuvant chemotherapy, 28.4% received adjuvant treatment after surgery, and 7.8% underwent surgery only. NGS was conducted on tissues collected at the initial diagnosis for 68.6% of the patients.

Demographic characteristics of study population

2. Gene candidates and signature selection

The initial dataset contained a total of 892 gene features. After applying a filter to eliminate redundant features due to multicollinearity, we retained 156 gene features. The tuned random forest model demonstrated a classification performance with the following values: AUROC of 0.8269, accuracy of 0.7143, sensitivity of 0.6667, and specificity of 0.7500. From this, we identified 10 gene candidates, specifically FGF10, GATA1, CARD1, DHTKD1, MDM2, TBX21, BCL6, SF3B1, AXIN1, and RET. Following a feature selection process known as backward elimination, we ended up with a set of five gene signatures: FGF10, GATA1, DHTKD1, TBX21, and RET. The details are summarized in Table 2. To calculate the final risk scores, we used the formula:

Odds ratio of fitted logistic regression

RiskScore=1/exp (1.3128×FGF10+2.3964×GATA1+1.2139×DHTKD1+2.2145×TBX21+1.0940×RET–2.6747).

Utilizing the training split dataset, the risk score derived from five gene signatures demonstrated superior predictive performance compared to other models based on single genes. In Fig. 2A, the prediction performances using the training split set were depicted. The AUROC of the risk score was 0.8327, outperforming individual genes such as FGF10 (0.7307), GATA1 (0.6138), DHTKD1 (0.6571), TBX21 (0.6138), and RET (0.6525). We assessed the classification performance using a test split dataset, presenting the results in Table 3 and Fig. 2B. Our logistic regression model successfully distinguished between relapse and normal cases with AUROC values of 0.9087, as well as an accuracy of 0.9048 (sensitivity, 0.8750; specificity, 0.9231; positive predictive value [PPV], 0.8750; negative predictive value [NPV], 0.9231). Notably, when considering the performance of a model that relied solely on FGF10, it exhibited a remarkable performance (AUROC, 0.8990; accuracy, 0.9048; sensitivity, 0.9750; specificity, 0.9231; PPV, 0.875; NPV, 0.9231). However, the other gene signatures showed lower performance when used individually to predict relapse, with AUROC values below 0.7 for GATA1 (AUROC, 0.5625; accuracy, 0.6667; sensitivity, 0.1250; specificity, 1.0000; PPV, 1.0000; NPV, 0.6500), DHTKD1 (AUROC, 0.6827; accuracy, 0.6667; sensitivity, 0.7500; specificity, 0.6154; PPV, 0.5455; NPV, 0.8000), TBX21 (AUROC, 0.5240; accuracy, 0.6190; sensitivity, 0.1250; specificity, 0.9231; PPV, 0.5000; NPV, 0.6316), and RET (AUROC, 0.6058; accuracy, 0.5714; sensitivity, 0.7500; specificity, 0.4615; PPV, 0.4615; NPV, 0.7500).

Fig. 2.

Area under the receiver operating characteristic (AUROC) curves of predictive models. Panel A illustrates the AUROC curve derived from the training dataset, while Panel B from the test dataset, each corresponding to the 5-gene signature-based logistic regression model (full model) and individual gene models for FGF10, GATA1, DHTKD1, TBX21, and RET. Panel A indicates the full model’s superior performance over single gene-based models, whereas Panel B exhibits the FGF10 single gene-based model’s performance, comparable with the full model.

Classification performance on the test dataset

In Fig. 3, we visually represented the expression of the five gene signatures using a heatmap. This visualization helped distinguish expression patterns between normal and relapse cases in TNBC patients. It’s observable that five gene signatures exhibit a comparatively higher likelihood of mutation within the recurrence group.

Fig. 3.

Heatmap depicting expression of the five gene signatures. This heatmap contrasts the expression levels of five gene signatures between patient groups categorized by relapse and normalcy.

3. Survival analysis of the prognostic gene signature in relapse of TNBC

To delve deeper into the connection between gene signatures and the likelihood of relapse in TNBC patients, we conducted a survival analysis using the Kaplan-Meier statistical method. This analysis revealed that the entire set of gene signatures exhibited a noteworthy association with the relapse of TNBC patients, as indicated by p-values below 0.01 (refer to Fig. 4B-F). Furthermore, the results from our logistic regression analysis indicated a substantial distinction in relapse risk between patients categorized as high and low risk (using a threshold of 0.45), with a p-value below 0.01 (Fig. 4).

Fig. 4.

Kaplan-Meier survival analysis for gene-based models. (A) The survival probabilities using the full model with five gene signatures. (B-F) The prognostic estimations based on single gene models for FGF10, GATA1, DHTKD1, TBX21, and RET. Each survival curve distinctly demonstrates the divergence between the positive and negative (high risk and low risk) groups.

Discussion

With the two-stage gene signatures identifications, we could select five most significant genes to predict relapse in TNBC patients with considerable prediction performances (AUROC, 0.8990). The gene signatures were FGF10, GARA1, DHTKD1, TBX21, and RET. With the following survival analysis, each gene signatures showed diverging patterns (Fig. 4).

First of all, the FGF10 gene encodes a protein called fibroblast growth factor 10 (FGF10), which plays crucial roles via FGFR2b signaling in both the epithelium and mesenchyme during mammary gland development [12]. In humans, FGF10 expression has been observed in 10% of breast adenocarcinomas, and the activation of FGFR2b signaling is associated with a poorer prognosis [12]. Besides breast cancer, FGF10 also known to stimulate cell migration and invasion in pancreatic cancer cells through its interaction with FGFR2-IIIb, resulting in aggressive tumor proliferation [13].

GATA-binding factor 1, or GATA-1, is the founding member of the GATA family of transcription factors and is also referred to as the erythroid transcription factor. GATA factors are zinc finger DNA-binding proteins that regulate the development of various tissues by either activating or repressing transcription [14]. Consequently, GATA factors play a crucial role in coordinating cellular maturation, proliferation arrest, and cell survival [15]. They have been associated with colorectal and ovarian cancer progression, as well as gemcitabine resistance in pancreatic cancer [16-18]. In breast cancer, there have been reports suggesting that GATA-1 induction may lead to epithelial-mesenchymal transition, which is considered an indicator of poor prognosis [19].

The DHTKD1 gene encodes a component of a mitochondrial 2-oxoglutarate-dehydrogenase-complex-like protein involved in the degradation pathways of several amino acids, including lysine [20]. Mutations in this gene are associated with 2-aminoadipic 2-oxoadipic aciduria and Charcot-Marie-Tooth Disease [20]. Research related to cancer progression, particularly in breast cancer prognosis prediction, is rare. However, in a recent study on bladder cancer, it was found that circDHTKD1 was positively associated with lymph node metastasis and significantly upregulated in bladder cancer [21]. DHTKD1 deficiency leads to mitochondrial dysfunction [22], which in turn can contribute to cancer progression by altering mitochondrial metabolism [23]. This alteration can lead to an increase in the production of mitochondrial reactive oxygen species and changes in cellular redox status. This, in turn, affects the activities of transcription factors such as HIF1α and FOS-JUN, leading to changes in gene expression and the stimulation of cancer cell proliferation [24].

TBX21 gene belongs to a phylogenetically conserved gene family characterized by a common DNA-binding domain known as the T-box [25]. Studies conducted in mice have demonstrated that the Tbx21 protein serves as a Th1 cell-specific transcription factor responsible for regulating the expression of the hallmark Th1 cytokine, interferon-γ [26]. Recently, Zhao et al. [27] suggest that TBX21 may be used to assess the survival of patients and promote cancer stemness through the TBX21–IL-4 pathway in lung adenocarcinoma. Further investigation is needed to elucidate the additional roles of immunotherapy in TNBC, where the response to immunotherapies is becoming a key prognostic factor.

The RET gene, located on human chromosome 10q11.2, encodes a transmembrane receptor and belongs to the tyrosine protein kinase family of proteins [28]. RET gene alterations, including amplification, missense mutations, known fusions, novel fusions, and rearrangements, have been observed in breast cancer, confirming its involvement in the development of breast cancer [29]. Previous studies have shown that the receptor tyrosine kinase RET is overexpressed in a subset of estrogen receptor (ER)–positive breast cancers, and the interaction between RET and ER plays a crucial role in responses to endocrine therapy [30]. RET expression varies significantly among different breast cancer subtypes, with the lowest RET expression in the basal-like subtype and the highest expression in the luminal A subtype [31]. Further research is needed to understand the role of RET in TNBC, particularly in relation to its molecular subtypes.

Attempts to predict the prognosis of TNBC have been made by classifying its subtypes. Lehmann classified TNBC into six subtypes based on gene expression profiles [32] and later refined this to four: basal-like 1 (BL1), basal-like 2 (BL2), mesenchymal (M), and luminal androgen receptor subtype [33]. Multi-omics analysis was also proposed to suggest the immune cell composition of each subtype and differential pharmacological vulnerabilities [34]. Although various classification methods have been proposed to address the heterogeneity of TNBC, they have not yet achieved perfect precision treatment [35,36].

This clinical unmet need led to attempts to incorporate AI [37]. Alsaleem et al. [38] employed artificial neural network techniques to predict distant metastasis-free survival and breast cancer-specific survival in TNBC patients, discovering a two-gene prognostic signature (ACSM4 and SPDYC). This signature showed good predictive power with a probability of survival hazard ratio of 0.60 (95% confidence interval, 0.47 to 0.76) when high compared to low. However, the identified genes ACSM4 and SPDYC, which are involved in CDK1/2 expression and the metabolism of carboxylic acids and fatty acid beta oxidation, respectively, did not show a clear association with TNBC biology.

In the process of identifying a gene set that effectively explains the prognosis of TNBC and its cancer biology, we selectively curated a series of mutation groups. One such decision involved excluding nonsense (stopgain) mutations. In the analysis that included all stopgain mutations, the derived top 5 gene set predominantly consisted of oncogenes, making it difficult to explain cancer biology through a loss-of-function perspective. This inclusion also tended to diminish the overall predictive accuracy (AUROC, 0.8269 to 0.8173). Based on prior knowledge of each gene’s functional relevance, we concluded that the original gene set (FGF10, GATA1, CARD1, DHTKD1, MDM2, TBX21, BCL6, SF3B1, AXIN1, and RET) provides greater insight and interpretability.

This study has several limitations. Our focus was solely on gene-level mutations categorized as binary labels, employed for analytical purposes. However, this approach may limit the depth of information captured by the model. There exists an opportunity for enhancement by integrating diverse data types, such as clinical data and digital pathology images. Thus, our future research endeavors aim to develop a comprehensive model capable of accommodating both gene signatures and the varied data modalities present in this study. Another point is the relatively small dataset used in this study, comprising only 102 samples, which resulted in a limited test set. While the full model slightly outperformed in terms of AUROC, the FGF10-only model also demonstrated nearly comparable performance. This may indicate the strong predictive power of FGF10; however, due to the limited size of the dataset, this performance may not be fully validated. Consequently, we are planning future studies to further validate this model in a larger data collection environment or with external datasets.

To our knowledge, this is the first study to use machine learning to analyze a cohort exclusively comprising Asian TNBC patients. It also boasts the advantage of using targeted NGS, which is actively utilized in clinics. Our model exhibited noteworthy predictive performance despite employing only five genes with minimal sequencing and computational expenses, contrasting with previous studies that used complex methods with long turnaround times. Three of the five genes we identified (FGF10, GATA-1, RET) have already been confirmed to have direct or indirect influences on cancer biology. Additionally, TBX21 is associated with an immune signature, lending credibility to its role in predicting TNBC prognosis. Consequently, this approach holds promise for practical implementation within real-world clinical settings.

Electronic Supplementary Material

Supplementary materials are available at Cancer Research and Treatment website (https://www.e-crt.org).

Notes

Ethical Statement

The research protocol was approved by Institutional review board of Korea University Anam Hospital (IRB No. 2023AN0080). Informed consent was obtained from study participants for genomic analysis and the presentation of research finding. This study was conducted in accordance with the Declaration of Helsinki and Good Clinical Practice guidelines.

Author Contributions

Conceived and designed the analysis: Park KH.

Collected the data: Kim JW, Park KH.

Contributed data or analysis tools: Ahn S.

Performed the analysis: Lee J, Lee SH, Ahn S.

Wrote the paper: Kim JW, Lee J.

Conflict of Interest

Conflict of interest relevant to this article was not reported.

Funding

This research was supported by a grant of the Korea Health Technology R&D Project through the Korea Health Industry Development Institute (KHIDI), funded by the Ministry of Health & Welfare, Republic of Korea (grant number: RS-2022-KH129295).

References

1. Lukasiewicz S, Czeczelewski M, Forma A, Baj J, Sitarz R, Stanislawek A. Breast cancer-epidemiology, risk factors, classification, prognostic markers, and current treatment strategies: an updated review. Cancers (Basel) 2021;13:4287.
2. Dawson SJ, Provenzano E, Caldas C. Triple negative breast cancers: clinical and prognostic implications. Eur J Cancer 2009;45 Suppl 1:27–40.
3. Ismail-Khan R, Bui MM. A review of triple-negative breast cancer. Cancer Control 2010;17:173–6.
4. Masuda H, Masuda N, Kodama Y, Ogawa M, Karita M, Yamamura J, et al. Predictive factors for the effectiveness of neoadjuvant chemotherapy and prognosis in triple-negative breast cancer patients. Cancer Chemother Pharmacol 2011;67:911–7.
5. Huang M, O’Shaughnessy J, Zhao J, Haiderali A, Cortes J, Ramsey SD, et al. Association of pathologic complete response with long-term survival outcomes in triple-negative breast cancer: a meta-analysis. Cancer Res 2020;80:5427–34.
6. Kourou K, Exarchos TP, Exarchos KP, Karamouzis MV, Fotiadis DI. Machine learning applications in cancer prognosis and prediction. Comput Struct Biotechnol J 2015;13:8–17.
7. Ferroni P, Zanzotto FM, Riondino S, Scarpato N, Guadagni F, Roselli M. Breast cancer prognosis using a machine learning approach. Cancers (Basel) 2019;11:328.
8. Chen Z, Wang M, De Wilde RL, Feng R, Su M, Torres-de la Roche LA, et al. A machine learning model to predict the triple negative breast cancer immune subtype. Front Immunol 2021;12:749459.
9. Thalor A, Kumar Joon H, Singh G, Roy S, Gupta D. Machine learning assisted analysis of breast cancer gene expression profiles reveals novel potential prognostic biomarkers for triple-negative breast cancer. Comput Struct Biotechnol J 2022;20:1618–31.
10. Chen DL, Cai JH, Wang CC. Identification of key prognostic genes of triple negative breast cancer by LASSO-based machine learning and bioinformatics analysis. Genes (Basel) 2022;13:902.
11. Lee Y, Lee S, Sung JS, Chung HJ, Lim AR, Kim JW, et al. Clinical application of targeted deep sequencing in metastatic colorectal cancer patients: actionable genomic alteration in K-MASTER project. Cancer Res Treat 2021;53:123–30.
12. Rivetti S, Chen C, Chen C, Bellusci S. Fgf10/Fgfr2b signaling in mammary gland development, homeostasis, and cancer. Front Cell Dev Biol 2020;8:415.
13. Nomura S, Yoshitomi H, Takano S, Shida T, Kobayashi S, Ohtsuka M, et al. FGF10/FGFR2 signal induces cell migration and invasion in pancreatic cancer. Br J Cancer 2008;99:305–13.
14. Zheng R, Blobel GA. GATA transcription factors and cancer. Genes Cancer 2010;1:1178–88.
15. Crispino JD. GATA1 in normal and malignant hematopoiesis. Semin Cell Dev Biol 2005;16:137–47.
16. Liu Z, Zhu Y, Li F, Xie Y. GATA1-regulated JAG1 promotes ovarian cancer progression by activating Notch signal pathway. Protoplasma 2020;257:901–10.
17. Yu J, Liu M, Liu H, Zhou L. GATA1 promotes colorectal cancer cell proliferation, migration and invasion via activating AKT signaling pathway. Mol Cell Biochem 2019;457:191–9.
18. Chang Z, Zhang Y, Liu J, Guan C, Gu X, Yang Z, et al. GATA1 promotes gemcitabine resistance in pancreatic cancer through antiapoptotic pathway. J Oncol 2019;2019:9474273.
19. Li Y, Ke Q, Shao Y, Zhu G, Li Y, Geng N, et al. GATA1 induces epithelial-mesenchymal transition in breast cancer cells through PAK5 oncogenic signaling. Oncotarget 2015;6:4345–56.
20. Xu WY, Zhu H, Shen Y, Wan YH, Tu XD, Wu WT, et al. DHTKD1 deficiency causes Charcot-Marie-Tooth disease in mice. Mol Cell Biol 2018;38:e00085–18.
21. Lu Q, Yin H, Deng Y, Chen W, Diao W, Ding M, et al. circDHTKD1 promotes lymphatic metastasis of bladder cancer by upregulating CXCL5. Cell Death Discov 2022;8:243.
22. Xu W, Zhu H, Gu M, Luo Q, Ding J, Yao Y, et al. DHTKD1 is essential for mitochondrial biogenesis and function maintenance. FEBS Lett 2013;587:3587–92.
23. Wallace DC. Mitochondria and cancer. Nat Rev Cancer 2012;12:685–98.
24. Bardella C, Pollard PJ, Tomlinson I. SDH mutations in cancer. Biochim Biophys Acta 2011;1807:1432–43.
25. Zhang X, Wen X, Feng N, Chen A, Yao S, Ding X, et al. Increased expression of T-Box transcription factor protein 21 (TBX21) in skin cutaneous melanoma predicts better prognosis: a study based on The Cancer Genome Atlas (TCGA) and Genotype-Tissue Expression (GTEx) databases. Med Sci Monit 2020;26e923087.
26. Zhao S, Shen W, Du R, Luo X, Yu J, Zhou W, et al. Three inflammation-related genes could predict risk in prognosis and metastasis of patients with breast cancer. Cancer Med 2019;8:593–605.
27. Zhao S, Shen W, Yu J, Wang L. TBX21 predicts prognosis of patients and drives cancer stem cell maintenance via the TBX21-IL-4 pathway in lung adenocarcinoma. Stem Cell Res Ther 2018;9:89.
28. Traugott AL, Moley JF. The RET protooncogene. In : Sturgeon C, ed. Endocrine neoplasia. Cancer Treatment and Research, Vol. 153 Springer; 2010. p. 303–19.
29. Morandi A, Plaza-Menacho I, Isacke CM. RET in breast cancer: functional and therapeutic implications. Trends Mol Med 2011;17:149–57.
30. Lo Nigro C, Rusmini M, Ceccherini I. RET in breast cancer: pathogenic implications and mechanisms of drug resistance. Cancer Drug Resist 2019;2:1136–52.
31. Zhai Q, Li H, Sun L, Yuan Y, Wang X. Identification of differentially expressed genes between triple and non-triple-negative breast cancer using bioinformatics analysis. Breast Cancer 2019;26:784–91.
32. Lehmann BD, Bauer JA, Chen X, Sanders ME, Chakravarthy AB, Shyr Y, et al. Identification of human triple-negative breast cancer subtypes and preclinical models for selection of targeted therapies. J Clin Invest 2011;121:2750–67.
33. Lehmann BD, Jovanovic B, Chen X, Estrada MV, Johnson KN, Shyr Y, et al. Refinement of triple-negative breast cancer molecular subtypes: implications for neoadjuvant chemotherapy selection. PLoS One 2016;11e0157368.
34. Lehmann BD, Colaprico A, Silva TC, Chen J, An H, Ban Y, et al. Multi-omics analysis identifies therapeutic vulnerabilities in triple-negative breast cancer subtypes. Nat Commun 2021;12:6276.
35. Burstein MD, Tsimelzon A, Poage GM, Covington KR, Contreras A, Fuqua SA, et al. Comprehensive genomic analysis identifies novel subtypes and targets of triple-negative breast cancer. Clin Cancer Res 2015;21:1688–98.
36. Yin L, Duan JJ, Bian XW, Yu SC. Triple-negative breast cancer molecular subtyping and treatment progress. Breast Cancer Res 2020;22:61.
37. Ensenyat-Mendez M, Llinas-Arias P, Orozco JI, Iniguez-Munoz S, Salomon MP, Sese B, et al. Current triple-negative breast cancer subtypes: dissecting the most aggressive form of breast cancer. Front Oncol 2021;11:681476.
38. Alsaleem MA, Ball G, Toss MS, Raafat S, Aleskandarany M, Joseph C, et al. A novel prognostic two-gene signature for triple negative breast cancer. Mod Pathol 2020;33:2208–20.

Article information Continued

Fig. 1.

Gene signature identification for relapse prediction in triple-negative breast cancer (TNBC) patients. An analytical workflow, beginning with variant call format (VCF) files, aggregates gene-level mutations ClinVar and exonic function filtering. A decision tree classifier is employed to identify 10 gene candidates, which are further refined through logistic regression with backward feature selection to ascertain f ive critical gene signatures. These signatures are determined to predict relapse in TNBC patients with heightened accuracy. Their significance validated by Kaplan-Meier and area under the receiver operating characteristic curves.

Fig. 2.

Area under the receiver operating characteristic (AUROC) curves of predictive models. Panel A illustrates the AUROC curve derived from the training dataset, while Panel B from the test dataset, each corresponding to the 5-gene signature-based logistic regression model (full model) and individual gene models for FGF10, GATA1, DHTKD1, TBX21, and RET. Panel A indicates the full model’s superior performance over single gene-based models, whereas Panel B exhibits the FGF10 single gene-based model’s performance, comparable with the full model.

Fig. 3.

Heatmap depicting expression of the five gene signatures. This heatmap contrasts the expression levels of five gene signatures between patient groups categorized by relapse and normalcy.

Fig. 4.

Kaplan-Meier survival analysis for gene-based models. (A) The survival probabilities using the full model with five gene signatures. (B-F) The prognostic estimations based on single gene models for FGF10, GATA1, DHTKD1, TBX21, and RET. Each survival curve distinctly demonstrates the divergence between the positive and negative (high risk and low risk) groups.

Table 1.

Demographic characteristics of study population

Clinical characteristic No. (%) (n=102)
Age (yr), median (range) 50 (30-85)
Stage
 I 17 (16.7)
 II 20 (19.6)
 III 32 (31.4)
 IV 7 (6.9)
 NE 26 (25.5)
Treatment
 Neoadjuvant therapy followed by surgery 43 (42.2)
 Surgery followed by adjuvant therapy 29 (28.4)
 Surgery only 10 (9.8)
 Chemotherapy only 7 (6.9)
 NE 13 (12.7)
Tumor status
 Initial diagnosis 70 (68.6)
 Surgical tissue after neoadjuvant therapy 26 (25.5)
 NE 6 (5.9)

NE, not evaluable.

Table 2.

Odds ratio of fitted logistic regression

Variable Odds ratio Confidence interval
p-value
Low High
FGF10 3.72 1.30 10.93 0.015
GATA1 10.98 1.52 227.98 0.040
DHTKD1 3.37 1.17 10.39 0.028
TBX21 9.16 1.16 199.58 0.067
RET 2.99 0.99 9.76 0.058
Bias 0.07 0.02 0.20 < 0.001

Table 3.

Classification performance on the test dataset

Model Threshold AUROC Accuracy Sensitivity Specificity PPV NPV
Full 0.45 0.9087 0.9048 0.8750 0.9231 0.8750 0.9231
FGF10 0.5 0.8990 0.9048 0.8750 0.9231 0.8750 0.9231
GATA1 0.5 0.5625 0.6667 0.1250 1.0000 1.0000 0.6500
DHTKD1 0.5 0.6827 0.6667 0.7500 0.6154 0.5455 0.8000
TBX21 0.5 0.5240 0.6190 0.1250 0.9231 0.5000 0.6316
RET 0.5 0.6058 0.5714 0.7500 0.4615 0.4615 0.7500

AUROC, area under the receiver operating characteristics; NPV, negative predictive value; PPV, positive predictive value.