Decision tree-based modelling for identification of potential interactions between type 2 diabetes risk factors: a decade follow-up in a Middle East prospective cohort study

Azra Ramezankhani; Esmaeil Hadavandi; Omid Pournik; Jamal Shahrabi; Fereidoun Azizi; Farzad Hadaegh

doi:10.1136/bmjopen-2016-013336

Article Text

PDF

XML

Diabetes and endocrinology

Research

Decision tree-based modelling for identification of potential interactions between type 2 diabetes risk factors: a decade follow-up in a Middle East prospective cohort study

Azra Ramezankhani1,
Esmaeil Hadavandi2,3,
Omid Pournik4,
Jamal Shahrabi2,
Fereidoun Azizi5,
Farzad Hadaegh1

¹Prevention of Metabolic Disorders Research Center, Research Institute for Endocrine Science, Shahid Beheshti University of Medical Sciences, Tehran, Iran
²Industrial Engineering Department, Amirkabir University of Technology, Tehran, Iran
³Department of Industrial Engineering, Birjand University of Technology, Birjand, Iran
⁴Department of Community Medicine, School of Medicine, Iran University of Medical Sciences, Tehran, Iran
⁵Endocrine Research Center, Research Institute for Endocrine Sciences, Shahid Beheshti University of Medical Sciences, Tehran, Iran

Correspondence to Dr Farzad Hadaegh; fzhadaegh{at}endocrine.ac.ir

Abstract

Objective The current study was undertaken for use of the decision tree (DT) method for development of different prediction models for incidence of type 2 diabetes (T2D) and for exploring interactions between predictor variables in those models.

Design Prospective cohort study.

Setting Tehran Lipid and Glucose Study (TLGS).

Methods A total of 6647 participants (43.4% men) aged >20 years, without T2D at baselines ((1999–2001) and (2002–2005)), were followed until 2012. 2 series of models (with and without 2-hour postchallenge plasma glucose (2h-PCPG)) were developed using 3 types of DT algorithms. The performances of the models were assessed using sensitivity, specificity, area under the ROC curve (AUC), geometric mean (G-Mean) and F-Measure.

Primary outcome measure T2D was primary outcome which defined if fasting plasma glucose (FPG) was ≥7 mmol/L or if the 2h-PCPG was ≥11.1 mmol/L or if the participant was taking antidiabetic medication.

Results During a median follow-up of 9.5 years, 729 new cases of T2D were identified. The Quick Unbiased Efficient Statistical Tree (QUEST) algorithm had the highest sensitivity and G-Mean among all the models for men and women. The models that included 2h-PCPG had sensitivity and G-Mean of (78% and 0.75%) and (78% and 0.78%) for men and women, respectively. Both models achieved good discrimination power with AUC above 0.78. FPG, 2h-PCPG, waist-to-height ratio (WHtR) and mean arterial blood pressure (MAP) were the most important factors to incidence of T2D in both genders. Among men, those with an FPG≤4.9 mmol/L and 2h-PCPG≤7.7 mmol/L had the lowest risk, and those with an FPG>5.3 mmol/L and 2h-PCPG>4.4 mmol/L had the highest risk for T2D incidence. In women, those with an FPG≤5.2 mmol/L and WHtR≤0.55 had the lowest risk, and those with an FPG>5.2 mmol/L and WHtR>0.56 had the highest risk for T2D incidence.

Conclusions Our study emphasises the utility of DT for exploring interactions between predictor variables.

Diabetes
Interaction
Decision tree
Data Mining
Prediction

This is an Open Access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/4.0/

https://doi.org/10.1136/bmjopen-2016-013336

Statistics from Altmetric.com

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

Strengths and limitations of this study

We used a large population-based sample for our study.
The direct measurements of glucose value and anthropometric indices were used rather than self-reported information for predictor variables and outcome.
Our study proposes a new approach for detecting interactions between predictors.
There were no data available on the dietary intake among participants.
External validity did not perform for the derived prediction models.

Introduction

The prevalence of type 2 diabetes (T2D) mellitus has been increasing rapidly over the past decade. Around 366 million people worldwide had diabetes mellitus in 2011, and this number is expected to reach 552 million by 2030.1 Several risk factors, such as age, sex, ethnicity, family history, obesity and hypertension, are well documented. However, detecting the precise interaction of these and other risk factors with one another is a complex process that varies both within and across populations.2–4

During the past two decades, dozens of prediction models for diabetes have been developed using logistic or Cox regression models,4 ,5 while recently a systematic search of those multivariable models has shown that few reported prediction models contain interactions, and it seems that few researchers examine them.6 There are a number of reasons for not using interactions in traditional statistical methods. First, there are generally many possible predictor variables in medical research which make the task of variable selection difficult. Hence, traditional statistical methods are poorly suited for this type of multiple comparisons. Second, many clinical variables are not normally distributed and different groups of participants may have markedly different degrees of variations. Third, assessment of interactions, using the traditional regression models, requires prespecification of the interaction terms, for example, in a linear model involving outcome Y, and two predictor variables (x₁ and x₂), the product term x₁x₂ is the common representation of the two-way interaction effect. As the number of variables in the model increases, the number of possible interactions that can be investigated is large and leads to a complicated model that can be difficult to fit and interpret.6 ,7 Non-parametric regression has been introduced in 1963 which are another class of simple regression models for explanation and prediction nowadays known as ‘recursive partitioning’ or ‘decision trees’ (DT). Many variants and extensions of the tree methods have been published in the past 50 years, which have been widely used in many fields such as machine learning, data mining and pattern recognition.8 ,9 Recursive partitioning is a statistical method for exploration of interactions or non-linear relationships among explanatory variables, identification of different subgroups, detection of the most important variables in those subgroups, and finally offering a new way to look at complex data.8 ,10 ,11 Since there will never be enough resources to implement every prevention programme for all target groups, health policymakers prefer interventions that target high-risk groups.12 Therefore, DT models might be helpful for identifying different groups which allow implementation of specific interventions for each group according to their risk probabilities (low-risk and high-risk groups).

The aim of this study was to develop a series of classification trees for adult men and women based on three commonly used DT algorithms (Classification and Regression Tree (CART), Quick Unbiased Efficient Statistical Tree (QUEST) and commercial version (C5.0)) to gain more information on interactions between factors contributing to the incidence of T2D. We used the Tehran Lipid and Glucose Study (TLGS) database for our analysis.

Methods

Study population

The TLGS, an ongoing prospective study, has been described in detail elsewhere.13 Briefly, the baseline study (phase 1) was performed from 1999 to 2001, with follow-ups in three consecutive phases, 2002–2005 (phase 2), 2005–2008 (phase 3), and the last 2009–2012 (phase 4). After the cross-sectional phase (phase 1), participants were assigned to a cohort and a prospective interventional study. For this study, 10 368 participants aged ≥20 years from the first phase were selected and followed from the date of enrolment through phase 4; moreover, in the second phase, 2440 new participants entered and were followed in the next two phases (3 and 4). We excluded participants with prevalent T2D at baseline (n=1376) and those with missing data regarding fasting plasma glucose (FPG) and 2-hour postchallenge plasma glucose ( 2h-PCPG) (n=1122). Overall, 3663 (35%) participants were lost to follow-up and 729 new cases of T2D were identified by the end of phase 4 (figure 1). The written informed consent was obtained from each participant.

Figure 1

Flow diagram for the selection of study participants in the Tehran Lipid and Glucose Study. 2h-PCPG, 2-hour postchallenge plasma glucose; FPG, fasting plasma glucose.

Clinical, anthropometric and laboratory measurements

Information on demographics, education, smoking status, physical activity, and medical and drug history was collected by interview. For women, additional information on reproductive history, menstruation status and interventions to prevent pregnancy was collected using a pretested questionnaire. Anthropometric measures including weight, height and waist circumference (WC) were measured, according to a standard protocol.14 Body mass index (BMI) was calculated as weight (kg)/height (m)². Waist-to-hip ratio (WHpR) was calculated as WC/hip circumference and waist-to-height ratio (WHtR) was calculated as WC/height. Systolic and diastolic blood pressure (SBP and DBP, respectively), and blood parameters such as FPG, 2h-PCPG, triglycerides (TGs), total cholesterol (TC) and high-density lipoprotein cholesterol (HDL-c) were measured using previously reported methods.15 TGs to HDL-c ratio (TG/HDL) ratio was obtained as TG/HDL-c and TC-to-HDL-c (TC/HDL) was calculated as TC/HDL-c.

Definition of variables and outcome

Education level was categorised to five levels as illiterate, 1–5 years, 6–12 years, 13–16 years and more than 16 years schooling. Marital status was categorised as single, married, widowed and divorced. A current smoker was defined as a person who smokes cigarettes daily or occasionally. Former smokers were defined as individuals who have smoked daily or occasionally and who had quit smoking. Passive smoking was defined as exposure to secondhand cigarette smoke in the home, at work or in other environments. A family history of premature cardiovascular diseases (CVD) was considered as any experience of fatal or non-fatal myocardial infarction, stroke or sudden cardiac arrest in first-degree relatives, if it occurred before 55 years of age in male relatives and before 65 years of age in female relatives. A history of CVD was defined as previous ischaemic heart disease and/or cerebrovascular accidents. A family history of diabetes (FHD) was defined as having T2D in first-degree relatives. On the basis of their self-reported levels of leisure time physical activity, participants were categorised into two groups in which ‘inactive’ means those doing exercise or labour less than three times a week or performing activities achieving lower than 600 MET. Mean arterial blood pressure (MAP) was obtained as ([(2×diastolic)+systolic]/3).16 Pulse pressure was defined as SBP minus DBP. Participants were grouped into two categories based on participating in the lifestyle intervention. Women were categorised into three groups on the basis of their menstruation status: having normal menstrual cycle by taking medication, normal menopause, early menopause because of surgery or other reasons. Women were also categorised to six levels considering pregnancy prevention methods: use of hormonal contraceptive drugs, intrauterine devices (IUDs), using condoms, withdrawal method, tubectomy/vasectomy and not applicable. They were also categorised into two groups based on birth history, a history of hypertension and hyperglycaemia in pregnancy. Incidence of T2D (outcome variable) was defined based on an FPG≥7.0 mmol/L or 2h-PCPG≥11.1 mmol/L or taking antidiabetic medication in all phases of the study.17 Final data sets consisted of 6647 cases (3762 women) which included 54 and 44 primary predictor variables in women and men, respectively.

Statistical methods

Data preparation

Data were prepared before analysis. Data preparation included: missing data handling, variables selection, defining the train and validation data sets and balancing the train data sets.

Missing data handling

Results of the Little's missing completely at random (MCAR) test18 on the primary set of predictor variables showed that in men the pattern of missing data was completely at random (MCAR) (p=0.15), but for women missing values were not MCAR (p<0.001).19 We used single imputation for imputing the missing data. For imputation, all the primary variables were included, except for the outcome variable. Continuous variables were imputed by the CART method,10 using SPSS modeler (V.14.2.0.3, IBM), and for categorical variables we applied the weighted K-Nearest Neighbor approach using RapidMiner (V.5).20

Training and validation data

The entire data sets of men and women were divided into two sets using stratified random sampling: a training set consisted of 70% of the data for model development, and a test or validation set consisted of the remaining (30%) for model validation (internal validation) (figure 2).

Figure 2

Generation of training and validation data set diagram.

Data balancing

Most of the popular classification algorithms such as DT work well when the positive and negative cases are evenly distributed and problems arise when the data set is imbalanced.21 The class imbalance in medical data occurs when there are many more cases of some classes (majority class or negative) than others (minority class or positive).22 In such cases, standard classifiers tend to produce high accuracy over that of the majority class.23 There are a finite number of solutions to handle imbalanced data sets.21–23 In our previous work, we showed the effectiveness of Synthetic Minority Oversampling Technique (SMOTE) for handling imbalanced data sets.24 In this study, we balanced two training data sets of men and women using SMOTE as previously reported (figure 2).24

Variables selection

Variable or feature selection methods have been used since the 1970s in the fields of statistics and machine learning techniques.25 Variable selection methods have been shown to be effective in removing redundant and irrelevant variables, improving prediction performance of learning algorithms and reducing the effects of high dimensionality in the data.26 Therefore, in order to identify the best subset of variables while retaining the predictive power of the original variables, we applied the multivariate filter approach, using correlation-based feature selection and consistency-based feature selection as two evaluation criteria in conjunction with the Best First, Genetic Algorithm as two search strategies.27 Therefore, four subsets of variables were selected using a combination of the two search strategies and two evaluation criteria. To arrive at the final set, the four subsets were reviewed to choose the variables that were observed at least in two subsets. Variable selection methods were applied on the training data sets after imputation of missing data. We used the Weka toolkit (V.3.2.) for selecting variables.

Statistical analysis

Baseline characteristics were compared between participants with and without T2D across men and women. Also, characteristics were compared between followed up versus non-followed up participants. Comparisons were done using Student's t-test and χ² with a two-tailed p<0.05 being considered significant.

Methods for DT modelling

There are many different algorithms for fitting tree-structured models coming from different communities.9 ,28 All the DT algorithms generate a set of classification rules and construct a DT. A tree has three types of nodes: root node, internal node and terminal nodes. Both the root and the internal nodes are partitioned into two nodes in the next layer; however, the terminal nodes do not have offspring nodes. The root node contains the learning sample from which the tree is grown. The basic process of developing a DT includes three elements: the selection of variable for splits the data (splitting criteria), stopping rule to decision of when to stop splitting a node and mark it terminal, and the pruning methods.8

To choose the right algorithm for our problem, we applied three types of DT algorithms which are widely used for generating a binary tree: the CART algorithm,8 QUEST29 and C5.0.30 All the DT models were performed using IBM SPSS modeler 14.2.

Model evaluation

Performances of the models were evaluated on the test or validation data sets. In data mining, the classifier is basically evaluated by accuracy, sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV) and the area under the curve (AUC). When data are imbalanced, accuracy performs better on the majority classes (negative cases). The geometric mean (G-Mean), however, indicates the balance between model performance on the negative and positive classes and avoids overfitting to the negative class.31 1

F-Measure, harmonic mean of PPV (precision) and sensitivity (recall), is another measure that increases proportionally to the increase of precision and recall. A high value of F-Measure indicates that the model performs better on the positive class.31 ,32 We chose sensitivity and G-Mean for comparison of the models and select the best one. 2

Results

Missing data analysis showed that about 59% and 70% of primary variables (44 and 54 in men and women, respectively) had at least two missing data. The ranges of missing data were (0.1–6%) and (0.1–5%) in the women and men data set, respectively. Using the variable selection methods, 15 and 20 variables were identified to include in the model building process for men and women, respectively. The percentage of missing data for selected variables has been shown in tables 1 and 2.

View this table:

Table 1

Baseline characteristics of men (TLGS 1999–2012)

View this table:

Table 2

Baseline characteristics of women (TLGS 1999–2012)

Characteristics of participants

Baseline characteristics of the study population are presented in tables 1 and 2. During a median 9.5 years of follow-up (IQR 6.13–10.2 years), T2D developed in 302 men (10%) and 427 women (11%). Comparison of baseline characteristics between the followed and non-followed participants (only for selected variables) is shown in tables 3 and 4. Followed men had higher value for TC/HDL (5.5 vs 5.4), but lower age (41.8 vs 43.4 years). The proportion of individuals with low education levels (≤5 years) was higher in followed men (20.5% vs 26.2%). Followed women had lower value for age (39.6 vs 40.5 years), pulse pressure (39.3 vs 40.2 bpm) and MAP (89.2 vs 89.9 mm Hg). The proportion of illiterate women was lower in followed women (8.1% vs 13.2%).

View this table:

Table 3

Baseline characteristics of followed up and non-followed up men (TLGS 1999–2012)

View this table:

Table 4

Baseline characteristics of followed up and non-followed up women (TLGS 1999–2012)

Model performances

We constructed the DT models using the balanced training data sets with two set of variables: (1) selected variables that included 2h-PCPG, and (2) selected variables without 2h-PCPG. The performance measures for the two types of DT models are shown in tables 5 and 6. Comparison between models 1 and 2 shows that by removing 2h-PCPG from the variables list, the sensitivity for all three models decreases by 5–10% and 2–5% in men and women, respectively. Results show that QUEST has the highest sensitivity and G-Mean among all models for both men and women; therefore, we chose it as the best DT models.

View this table:

Table 5

Performances of the decision tree models for men (Tehran Lipid and Glucose Study 1999–2012)

View this table:

Table 6

Performances of the decision tree models for women (Tehran Lipid and Glucose Study 1999–2012)

DT analysis in men

Figure 3 depicts the DT for model 1, including the predictor variables and the cut-off points for each predictor. It used four variables (FPG, 2h-PCPG, age and WHtR) for classification and generated seven decision rules; each rule identifies a special subgroup with a certain probability of outcome (positive or negative) for each person belonging to that subgroup. The FPG, located on the top of the tree, was the most important factor in incidence of T2D.

Figure 3

Decision tree for model 1 in men. Performance measures: sensitivity: 78%, specificity: 72%, G-Mean: 0.75, AUC: 0.78. 2h-PCPG, 2-hour postchallenge plasma glucose; AUC, area under the curve; FPG, fasting plasma glucose; G-Mean, geometric mean; WHtR, waist-to-height ratio.

Table 7 shows the seven subgroups identified by the DT of model 1. Each group was specified by a combination of variables that identified a probability for incidence of T2D. For example, group 1 (low risk) consisted of men with an FPG<4.9 mmol/L and 2h-PCPG<7.7 mmol/L who had a 10% probability for incidence of T2D in the study period. Group 7 (high risk) consisted of men with an FPG>5.3 mmol/L and 2h-PCPG>4.4 mmol/L who had a 79% probability for incidence of T2D. The observed risk pattern in each subgroup revealed the interaction between a set of variables; that is, the patterns for group 2 show that in men with an FPG of 4.9–5.3 mmol/L and 2h-PCPG<7.7 mmol/L, risk of incidence depends on the value of WHtR. There was also an interaction between FPG, 2h-PCPG and age such that age >43 years increased the risk of T2D among men who had an FPG>5.3 mmol/L and 2h-PCPG≤4.4 mmol/L (groups 3 and 6). In model 2 (without 2h-PCPG), 9 subgroups were identified. The DT used four variables (FPG, WHtR, MAP and FHD) for classification (table 7). Results showed that FPG was the most important predictor for incidence of T2D; men with an FPG<4.9 mmol/L had a lower risk, but with FPG above 5.3 mmol/L, the risk of incidence depended on the WHtR and MAP.

View this table:

Table 7

Groups identified by decision tree models for men (Tehran Lipid and Glucose Study 1999–2012)

DT analysis in women

The DT created for women is shown in figure 4. The model used three variables (FPG, 2h-PCPG and WHtR) for identification of seven subgroups (table 8). Group 1 (low risk) consisted of women who had an FPG≤5.2 mmol/L and WHtR≤0.55 (12% probability for incidence of T2D). Group 7 (high risk) consisted of women who had an FPG>5.2 mmol/L and WHtR>0.52 (81% probability for incidence of T2D). The observed patterns in the subgroups show that when the FPG level is <5.2 mmol/L, WHtR and 2h-PCPG are the most important factors in incidence of T2D, whereas if FPG is >5.2 mmol/L, WHtR is the most important factor. Some types of interactions were observed between FPG, WHtR and 2h-PCPG in women; for example, the patterns in groups 4 and 7 show that in women with an FPG>5.2 mmol/L, T2D incidence will increase by about 55% with an WHtR of over 0.52. In model 2 for women, in which we excluded 2h-PCPG from the variables list, nine subgroups were identified using three variables (FPG, 2h-PCPG and MAP). This model had a lower sensitivity than model 1. Different interactions were found by this model; that is, when FPG is >5.2 mmol/L, WHtR≥0.56 is the most important risk factor for T2D, whereas when FPG is <5.2 mmol/L, WHtR and MAP play an important role in T2D incidence.

View this table:

Table 8

Groups identified by decision tree models for women (Tehran Lipid and Glucose Study 1999–2012)

Figure 4

Decision tree for model 1 in women. Performance measures: sensitivity: 78%, specificity: 78%, G-Mean: 0.78, AUC: 0.81. 2h-PCPG, 2-hour postchallenge plasma glucose; AUC, area under the curve; FPG, fasting plasma glucose; G-Mean, geometric mean; WHtR, waist-to-height ratio.

Discussion

In this study, we used the three types of DT-based methods to provide insight into the factors that have an important role in the incidence of T2D and how these factors might interact to reveal specific subgroups. We used the more established and widely available algorithms to select the one with the best performance. Considering sensitivity and G-Mean, QUEST had the best performance in both men and women data sets. Although our study focused on exploration of interactions, DT models can be used for predicting the 9 years risk of developing T2D. Also, it is possible to identify who needs more or different treatments if we take interactions into account.

Two sets of variables were used for DT development. In model (1), we used selected variables which included 2h-PCPG, and in model (2), we excluded the 2h-PCPG from the variables list. Results of QUEST showed that although four similar predictors had the highest power both in men and women, they had different interaction patterns in the two genders; for instance, women with WHtR≤0.52 had a lower risk (26%) for T2D even with an FPG level of above 5.2 mmol/L. However, in men, the results showed that when FPG is >5.3 mmol/L, there was still 56% of T2D risk, even with a lowering of WHtR to below 0.45. A systematic review of existing evidences has shown that the mean of suggested cut-off values for WHtR in men and women, respectively, was 0.52 and 0.53 for incidence of T2D.33 However, the results of this study showed that the recommended cut-off of 0.52 for WHtR is not a safe value for decreasing the risk of T2D among men,since significant risk of T2D was observed among men with WHtR≤0.45, as we pointed out above. Therefore, men with WHtR below 0.52 should not be given false assurances about their risk of incident T2D if their FPG level is >5.3 mmol/L.

A review of current studies shows that being aged >40 years is a risk factor fordeveloping T2D.34Theresults of our study show that age ≥43 years is a risk factor for men who have an FPG level >5.3 mmol/L. Results from this study confirm previous findings about the FPG cut-off point, obtained using traditional methods; additionally, we found the FPG cut-off point for men and women separately. For instance, two published studies of TLGS have shown that individuals with FPG levels <5.1 mmol/L are very unlikely to develop T2D during 6 and 9 years follow-up.35 ,36 This study shows that among men with an FPG level <4.9 mmol/L, there is only 14% risk for T2D incidence within about 9 years. Another interesting finding of our study was the important role of MAP in incidence of T2D in men and women. There are very few studies assessing the role of MAP in T2D incidence. Based on some previous studies, hypertension has been recognised as a risk factor for incident T2D in various populations.37 The inter-related pathophysiology of hypertension and T2D is complex and not fully understood.38 Our study showed that an MAP of ≥92 mm Hg is a risk factor among men with an FPG>5.3 mmol/L even if WHtR is <0.49. In women, an MAP of ≥97 mm Hg is a risk factor when WHtR is >0.66, even if the FPG level is ≤5.2 mmol/L. These results imply that the co-occurrence of a high level of MAP and central obesity among women is a risk factor for T2D, whereas in men an increased level of FPG and MAP together is a risk factor for T2D. A simple point score system has recently been developed based on the TLGS database, including SBP, FHD, WHtR, TG/HDL-c and FPG as predictors;39 continuous variables such as FPG and WHtR were, however, categorised into three or four groups. In other words, the cut-off points were predefined for prediction of T2D. In our study, DT algorithms generated optimal cut-off points for these variables as they relate to the best classification of participants with and without T2D.

Some strengths of this study include a large population-based sample. We used direct measurements of glucose value and anthropometric indices rather than self-reported information for both predictor variables and outcomes. Applying two variable selection methods with two evaluation criteria, missing data imputation and construction of DT models for both genders are other notable strengths. We have described the methodology in detail, allowing medical researchers to perform similar studies in different domains using DT methods.

The limitation of this study is the 35% loss to follow-up rate, although a number of authors have proposed a value of 50–80% as an acceptable level of follow-up rates.40 In this study, we found statistically but not clinically important differences between the followed versus non-followed population in some baseline variables. The followed men had a higher value for the TC-to-HDL ratio, but lower age. In women, age, pulse pressure and MAP were lower for the followed population. Since these factors were associated with T2D, the results may be biased towards an underestimation of the association between these risk factors such as age and MAP and T2D. Additionally, we did not have data on dietary intake, which is an important factor in T2D studies. Finally, the models need to be validated on an independent population considering the ethnic and racial variations in T2D incidence.

Conclusions

DT analysis identified different interactions between predictor variables of T2D incidence in men and women. Sensitivity and G-Mean were measured on the validation data and showed acceptable performance of the DT models. Our results showed that WHtR and FPG were important risk factors in women and men, respectively.

Acknowledgments

The authors wish to acknowledge Ms Niloofar Shiva for critical editing of English grammar and syntax of the manuscript.

References

↵
1. Whiting DR,
2. Guariguata L,
3. Weil C, et al
. IDF diabetes atlas: global estimates of the prevalence of diabetes for 2011 and 2030. Diabetes Res Clin Pract 2011;94:311–21. doi:10.1016/j.diabres.2011.10.029
OpenUrl CrossRef PubMed
↵
1. Hippisley-Cox J,
2. Coupland C,
3. Robson J, et al
. Predicting risk of type 2 diabetes in England and Wales: prospective derivation and validation of QDScore. BMJ 2009;338:b880. doi:10.1136/bmj.b880
OpenUrl Abstract/FREE Full Text
↵
1. Park KS
. The search for genetic risk factors of type 2 diabetes mellitus. Diabetes Metab J 2011;35:12–22. doi:10.4093/dmj.2011.35.1.12
OpenUrl CrossRef PubMed
↵
1. Noble D,
2. Mathur R,
3. Dent T, et al
. Risk models and scores for type 2 diabetes: systematic review. BMJ 2011;343:d7163. doi:10.1136/bmj.d7163
OpenUrl Abstract/FREE Full Text
↵
1. Abbasi A,
2. Peelen LM,
3. Corpeleijn E, et al
. Prediction models for risk of developing type 2 diabetes: systematic literature search and independent external validation study. BMJ 2012;345:e5900. doi:10.1136/bmj.e5900
OpenUrl Abstract/FREE Full Text
↵
1. Moons KG,
2. Altman DG,
3. Reitsma JB, et al
. Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD): explanation and elaboration. Ann Intern Med 2015;162:W1–73. doi:10.7326/M14-0698
OpenUrl CrossRef PubMed
↵
1. Kleinbaum DG,
2. Klein M
. Logistic regression: a self-learning text. Springer Science & Business Media, 2010.
↵
1. Han J,
2. Kamber M,
3. Pei J
. Data mining: concepts and techniques. Elsevier, 2011.
↵
1. Loh WY
. Fifty years of classification and regression trees. Int Stat Rev 2014;82:329–48. doi:10.1111/insr.12016
OpenUrl CrossRef
↵
1. Van Buuren S
. Flexible imputation of missing data. CRC Press, 2012.
↵
1. Zhang H,
2. Singer B
. Recursive partitioning and applications. Springer Science & Business Media, 2010.
↵
1. Epping-Jordan JE,
2. Galea G,
3. Tukuitonga C, et al
. Preventing chronic diseases: taking stepwise action. Lancet 2005;366:1667–71. doi:10.1016/S0140-6736(05)67342-4
OpenUrl CrossRef PubMed Web of Science
↵
1. Azizi F,
2. Ghanbarian A,
3. Momenan AA, et al
. Prevention of non-communicable disease in a population in nutrition transition: Tehran Lipid and Glucose Study phase II. Trials 2009;10:5. doi:10.1186/1745-6215-10-5
OpenUrl CrossRef PubMed
↵
1. Azizi F,
2. Rahmani M,
3. Emami H, et al
. Cardiovascular risk factors in an Iranian urban population: Tehran lipid and glucose study (phase 1). Soz Praventivmed 2002;47:408–26. doi:10.1007/s000380200008
OpenUrl CrossRef PubMed Web of Science
↵
1. Harati H,
2. Hadaegh F,
3. Saadat N, et al
. Population-based incidence of type 2 diabetes and its associated risk factors: results from a six-year cohort study in Iran. BMC Public Health 2009;9:186. doi:10.1186/1471-2458-9-186
OpenUrl CrossRef PubMed
↵
1. Franklin SS,
2. Gustin W,
3. Wong ND, et al
. Hemodynamic patterns of age-related changes in blood pressure: the Framingham Heart Study. Circulation 1997;96:308–15. doi:10.1161/01.CIR.96.1.308
OpenUrl Abstract/FREE Full Text
↵
1. Gavin J,
2. Alberti K,
3. Davidson M, et al
. Report of the expert committee on the diagnosis and classification of diabetes mellitus. Diabetes Care 1997;20:1183–97. doi:10.2337/diacare.20.7.1183
OpenUrl FREE Full Text
↵
1. Enders CK
. Applied missing data analysis. Guilford Press, 2010.
↵
1. Steyerberg EW
. Clinical prediction models: a practical approach to development, validation, and updating. Springer Science & Business Media, 2009.
↵
Akthar F, Hahne C. RapidMiner 5 Operator Reference (2012). https://rapidminer.com/wp-content/uploads/ 2013/10 /RapidMiner_OperatorReference_en.pdf (accessed 12 Feb 2015).
↵
1. Chawla NV,
2. Lazarevic A,
3. Hall LO, et al
. SMOTEBoost: improving prediction of the minority class in boosting. In: Lavrač N, Gamberger D, Todorovski L, et al. Knowledge discovery in databases: PKDD 2003. Berlin: Springer, 2003:107–19.
↵
1. Chawla N,
2. Bowyer K,
3. Hall L, et al
. SMOTE: Synthetic Minority Over-Sampling Technique. J Artif Intell Res 2002;16:321–57.
OpenUrl
↵
1. López V,
2. Fernández A,
3. Moreno-Torres JG, et al
. Analysis of preprocessing vs. cost-sensitive learning for imbalanced classification. Open problems on intrinsic data characteristics. Expert Syst Appl 2012;39:6585–608.
OpenUrl CrossRef
↵
1. Ramezankhani A,
2. Pournik O,
3. Shahrabi J, et al
. The Impact of oversampling with SMOTE on the performance of 3 classifiers in prediction of type 2 diabetes. Med Decis Making 2016;36:137–44. doi:10.1177/0272989X14560647
OpenUrl Abstract/FREE Full Text
↵
1. John GH,
2. Kohavi R,
3. Pfleger K. eds
. Irrelevant features and the subset selection problem. Machine Learning: Proceedings of the Eleventh International Conference. 1994.
↵
1. Liu H,
2. Motoda H
. Computational methods of feature selection. CRC Press, 2007.
↵
1. Liu H,
2. Motoda H
. Feature selection for knowledge discovery and data mining. Springer, 1998.
↵
1. Rusch T,
2. Zeileis A
. To see the wood for the trees: discussion of “50 years of classification and regression trees”. Int Stat Rev 2014;82: 361–7. doi:10.1111/insr.12062
OpenUrl
↵
1. Ture M,
2. Tokatli F,
3. Kurt I
. Using Kaplan-Meier analysis together with decision tree methods (C&RT, CHAID, QUEST, C4. 5 and ID3) in determining recurrence-free survival of breast cancer patients. Expert Syst Appl 2009;36:2017–26.
OpenUrl
↵
1. Ville BD
. Decision tree for business intelligence and data mining. SAS Publishing, 2006.
↵
1. Bekkar M,
2. Djemaa HK,
3. Alitouche TA
. Evaluation measures for models assessment over imbalanced data sets. J Info Eng Appl 2013;3:27–38.
OpenUrl
↵
1. Fawcett T
. An introduction to ROC analysis. Pattern Recognit Lett 2006;27:861–74. doi:10.1016/j.patrec.2005.10.010
OpenUrl CrossRef Web of Science
↵
1. Ashwell M,
2. Gunn P,
3. Gibson S
. Waist-to-height ratio is a better screening tool than waist circumference and BMI for adult cardiometabolic risk factors: systematic review and meta-analysis. Obes Rev 2012;13:275–86. doi:10.1111/j.1467-789X.2011.00952.x
OpenUrl CrossRef PubMed
↵
1. Stevens JW,
2. Khunti K,
3. Harvey R, et al
. Preventing the progression to type 2 diabetes mellitus in adults at high risk: a systematic review and network meta-analysis of lifestyle, pharmacological and surgical interventions. Diabetes Res Clin Pract 2015;107:320–31. doi:10.1016/j.diabres.2015.01.027
OpenUrl CrossRef PubMed
↵
1. Bozorgmanesh M,
2. Hadaegh F,
3. Saadat N, et al
. Fasting glucose cutoff point: where does the risk terminate? Tehran lipid and glucose study. Acta Diabetol 2012;49:341–8. doi:10.1007/s00592-011-0298-5
OpenUrl PubMed
↵
1. Ramezankhani A,
2. Pournik O,
3. Shahrabi J, et al
. Applying decision tree for identification of a low risk population for type 2 diabetes. Tehran Lipid and Glucose Study. Diabetes Res Clin Pract 2014;105: 391–8. doi:10.1016/j.diabres.2014.07.003
OpenUrl
↵
1. Hatami M,
2. Hadaegh F,
3. Khalili D, et al
. Family history of diabetes modifies the effect of blood pressure for incident diabetes in Middle Eastern women: Tehran Lipid and Glucose Study. J Hum Hypertens 2012;26:84–90.
OpenUrl PubMed
↵
1. Cooper-DeHoff RM,
2. Egelund EF,
3. Pepine CJ
. Blood pressure lowering in patients with diabetes-one level might not fit all. Nat Rev Cardiol 2011;8:42–9.
OpenUrl PubMed
↵
1. Bozorgmanesh M,
2. Hadaegh F,
3. Ghaffari S, et al
. A simple risk score effectively predicted type 2 diabetes in Iranian adult population: population-based cohort study. Eur J Public Health 2011;21:554–9. doi:10.1093/eurpub/ckq074
OpenUrl Abstract/FREE Full Text
↵
1. Kristman V,
2. Manno M,
3. Côté P
. Loss to follow-up in cohort studies: how much is too much? Eur J Epidemiol 2004;19:751–60. doi:10.1023/B:EJEP.0000036568.02655.f8
OpenUrl CrossRef PubMed Web of Science

Footnotes

Contributors FA and FH designed the study protocol, and participated in the coordination and management of the study. AR performed the statistical analysis and wrote the manuscript. EH, JS and OP participated in the statistical analysis and interpretation of data. All authors read and approved the final manuscript.
Funding This study was supported by grant number 121 from the National Research Council of the Islamic Republic of Iran.
Disclaimer The funding source had no role in the design, in the collection, analysis and interpretation of data, in the writing of the manuscript, and in the decision to submit the manuscript for publication.
Competing interests None declared.
Patient consent Obtained.
Ethics approval This study was approved by the Ethical Committee of the Research Institute for Endocrine Sciences.
Provenance and peer review Not commissioned; externally peer reviewed.
Data sharing statement No additional data are available.

[1] ↵
Whiting DR,
Guariguata L,
Weil C, et al
. IDF diabetes atlas: global estimates of the prevalence of diabetes for 2011 and 2030. Diabetes Res Clin Pract 2011;94:311–21. doi:10.1016/j.diabres.2011.10.029
OpenUrl CrossRef PubMed

[2] Whiting DR,

[3] Guariguata L,

[4] Weil C, et al

[5] ↵
Hippisley-Cox J,
Coupland C,
Robson J, et al
. Predicting risk of type 2 diabetes in England and Wales: prospective derivation and validation of QDScore. BMJ 2009;338:b880. doi:10.1136/bmj.b880
OpenUrl Abstract/FREE Full Text

[6] Hippisley-Cox J,

[7] Coupland C,

[8] Robson J, et al

[9] ↵
Park KS
. The search for genetic risk factors of type 2 diabetes mellitus. Diabetes Metab J 2011;35:12–22. doi:10.4093/dmj.2011.35.1.12
OpenUrl CrossRef PubMed

[10] Park KS

[11] ↵
Noble D,
Mathur R,
Dent T, et al
. Risk models and scores for type 2 diabetes: systematic review. BMJ 2011;343:d7163. doi:10.1136/bmj.d7163
OpenUrl Abstract/FREE Full Text

[12] Noble D,

[13] Mathur R,

[14] Dent T, et al

[15] ↵
Abbasi A,
Peelen LM,
Corpeleijn E, et al
. Prediction models for risk of developing type 2 diabetes: systematic literature search and independent external validation study. BMJ 2012;345:e5900. doi:10.1136/bmj.e5900
OpenUrl Abstract/FREE Full Text

[16] Abbasi A,

[17] Peelen LM,

[18] Corpeleijn E, et al

[19] ↵
Moons KG,
Altman DG,
Reitsma JB, et al
. Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD): explanation and elaboration. Ann Intern Med 2015;162:W1–73. doi:10.7326/M14-0698
OpenUrl CrossRef PubMed

[20] Moons KG,

[21] Altman DG,

[22] Reitsma JB, et al

[23] ↵
Kleinbaum DG,
Klein M
. Logistic regression: a self-learning text. Springer Science & Business Media, 2010.

[24] Kleinbaum DG,

[25] Klein M

[26] ↵
Han J,
Kamber M,
Pei J
. Data mining: concepts and techniques. Elsevier, 2011.

[27] Han J,

[28] Kamber M,

[29] Pei J

[30] ↵
Loh WY
. Fifty years of classification and regression trees. Int Stat Rev 2014;82:329–48. doi:10.1111/insr.12016
OpenUrl CrossRef

[31] Loh WY

[32] ↵
Van Buuren S
. Flexible imputation of missing data. CRC Press, 2012.

[33] Van Buuren S

[34] ↵
Zhang H,
Singer B
. Recursive partitioning and applications. Springer Science & Business Media, 2010.

[35] Zhang H,

[36] Singer B

[37] ↵
Epping-Jordan JE,
Galea G,
Tukuitonga C, et al
. Preventing chronic diseases: taking stepwise action. Lancet 2005;366:1667–71. doi:10.1016/S0140-6736(05)67342-4
OpenUrl CrossRef PubMed Web of Science

[38] Epping-Jordan JE,

[39] Galea G,

[40] Tukuitonga C, et al

[41] ↵
Azizi F,
Ghanbarian A,
Momenan AA, et al
. Prevention of non-communicable disease in a population in nutrition transition: Tehran Lipid and Glucose Study phase II. Trials 2009;10:5. doi:10.1186/1745-6215-10-5
OpenUrl CrossRef PubMed

[42] Azizi F,

[43] Ghanbarian A,

[44] Momenan AA, et al

[45] ↵
Azizi F,
Rahmani M,
Emami H, et al
. Cardiovascular risk factors in an Iranian urban population: Tehran lipid and glucose study (phase 1). Soz Praventivmed 2002;47:408–26. doi:10.1007/s000380200008
OpenUrl CrossRef PubMed Web of Science

[46] Azizi F,

[47] Rahmani M,

[48] Emami H, et al

[49] ↵
Harati H,
Hadaegh F,
Saadat N, et al
. Population-based incidence of type 2 diabetes and its associated risk factors: results from a six-year cohort study in Iran. BMC Public Health 2009;9:186. doi:10.1186/1471-2458-9-186
OpenUrl CrossRef PubMed

[50] Harati H,

[51] Hadaegh F,

[52] Saadat N, et al

[53] ↵
Franklin SS,
Gustin W,
Wong ND, et al
. Hemodynamic patterns of age-related changes in blood pressure: the Framingham Heart Study. Circulation 1997;96:308–15. doi:10.1161/01.CIR.96.1.308
OpenUrl Abstract/FREE Full Text

[54] Franklin SS,

[55] Gustin W,

[56] Wong ND, et al

[57] ↵
Gavin J,
Alberti K,
Davidson M, et al
. Report of the expert committee on the diagnosis and classification of diabetes mellitus. Diabetes Care 1997;20:1183–97. doi:10.2337/diacare.20.7.1183
OpenUrl FREE Full Text

[58] Gavin J,

[59] Alberti K,

[60] Davidson M, et al

[61] ↵
Enders CK
. Applied missing data analysis. Guilford Press, 2010.

[62] Enders CK

[63] ↵
Steyerberg EW
. Clinical prediction models: a practical approach to development, validation, and updating. Springer Science & Business Media, 2009.

[64] Steyerberg EW

[65] ↵
Akthar F, Hahne C. RapidMiner 5 Operator Reference (2012). https://rapidminer.com/wp-content/uploads/ 2013/10 /RapidMiner_OperatorReference_en.pdf (accessed 12 Feb 2015).

[66] ↵
Chawla NV,
Lazarevic A,
Hall LO, et al
. SMOTEBoost: improving prediction of the minority class in boosting. In: Lavrač N, Gamberger D, Todorovski L, et al. Knowledge discovery in databases: PKDD 2003. Berlin: Springer, 2003:107–19.

[67] Chawla NV,

[68] Lazarevic A,

[69] Hall LO, et al

[70] ↵
Chawla N,
Bowyer K,
Hall L, et al
. SMOTE: Synthetic Minority Over-Sampling Technique. J Artif Intell Res 2002;16:321–57.
OpenUrl

[71] Chawla N,

[72] Bowyer K,

[73] Hall L, et al

[74] ↵
López V,
Fernández A,
Moreno-Torres JG, et al
. Analysis of preprocessing vs. cost-sensitive learning for imbalanced classification. Open problems on intrinsic data characteristics. Expert Syst Appl 2012;39:6585–608.
OpenUrl CrossRef

[75] López V,

[76] Fernández A,

[77] Moreno-Torres JG, et al

[78] ↵
Ramezankhani A,
Pournik O,
Shahrabi J, et al
. The Impact of oversampling with SMOTE on the performance of 3 classifiers in prediction of type 2 diabetes. Med Decis Making 2016;36:137–44. doi:10.1177/0272989X14560647
OpenUrl Abstract/FREE Full Text

[79] Ramezankhani A,

[80] Pournik O,

[81] Shahrabi J, et al

[82] ↵
John GH,
Kohavi R,
Pfleger K. eds
. Irrelevant features and the subset selection problem. Machine Learning: Proceedings of the Eleventh International Conference. 1994.

[83] John GH,

[84] Kohavi R,

[85] Pfleger K. eds

[86] ↵
Liu H,
Motoda H
. Computational methods of feature selection. CRC Press, 2007.

[87] Liu H,

[88] Motoda H

[89] ↵
Liu H,
Motoda H
. Feature selection for knowledge discovery and data mining. Springer, 1998.

[90] Liu H,

[91] Motoda H

[92] ↵
Rusch T,
Zeileis A
. To see the wood for the trees: discussion of “50 years of classification and regression trees”. Int Stat Rev 2014;82: 361–7. doi:10.1111/insr.12062
OpenUrl

[93] Rusch T,

[94] Zeileis A

[95] ↵
Ture M,
Tokatli F,
Kurt I
. Using Kaplan-Meier analysis together with decision tree methods (C&RT, CHAID, QUEST, C4. 5 and ID3) in determining recurrence-free survival of breast cancer patients. Expert Syst Appl 2009;36:2017–26.
OpenUrl

[96] Ture M,

[97] Tokatli F,

[98] Kurt I

[99] ↵
Ville BD
. Decision tree for business intelligence and data mining. SAS Publishing, 2006.

[100] Ville BD

[101] ↵
Bekkar M,
Djemaa HK,
Alitouche TA
. Evaluation measures for models assessment over imbalanced data sets. J Info Eng Appl 2013;3:27–38.
OpenUrl

[102] Bekkar M,

[103] Djemaa HK,

[104] Alitouche TA

[105] ↵
Fawcett T
. An introduction to ROC analysis. Pattern Recognit Lett 2006;27:861–74. doi:10.1016/j.patrec.2005.10.010
OpenUrl CrossRef Web of Science

[106] Fawcett T

[107] ↵
Ashwell M,
Gunn P,
Gibson S
. Waist-to-height ratio is a better screening tool than waist circumference and BMI for adult cardiometabolic risk factors: systematic review and meta-analysis. Obes Rev 2012;13:275–86. doi:10.1111/j.1467-789X.2011.00952.x
OpenUrl CrossRef PubMed

[108] Ashwell M,

[109] Gunn P,

[110] Gibson S

[111] ↵
Stevens JW,
Khunti K,
Harvey R, et al
. Preventing the progression to type 2 diabetes mellitus in adults at high risk: a systematic review and network meta-analysis of lifestyle, pharmacological and surgical interventions. Diabetes Res Clin Pract 2015;107:320–31. doi:10.1016/j.diabres.2015.01.027
OpenUrl CrossRef PubMed

[112] Stevens JW,

[113] Khunti K,

[114] Harvey R, et al

[115] ↵
Bozorgmanesh M,
Hadaegh F,
Saadat N, et al
. Fasting glucose cutoff point: where does the risk terminate? Tehran lipid and glucose study. Acta Diabetol 2012;49:341–8. doi:10.1007/s00592-011-0298-5
OpenUrl PubMed

[116] Bozorgmanesh M,

[117] Hadaegh F,

[118] Saadat N, et al

[119] ↵
Ramezankhani A,
Pournik O,
Shahrabi J, et al
. Applying decision tree for identification of a low risk population for type 2 diabetes. Tehran Lipid and Glucose Study. Diabetes Res Clin Pract 2014;105: 391–8. doi:10.1016/j.diabres.2014.07.003
OpenUrl

[120] Ramezankhani A,

[121] Pournik O,

[122] Shahrabi J, et al

[123] ↵
Hatami M,
Hadaegh F,
Khalili D, et al
. Family history of diabetes modifies the effect of blood pressure for incident diabetes in Middle Eastern women: Tehran Lipid and Glucose Study. J Hum Hypertens 2012;26:84–90.
OpenUrl PubMed

[124] Hatami M,

[125] Hadaegh F,

[126] Khalili D, et al

[127] ↵
Cooper-DeHoff RM,
Egelund EF,
Pepine CJ
. Blood pressure lowering in patients with diabetes-one level might not fit all. Nat Rev Cardiol 2011;8:42–9.
OpenUrl PubMed

[128] Cooper-DeHoff RM,

[129] Egelund EF,

[130] Pepine CJ

[131] ↵
Bozorgmanesh M,
Hadaegh F,
Ghaffari S, et al
. A simple risk score effectively predicted type 2 diabetes in Iranian adult population: population-based cohort study. Eur J Public Health 2011;21:554–9. doi:10.1093/eurpub/ckq074
OpenUrl Abstract/FREE Full Text

[132] Bozorgmanesh M,

[133] Hadaegh F,

[134] Ghaffari S, et al

[135] ↵
Kristman V,
Manno M,
Côté P
. Loss to follow-up in cohort studies: how much is too much? Eur J Epidemiol 2004;19:751–60. doi:10.1023/B:EJEP.0000036568.02655.f8
OpenUrl CrossRef PubMed Web of Science

[136] Kristman V,

[137] Manno M,

[138] Côté P

Log in using your username and password

Main menu

Log in using your username and password

You are here

Abstract

Statistics from Altmetric.com

Request Permissions

Strengths and limitations of this study

Introduction

Methods

Study population

Clinical, anthropometric and laboratory measurements

Definition of variables and outcome

Statistical methods

Data preparation

Missing data handling

Training and validation data

Data balancing

Variables selection

Statistical analysis

Methods for DT modelling

Model evaluation

Results

Characteristics of participants

Model performances

DT analysis in men

DT analysis in women

Discussion

Conclusions

Acknowledgments

References

Footnotes

Read the full text or download the PDF:

Log in using your username and password