Introduction
The development of high technologies has significantly changed the research and treatment methods in psychiatry. Advanced technologies such as social media, smartphones and wearable devices have enabled psychiatric clinicians and researchers to collect a wide range data of subjects/patients within a relatively short period of time to monitor the psychical status of clients or patients,1 and to offer more accurate and personalised treatments. While enjoying the convenience brought to us by the advanced technologies, we are facing the challenge of analysing the large data set generated from them, and making good prediction of some outcomes for a new subject. Unlike the traditional statistical methods which try to find a good fit of the data to interpret the association between the outcome and some potential features, medical researchers and clinicians are interested in the prediction of treatment methods (for example, the dosage of a drug) and treatment outcomes (eg, 5-year survival probability) given a comprehensive measurement of different features of a patient.
Machine learning (ML) takes advantage of advanced statistical methods and computer science techniques, and has been implemented to analyse ‘big data’ nowadays.2 The common types of ML techniques used in the psychiatric field include supervised learning (SL) and unsupervised learning (USL).3
SL is used for data type with a labelled response variable. The purpose of SL is to develop a model for which the outcome can be formulated as a function of the features (covariates) so that the model can make a prediction of the outcome in the future when only the features are given. For instance, suppose we are interested in identifying a patient with either major depressive disorder or no depression, based on the measurement of some factors of patients. SL methods try to build a model between the outcome (eg, depression or not) and a series of features, such as age, gender, education background, work type and so on, which are collected from different data sources. Commonly used examples of SL algorithms include logistic regression (LR) and support vector machine (SVM);4 LR was borrowed directly from traditional statistics and SVM was invented by computer scientists.5 We will discuss LR in detail in the next section.
USL is applied to data without a labelled outcome.6 The algorithms try to recognise similarities/dissimilarities between subjects through input variables (features) without the aid of a labelled outcome. This is why it is called ‘unsupervised’. One of the most commonly used USL methods is the k-means clustering which minimises within-cluster variances to partition observations into k clusters. The lack of labelling will make USL more challenging, while this could also help to reveal the underlying data structure without a possible prior bias. We will discuss the k-means clustering by concrete example in the later section.