b

DiscoverSearch
About
My stuff
Breast Cancer Diagnosis via Classification Algorithms
2018·arXiv
Abstract
Abstract

In this paper, we analyze the Wisconsin Diagnostic Breast Cancer Data using Machine Learning classification techniques, such as the SVM, Bayesian Logistic Regression (Variational Approximation), and K-Nearest-Neighbors. We describe each model, and compare their performance through different measures. We conclude that SVM has the best performance among all other classifiers, while it competes closely with the Bayesian Logistic Regression that is ranked second best method for this dataset.

Breast Cancer is the most common cancer among women, all around the world. There are two types of breast mass, ”Benign” and ”Malignant” , where benign is the non-cancerous type and malignant is the cancerous one. According to the World Health Organization, about 502,000 deaths (per year) worldwide are caused by breast cancer. The most effective way to reduce breast cancer deaths is to detect it earlier. Mammography and a Fine Needle Aspiration biopsy (FNA) are the most common diagnostic tools, which may be time consuming and cost-effective. Thus the automatic diagnosis of breast cancer becomes important, while attracting many researchers from all over the world to find an accurate and reliable diagnosis (classification) method. In Machine Learning content, there are several different methods for classification, such as SVM (Support Vector Machines) which is one of the most popular ones, Bayesian Logistic Regression (which we will use Variational Approximation here), and K-Nearest-Neighbors, etc.

SVM attempts to find the best hyperplane (by maximizing the margin) that separates the data into the two classes benign and malignant, and thus is a powerful method since it is flexible with the selection of kernel functions that map the data into higher dimension so that they are separable. Bayesian logistic regression is also another popular classification method where a prior is defined for the parameters of the model to find the posterior. After finding the posterior, we can obtain the predictive distribution and use it to predict for new test cases by defining a decision boundary. K-Nearest-Neighbors is one of the simplest classification methods, also called as the lazy learning where an object is assigned to the class most common amongst its k nearest neighbours (k being a positive integer).

Previous work on breast cancer diagnosis using classifier techniques, have shown that usually SVM has a higher accuracy performance compared to other classifiers, although in a few, Logistic Regression was also at a high level of accuracy (depending on the data set). On the other hand, although K-NN may not be the best classifier in general, but as we will see, it shows a desirable performance for this breast cancer diagnosis problem. In this paper, we have collected the Wisconsin Diagnostic Breast Cancer Data (WDBC) from the UCI Machine Learning Repository to apply the methods described above. We will see that our results are in-line with previous conclusions, although a Bayesian approach to Logistic Regression will significantly increase the performance compared to the Logistic Regression.

The Wisconsin Diagnostic Breast Cancer Data consists of 569 instances with 32 attributes such as ID, diagnosis statement (B=benign, M=malignant), and 30 real-valued features that were computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. Ten real-valued features were computed for each cell nucleus:

1) radius, 2) texture, 3) perimeter, 4) area, 5) smoothness, 6) compactness, 7) concavity, 8) concave points, 9) symmetry, 10) fractal dimension.

The mean, standard error, and ”worst” or largest ( mean of the three largest values ) of these features were computed for each image, resulting 30 features. There are 357 benign and 212 malignant, which we will randomly split the 569 data into 80% training set and 20% test set to check performance.

Here we will briefly describe each classification method:

3.1 SVM (Support Vector Machines)

Given some data  D = { (xi, yi) | xi ∈ Rp, yi ∈ {−1, 1} }ni=1where  yiis indicating the class of feature vector  xi, the goal is to find a hyperplane that separates the two classes by maximizing the margin, which is the minimum distance from the data points to the hyperplane. In general, a hyperplane is defined as  w.x − b = 0where w is the normal vector to the hyperplane and w.x is the inner product of w and x. The parameter b||w||specifies the distance of the hyperplane from the origin along the normal vector w. So, we have to find the best w and b such that the margin is maximized. Hence if the training data are linearly separable, the optimizing problem will be:

image

Where the solution to w will be  w∗ = �i αiyixi.

There is a modified maximum margin optimization that allows misclassification of training points, when there exists no hyperplane that can completely separate the training set. It is called the Soft Margin method, which will choose a hyperplane that separates the data as cleanly as possible, while maximizing the distance to the nearest cleanly separated points. In this method, new parameters,  ξi, are added to measure the degree of misclassification for each data  xi, by :

image

The new optimization will become a trade off between a large margin and a small error penalty. If the penalty function is linear, the optimization problem changes to:

image

Where the constant C controls the trade off between complexity of decision rule and frequency of error. By using Lagrange Multipliers  αi, the dual form of the previous optimization problem will be:

image

Where  k(xi, xj) = xi.xjis called the kernel function (here it’s linear). In general, kernel functions are defined as  k(xi, xj) = φ(xi).φ(xj), where  φis a function that maps the data to a higher dimensional space called the feature space, to make a better look of the data. Function  φdoesn’t have to be known since just their inner product is important in the optimization problem. Below are some possible kernel functions:

Polynomial:  k(xi, xj) = (xi.xj)d

Gaussian radial basis function:  k(xi, xj) = e−γ||xi−xj||2where  γ > 0, we can also assume  γ = 12σ2

Hyperbolic tangent:  k(xi, xj) = tanh(κxi.xj + c)for some (not every)  κ > 0and c < 0

After finding the optimum  w∗and  b∗, we can classify a new test case  x∗by  sign(w∗.φ(x) + b∗)which is equivalent to  sign(�ni=1 αiyik(xi, x∗) + b∗).

3.2 Bayesian Logistic Regression

The model known as Logistic Regression, which is a model for classification, is as below:

image

with  p(C2|φ) = 1 − p(C1|φ), where  σ(x) = 11+e−xis the logistic sigmoid function and  φis the feature space. Then, the likelihood function for the data  {φi, ti}ni=1, where  ti ∈ {0, 1}and φi = φ(xi), will be:

image

Where  t = (t1, ..., tn)T. To predict for new test cases, we need to estimate the parameter w, which is a p dimensional vector, according to p features for each training data. One way is to estimate it directly, using any iterative optimization scheme, such as Newton-Raphson, Gradient Descent, etc. Another way is to look at the bayesian approach where a prior for w is chosen to obtain the posterior. Usually Gaussian priors are considered, and since  posterior ∝ prior × likelihood, computing the posterior would be intractable (logistic sigmoid and a Gaussian). On the other hand, the posterior can be approximated using various methods, such as Laplace approximation, MCMC, etc. Here we will approximate the posterior using the variational approximation. We will assume the prior is Gaussian,  p(w) = N(w|m0, S0).

In the variational framework, we wish to maximize a lower bound on the marginal likelihood:

image

Before we proceed to the next step, note that we also have the following property for each i:

image

Where  ai = wT φi. Now to find a lower bound on p(t), from (6) and (7), we need to first use the variational lower bound for the logistic sigmoid function [1], hence:

image

where  λ(ξ) = 12ξ[σ(ξ) − 12]and  ξiare the variational parameters for each data point. Now by (6) and (8), we have :

image

where ξ  = {ξ1, ..., ξn}and h(w, ξ) = �ni=1σ(ξi)e{wT φiti− (wT φi+ξi)2 −λ(ξi)([wT φi]2−ξi2)}

Now, by taking the logarithm of both sides of (9), we will see that the right hand side will be a quadratic function of w. Hence the variational approximation to the posterior distribution of w will have a Gaussian form as below:

image

On the other hand, we need to find the variational parameters  ξiby maximizing the lower bound on the marginal likelihood given in (6). One way to approach this, is to use the EM algorithm, followed by:

Initialize all  ξis

E-Step: use these parameters to find the posterior distribution over w, as given in (10), (11)

M-Step: maximize the expected complete-data log likelihood given by  Q(ξ, ξold) =E[ln h(w, ξ)p(w)]. Which is equivalent to solving:  (ξnewi )2 = φTi (SN + µNµTN)φi

Up to now, we have assumed that hyperparameters (in prior) are known. Hyperparameters can also be unknown, so that they can be learned from the data. Hence, we will need additional priors for the hyperparameters. Assume the prior on w depends on another unknown parameter  α. Thus, we can consider the prior for w to be Gaussian again,  p(w|α) = N(w|0, α−1I), and the prior for  αto be gamma,  p(α) = Gam(α|a0, b0). Now the marginal likelihood will be:

image

Since this is analytically intractable, we will consider the variational distribution  q(w, α)and the decomposition :

image

Where

image

And

image

Since L(q) is still intractable, we will use the local variational bound on the logistic sigmoid function (just as before). Hence we will have:

image

Next, we assume that the variational distribution factorizes between parameters and hyperparameters:  q(w, α) = q(w)q(α). Now we will find each factor as below [2]:

image

Omitting the algebra, w will follow  q(w) = N(w|µN, SN)and  αwill follow  q(α) =Gam(α|aN, bN), with updated equations :

image

Also

image

where P is the dimension of feature space and  E[α] = aNbN , Ew[wT w] = Tr(SN + µµT ). We should note that the updated equations for  ξiis the same as before.

On the other hand, the variational lower bound would be:

image

With:

image

Ew[ln h(w, ξ)] =

image

3.2.1 Predictive distribution

Given a new test case (in its feature space)  φ(x∗), we can compute the predictive distribution for class  C1by marginalizing with respect to the posterior distribution for p(w|t) which is approximated with the Gaussian q(w):

image

with  p(C2|φ(x∗), t) = 1 − p(C1|φ(x∗), t). We have computed the above integral using the Monte Carlo technique, by sampling from the Gaussian distribution q(w).

3.3 K-Nearest-Neighbors

As mentioned before, in this method a test point is classified by assigning the label which is most frequent among the k (a user-defined constant) training points nearest to that test point. Euclidean distance is one of the options to use as the measure of closeness (distance metric), which we will apply here.

4.1 Error Rates

After randomly dividing the data into 80% training set and 20% test set (455 Training data, 114 Test data), we have calculated the training error rate and the test error rate for each method. We should note that for SVM, the error rates for each kernel type was resulted from the best parameters chosen from a finite set of different values, by using 10-fold cross-validation. Table 1 below are showing these error rates and the number of misclassified points.

Table 1: Testing and training error rates

image

As it is seen from the test error rates and training error rates, SVM with a Gaussian Radial Basis kernel has the smallest rate (both test and training) among the other SVMs, and K-NN with k=10 has also the smallest error rate among the other K-NNs. So to compare the performance of these methods (after selecting the best from each category, explained earlier), we see that their error rates rank from smallest to largest by: 1) SVM(Gaussian) 2) Bayesian Logistic Regression 3) K-NN(k=10) 4) Logistic Regression. To have a more accurate result, in the next section, we will look at the ROC curves and AUC to compare these 4 methods ( after selection ).

4.2 ROC curves and AUC

In classification, specifically diagnosis of cancer, it’s important that what class the test cases are assigned to. There are four possible cases: 1) True Positive[TP] ( If new data is assigned to class positive, and it actually belongs to that class) 2) False Positive[FP] ( If it is assigned to positive

image

Figure 1: ROC curves

class, but is actually in the negative class, in our case a person is diagnosed as cancerous, while being healthy) 3) True Negative[TN] ( when it is assigned to class negative and it actually belongs to that class) 4) False Negative[FN] ( It is assigned to class negative while it belongs to the positive class, in our case the person is predicted to be healthy, but actually has cancer, which is completely dangerous). In the previous section, we just considered all the false positive and false negative together, and computed the error. To conclude more accurate results, we have to consider all these cases, where in specific, ROC curves are used for showing the ”True positive Positive rate(TPR)” vs ”False Positive rate(FPR)” [TPR = T PT P +F Nand  FPR = F PF P +T N]. ROC curves represent relative tradeoffs between benefits (TP) and costs (FP). As it is seen in Figure 1, the distance of the ROC curves of SVM, Bayesian Logistic Regression, K-NN and Logistic Regression from the diagonal line are ordered from largest to smallest, respectively, showing performance (from highest to least). So again the same order of performance is resulted here with the ROC curves. Another way to measure the performance is to look at the AUC values (which are the area under the ROC curves) and also the Accuracy measurement. They are given in Table 2.

Table 2: AUC (area under ROC curves)

image

Again as it is seen from Table 2, the AUCs and accuracies are ordered from largest to smallest with the same orders of methods. The accuracy vs cut-off graphs for each method are also shown in Figure 2. Since our decision boundary was probability 0.5 ( or the sign function in SVM), we see that all methods (except Logistic Regression) have their highest accuracy (Figure 2).

Finally, to summarize the results, we started by looking at test error rates, ROC curves and at last AUC and accuracy values for each method. From the comparison, we see that SVM has the least error rate (highest Accuracy), as well as the highest AUC value, so it has the highest performance among the others (for this data set). On the other hand, Bayesian Logistic Regression is ranked second best, which has improved the performance of usual Logistic Regression. And at last, K-NN

image

Figure 2: Accuracy vs Cutoff

and Logistic Regression were ranked third and fourth respectively. But in general, they were all good classifiers since they all had high accuracy rates. The study of feature selection can be considered as a future direction for this dataset.

[1] Christopher M. Bishop, Pattern Recognition and Machine Learning, pp. 497-505 , pp. 466. Cambridge.

[2] Katie Planey, Machine Learning Approaches to Breast Cancer Diagnosis and Treatment Response Prediction. Stanford Biomedical Informatics.

[3] Gouda I. Salama, M.B.Abdelhalim, and Magdy Abd-elghany Zeid (2012), Breast Cancer Diagnosis on Three Different Data Sets Using Multi-Classfiers. Computer Science, Arab Academy for Science Technology & Maritime Transport, Cairo, Egypt.

[4] Y.Ireaneus Anna Rejani, Dr.S.Thamarai Selvi [2009], Early Detection Of Breast Cancer Using SVM Clas-sifier Technique. Noorul Islam College of Engineering, Tamilnadu, India.

[5] Lukasz Jelen , Thomas Fevens, Adam Krzyz Ak [2008], Classification Of Breast Cancer Malignancy Using Cytological Images Of Fine Needle Aspiration Biopsies . Department of Computer Science and Software Engineering Concordia University, Montreal, Quebec, Canada.

[6] Manish Sarkar and Tze-Yun Leong, Application of K-Nearest-Neighbors Algorithm on Breast Cancer Diagnosis problem. Department of Computer Science, School of Computing, The National university of Singapore.

[7] Tommi S. Jaakkola and Michael I. Jordan [1999], Bayesian Parameter Estimation Via Variational Methods. Dept. of Elec. Eng. & Computer Science, Massachusetts Institute of Technology, Cambridge, MA, USA.

[8] Tom Fawcett [2006], An Introduction To ROC Analysis. Institute for the Study of Learning and Expertise, 2164 Staunton Court, Palo Alto, CA 94306, USA.


Designed for Accessibility and to further Open Science