Closed-form solutions are sweet. No hand-wringing/waving required to make a point. Given the assumptions, the model predictions are exact so we can readily evaluate the impact of assumptions. And, we get the means to evaluate alternate (e.g. numerical) approaches applied to these same limiting cases with the exact solution. We are of course talking about the previous post where we obtained closed-form solutions to the Naive Bayes predicted decision boundaries when the true decision boundary was a known linear or nonlinear form. As there is good bit of ground to cover we will start in earnest with the “Next Steps” we outlined in the previous post. Please review it if needed as there is little room for repetition in a 2000+ word article as my posts tend to be. Somewhat of a longer post this one actually, but mainly because of diagrams and simulation results so should not be too bad.
1. The Confusion Matrix
For our 2-class (C1 and “Not C1” – or C2 as we used to call it in the previous post) and 2-feature ([x, y]) situation a nice way to characterize the performance of any classifier is to compute the confusion matrix shown in Figure 1 below. The true and predicted boundaries intersect and split the feature space into four zones. The data points in these zones have either been correctly (i.e. True) or incorrectly (i.e. False) identified as belonging (i.e. Positive) to C1 or not (i.e Negative). For example the top-right corner zone is marked as False Negative, as the prediction falsely classified it as ‘Not C1‘ . The naming of the other zones follows likewise.
The areas of intersection FP and FN and any metrics computed thereof yield significant insight into the effectiveness of the classifier. For example, a highly specific classifier for C1 will never place a true ‘not C1‘ data point in a ‘C1‘ zone. Likewise if the classifier is highly sensitive to C1, it will never place a true C1 in ‘not C1‘ zone. The overall accuracy of the classifier would of course be the fraction of correct predictions. All this leads to the following definitions that are commonly employed.
(1)
Each classification exercise yields a value for each of these metrics. Running a series of such exercises and collecting these metrics from each run we can characterize how good the classifier is for data in this feature space. High sensitivity (small FN zone) & high specificity (small FP zone) are naturally the ideal characteristics for a classifier.
2. The Areas of Intersection
With the prelims out of the way we are ready to quantify how well the Naive Bayes classifier did for the linear and nonlinear boundaries we considered in the previous post. As shown in Figure 2 below, each value of k splits the feature space between A1 and A2, while A1 + A2 stays constant. The predicted boundary is a function of this parameter k naturally and each such predicted boundary yields a value for FN, and FP the areas of intersection we are interested in.
In case of the nonlinear parabolic boundary we work with y = kx2 rather than y = x2 as we had in the previous post. Deriving the updated decision boundary is a straightforward application of the same geometric procedure detailed earlier. We simply write down the results for both the linear and parabolic cases here for easy reference. yp stands for the predicted boundary in the following and we use yt for the true boundary.
(2)
(3)
Figure 3 illustrates the dynamic interplay and tradeoff between FP and FN as k is varied in either case. The areas of intersection FP and FN are what we are after and they can be integrated for directly as we have explicit equations for yp and yt.
2.1 FP and FN for the linear case
In the linear case we know that the true and predicted boundaries intersect at the point (a/2, k/2).
(4)
Plugging in yt = kx/a and yp from Equation 2, integrating and simplifying we get:
(5)
2.2 FP and FN for the parabolic case
The point of intersection here is the solution to the equation yp= k x2 where yp is given by Equation 3. Unfortunately there is no explict solution but the whole thing simplifies to a quadratic that needs to be solved for a root that makes sense.
(6)
Referring to Figure 2B, the areas FP and FN are obtained as
(7)
The integrand and the limits are known exactly so the integral can be precisely evaluated numerically with a code snippet such as:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 |
import numpy as np from scipy.integrate import quad def integrand (x, k): yt = min (k * x * x, b) temp = 1.0/k**0.5 + (1.0/k**0.5) * (3*a/(2*(b/k)**0.5) - 1.0) * (b/(k*x*x) - 1.0) yp = min ((a/temp)**2, b) return abs(yt - yp) def findIntersectionX (k): c2 = 4 * (b*k)**0.5 - 3*a*k c1 = -2 * a * (b*k)**0.5 c0 = b * (3 * a - 2 * (b/k)**0.5) x1 = (-c1 + (c1**2 - 4 * c2 * c0)**0.5) / (2*c2) x2 = (-c1 - (c1**2 - 4 * c2 * c0)**0.5) / (2*c2) if ( (x1 >= xmin) and (x1 <= xmax) ): return x1 elif ( (x2 >= xmin) and (x2 <= xmax) ): return x2 else: print ("ERROR!") a = 1.0 b = 1.0 k = 10.0 xmin = 0.0 xmax = (b/k)**0.5 xIntersection = findIntersectionX (k) fp_area = quad(integrand, xmin, xIntersection, args=(k), epsabs=1.0e-8) fn_area = quad(integrand, xIntersection, xmax, args=(k), epsabs=1.0e-8) |
3. The Characteristics
With the means to get FP and FN for any k we are ready to evaluate the classifier with respect to its sensitivity, specificity etc… measures we went over in Section 1. TP and TN follow directly from the geometry.
(8)
Figure 4 shows the intersection areas and the obtained metrics thereof as a function of A1/A2 . FP and FN areas are small fractions of the overall areas, and the accuracy is excellent. There is no question that metrics are quite good in both cases even while the linear case seems to have an edge. The range of A1/A2 in these results is somewhat constrained (0 → 2 in the parabolic case, and 1 → ∞ in the linear case) in both cases due to the design of the derived solutions.
The observation in Figure 4B that sensitivity, specificity and accuracy go through a single point is no fluke. When sensitivity and specificity are equal,
where the last expression is the definition of accuracy.
4. Numerical Simulations with SciKit
SciKit has excellent implementations of algorithms including naive bayes for classification. Gaussian Naive Bayes approximates a gaussian distribution from the training data to estimate P(x|C1) and P(y|C1) for new data point x,y. This is fine when this new data point x,y is away from the decision boundary but will clearly have some error otherwise. What is the consequence of this error on classification is the question we seek to explore here with a few numerical simulations. Further, we apply a couple of competing classification techniques namely Logistic Regression and MLPClassifier, again as implemented in SciKit. Analysis of assumptions built into these other techniques is the subject matter for a different post. Here we simply use them – SciKit makes it very easy to try things out.
Generating training and test data is a breeze, as we know the exact decision boundary. Lines 7 and 10 in the code snippet below read in the data.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
from sklearn.naive_bayes import GaussianNB from sklearn.linear_model import LogisticRegression from sklearn.neural_network import MLPClassifier from sklearn.metrics import * import numpy as np train = np.array(np.loadtxt(open('./train.csv', "rb"), delimiter=",", skiprows=1)) trainX = train[:,0:2] # x, y locations trainY = train[:,2] # Class label 1 (i.e. C1) or 0 (i.e. Not C1) test = np.array(np.loadtxt(open('./test.csv', "rb"), delimiter=",", skiprows=1)) testX = test[:,0:2] testY = test[:,2] gnb = GaussianNB() lr = LogisticRegression() mlp = MLPClassifier(hidden_layer_sizes=(4,)) stats = [] for id, clf, name in [(0,lr, 'lr'),(1,gnb, 'gnb'),(2,mlp, 'mlp')]: clf.fit(trainX, trainY) predictions = clf.predict (testX) sk_tn, sk_fp, sk_fn, sk_tp = confusion_matrix(testY, predictions).ravel() sk_accuracy = accuracy_score(testY, predictions) sk_precision = precision_score(testY, predictions) sk_sensitivity = recall_score(testY, predictions) sk_specificity = sk_tn / (sk_tn + sk_fp) |
We use default values for the parameters for the most part for the three classifiers initialized in Lines 14-16. The MLPClassifier uses 100 neurons and one hidden layer by default. One hidden layer is fine, but given that we only have 2 input neurons (one for x and the other for y), using 100 hidden neurons is an overkill. We pick 4, we could likely do with even less. Line 20 trains the model, and in Line 23 we obtain the confusion matrix. All the other metrics can be computed off of that of course.
4.1 Error from the Gaussian Approximation in SciKit
Figures 5A and 5C show the results from SciKit’s gaussian naive bayes simulation for the linear case with k = 0.6 (i.e. A1/A2 = 2.33). The misprediction zones in red are larger compared to the same obtained with analytic naive bayes predictions. Logistic regression is known to be a linear classifier so the near perfect prediction in Figure 5B is to be expected. MLP is of course the best of the lot in all cases even with just 4 neurons in a single hidden layer. But we shall see later that this happy scenario does not hold for boundaries with more severe nonlinearities.
4.2 Classification Metrics
Figure 6 compares the performance of the three classifiers over a number of simulations where the ratio A1/A2 is varied while holding A1 + A2 at unity. This is much like Figure 4 where we compared analytic naive bayes predictions against the truth. A few easy observations are as follows.
- Naive bayes has the largest error zones (FP and FN). FP for logistic regression increases with A1/A2 even in the linear case. Given that logistic regression is a linear classifier, this will require some further analysis as to why.
- In terms of the total area of mischaracterization (i.e. FP + FN) we see that MLP, logistic regression, and naive bayes are in the order of decreasing excellence.
- Depending on what is being demanded one may opt for a classifier that excels at that criteria. For example MLP is excellent here for specificity.
5. Are Neural Nets Better at Classification?
From the results in Sections 3 and 4, it would seem that neural networks are better at classification of data with linear or nonlinear separation boundaries. But such a generalization would likely be wrong. When we know the truth as is the case here, we can monkey with the number of hidden layers, the number of neurons within, activation functions etc… to get better matches with that truth. But that result may not have a general validity beyond the scope of that specific exercise. To see this in practice let us consider the case of a sine wave as the separation boundary and attempt to classify with the same three schemes.
(9)
As k increases, the feature space gets split into more and more number of non-contiguous zones for the classes, and classification gets harder. The mechanics of obtaining the classification results with SciKit are the same as in Section 4 so we go directly to the results.
Figures 7A plots the magnitudes of the areas of intersection as the number of non-contiguous zones is varied by varying k. Clearly MLP is no better for any k when we use a single hidden layer. MLP. FP and FP together take up about 20-30% of the total area, so there is that much data that would be incorrectly classified. This is reflected in the smaller specificity values in Figure 7B. With 6 hidden layers MLP has much smaller error zones for all k values, and thereby a much better specificity as seen in Figure 7B.
Figures 7C – 7E show the error zones obtained in all cases when k = 4. Error zones decrease as the hidden layers are increased, yielding the best classification for 6 layers. But unfortunately it does worse with 7 layers, somewhat unexpected… the alleged alchemy at play. There is clearly more going on here that requires deeper analysis.
6. Next Steps
With that we close this somewhat long but straightforward post. For the linear and parabolic separation boundaries we have shown that all three classifiers tested here were quite good, with MLP edging out logistic regression and naive bayes. The out performance of MLP is interesting but there is no clear understanding at this time as to how exactly it manages to do better. MLP can likely work well for more severe nonlinearities but careful experimentation may be required for generalizing the obtained results. Nevertheless, the results here are encouraging enough that we would like to try out neural nets for text classification problems in future posts.