Two categories classification

In this problem we manage to create and evaluate a classification of two classes. Data are on ascii code and there are 400 examples and 5 features. For solving the problem we use python 3.6 and sklearn library. Firstly, we import all the necessary libraries into the project.

In [19]:
import pandas as pd
import matplotlib.pyplot as plt
plt.style.use('classic')
pd.options.display.max_columns = 100
from matplotlib import pyplot as plt
import matplotlib
matplotlib.style.use('ggplot')
import numpy as np

Next, using the pandas library, we take the data from the file and we save it to the variable "data". Next we separate the tokens using the space separator (not comma). We remove the first column (because it's nan for some reason). We keep the "data" in "plot_data" for the graph creation later. In "labels" we keep the last column, which shows the category. Also we shuffle the data.

In [20]:
pd.options.display.max_rows = 100
df = pd.read_csv('artificial.data',  sep=' ',header=None)
data = df.drop([ df.columns[0]], axis=1)
data=data.sample(frac=1)
plot_data = data
labels=data[6]

Below you can see the five firt rows of data (X):

In [21]:
data.head()
Out[21]:
1 2 3 4 5 6
53 61.839916 130.124130 928.123318 -61.119250 61.787568 1.0
202 20.708939 -122.603864 -90.456470 -11.076384 -4.516144 2.0
217 6.240046 -53.888858 356.246485 20.367080 25.138648 2.0
135 17.146790 1242.430300 -522.333543 -80.610695 -52.795863 1.0
75 52.321543 602.682416 -653.438339 25.420529 -53.011974 1.0

As you can see, data take values in a prety long space. For that reason we normalize them. A common method for that is the StandardScaler but because in this situation it doesn't gave as good results we used MinMax instead. In "plot_data" variable we assign the normalized data for plotting the feature relations later.

In [22]:
from sklearn.preprocessing import MinMaxScaler
mms = MinMaxScaler()
plot_data.ix[:,[1,2,3,4,5]] = mms.fit_transform(plot_data.ix[:,[1,2,3,4,5]])

Using the seaborn library, we plot all the relations between the features. The last column is the labels, so we don't care about that for now. Taking a closer look to the data, we notice that many data are not separable, so we choose to use the features 2 and 3 for the classification.

In [23]:
import seaborn as sns
import matplotlib.pyplot as plt
sns.set()
sns.pairplot(plot_data, hue=6)
sns.plt.show()

We save the data that we'll use for the classification on "scaled_data"

In [24]:
scaled_data = plot_data.ix[:,[2,3]] 

After that, we can see the plots of our final features. We notice that these features are seperatable.

In [25]:
plot_data = plot_data.ix[:,[2,3,6]] 
sns.pairplot(plot_data, hue=6)
sns.plt.show()

The classifier that we use is SVM (called SMC when we refer in classification problems). SVM iS a linear classifier, but because our data aren't linear separable we use an rbf kernel. This kernel is a function that moves the problem in a bigger dimension where data are linear separable.

In [26]:
from sklearn.svm import SVC

To find good parameters for SVM we use GridSearch. GridSearch tests different values for gamma and C. Gamma shows how far the influence of a training examble can go. Parameter C normalizes the decision surface. GridSearch returns the grid.best.estimator_ which is a SVM definition with the optimal parameters within the limits we choose.

In [27]:
#==============================================================================
#from sklearn.grid_search import GridSearchCV
#g_range = list(range(400,450))
#c_range = list(range(1,12))
#clf = SVC(gamma=66, C=1, random_state=1 ,kernel='rbf')
#param_grid = dict(gamma=g_range,C=c_range)#n_jobs = -1 -> paraller
#grid=GridSearchCV(clf, param_grid, cv=10, scoring='accuracy')
#grid.fit(scaled_data, labels )
#print (grid.best_estimator_)
#print (grid.best_score_)

#==============================================================================

This is the selected SVM definition:

In [28]:
clf=SVC(C=2, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma=600, kernel='rbf',
  max_iter=-1, probability=False, random_state=1, shrinking=True,
  tol=0.001, verbose=False)

Finally we run a 10 folds cross validation. For every fold we fit the data to classifier keeping the score in cv_scores list.

In [29]:
#10 folds cross validation
num_folds = 10
new_train_Xfolds = np.array_split(scaled_data, num_folds)
new_train_Yfolds = np.array_split(labels, num_folds)


cv_scores =[]
for j in range(num_folds):
    X_train_cv = np.vstack(new_train_Xfolds[0:j][:]+new_train_Xfolds[j+1:][:])
    X_test_cv = new_train_Xfolds[j][:].as_matrix()
        
    y_train_cv = np.hstack(new_train_Yfolds[0:j]+new_train_Yfolds[j+1:])
    y_test_cv = new_train_Yfolds[j].as_matrix()
    
    clf.fit(X_train_cv, y_train_cv)
    
    scores_training = clf.score(X_train_cv, y_train_cv)
    score = clf.score( X_test_cv, y_test_cv)

    cv_scores.append(score)

Finally we print the mean of these accuracies

In [30]:
print("10-Folds Cross Validation:", np.mean(cv_scores))
10-Folds Cross Validation: 0.99