Data sets for machine learning are multi dimensional. The number of dimensions depends on the data domain. For example data for a collaborative item-to-item recommender includes a user and an item. This can easily be fit to a co-occurrence matrix. However, data from medical diagnosis have more than 3 dimensions. This is the same for air pollution measurements, vehicle stats and other data sets. When the number of dimensions is 5 or less, it is easy to visualise data before deciding on an approach to machine learning.
Visualising data before applying machine learning has advantages.
1) Visually identify unique patterns in dimensions. Those dimensions are likely more important to help in decision making.
2) Help to decide candidate model(s). For example, would it be better to use a Random Forest classifiers or Support Vector Machine or Nearest Neighbour Classifier?
3) Select a subset of dimensions (before model fitting) that are more suitable for machine learning. Not all collected dimensions turn out useful. Subsequently adding dimensions leads to different use cases and insights.
The rest of the post is divided into
1) Visualization
2) Prediction results using Classifiers
3) Plotting feature weights from fitted models to confirm insights from visualization.
The rest of the post is divided into
1) Visualization
2) Prediction results using Classifiers
3) Plotting feature weights from fitted models to confirm insights from visualization.
1) Visualization
The sample data set for this post is the Breast Cancer Wisconsin (Diagnostic) Data Set from University of California, Irvine Machine Learning Repository. The dimensions are measurements on cell nuclei. There are 30 dimensions. How are the Benign and Malignant cell readings distributed across the dimensions? In other words what is different between Benign and Malignant cells? Visualising the data on parallel coordinates gives a sense of its dimensions. (Red=Malignant Blue=Benign)
As an example the perimeter dimension in the top chart is interesting. Filtering based on that looks like this.
While the perimeter dimension cannot help with a binary decision, visualisation shows that the compactness and concavity dimensions can help to decide from that point on. When you go through the above visualization and interact with it, notice that it looks more like decision making. So is this data set a good candidate for Random Forest Classifier or Nearest Neighbour Classifier or SVC?
2) Prediction results
Applying these classifiers to out of bag test data yields different levels of accuracy in prediction. Random Forest Classifier is more accurate than Nearest Neighbour Classifier. The result of applying these classifiers is shown below. RandomForestClassifier/GradientBoostingClassifier is the best choice as was clear from the visualization,3) Confirm dimension insights from visualization
Fitted tree models have an attribute that shows the weight of each dimension. This allows us to confirm if the dimensions we thought were important from the visualization are really of any predictive value.
From the two plots notice that the dimensions radius, perimeter, concavity, concave_points have more weights than others in both classifiers. Notice that the standard error dimensions are less important.