What is Machine Learning?
Key Idea: What is Machine Learning?
Machine learning (ML) refers to a series of different methods that enable a system to autonomously "learn" from large amounts of data in order to perform a task (like making predictions or decisions) without being explicitly programmed to do so.
The "machine" part of ML refers to model execution by a computational system.
The "learning" part of ML refers to the system's continuous adjustments based on the data its supplied to modify its performance.
Watch: What is Machine Learning?
Watch this episode of Crash Course Computer Science, hosted by Carrie Anne Philbin.
In this episode of Crash Course Computer Science, Carrie Anne Philbin offers a high-level overview of machine learning and walks us through an example of a classification problem
A few key points:
Supervised ML classifiers distill the complexity of real-world objects and phenomena into features: values that usefully characterize the things we wish to classify (see our simple classifier review below!)
The goal of ML classifiers is to maximize correct classifications while minimizing incorrect classifications.
Several ML techniques are largely rooted in the field of statistics and are built upon statistical frameworks.
A common ML methodology is the use of artificial neural networks (NN or neural nets). Neural nets are comprised of artificial neurons (inspired by biological neurons) which are organized into a series of layers (input, hidden, and output).
In a classification scenario, neural nets can process inputs and propagate them forward through a series of weighting, summing, biasing, and applying activation functions (for more complex/nonlinear decision boundaries) to reach a classification decision.
Crash Course Computer Science, Episode #34
Cite as: Brungard B. et al. Machine Learning & Artificial Intelligence: Crash Course Computer Science #34. Vol 34. PBS Digital Studios; 2017.
Deeper Dive: Simple Classifier Explanation
Let's review the simple ML classifier example outlined by Carrie Anne.
All figures below are directly sourced or adapted from Crash Course Computer Science, Episode #34, seen above!
Plotting Carrie Anne's labeled moth data (n=200)
Here is a scatter plot of Carrie Anne's labeled moth data, with moth wingspan plotted along the x-axis and moth mass plotted along the y-axis. Observations confirmed to be Emperor moths are plotted in red and observations confirmed to be Luna moths are plotted in blue.
Clusters overlap
The species clearly form two clusters, but the clusters overlap, so it may be difficult for us to know how to best separate them. An ML classifier can help us find optimal separations!
Our first decision boundary
A classifier might decide that a good separation (or decision boundary) can be placed vertically at 45 mm on the x-axis, implying observations with a wingspan of or less than 45 mm is likely to be an Emperor moth.
Our second decision boundary
The classifier may place a second decision boundary horizontally at 0.75 grams on the y-axis, implying that observations with a mass of 0.75 grams or less (in addition to a wingspan of 45 mm or less) are likely to be an Emperor moth.
The spaces created by the decision boundaries are called decision spaces.
Notice that there is no way to impose straight lines that classify all observations correctly.
Lower the decision boundary?
Lowering the wingspan decision boundary leads to more misclassifications of Emperor moths.
Increase the decision boundary?
Increasing the wingspan decision boundary leads to more misclassifications of Luna moths.
While decision boundaries in classifier models are not perfect, the goal of classifiers is to maximize correct classifications and minimize incorrect classifications.
True Positive (TP)
Using our original boundaries, we classify 86 Emperor moths correctly as Emperor moths. These are known as true positive (TP) classifications.
False Negative (FN)
We classify 14 Emperor moths incorrectly as Luna moths. These are false negative (FN) classifications.
False Positive (FP)
We classify 18 Luna moths incorrectly as Emperor moths. These are false positive (FP) classifications.
True Negative (TN)
We classify 82 Luna moths correctly as Luna moths. These are true negative (TN) classifications.
Our Confusion Matrix
The table which displays these classifications from our binary classifier form a confusion matrix: an N x N matrix used for evaluating the performance of a classification model, where N is the number of target classes (in our case, N=2).
We can calculate a series of classifier performance metrics from our confusion matrix.
Accuracy (ACC)
(TP + TN) / (TP + TN + FP + FN) = ACC
(86 + 82) / (86 + 82 + 18 + 14) = 0.84
Precision (or Positive Predictive Value [PPV])
TP / (TP + FP) = PPV
86 / (86 + 18) = 0.83
Negative Predictive Value (NPV)
TN / (TN + FN) = NPV
82 / (82 + 14) = 0.85
Recall (or Sensitivity, True Positive Rate [TPR])
TP / (TP + FN) = TPR
86 / (86 + 14) = 0.86
Specificity (or True Negative Rate [TNR])
TN / (FP + TN) = TNR
82 / (18 + 82) = 0.82
F-Measure (or F1-Score)
2 × ((recall × precision)/(recall + precision)) = F1
2 × ((0.86 × 0.83)/(0.86 + 0.83)) = 0.84
False Positive Rate (FPR; Type I Error)
FP / (FP + TN) = FPR
18 / (18 + 82) = 0.18
False Negative Rate (FNR; Type II Error)
FN / (FN + TP) = FNR
14 / (14 + 86) = 0.14
Matthews Correlation Coefficient (MCC)
(TP × TN - FP × FN) / √((TP+FP) × (TP+FN) × (TN+FP) × (TN+FN)) = MCC
86 × 82 - 18 × 14 / √((86+18) × (18+14) × (82+18) × (82+14)) = 0.68
Adding an unlabeled observation
Using our classifier (depicted as decision spaces and boundaries), we can add an unlabeled observation and see how our classifier would classify it.
Decision spaces help us classify
Based on our decision boundaries and decision spaces, the classifier classifies the observation as a Luna moth.
Classifier represented as decision tree
Our moth classifier could be depicted as a decision tree.
Classifier represented as "if-else" statements
The classifier could also be depicted as a a series of "if-else" statements.
SVMs are classification based techniques that use support vectors, hyperplanes, and decision boundaries to classify data.
Hyperplane: a subspace that represents the largest separation between the the data classes (features)
Support vectors: Data points which guide the position of the hyperplane (closest data points to the hyperplane)
Decision boundaries : Boundaries which exist around the hyperplane to guide data point classification
SVM can result in linear decision boundaries.
SVM can also use a nonlinear function (creating a new variable referred to as "kernel") leading to nonlinear decision boundaries.
An artificial neural network could also be used for this classification example (see Bonus Material at the end of this submodule!)
Deeper Dive: Simple Healthcare Classifier Use Case
Open-Angle Glaucoma, sourced from Mayo Clinic
To apply these concepts to the healthcare domain, let's review another simple (two-feature) classification example to predict whether an unlabeled observation is glaucomatous.
Glaucoma: a group of conditions that damage the optic nerve, often attributed to high pressure in the eye.
NOTE: While this example is tree-based and uses two features for simplicity and 2D representation, in application, robust diagnostic models often require numerous thoughtfully determined features that are contextually sensitive to a population of interest and use of sophisticated non-linear approaches to classification.
We are going to use two features in our classifier: 1.) age and 2.) intraocular pressure (IOP).
Older age and elevated IOP are both risk factors for the onset of glaucoma. Normal eye pressure ranges between 12-22 mm Hg, while IOP greater than 22 mm Hg indicates elevation beyond the normal range.
Plotting our patient data
Here we have a scatter plot of fictitious patient age and IOP data. Age (in years) is plotted along the x-axis and IOP (in mmHg) is plotted along the y-axis. Patients diagnosed with glaucoma are plotted in red (n=100) and patients who do not have glaucoma are plotted in black.
Clusters overlap
Just like in the moth example, the patients clearly form two clusters with slight overlap.
Our first decision boundary
A linear classifier might place a decision boundary horizontally at around 22 mm Hg on the y-axis, implying observations with an IOP of 22 mm Hg or greater are likely glaucomatous.
Another decision boundary
A second decision boundary might be placed vertically at around 25 years on the x-axis, implying observations aged greater than 25 years with an IOP of 22 mm Hg or greater are likely glaucomatous.
Glaucoma classifications (positive)
Looking solely at our glaucomatous observations, we see that our classifier correctly classified 91 cases, and incorrectly classified 9 cases.
Non-glaucoma classifications (negative)
Looking solely at our non-glaucomatous observations, we see that our classifier correctly classified 90 cases, and incorrectly classified 10 cases.
Let's look at our confusion matrix-based performance metrics:
Accuracy (ACC)
(TP + TN) / (TP + TN + FP + FN) = ACC
(91 + 90) / (91 + 90 + 10 + 9) = 0.905
Precision (or Positive Predictive Value [PPV])
TP / (TP + FP) = PPV
91 / (91 + 10) = 0.901
Negative Predictive Value (NPV)
TN / (TN + FN) = NPV
90 / (90 + 9) = 0.909
Recall (or Sensitivity, True Positive Rate [TPR])
TP / (TP + FN) = TPR
91 / (91 + 9) = 0.91
Specificity (or True Negative Rate [TNR])
TN / (FP + TN) = TNR
90 / (10 + 90) = 0.90
F-Measure (or F1-Score)
2 × ((recall × precision)/(recall + precision)) = F1
2 × ((0.91 × 0.901)/(0.91 + 0.901)) = 0.905
False Positive Rate (FPR; Type I Error)
FP / (FP + TN) = FPR
10 / (10 + 90) = 0.1
False Negative Rate (FNR; Type II Error)
FN / (FN + TP) = FNR
9 / (9 + 91) = 0.09
Matthews Correlation Coefficient (MCC)
(TP × TN - FP × FN) / √((TP + FP) × (TP + FN) × (TN + FP) × (TN + FN)) = MCC
(91 × 90 - 10 × 9 / √((91 + 10) × (91 + 9) × (90 + 10) × (90 + 9)) = 0.81
If we were to provide an unlabeled observation to our model, the classifier would classify the observation as glaucomatous based on the decision space it falls into.
Deeper Dive: Supervised Learning
Our hypothetical classifiers above are examples of a supervised model, because it used labeled data to decide what values to divide features on in order to maximize correct classifications while minimizing misclassifications.
Let's take a closer look at supervised learning.
Crash Course Artificial Intelligence, Episode #2
Cite as: Brungard B. et al. Supervised Learning: Crash Course AI #2. Vol 2. PBS Digital Studios; 2019.
Watch this episode of Crash Course Artificial Intelligence, hosted by Jabril Ashe.
A few key points:
Supervised learning is the sect of ML which uses labeled data to train a model
Rosenblatt’s Perceptron was the first algorithmically described supervised neural network whose conceptual underpinning continues to be used in supervised binary classification.
In a supervised binary classifier that uses a step function, the classifier receives inputs that are multiplied by their respective weights and compared to an adjustable threshold weight called the "bias." If the sum of these multiplied inputs is less than the bias, then the neuron outputs a 0. If the sum is greater than the bias, then the neuron will output a 1.
If the classification is correct, weights and biases are not updated. If the classification is incorrect, weights and biases are updated, which also updates the decision boundary.
Deeper Dive: Unsupervised Learning
So far, we've mainly discussed supervised learning based on labeled training, test, and validation datasets. But what happens when our data are not labeled?
In unsupervised learning models, algorithms use unlabeled datasets to detect and choose what features to divide on.
Crash Course Artificial Intelligence, Episode #6
Cite as: Brungard B. et al. Unsupervised Learning: Crash Course AI #6. Vol 6. PBS Digital Studios; 2019.
Watch this episode of Crash Course Artificial Intelligence, hosted by Jabril Ashe.
A few key points:
Unsupervised learning is the sect of ML which uses unlabeled data to train a model.
Computational recognition of shared properties among instances in a dataset is called unsupervised clustering.
K-means is one of the simplest unsupervised clustering algorithms.
K-means relies on the following assumptions:
There are k number of clusters in a dataset based on pre-selected properties of interest
There is a way to compare observations
There is a way to surmise how many clusters exist in a dataset based on pre-selected properties
The mean is a valid measure of central tendency to create centroids for each predicted cluster
A great thread about k-means algorithm assumptions can be seen here.
Crash Course Statistics, Episode #37
Cite as: Brungard B. et al. Unsupervised Machine Learning: Crash Course Statistics #37. Vol 37. Complexly; 2018.
Watch this episode of Crash Course Statistics, hosted by Adriene Hill.
A few key points:
Expanding on K-means:
The first set of k clusters are determined by arbitrarily placing k centroids and treating them as the center of each cluster.
Each data point is assigned to the cluster of the centroid it is closest to (based on pre-selected properties).
The centroid (mean) is recalculated based on the datapoints now contained in each cluster and the data points are again reassigned.
The process of centroid recalculation and cluster reassignment continues until the centroids converge (i.e., the centroids and clusters stop changing).
Because unsupervised learning does not use labels, there are no "true" values to compare to, so we are unable to use a confusion matrix to assess unsupervised clustering performance like we can for supervised classification. However, we can use other metrics, like the silhouette score to assess cluster cohesion and separation.
Another type of clustering often used in ML is hierarchical clustering.
Hierarchical clustering is used to create a hierarchy of clusters in an unlabeled dataset such that a tree-shaped structure (a dendrogram) is formed.
The "similarity" of clusters is determined by how far up in the dendrogram they join.
Flaws in ML: A Real-World Cancer Prediction Example
ML Systems in the Real World: Cancer Prediction
Cite as: Sculley, D. et al. Machine Learning Crash Course: ML Systems in the Real World: Cancer Prediction. Google Developers; 2020.
Review this mini-lecture offered by D. Sculley from Google's Machine Learning Crash Course.
In this mini-lecture, D. Sculley walks us through a real-world example of how label leakage impacted the generalizability of a cancer prediction model (we'll cover model generalization in Module 2).
Label leakage (also called data leakage, target leakage, or just leakage) occurs when targetted predictions are inadvertently present in a training dataset, resulting in excellent model performance on test data, but poor performance on new data.
A few key points:
A cancer prediction model was trained using data from patient medical records with features of interest including patient age, gender, prior conditions, hospital name, vital signs, test results, etc.
The model gave excellent performance on withheld test data; however, its performance suffered when using new patient data.
The reason for this was that one feature, hospital name (which included strings like "Beth Israel Cancer Center"), was an obviously reliable indicator for whether a patient had cancer (i.e., patients being treated at cancer centers are far more likely to have cancer than those who are not). Using this feature was a subtle way for the model to "cheat."
NOTE: Even if the string itself did not contain the word "cancer" or if the hospital feature was an anonymized integer, label leakage would still occur due to undiscovered feature correlation (high cancer patient correlation with cancer treatment centers).
Exercise: Build a Simple Classifier
In this exercise, we are going to build a simple K-nearest neighbors (KNN) classifier using the Wisconsin Breast Cancer (Diagnostic) Dataset.
Directions:
Make a copy of this Google Colaboratory notebook in your Google Drive account (also found on GitHub)
Follow the annotated directions to generate a KNN classifier
NOTES:
Make revisions as directed in cells where you see "⬅️✏️"
Share: #ML4Health
Thought prompt: In what circumstances do you think a well-designed and audited ML model is appropriate to deploy in a healthcare or public health setting?
Share your thoughts on Twitter using the hashtags #MDSD4Health #ML4Health
Tag us to join the conversation! @MDSD4Health
For ideas on how to take part in the conversation, check out our Twitter Participation Guide.
Bonus Material!
Classification Using a Neural Network
Think back to Carrie Anne's moth classification example. In addition to using a tree-based or SVM approaches, we could also use an artificial neural network. See a reiteration of the example with a high-level summary of the described calculations below.
Mass and wingspan values for our unlabeled moth observation comprise our input layer.
Each input is multiplied by an initially randomly-set weight. The weighted inputs are summed together, then an initially-randomly set bias is applied. Weights and biases are adjusted iteratively using the labeled data to gradually improve accuracy. An activation function is applied to the result.
This process is executed for all neurons in a layer, and the values propagate forward in the neural net, one layer at a time. The output with the highest value is the classification decision (Luna moth).