In supervised learning, every training sample from the dataset has a corresponding label or output value associated with it. As a result, the algorithm learns to predict labels or output values.
Supervised learning can be separated into two types of problems when data mining—classification and regression:
Classification uses an algorithm to accurately assign test data into specific categories. It recognizes specific entities within the dataset and attempts to draw some conclusions on how those entities should be labeled or defined. Common classification algorithms are linear classifiers, support vector machines (SVM), decision trees, k-nearest neighbor, and random forest, which are described in more detail below.
Regression is used to understand the relationship between dependent and independent variables. It is commonly used to make projections, such as for sales revenue for a given business. Linear regression, logistical regression, and polynomial regression are popular regression algorithms.
Neural networks
Primarily leveraged for deep learning algorithms, neural networks process training data by mimicking the interconnectivity of the human brain through layers of nodes. Each node is made up of inputs, weights, a bias (or threshold), and an output. If that output value exceeds a given threshold, it “fires” or activates the node, passing data to the next layer in the network. Neural networks learn this mapping function through supervised learning, adjusting based on the loss function through the process of gradient descent. When the cost function is at or near zero, we can be confident in the model’s accuracy to yield the correct answer.
Naive Bayes
Naive Bayes is classification approach that adopts the principle of class conditional independence from the Bayes Theorem. This means that the presence of one feature does not impact the presence of another in the probability of a given outcome, and each predictor has an equal effect on that result. There are three types of Naïve Bayes classifiers: Multinomial Naïve Bayes, Bernoulli Naïve Bayes, and Gaussian Naïve Bayes. This technique is primarily used in text classification, spam identification, and recommendation systems.
Linear regression
Linear regression is used to identify the relationship between a dependent variable and one or more independent variables and is typically leveraged to make predictions about future outcomes. When there is only one independent variable and one dependent variable, it is known as simple linear regression. As the number of independent variables increases, it is referred to as multiple linear regression. For each type of linear regression, it seeks to plot a line of best fit, which is calculated through the method of least squares. However, unlike other regression models, this line is straight when plotted on a graph.
Logistic regression
While linear regression is leveraged when dependent variables are continuous, logistical regression is selected when the dependent variable is categorical, meaning they have binary outputs, such as "true" and "false" or "yes" and "no." While both regression models seek to understand relationships between data inputs, logistic regression is mainly used to solve binary classification problems, such as spam identification.
Support vector machine (SVM)
A support vector machine is a popular supervised learning model developed by Vladimir Vapnik, used for both data classification and regression. That said, it is typically leveraged for classification problems, constructing a hyperplane where the distance between two classes of data points is at its maximum. This hyperplane is known as the decision boundary, separating the classes of data points (e.g., oranges vs. apples) on either side of the plane.
K-nearest neighbor
K-nearest neighbor, also known as the KNN algorithm, is a non-parametric algorithm that classifies data points based on their proximity and association to other available data. This algorithm assumes that similar data points can be found near each other. As a result, it seeks to calculate the distance between data points, usually through Euclidean distance, and then it assigns a category based on the most frequent category or average.
Its ease of use and low calculation time make it a preferred algorithm by data scientists, but as the test dataset grows, the processing time lengthens, making it less appealing for classification tasks. KNN is typically used for recommendation engines and image recognition.
Random forest
Random forest is another flexible supervised machine learning algorithm used for both classification and regression purposes. The "forest" references a collection of uncorrelated decision trees, which are then merged together to reduce variance and create more accurate data predictions.
In reinforcement learning, the algorithm figures out which actions to take in a situation to maximize a reward (in the form of a number) on the way to reaching a specific goal.
The Reinforcement Learning problem involves an agent exploring an unknown environment to achieve a goal. RL is based on the hypothesis that all goals can be described by the maximization of expected cumulative reward. The agent must learn to sense and perturb the state of the environment using its actions to derive maximal reward. The formal framework for RL borrows from the problem of optimal control of Markov Decision Processes (MDP).
The main elements of an RL system are:
The agent or the learner
The environment the agent interacts with
The policy that the agent follows to take actions
The reward signal that the agent observes upon taking actions
Robotics. Robots with pre-programmed behavior are useful in structured environments, such as the assembly line of an automobile manufacturing plant, where the task is repetitive in nature. In the real world, where the response of the environment to the behavior of the robot is uncertain, pre-programming accurate actions is nearly impossible. In such scenarios, RL provides an efficient way to build general-purpose robots. It has been successfully applied to robotic path planning, where a robot must find a short, smooth, and navigable path between two locations, void of collisions and compatible with the dynamics of the robot.
AlphaGo. One of the most complex strategic games is a 3,000-year-old Chinese board game called Go. Its complexity stems from the fact that there are 10^270 possible board combinations, several orders of magnitude more than the game of chess. In 2016, an RL-based Go agent called AlphaGo defeated the greatest human Go player. Much like a human player, it learned by experience, playing thousands of games with professional players. The latest RL-based Go agent has the capability to learn by playing against itself, an advantage that the human player doesn’t have.
Autonomous Driving. An autonomous driving system must perform multiple perception and planning tasks in an uncertain environment. Some specific tasks where RL finds application include vehicle path planning and motion prediction. Vehicle path planning requires several low and high-level policies to make decisions over varying temporal and spatial scales. Motion prediction is the task of predicting the movement of pedestrians and other vehicles, to understand how the situation might develop based on the current state of the environment.
In unsupervised learning, there are no labels for the training data. A machine learning algorithm tries to learn the underlying patterns or distributions that govern the data.
Simply put, unsupervised learning works by analyzing uncategorized, unlabeled data and finding hidden structures in it.
In supervised learning, a data scientist feeds the system with labeled data, for example, the images of cats labeled as cats, allowing it to learn by example. In unsupervised learning, a data scientist provides just the photos, and it's the system's responsibility to analyze the data and conclude whether they're the images of cats.
Unsupervised machine learning requires massive volumes of data. In most cases, the same is true for supervised learning as the model becomes more accurate with more examples.
The process of unsupervised learning begins with the data scientists training the algorithms using the training datasets. The data points in these datasets are unlabeled and uncategorized.
The algorithm’s learning goal is to identify patterns within the dataset and categorize the data points based on the same identified patterns. In the example of cat images, the unsupervised learning algorithm can learn to identify the distinct features of cats, such as their whiskers, long tails, and retractable claws.
If you think about it, unsupervised learning is how we learn to identify and categorize things. Suppose you've never tasted ketchup or chili sauce. If you're given two "unlabeled" bottles of ketchup and chili sauce each and asked to taste them, you'll be able to differentiate between their flavors.
You'll also be able to identify the peculiarities of both the sauces (one being sour and the other spicy) even if you don't know the names of either. Tasting each a few more times will make you more familiar with the flavor. Soon, you'll be able to group dishes based on the sauce added just by tasting them.
By analyzing the taste, you can find specific features that differentiate the two sauces and group dishes. You don't need to know the sauces' names or that of the dishes to categorize them. You might even end up calling one the sweet sauce and the other hot sauce.
This is similar to how machines identify patterns and classify data points with the help of unsupervised learning. In the same example, supervised learning would be someone telling you the names of both the sauces and how they taste beforehand.
Apriori algorithm
Apriori algorithm is built for data mining. It's useful for mining databases containing a large number of transactions, for example, a database containing the list of items bought by shoppers in a supermarket. It is used for identifying the harmful effects of drugs and in market basket analysis to find the set of items customers are more likely to buy together.
ECLAT algorithm
Equivalence Class Clustering and bottom-up Lattice Traversal, or ECLAT for short, is a data mining algorithm used to achieve itemset mining and find frequent items.
Apriori algorithm uses horizontal data format and so needs to scan the database multiple times to identify frequent items. On the other hand, ECLAT follows a vertical approach and is generally faster as it needs to scan the database only once.
Frequent pattern (FP) growth algorithm
The frequent pattern (FP) growth algorithm is an improved version of the Apriori algorithm. This algorithm represents the database in the form of a tree structure known as a frequent tree or pattern.
Such a frequent tree is used for mining the most frequent patterns. While the Apriori algorithm needs to scan the database n+1 times (where n is the length of the longest model), the FP-growth algorithm requires just two scans.
K-means clustering
Many iterations of the k-means algorithm are widely used in the field of data science. Simply put, the k-means clustering algorithm groups similar items into clusters. The number of clusters is represented by k. So if the value of k is 3, there will be three clusters in total.
This clustering method divides the unlabeled dataset so that each data point belongs to only a single group with similar properties. The key is to find K centers called cluster centroids.
Each cluster will have one cluster centroid, and on seeing a new data point, the algorithm will determine the closest cluster to which the data point belongs based on metrics like the euclidean distance.
Principal component analysis (PCA)
The principal component analysis (PCA) is a dimensionality-reduction method generally used to reduce the dimensionality of large datasets. It does this by converting a large number of variables into a smaller one that contains almost all the information in the large dataset.
Reducing the number of variables might affect the accuracy slightly, but it could be an acceptable tradeoff for simplicity. That's because smaller datasets are easier to analyze, and machine learning algorithms don't have to sweat much to derive valuable insights.