We are living in a truly defining era of human history, marked by the rapid evolution of computing from mainframes to personal computers and now to cloud computing. But what truly sets this period apart is the democratization of various tools, techniques, and machine learning algorithms that have followed this computing revolution. As a data scientist, I am excited about the endless possibilities that lie ahead.
Today, I can build data-crunching machines with complex algorithms for a few dollars per hour. But let me tell you, the journey to get here was not easy. I had to endure many dark days and long nights.
In this guide, I will focus on commonly used machine learning techniques and algorithms, including linear regression, logistic regression, Naive Bayes, kNN, and random forest, among others. I will cover both the theory and implementation of these algorithms in R and Python. This guide is specifically curated for beginners looking to start their data science journey and learn machine learning models from scratch.
Who can benefit the most from this guide? Anyone who is interested in machine learning, artificial intelligence, or data science will find this guide valuable. Through this guide, I will enable aspiring data scientists and machine learning enthusiasts from all over the world to work on machine-learning problems and gain experience. I will provide a high-level understanding of various machine learning algorithms, along with R and Python codes to run them.
There are three main types of machine learning algorithms: supervised learning, unsupervised learning, and reinforcement learning. Supervised learning algorithms involve predicting a target variable from a given set of predictors, while unsupervised learning algorithms do not have any target variable to predict or estimate. Reinforcement learning algorithms train machines to make specific decisions based on trial and error.
I have deliberately skipped the statistics behind these techniques and artificial neural networks, as you don’t need to understand them initially. However, if you want to equip yourself to start building a machine learning project, this guide is an excellent starting point. Let’s dive in!
Who Can Benefit the Most From This Guide?
What I am giving out today is probably the most valuable guide I have ever created.
The idea behind creating this guide is to simplify the journey of aspiring data scientists and machine learning (which is part of artificial intelligence) enthusiasts across the world. Through this guide, I will enable you to work on machine-learning problems and gain from experience. I am providing a high-level understanding of various machine learning algorithms along with R & Python codes to run them. These should be sufficient to get your hands dirty. You can also check out our Machine Learning Course.
Essentials of machine learning algorithms with implementation in R and Python
I have deliberately skipped the statistics behind these techniques and artificial neural networks, as you don’t need to understand them initially. So, if you are looking for a statistical understanding of these algorithms, you should look elsewhere. But, if you want to equip yourself to start building a machine learning project, you are in for a treat.
3 Types of Machine Learning Algorithms
Supervised Learning Algorithms
How it works: This algorithm consists of a target/outcome variable (or dependent variable) which is to be predicted from a given set of predictors (independent variables). Using this set of variables, we generate a function that maps input data to desired outputs. The training process continues until the model achieves the desired level of accuracy on the training data. Examples of Supervised Learning: Regression, Decision Tree, Random Forest, KNN, Logistic Regression, etc.
Unsupervised Learning Algorithms
How it works: In this algorithm, we do not have any target or outcome variable to predict / estimate (which is called unlabelled data). It is used for recommendation systems or clustering populations in different groups. clustering algorithms are widely used for segmenting customers into different groups for specific interventions. Examples of Unsupervised Learning: Apriori algorithm, K-means clustering.
Reinforcement Learning Algorithms
How it works: Using this algorithm, the machine is trained to make specific decisions. It works this way: the machine is exposed to an environment where it trains itself continually using trial and error. This machine learns from past experience and tries to capture the best possible knowledge to make accurate business decisions. Example of Reinforcement Learning: Markov Decision Process
List of Top 10 Common Machine Learning Algorithms
Here is the list of commonly used machine learning algorithms. These algorithms can be applied to almost any data problem:
- Linear Regression
- Logistic Regression
- Decision Tree
- Naive Bayes
- Random Forest
- Dimensionality Reduction Algorithms
- Gradient Boosting algorithms
It is used to estimate real values (cost of houses, number of calls, total sales, etc.) based on a continuous variable(s). Here, we establish the relationship between independent and dependent variables by fitting the best line. This best-fit line is known as the regression line and is represented by a linear equation Y= a*X + b.
The best way to understand linear regression is to relive this experience of childhood. Let us say you ask a child in fifth grade to arrange people in his class by increasing the order of weight without asking them their weights! What do you think the child will do? He/she would likely look (visually analyze) at the height and build of people and arrange them using a combination of these visible parameters. This is linear regression in real life! The child has actually figured out that height and build would be correlated to weight by a relationship, which looks like the equation above.
In this equation:
- Y – Dependent Variable
- a – Slope
- X – Independent variable
- b – Intercept
These coefficients a and b are derived based on minimizing the sum of the squared difference of distance between data points and the regression line.
Look at the below example. Here we have identified the best-fit line having linear equation y=0.2811x+13.9. Now using this equation, we can find the weight, knowing the height of a person.
Linear Regression is mainly of two types: Simple Linear Regression and Multiple Linear Regression. Simple Linear Regression is characterized by one independent variable. And, Multiple Linear Regression(as the name suggests) is characterized by multiple (more than 1) independent variables. While finding the best-fit line, you can fit a polynomial or curvilinear regression. And these are known as polynomial or curvilinear regression.
Don’t get confused by its name! It is a classification algorithm, not a regression algorithm. It is used to estimate discrete values ( Binary values like 0/1, yes/no, true/false ) based on a given set of independent variable(s). In simple words, it predicts the probability of the occurrence of an event by fitting data to a logistic function. Hence, it is also known as logit regression. Since it predicts the probability, its output values lie between 0 and 1 (as expected).
Again, let us try and understand this through a simple example.
Let’s say your friend gives you a puzzle to solve. There are only 2 outcome scenarios – either you solve it, or you don’t. Now imagine that you are being given a wide range of puzzles/quizzes in an attempt to understand which subjects you are good at. The outcome of this study would be something like this – if you are given a trigonometry-based tenth-grade problem, you are 70% likely to solve it. On the other hand, if it is a grade fifth history question, the probability of getting an answer is only 30%. This is what Logistic Regression provides you.
Coming to the math, the log odds of the outcome are modeled as a linear combination of the predictor variables.
odds= p/ (1-p) = probability of event occurrence / probability of not event occurrence
ln(odds) = ln(p/(1-p))
logit(p) = ln(p/(1-p)) = b0+b1X1+b2X2+b3X3....+bkXk
Above, p is the probability of the presence of the characteristic of interest. It chooses parameters that maximize the likelihood of observing the sample values rather than that minimize the sum of squared errors (like in ordinary regression).
Now, you may ask, why take a log? For the sake of simplicity, let’s just say that this is one of the best mathematical ways to replicate a step function. I can go into more details, but that will beat the purpose of this article.
This is one of my favorite algorithms, and I use it quite frequently. It is a type of supervised learning algorithm that is mostly used for classification problems. Surprisingly, it works for both categorical and continuous dependent variables. In this algorithm, we split the population into two or more homogeneous sets. This is done based on the most significant attributes/ independent variables to make as distinct groups as possible. For more details, you can read Decision Tree Simplified.
In the image above, you can see that population is classified into four different groups based on multiple attributes to identify ‘if they will play or not’. To split the population into different heterogeneous groups, it uses various techniques like Gini, Information Gain, Chi-square, and entropy.
The best way to understand how the decision tree works, is to play Jezzball – a classic game from Microsoft (image below). Essentially, you have a room with moving walls and you need to create walls such that the maximum area gets cleared off without the balls.
So, every time you split the room with a wall, you are trying to create 2 different populations within the same room. Decision trees work in a very similar fashion by dividing a population into as different groups as possible.
Starting off with an inkling of the frequently utilized machine learning algorithms, it is safe to assume that you now possess some understanding. This article was crafted with the express purpose of facilitating your prompt commencement, furnished with R and Python codes. Should you possess a zealous aspiration to dominate machine learning algorithms, then why not begin immediately? Select some problems, cultivate a tangible comprehension of the procedure, utilize these codes, and savor the amusement!