logo

Introduction to Machine Learning

Assignment 1: classifier by hand, assignment 1: classifier by hand #.

In this assignment, you learn step by step how to code a binary classifier by hand!

Don’t worry, it will be guided through.

Introduction: calorimeter showers #

A calorimeter in the context of experimental particle physics is a sub-detector aiming at measuring the energy of incoming particles. At CERN Large Hadron Collider, the giant multipurpose detectors ATLAS and CMS are both equipped with electromagnetic and hadronic calorimeters. The electronic calorimeter, as its name indicates, is measuring the energy of incoming electrons. It is a destructive method: the energetic electron entering the calorimeter will interact with its dense material via the electromagnetic force. It eventually results in the generation of a shower of particles (electromagnetic shower), with a characteristic average depth and width. The depth is along the direction of the incoming particle and the width is the dimension perpendicular to it.

Problem? There can be noisy signals in electromagnetic calorimeters that are generated by hadrons, not electrons.

Your mission is to help physicists by coding a classifier to select electron-showers (signal) from hadron-showers (background).

To this end, you are given a dataset of shower characterists from previous measurements where the incoming particle was known. The main features are the depth and the width.

../_images/a01_showers.png

Visualization of an electron shower (left) and hadron shower (right). From ResearchGate #

Hadron showers are on average longer in the direction of the incoming hadron (depth) and large in the transverse direction (width).

../_images/a01_showers_distribs.png

1. Get the Data #

Download the dataset and put it on your GDrive. Open a new Jupyter-notebook from Google-Colab . To mount your drive:

For the following, import the NumPy and pandas libraries:

1.1 Get and load the data Read the dataset into a dataframe df . What are the columns? Which column stores the labels (targets)?

1.2 How many samples are there?

2. Feature Scaling #

2.1 Explain If the parameters are initialized randomly between 0 and 1 and the data are not zero-centered, what happens to the gradient descent? Explain the behaviour.

2.2 Standardization Create for each input feature an extra column in the dataframe to rescale it to a distribution of zero-mean and unit-variance. To see statistical information on a dataframe, a convenient method is:

We will take \(x_1\) and \(x_2\) in the order of the dataframe’s columns. By searching in the documentation for methods retrieving the mean and standard deviation, complete the following code:

Hint: recall Definition 12 and the equation to scale a feature according to the standardization method.

Check your results by calling df.describe() on the updated dataframe.

3. Data Prep #

Let’s make the dataset ready for the classifier. As seen in class, the hypothesis function in the linear assumption has the dot product \(\sum_{j=0}^n x^{(i)}_j \theta_j = x^{(i)} \theta^{\; T}\) , where by convention \(x_0 = 1\) . With 2 input features, there are two parameters to optimize for each feature and the intercept term \(\theta_0\) . To perform the dot product above, let’s add to the dataframe a column:

3.1 Adding x0 column Add a column x0 to the dataframe df filled entirely with ones.

3.2 Matrix X Create a new dataframe X that contain the x0 column and the columns of the two scaled input features.

3.3 Labels to binary The target column contains alphabetical labels. Create an extra column in your dataframe called y containing 1 if the sample is an electron shower and 0 if it is a hadron one.

Hint: you can first create an extra column full of 0, then apply a filter using the .loc property from DataFrame. It can be in the form:

Read the pandas documentation on .loc .

3.4 Vector y Extract from the dataframe this y column with the binary labels in a separate dataframe y .

4. DataFrames to Numpy #

The inputs are almost ready, yet there are still some steps. As we saw in Lecture 3 in Part Performance Metrics , we need to split the dataset in a training and a testing sets. We will use the very convenient train_test_split method from Sciki-Learn. Then, we convert the resulting dataframes to NumPy arrays. This Python library is the standard in data science. In this assignment, you will manipulate NumPy arrays and build the foundations you need to master the programming aspect of machine learning.

Copy this code into your notebook:

4.1 Shapes Show the dimensions of the four NumPy arrays using the .shape property. Comment the numbers with respect to the notations defined in class. Does it make sense?

4.2 Test size Looking at the shapes, explain what test_size represents.

5. Useful Functions #

We saw a lot of functions in the lectures: hypothesis, logistic, cost, etc. We will code these functions as python functions to make the code more modular and easier to read.

In the following you will work with two dimensional NumPy arrays. Make sure the objects you declare in the code have the correct dimension (number of rows and number of columns).

To help you, look at the examples below. This creates a 2 x 3 matrix:

This is a list:

This makes a 1D NumPy array:

This declares a 2D NumPy array with one row (aka arow vector):

5.1 Linear Sum Write a function computing the linear sum of the features \(X\) with the \(\boldsymbol{\theta}\) parameters for each input sample.

What should be the dimensions of the returned object? Make sure your function returns a 2D array with the correct shape.

5.2 Logistic Function Write a function computing the logistic function:

5.3 Hypothesis Function Using the two functions above, write the hypothesis function \(h_\theta(\boldsymbol{x^{(i)}})\) :

5.4 Partial Derivatives of Cross-Entropy Cost Function In the linear assumption where \(z(\boldsymbol{x^{(i)}}) = \sum_{j=0}^n \theta_j x^{(i)}_j\) , the partial derivatives of the cross-entropy cost function are:

Write a function that takes three column vectors (m \(\times\) 1) and computes the partial derivatives \(\frac{\partial}{\partial \theta_j} J(\theta)\) for a given feature \(j\) :

Hint: perform an array operation to store all the derivatives in a column vector. Then sum over the elements of that vector. At the end your function should return a scalar, i.e. one value.

5.5 Cross-Entropy Cost Function Write a function computing the total cost from the 2D column vectors of predictions and observations:

6. Classifier #

The core of the action.

Luckily a skeleton is provided. You will have to replace the statments # ... by proper code. It will mostly consist of calling the functions you defined in the previous section.

Test your code frequently. To do so, you can assign dummy temporary values for the variables you do not use yet, so that python knows they are defined and your code can run.

If you struggle or cannot finish, summarize in your notebook your trials and investigations.

7. Plot cost versus epochs #

Use the following macro to plot the variation of the total cost vs the iteration number:

Call this macro:

You should get something like this:

../_images/a01_cost_vs_N.png

7.1: Describe the plot; what is the fundamental difference between the two series train and test?

7.2: What would it mean if there would be a bigger gap between the test and training values of the cost?

8. Performance #

We will write our own functions to quantitatively assess the performance of the classifier.

Before counting the true and false predictions, we need… predictions! We already wrote a function h_class outputting a prediction as a continuous variable between 0 and 1 , equivalent to a probabilitiy. The function below is calling h_class and then fills a python list of binary predictions, so either 0 or 1. For the boundary, recall in the lecture that the sigmoid is symmetric around y = 0.5, so we will work with this boundary for now. Copy this to your notebook:

Call the function:

We will work with lists from now on, so flatten the observed test values:

8.1 Accuracy Write a function computing the accuracy of the classifier:

Call your function using the test set and print the result.

Call your function, still with the test set of course, and print the result.

BRAVO! You know now the math behind a binary classifier!

X. BONUS: Decision Boundaries #

This is for advanced programmers and/or your curiosity and/or if you have the time. Bonus points will be given even if you answer with math equations only and not necessarily the associated python code. Of course you if you succeed in getting the python, more bonus points for you!

Goal We want to draw on a scatter plot the lines corresponding to different decision boundaries.

Scroll down at the very end to see where we are heading to.

X.0 Scatter plot The first step is to split the signal and background into two different dataframes. Using the general dataframe df defined at the beginning:

The plotting macro:

To call it:

X.1 Useful functions Recall the logistic function:

Write a function rev_sigmoid that outputs the value \(z = f(\hat{y})\) .

Write a function scale_inputs that scales a list of raw input features, either \(x_1\) or \(x_2\) , according to the standardization procedure.

Write the function unscale_inputs that does the contrary.

X.2 Equation For a given threshold \(\hat{y}\) , write the equation of the line boundary: \(x_2 = f(\boldsymbol{\theta}, x_1, \hat{y})\) .

X.3 Coordinate points To draw a line on a plot in Matplotlib, one needs to provide the coordinates as a set of two data points.

Write a function that compute the coordinates x2_left and x2_right – associated with the values of x1_min and x1_max respectively – of a decision boundary line at a given threshold \(\hat{y}\) . (recall 0.5 is the standard one for logistic regression).

Warrior-level bonus: compute this for several thresholds, i.e. the function returns a list of line properties. Tip: it’s convenient to store the result in a dictionary. For instance you can have keys threshold , x2_left , x2_right .

X.4 Plotting the boundaries (advanced) In the scatter plot code provided, uncomment the boundary section and draw the line(s) using the Matplotlib plot function.

In the very end, this is how it would render:

../_images/a01_scatter_with_boundaries.png

Scatter plot of electron and hadron showers with decision boundary lines for various thresholds. Code will be shown while releasing solutions. #

The higher the threshold, the more the boundary line shifts downwards in the electron-dense area. Why is that the case?

  You are encouraged to work in groups of two, however submissions are individual.

If you have received help from your peers and/or have worked within a group, summarize in the header of your notebook the involved students, how they helped you and the contributions of each group member. This is to ensure fairness in the evaluation.

You can use the internet such as the official pages of relevant libraries, forum websites to debug, etc. However, using an AI such as ChatGPT would be considered cheating (and not good for you anyway to develop your programming skills).

The instructor and tutors are available throughout the week to answer your questions. Write an email with your well-articulated question(s). Put in CC your teammates if any.

Thank you and do not forget to have fun while coding!

Machine Learning Fundamentals Handbook – Key Concepts, Algorithms, and Python Code Examples

Tatev Aslanyan

If you're planning to become a Machine Learning Engineer, Data Scientist, or you want to refresh your memory before your interviews, this handbook is for you.

In it, we'll cover the key Machine Learning algorithms you'll need to know as a Data Scientist, Machine Learning Engineer, Machine Learning Researcher, and AI Engineer.

Throughout this handbook, I'll include examples for each Machine Learning algorithm with its Python code to help you understand what you're learning.

Whether you're a beginner or have some experience with Machine Learning or AI, this guide is designed to help you understand the fundamentals of Machine Learning algorithms at a high level.

As an experienced machine learning practitioner, I'm excited to share my knowledge and insights with you.

What You'll Learn

Chapter 1: what is machine learning.

  • Chapter 2: Most popular Machine Learning algorithms
  • 2.1 Linear Regression and Ordinary Least Squares (OLS)
  • 2.2 Logistic Regression and MLE
  • 2.3 Linear Discriminant Analysis(LDA)

2.4 Logistic Regression vs LDA

  • 2.5 Naïve Bayes

2.6 Naïve Bayes vs Logistic Regression

2.7 decision trees, 2.8 bagging, 2.9 random forest.

  • 2.10 Boosting or Ensamble Techniques (AdaBoost, GBM, XGBoost)

3.  Chapter 3: Feature Selection

  • 3.1 Subset Selection
  • 3.2 Regularization (Ridge and Lasso)
  • 3.3 Dimensionality Reduction (PCA)

4.  Chapter 4: Resampling Technique

  • 4.1 Cross Validation: (Validation Set, LOOCV, K-Fold CV)
  • 4.2 Optimal k in K-Fold CV
  • 4.5 Bootstrapping

5.  Chapter 5: Optimization Techniques

  • 5.1 Optimization Techniques: Batch Gradient Descent (GD)
  • 5.2 Optimization Techniques: Stochastic Gradient Descent (SGD)
  • 5.3 Optimization Techniques: SGD with Momentum
  • 5.4 Optimization Techniques: Adam Optimiser
  • 6.1 Key Takeaways & What Comes Next
  • 6.2 About the Author — That’s Me!
  • 6.3 How Can You Dive Deeper?
  • 6.4 Connect with Me

image-88

Prerequisites

To make the most out of this handbook, it'll be helpful if you're familiar with some core ML concepts:

Basic Terminology:

  • Training Data & Test Data: Datasets used to train and evaluate models.
  • Features: Variables aiding in predictions, we also call independent variables
  • Target Variable: The predicted outcome, also called dependent variable or response variable

Overfitting Problem in Machine Learning

Understanding Overfitting, how it's related to Bias-Variance Tradeoff, and how you can fix it is very important. We will look at regularization techniques in detail in this guide, too. For a detailed understanding, refer to:

1*sHhtYhaCe2Uc3IU0IgKwIQ

Foundational Readings for Beginners

If you have no prior statistical knowledge and wish to learn or refresh your understanding of essential statistical concepts, I'd recommend this article: Fundamental Statistical Concepts for Data Science

For a comprehensive guide on kickstarting a career in Data Science and AI, and insights on securing a Data Science job, you can delve into my previous handbook: Launching Your Data Science & AI Career

Tools/Languages to use in Machine Learning

As a Machine Learning Researcher or Machine Learning Engineer, there are many technical tools and programming languages you might use in your day-to-day job. But for today and for this handbook, we'll use the programming language and tools:

  • Python Basics: Variables, data types, structures, and control mechanisms.
  • Essential Libraries: numpy , pandas , matplotlib ,   scikit-learn , xgboost
  • Environment: Familiarity with Jupyter Notebooks  or PyCharm as IDE.

Embarking on this Machine Learning journey with a solid foundation ensures a more profound and enlightening experience.

Now, shall we?

Machine Learning (ML), a branch of artificial intelligence (AI), refers to a computer's ability to autonomously learn from data patterns and make decisions without explicit programming. Machines use statistical algorithms to enhance system decision-making and task performance.

At its core, ML is a method where computers improve at tasks by learning from data. Think of it like teaching computers to make decisions by providing them examples, much like showing pictures to teach a child to recognize animals.

For instance, by analyzing buying patterns, ML algorithms can help online shopping platforms recommend products (like how Amazon suggests items you might like).

Or consider email platforms that learn to flag spam through recognizing patterns in unwanted mails. Using ML techniques, computers quietly enhance our daily digital experiences, making recommendations more accurate and safeguarding our inboxes.

On this journey, you'll unravel the fascinating world of ML, one where technology learns and grows from the information it encounters. But before doing so, let's look into some basics in Machine Learning you must know to understand any sorts of Machine Learning model.

Types of Learning in Machine Learning:

There are three main ways models can learn:

  • Supervised Learning: Models predict from labeled data (you got both features and labels, X and the Y)
  • Unsupervised Learning: Models identify patterns autonomously, where you don't have labeled date (you only got features no response variable, only X)
  • Reinforcement Learning: Algorithms learn via action feedback.

Model Evaluation Metrics:

In Machine Learning, whenever you are training a model you always must evaluate it. And you'll want to use the most common type of evaluation metrics depending on the nature of your problem.

Here are most common ML model evaluation metrics per model type:

1. Regression Metrics:

  • MAE, MSE, RMSE: Measure differences between predicted and actual values.
  • R-Squared: Indicates variance explained by the model.

2. Classification Metrics:

  • Accuracy: Percentage of correct predictions.
  • Precision, Recall, F1-Score: Assess prediction quality.
  • ROC Curve, AUC: Gauge model's discriminatory power.
  • Confusion Matrix: Compares actual vs. predicted classifications.

3. Clustering Metrics:

  • Silhouette Score: Gauges object similarity within clusters.
  • Davies-Bouldin Index: Assesses cluster separation.

image-74

Chapter 2: Most Popular Machine Learning Algorithms

In this chapter, we'll simplify the complexity of essential Machine Learning (ML) algorithms. This will be a valuable resource for roles ranging from Data Scientists and Machine Learning Engineers to AI Researchers.

We'll start with basics in 2.1 with Linear Regression and Ordinary Least Squares (OLS), then go into 2.2 which explores Logistic Regression and Maximum Likelihood Estimation (MLE).

Section 2.3 explores Linear Discriminant Analysis (LDA), which is contrasted with Logistic Regression in 2.4. We get into Naïve Bayes in 2.5, offering a comparative analysis with Logistic Regression in 2.6.

In 2.7, we go through Decision Trees, subsequently exploring ensemble methods: Bagging in 2.8, and Random Forest in 2.9. Various and popular Boosting techniques unfold in the following segments, discussing AdaBoost in 2.10, Gradient Boosting Model (GBM) in 2.11, and concluding with Extreme Gradient Boosting (XGBoost) in 2.12.

All the algorithms we'll discuss here are fundamental and popular in the field, and every Data Scientist, Machine Learning Engineer, and AI researcher must know them at least at this high level.

Note that we will not delve into unsupervised learning techniques here, or enter into granular details of each algorithm.

2.1 Linear Regression

When the relationship between two variables is linear, you can use the Linear Regression statistical method. It can help you model the impact of a unit change in one variable, the independent variable on the values of another variable, the dependent variable .

Dependent variables are often referred to as response variables or explained variables, whereas independent variables are often referred to as regressors or explanatory variables.

When the Linear Regression model is based on a single independent variable, then the model is called Simple Linear Regression . But when the model is based on multiple independent variables, it’s referred to as Multiple Linear Regression .

Simple Linear Regression can be described by the following expression:

0*oLHnTG7OkSaBpmni

where Y is the dependent variable, X is the independent variable which is part of the data, β0 is the intercept which is unknown and constant, and β1 is the slope coefficient or a parameter corresponding to the variable X which is unknown and constant as well. Finally, u is the error term that the model makes when estimating the Y values.

The main idea behind linear regression is to find the best-fitting straight line, the regression line, through a set of paired ( X, Y ) data. One example of the Linear Regression application is modeling the impact of flipper length on penguins’ body mass , which is visualized below:

Image Source: The Author

Multiple Linear Regression with three independent variables can be described by the following expression:

0*O6gSvCYw8FxXAW54

where Y is the dependent variable, X is the independent variable which is part of the data, β0 is the intercept which is unknown and constant, and β1 , β 2, β 3 are the slope coefficients or a parameter corresponding to the variable X1, X2, X3 which are unknown and constant as well. Finally, u is the error term that the model makes when estimating the Y values.

2.1.1 Ordinary Least Squares

The ordinary least squares (OLS) is a method for estimating the unknown parameters such as β0 and β1 in a linear regression model. The model is based on the principle of least squares that minimizes the sum of squares of the differences between the observed dependent variable and its values predicted by the linear function of the independent variable, often referred to as fitted values .

This difference between the real and predicted values of dependent variable Y is referred to as residual . What OLS does is minimize the sum of squared residuals. This optimization problem results in the following OLS estimates for the unknown parameters β0 and β1 which are also known as coefficient estimates .

0*jFQQnpCqqPeKOGeJ

Once these parameters of the Simple Linear Regression model are estimated, the fitted values of the response variable can be computed as follows:

0*v66iFYRMqQOENjX0

Standard Error

The residuals or the estimated error terms can be determined as follows:

0*EqX54WI0SqwPlQ2S

It is important to keep in mind the difference between the error terms and residuals. Error terms are never observed, while the residuals are calculated from the data. The OLS estimates the error terms for each observation but not the actual error term. So, the true error variance is still unknown.

Also, these estimates are subject to sampling uncertainty. What this means is that we will never be able to determine the exact estimate, the true value, of these parameters from sample data in an empirical application. But we can estimate it by calculating the sample residual variance.

2.1.2 OLS Assumptions

The OLS estimation method makes the following assumptions which need to be satisfied to get reliable prediction results:

  • A ssumption (A) 1: the Linearity assumption states that the model is linear in parameters.
  • A2: the Random Sample assumption states that all observations in the sample are randomly selected.
  • A3: the Exogeneity assumption states that independent variables are uncorrelated with the error terms.
  • A4: the Homoskedasticity assumption states that the variance of all error terms is constant.
  • A5: the No Perfect Multi-Collinearity assumption states that none of the independent variables is constant and there are no exact linear relationships between the independent variables.

Note that the above description for Linear Regression is from my article named Complete Guide to Linear Regression .

For detailed article on Linear Regression check out this post:

1*dVP5uUuCxu35duUhU6Bi2g

2.1.3 Linear Regression in Python

Imagine you have a friend, Alex, who collects stamps. Every month, Alex buys a certain number of stamps, and you notice that the amount Alex spends seems to depend on the number of stamps bought.

Now, you want to create a little tool that can predict how much Alex will spend next month based on the number of stamps bought. This is where Linear Regression comes into play.

In technical terms, we're trying to predict the dependent variable (amount spent) based on the independent variable (number of stamps bought).

Below is some simple Python code using scikit-learn to perform Linear Regression on a created dataset.

  • Sample Data : stamps_bought represents the number of stamps Alex bought each month and amount_spent represents the corresponding money spent.
  • Creating and Training Model : Using LinearRegression() from scikit-learn to create and train our model using .fit() .
  • Predictions : Use the trained model to predict the amount Alex will spend for a given number of stamps. In the code, we predict the amount for 10 stamps.
  • Plotting : We plot the original data points (in blue) and the predicted line (in red) to visually understand our model’s prediction capability.
  • Displaying Prediction : Finally, we print out the predicted spending for a specific number of stamps (10 in this case).

LinearRegression

‌2.2 Logistic Regression

Another very popular Machine Learning technique is Logistic Regression which, though named regression, is actually a supervised classification technique .

Logistic regression is a Machine Learning method that models conditional probability of an event occurring or observation belonging to a certain class, based on a given dataset of independent variables.

When the relationship between two variables is linear and the dependent variable is a categorical variable, you may want to predict a variable in the form of a probability (number between 0 and 1). In these cases, Logistic Regression comes in handy.

This is because during the prediction process in Logistic Regression, the classifier predicts the probability (a value between 0 and 1) of each observation belonging to the certain class, usually to one of the two classes of dependent variable.

For instance, if you want to predict the probability or likelihood that a candidate will be elected or not during an election given the candidate's popularity score, past successes, and other descriptive variables about that candidate, you can use Logistic Regression to model this probability.

So, rather than predicting the response variable, Logistic Regression models the probability that Y belongs to a particular category.

It's similar to Linear Regression with a difference being that instead of Y it predicts the log odds. In statistical terminology, we model the conditional distribution of the response Y , given the predictor(s) X . So LR helps to predict the probability of Y belonging to certain class (0 and 1) given the features P(Y|X=x) .

The name Logistic in Logistic Regression comes from the function this approach is based upon, which is Logistic Function . Logistic Function makes sure that for too large and too small values, the corresponding probability is still within the [0,1 bounds].

image-46

In the equation above, the P(X) stands for the probability of Y belonging to certain class (0 and 1) given the features P(Y|X=x). X stands for the independent variable, β0 is the intercept which is unknown and constant, β1 is the slope coefficient or a parameter corresponding to the variable X which is unknown and constant as well similar to Linear Regression. e stands for exp() function.

Odds and Log Odds

Logistic Regression and its estimation technique MLE is based on the terms Odds and Log Odds. Where Odds is defined as follows:

1*s5_x03xuUHM_n3SAMujx7w

and Log Odds is defined as follows:

Screenshot-2023-10-14-at-5.39.11-AM

2.2.1 Maximum Likelihood Estimation (MLE)

While for Linear Regression, we use OLS (Ordinary Least Squares) or LS (Least Squares) as an estimation technique, for Logistic Regression we should use another estimation technique.

We can’t use LS in Logistic Regression to find the best fitting line (to perform estimation) because the errors can then become very large or very small (even negative) while in case of Logistic Regression we aim for a predicted value in [0,1].

So for Logistic Regression we use the MLE technique, where the likelihood function calculates the probability of observing the outcome given the input data and the model. This function is then optimised to find the set of parameters that results in the largest sum likelihood over the training dataset.

1*u7XLRKF3BVsvyF5zXLxEsg

The logistic function will always produce an S-shaped curve like above, regardless of the value of independent variable X resulting in sensible estimation most of the time.

2.2.2 Logistic Regression Likelihood Function(s)

The Likelihood function can be expressed as follows:

1*BR90pVIpXkTobihxToP8bg

So the Log Likelihood function can be expressed as follows:

1*573K4SJ2pDY5bmKndL8e_A

or, after transformation from multipliers to summation, we get:

1*nabbNqzEzMBR-2cIdfnRtA

Then the idea behind the MLE is to find a set of estimates that would maximize this likelihood function.

  • Step 1: Project the data points into a candidate line that produces a sample log (odds) value.
  • Step 2: Transform sample log (odds) to sample probabilities by using the following formula:

1*Tab5F2hMLHo9AMhEbjJQoQ

  • Step 3: Obtain the overall likelihood or overall log likelihood.
  • Step 4: Rotate the log (odds) line again and again, until you find the optimal log (odds) maximizing the overall likelihood

2.2.3 Cut off value in Logistic Regression

If you plan to use Logistic Regression at the end get a binary {0,1} value, then you need a cut-off point to transform the estimated values per observation from the range of [0,1] to a value 0 or 1.

Depending on your individual case you can choose a corresponding cut off point, but a popular cut-ff point is 0.5. In this case, all observations with a predicted value smaller than 0.5 will be assigned to class 0 and observations with a predicted value larger or equal than 0.5 will be assigned to class 1.

2.2.4 Performance Metrics in Logistic Regression

Since Logistic Regression is a classification method, common classification metrics such as recall, precision, F-1 measure can all be used. But there is also a metrics system that is also commonly used for assessing the performance of the Logistic Regression model, called Deviance .

2.2.5 Logistic Regression in Python

Jenny is an avid book reader. Jenny reads books of different genres and maintains a little journal where she notes down the number of pages and whether she liked the book (Yes or No).

We see a pattern: Jenny typically enjoys books that are neither too short nor too long. Now, can we predict whether Jenny will like a book based on its number of pages? This is where Logistic Regression can help us!

In technical terms, we're trying to predict a binary outcome (like/dislike) based on one independent variable (number of pages).

Here's a simplified Python example using scikit-learn to implement Logistic Regression:

  • Sample Data : pages represents the number of pages in the books Jenny has read, and likes represents whether she liked them (1 for like, 0 for dislike).
  • Creating and Training Model : We instantiate LogisticRegression() and train the model using .fit() with our data.
  • Predictions : We predict whether Jenny will like a book with a particular number of pages (260 in this example).
  • Plotting : We visualize the original data points (in blue) and the predicted probability curve (in red). The green dashed line represents the page number we’re predicting for, and the grey dashed line indicates the threshold (0.5) above which we predict a "like".
  • Displaying Prediction : We output whether Jenny will like a book of the given page number based on our model's prediction.

Screenshot-2023-10-20-at-8.44.09-PM

‌2.3 Linear Discriminant Analysis (LDA)

Another classification technique, closely related to Logistic Regression, is Linear Discriminant Analytics (LDA). Where Logistic Regression is usually used to model the probability of observation belonging to a class of the outcome variable with 2 categories, LDA is usually used to model the probability of observation belonging to a class of the outcome variable with 3 and more categories.

LDA offers an alternative approach to model the conditional likelihood of the outcome variable given that set of predictors that addresses the issues of Logistic Regression. It models the distribution of the predictors X separately in each of the response classes (that is, given Y ), and then uses Bayes’ theorem to flip these two around into estimates for Pr(Y = k|X = x).

Note that in the case of LDA these distributions are assumed to be normal. It turns out that the model is very similar in form to logistic regression. In the equation here:

1*jMSHLN0-cAG3zKGCxXWY7w

π_k represents the overall prior probability that a randomly chosen observation comes from the k th class. f_k(x) , which is equal to Pr(X = x|Y = k), represents the posterior probability , and is the density function of X for an observation that comes from the k th class (density function of the predictors).

This is the probability of X=x given the observation is from certain class. Stated differently, it is the probability that the observation belongs to the k th class, given the predictor value for that observation.

Assuming that f_k(x) is Normal or Gaussian, the normal density takes the following form (this is the one- normal dimensional setting):

1*0dOVbhy_xPi9rIa7Z7j2Fg

where μ_k and σ_k² are the mean and variance parameters for the k th class. Assuming that σ_¹² = · · · = σ_K² (there is a shared variance term across all K classes, which we denote by σ2).

Then the LDA approximates the Bayes classifier by using the following estimates for πk, μk, and σ2:

1*EloSKpmgw0Jhz-ubEGaogg

Where Logistic Regression is usually used to model the probability of observation belonging to a class of the outcome variable with 2 categories, LDA is usually used to model the probability of observation belonging to a class of the outcome variable with 3 and more categories.

2.3.1 Linear Discriminant Analysis in Python

Imagine Sarah, who loves cooking and trying various fruits. She sees that the fruits she likes are typically of specific sizes and sweetness levels.

Now, Sarah is curious: can she predict whether she will like a fruit based on its size and sweetness? Let's use Linear Discriminant Analysis (LDA) to help her predict whether she'll like certain fruits or not.

In technical language, we are trying to classify the fruits (like/dislike) based on two predictor variables (size and sweetness).

  • Sample Data : fruits_features contains two features – size and sweetness of fruits, and fruits_likes represents whether Sarah likes them (1 for like, 0 for dislike).
  • Creating and Training Model : We instantiate LinearDiscriminantAnalysis() and train it using .fit() with our sample data.
  • Prediction : We predict whether Sarah will like a fruit with a particular size and sweetness level ([2.5, 6] in this example).
  • Plotting : We visualize the original data points, color-coded based on Sarah’s like (yellow) and dislike (purple), and mark the new fruit with a red 'x'.
  • Displaying Prediction : We output whether Sarah will like a fruit with the given size and sweetness level based on our model's prediction.

Screenshot-2023-10-20-at-8.48.44-PM

Logistic regression is a popular approach for performing classification when there are two classes. But when the classes are well-separated or the number of classes exceeds 2, the parameter estimates for the logistic regression model are surprisingly unstable.

Unlike Logistic Regression, LDA does not suffer from this instability problem when the number of classes is more than 2. If n is small and the distribution of the predictors X is approximately normal in each of the classes, LDA is again more stable than the Logistic Regression model.

‌ 2.5 Naïve Bayes

Another classification method that relies on Bayes Rule , like LDA, is Naïve Bayes Classification approach. For more about Bayes Theorem, Bayes Rule and a corresponding example, you can read these articles .

Like Logistic Regression, you can use the Naïve Bayes approach to classify observation in one of the two classes (0 or 1).

The idea behind this method is to calculate the probability of observation belonging to a class given the prior probability for that class and conditional probability of each feature value given for given class. That is:

image-49

where Y stands for the class of observation, k is the k th class and x1, …, xn stands for feature 1 till feature n, respectively. f_k(x) = Pr(X = x|Y = k), represents the posterior probability, which like in case of LDA is the density function of X for an observation that comes from the k th class (density function of the predictors).

If you compare the above expression with the one you saw for LDA, you will see some similarities.

In LDA, we make a very important and strong assumption for simplification purposes: namely, that f_k is the density function for a multivariate normal random variable with class-specific mean μ_k, and shared covariance matrix Sigma Σ.

This assumtion helps to replace the very challenging problem of estimating K p-dimensional density functions with the much simpler problem, which is to estimate K p-dimensional mean vectors and one (p × p)-dimensional covariance matrices.

In the case of the Naïve Bayes Classifier, it uses a different approach for estimating f_1 (x), . . . , f_K(x). Instead of making an assumption that these functions belong to a particular family of distributions (for example normal or multivariate normal), we instead make a single assumption: within the k th class, the p predictors are independent. That is:

image-51

So Bayes classifier assumes that the value of a particular variable or feature is independent of the value of any other variables (uncorrelated), given the class/label variable.

For instance, a fruit may be considered to be a banana if it is yellow, oval shaped, and about 5–10 cm long. So, the Naïve Bayes classifier considers that each of these various features of fruit contribute independently to the probability that this fruit is a banana, independent of any possible correlation between the colour, shape, and length features.

Naïve Bayes Estimation

Like Logistic Regression, in the case of the Naïve Bayes classification approach we use Maximum Likelihood Estimation (MLE) as estimation technique. There is a great article providing detailed, coincise summary for this approach with corresponding example which you can find here .

2.5.1 Naïve Bayes in Python

Tom is a movie enthusiast who watches films across different genres and records his feedback—whether he liked them or not. He has noticed that whether he likes a film might depend on two aspects: the movie's length and its genre. Can we predict whether Tom will like a movie based on these two characteristics using Naïve Bayes?

Technically, we want to predict a binary outcome (like/dislike) based on the independent variables (movie length and genre).

  • Sample Data : movies_features contains two features: movie length and genre (encoded as numbers), while movies_likes indicates whether Tom likes them (1 for like, 0 for dislike).
  • Creating and Training Model : We instantiate GaussianNB() (a Naïve Bayes classifier assuming Gaussian distribution of data) and train it with .fit() using our data.
  • Prediction : We predict whether Tom will like a new movie, given its length and genre code ([100, 1] in this case).
  • Plotting : We visualize the original data points, color-coded based on Tom’s like (yellow) and dislike (purple). The red 'x' represents the new movie.
  • Displaying Prediction : We print whether Tom will like a movie of the given length and genre code, as per our model's prediction.

Screenshot-2023-10-20-at-8.51.54-PM

Naïve Bayes Classifier has proven to be faster and has a higher bias and lower variance. Logistic regression has a low bias and higher variance. Depending on your individual case, and the bias-variance trade-off , you can pick the corresponding approach.

image-52

Decision Trees are a supervised and non-parametric Machine Learning learning method used for both classification and regression purposes. The idea is to create a model that predicts the value of a target variable by learning simple decision rules from the data predictors.

Unlike Linear Regression, or Logistic Regression, Decision Trees are simple and useful model alternatives when the relationship between independent variables and dependent variable is suspected to be non-linear.

Tree-based methods stratify or segment the predictor space into smaller regions. The idea behind building Decision Trees is to divide the predictor space into distinct and mutually exclusive regions X1,X2,….. ,Xp → R_1,R_2, …,R_N where the regions are in the form of boxes or rectangles. These regions are found by recursive binary splitting since minimizing the RSS is not feasible. This approach is often referred to as a greedy approach.

Decision trees are built by top-down splitting. So, in the beginning, all observations belong to a single region. Then, the model successively splits the predictor space. Each split is indicated via two new branches further down on the tree.

This approach is sometimes called greedy because at each step of the tree-building process, the best split is made at that particular step, rather than looking ahead and picking a split that will lead to a better tree in some future step.

Stopping Criteria

There are some common stopping criteria used when building Decision Trees:

  • Minimum number of observations in the leaf.
  • Minimum number of samples for a node split.
  • Maximum depth of tree (vertical depth).
  • Maximum number of terminal nodes.
  • Maximum features to consider for the split.

Screenshot-2023-10-20-at-8.05.15-PM

For example, repeat this splitting process until no region contains more than 100 observations. Let's dive deeper

1. Minimum number of observations in the leaf: If a proposed split results in a leaf node with fewer than a defined number of observations, that split might be discarded. This prevents the tree from becoming overly complex.

2. Minimum number of samples for a node split: To proceed with a node split, the node must have at least this many samples. This ensures that there's a significant amount of data to justify the split.

3. Maximum depth of tree (vertical depth): This limits how many times a tree can split. It's like telling the tree how many questions it can ask about the data before making a decision.

4. Maximum number of terminal nodes: This is the total number of end nodes (or leaves) the tree can have.

5. Maximum features to consider for the split: For each split, the algorithm considers only a subset of features. This can speed up training and help in generalization.

When building a decision tree, especially when dealing with large number of features, the tree can become too big with too many leaves. This will effect the interpretability of the model, and might potentially result in an overfitting problem. Therefore, picking a good stopping criteria is essential for the interpretability and for the performance of the model.

RSS/Gini Index/Entropy/Node Purity

When building the tree, we use RSS (for Regression Trees) and GINI Index/Entropy (for Classification Trees) for picking the predictor and value for splitting the regions. Both Gini Index and Entropy are often called Node Purity measures because they describe how pure the leaf of the trees are.

image-53

The Gini index measures the total variance across K classes. It takes small value when all class error rates are either 1 or 0. This is also why it’s called a measure for node purity – Gini index takes small values when the nodes of the tree contain predominantly observations from the same class.

The Gini index is defined as follows:

image-54

where pˆmk represents the proportion of training observations in the mth region that are from the kth class.

Entropy is another node purity measure, and like the Gini index, the entropy will take on a small value if the m th node is pure. In fact, the Gini index and the entropy are quite similar numerical and can be expressed as follows:‌                                      

image-55

Decision Tree Classification Example

Let’s look at an example where we have three features describing consumers' past behaviour:

  • Recency (How recent was the customer’s last purchase?)
  • Monetary (How much money did the customer spend in a given period?)
  • Frequency (How often did this customer make a purchase in a given period?)

We will use the classification version of the Decision Tree to classify customers to 1 of the 3 classes (Good: 1, Better: 2 and Best: 3), given the features describing the customer's behaviour.

In the following tree, where we use Gini Index as a purity measure, we see that the first features that seems to be the most important one is the Recency. Let's look at the tree and then interpret it:

image-56

Customers who have a recency of 202 or larger (last time has made a purchase > 202 days ago) then the chance of this observation to be assigned to class 1 is 93% (basically, we can label those customers as Good Class customers).

For customers with Recency less than 202 (they made a purchase recently), we look at their Monetary value and if it's smaller than 1394, then we look at their Frequency. If the Frequency is then smaller than 44, we can then label this customers’ class as Better or (class 2). And so on.

Decision Trees Python Implementation

Alex is intrigued by the relationship between the number of hours studied and the scores obtained by students. Alex collected data from his peers about their study hours and respective test scores.

He wonders: can we predict a student's score based on the number of hours they study? Let's leverage Decision Tree Regression to uncover this.

Technically, we're predicting a continuous outcome (test score) based on an independent variable (study hours).

  • Sample Data : study_hours contains hours studied, and test_scores contains the corresponding test scores.
  • Creating and Training Model : We create a DecisionTreeRegressor with a specified maximum depth (to prevent overfitting) and train it with .fit() using our data.
  • Plotting the Decision Tree : plot_tree helps visualize the decision-making process of the model, representing splits based on study hours.
  • Prediction & Plotting : We predict the test score for a new study hour value (5.5 in this example), visualize the original data points, the decision tree’s predicted scores, and the new prediction.

Screenshot-2023-10-20-at-8.54.27-PM

The visualization depicts a decision tree model trained on study hours data. Each node represents a decision based on study hours, branching from the top root based on conditions that best forecast test scores. The process continues until reaching a maximum depth or no further meaningful splits. Leaf nodes at the bottom give final predictions, which for regression trees, are the average of target values for training instances reaching that leaf. This visualization highlights the model's predictive approach and the significant influence of study hours on test scores.

Screenshot-2023-10-20-at-8.54.43-PM

The "Study Hours vs. Test Scores" plot illustrates the correlation between study hours and corresponding test scores. Actual data points are denoted by red dots, while the model's predictions are shown as an orange step function, characteristic of regression trees. A green "x" marker highlights a prediction for a new data point, here representing a 5.5-hour study duration. The plot's design elements, such as gridlines, labels, and legends, enhance comprehension of the real versus anticipated values.

image-58

One of the biggest disadvantages of Decision Trees is their high variance. You might end up with a model and predictions that are easy to explain but misleading. This would result in making incorrect conclusions and business decisions.

So to reduce the variance of the Decision trees, you can use a method called Bagging. To understand what Bagging is, there are two terms you need to know:

  • Bootstrapping
  • Central Limit Theorem (CLT)

You can find more about Boostrapping, which is a resampling technique, later in this handbook. For now, you can think of Bootstrapping as a technique that performs sampling from the original data with replacement, which creates a copy of the data very similar to but not exactly the same as the original data.

Bagging is also based on the same ideas as the CLT which is one of the most important if not the most important theorem in Statistics. You can read in more detail about CLT here .

But the idea that is also used in Bagging is that if you take the average of many samples, then the variance is significantly reduced compared to the variance of each of the individual sample based models.

So, given a set of n independent observations Z1,…,Zn, each with variance σ2, the variance of the mean Z ̄ of the observations is given by σ2/n . So averaging a set of observations reduces variance.

For more Statistical details, check out the following tutorial:

1*5gU4KwudRqY-vP0G2UpRZA

Bagging is basically a Bootstrap aggregation that builds B trees using Bootrsapped samples. Bagging can be used to improve the precision (lower the variance of many approaches) by taking repeated samples from a single training data.

So, in Bagging, we generate B bootstrapped training samples, based on which B similar trees (correlated trees) are built that end up being aggregaated to calculate the predictions, so taking the average of these predictions for these B-samples. Notably, each tree is built on a bootstrap data set, independent of the other trees.

So, in case of Bagging in each tree split all p features are considered which results in similar trees wince every time the strongest predictors are at the top and weak ones at the bottom resulting all of the bagged trees will look quite similar to each other.

2.8.1 Bagging in Regression Trees

To apply bagging to regression trees, we simply construct B regression trees using B bootstrapped training sets, and average the resulting predictions. These trees are grown deep, and are not pruned. So each individual tree has high variance, but low bias. Averaging these B trees reduces the variance.

2.8.2 Bagging in Classification Trees

For a given test observation, we can record the class predicted by each of the B trees, and take a majority vote : the overall prediction is the most commonly occurring majority class among the B predictions.

2.8.3 OOB Out-of-Bag Error Estimation

When Bagging is applied to decision trees, there is no longer a need to apply Cross Validation to estimate the test error rate. In bagging, we repeatedly fit the trees to Bootstrapped samples – and on average only 2/3 of these observations are used. The other 1/3 are not used during the training process. These are called Out-of-bag observations.

So there are in total B/3 prediction per ith observation not used in training. We can take the average of response values for these cases (or majority class). So per observation, the OOB error and average of these forms the test error rate.

2.8.4 Bagging in Python

Meet Lucy, a fitness coach who is curious about predicting her clients’ weight loss based on their daily calorie intake and workout duration. Lucy has data from past clients but recognizes that individual predictions might be prone to errors. Let's utilize Bagging to create a more stable prediction model.

Technically, we'll predict a continuous outcome (weight loss) based on two independent variables (daily calorie intake and workout duration), using Bagging to reduce variance in predictions.

True weight loss: [2.  4.5] Predicted weight loss: [3.1  3.96] Mean Squared Error: 0.75

  • Sample Data : clients_data contains daily calorie intake and workout duration, and weight_loss contains the corresponding weight loss.
  • Train-Test Split : We split the data into training and test sets to validate the model's predictive performance.
  • Creating and Training Model : We instantiate BaggingRegressor with DecisionTreeRegressor as the base estimator and train it using .fit() with our training data.
  • Prediction & Evaluation : We predict weight loss for the test data, evaluating prediction quality with Mean Squared Error (MSE).
  • Visualizing One of the Base Estimators : Optionally, visualize one tree from the ensemble to understand individual decision-making processes (keeping in mind an individual tree may not perform well, but collectively they produce stable predictions).

Bagging

Random forests provide an improvement over bagged trees by way of a small tweak that decorrelates the trees.

As in bagging, we build a number of decision trees on bootstrapped training samples. But when building these decision trees, each time a split in a tree is considered, a random sample of m predictors is chosen as split candidates from the full set of p predictors.

The split is allowed to use only one of those m predictors. A fresh and random sample of m predictors is taken at each split, and typically we choose m ≈ √p — that is, the number of predictors considered at each split is approximately equal to the square root of the total number of predictors. This is also the reason why Random Forest is called “random”.

The main difference between bagging and random forests is the choice of predictor subset size m decorrelates the trees.

Using a small value of m in building a random forest will typically be helpful when we have a large number of correlated predictors. So, if you have a problem of Multicollearity, RF is a good method to fix that problem.

So, unlike in Bagging, in the case of Random Forest, in each tree split not all p predictors are considered – but only randomly selected m predictors from it. This results in not similar trees being decorrelateed. And due to the fact that averaging decorrelated trees results in smaller variance, Random Forest is more accurate than Bagging.

2.9.1 Random Forest Python Implementation

Noah is a botanist who has collected data about various plant species and their characteristics, such as leaf size and flower color. Noah is curious if he could predict a plant’s species based on these features.

Here, we’ll utilize Random Forest, an ensemble learning method, to help him classify plants.

Technically, we aim to classify plant species based on certain predictor variables using a Random Forest model.

  • Sample Data : plants_features contains leaf size and flower color, while plants_species indicates the species of the respective plant.
  • Train-Test Split : We separate the data into training and test sets.
  • Creating and Training Model : We instantiate RandomForestClassifier with a specified number of trees (10 in this case) and train it using .fit() with our training data.
  • Prediction & Evaluation : We predict the species for the test data and evaluate the predictions using a classification report which provides precision, recall, f1-score, and support.
  • Visualizing Feature Importances : We utilize a horizontal bar chart to display the importance of each feature in predicting the plant species. Random Forest quantifies the usefulness of features during the tree-building process, which we visualize here.

Random-Forest

‌2.10 Boosting or Ensemble Models

Like Bagging (averaging correlated Decision Trees) and Random Forest (averaging uncorrelated Decision Trees), Boosting aims to improve the predictions resulting from a decision tree. Boosting is a supervised Machine Learning model that can be used for both regression and classification problems.

Unlike Bagging or Random Forest, where the trees are built independently from each other using one of the B bootstrapped samples (copy of the initial training date), in Boosting, the trees are built sequentially and dependent on each other. Each tree is grown using information from previously grown trees.

Boosting does not involve bootstrap sampling. Instead, each tree fits on a modified version of the original data set. It’s a method of converting weak learners into strong learners.

In boosting, each new tree is a fit on a modified version of the original data set. So, unlike fitting a single large decision tree to the data, which amounts to fitting the data hard and potentially overfitting, the boosting approach instead learns slowly.

Given the current model, we fit a decision tree to the residuals from the model. That is, we fit a tree using the current residuals, rather than the outcome Y, as the response. We then add this new decision tree into the fitted function in order to update the residuals.

Each of these trees can be rather small, with just a few terminal nodes, determined by the parameter d in the algorithm. Now let's have a look at 3 most popular Boosting models in Machine Learning:

2.10.1 Boosting: AdaBoost

The first Ensemble algorithm we will look into today is AdaBoost. Like in all boosting techniques, in the case of AdaBoost the trees are built using the information from the previous tree – and more specifically part of the tree which didn’t perform well. This is called the weak learner (Decision Stump). This Decision Stump is built using only a single predictor and not all predictors to perform the prediction.

So, AdaBoost combines weak learners to make classifications and each stump is made by using the previous stump’s errors. Here is the step-by-step plan for building an AdaBoost model:

  • Step 1: Initial Weight Assignment – assign equal weight to all observations in the sample where this weight represents the importance of the observations being correctly classified: 1/N (all samples are equally important at this stage).
  • Step 2: Optimal Predictor Selection – The first stamp is built by obtaining the RSS (in case of regression) or GINI Index/Entropy (in case of classification) for each predictor. Picking the stump that does the best job in terms of prediction accuracy: the stump with the smallest RSS or GINI/Entropy is selected as the next tree.
  • Step 3: Computing Stumps Weight based on Stumps Total Error – The importance of this stump in the final tree is then determined using the total error that this stump is making. Where a stump that is not better than random flip of a coin with total error equal to 0.5 gets weight 0. Weight = 0.5*log(1-Total Error/Total Error)
  • Step 4: Updating Observation Weights – We increase the weight of the observations which have been incorrectly predicted and decrease the remaining observations which had higher accuracy or have been correctly classified, so that the next stump will have higher importance of correctly predicted the value f this observation.
  • Step 5: Building the next Stump based on updated weights – Using Weighted Gini index to chose the next stump.
  • Step 6: Combining B stumps – Then all the stumps are combined while taking into account their importance, weighted sum.

AdaBoost Python Implementation

Imagine a scenario where we aim to predict house prices based on certain features like the number of rooms and age of the house.

For this example, let's generate synthetic data where: num_rooms: The number of rooms in the house. house_age: The age of the house in years. price: The price of the house in thousand dollars:

image-79

2.10.2 Boosting Algorithm: Gradient Boosting Model (GBM)

AdaBoost and Gradient Boosting are very similar to each other. But compared to AdaBoost, which starts the process by selecting a stump and continuing to build it by using the weak learners from the previous stump, Gradient Boosting starts with a single leaf instead of a tree of a stump.

The outcome corresponding to this chosen leaf is then an initial guess for the outcome variable. Like in the case of AdaBoost, Gradient Boosting uses the previous stump’s errors to build the tree. But unlike in AdaBoost, the trees that Gradient Boost builds are larger than a stump. That’s a parameter where we set a max number of leaves.

To make sure the tree is not overfitting, Gradient Boosting uses the Learning Rate to scale the gradient contributions. Gradient Boosting is based on the idea that taking lots of small steps in the right direction (gradients) will result in lower variance (for testing data).

The major difference between the AdaBoost and Gradient Boosting algorithms is how the two identify the shortcomings of weak learners (for example, decision trees). While the AdaBoost model identifies the shortcomings by using high weight data points, gradient boosting performs the same by using gradients in the loss function (y=ax+b+e , e needs a special mention as it is the error term).

The loss function is a measure indicating how good a model’s coefficients are at fitting the underlying data. A logical understanding of loss function would depend on what we are trying to optimise.

Early Stopping

The special process of tuning the number of iterations for an algorithm (such as GBM and Random Forest) is called “Early Stopping” – a phenomenon we touched upon when discussing the Decision Trees.

Early Stopping performs model optimisation by monitoring the model’s performance on a separate test data set and stopping the training procedure once the performance on the test data stops improving beyond a certain number of iterations.

It avoids overfitting by attempting to automatically select the inflection point where performance on the test dataset starts to decrease while performance on the training dataset continues to improve as the model starts to overfit.

In the context of GBM, early stopping can be based either on an out of bag sample set (“OOB”) or cross-validation (“CV”). Like mentioned earlier, the ideal time to stop training the model is when the validation error has decreased and started to stabilise before it starts increasing due to overfitting.

To build GBM, follow this step-by-step process:

  • Step 1: Train the model on the existing data to predict the outcome variable
  • Step 2: Compute the error rate using the predictions and the real values (Pseudo Residual)
  • Step 3: Use the existing features and the Pseudo Residual as the outcome variable to predict the residuals again
  • Step 4: Use the predicted residuals to update the predictions from the Step 1, while scaling this contribution to the tree with a learning rate (hyper parameter)
  • Step 5: Repeat steps 1–4, the process of updating the pseudo residuals and the tree while scaling with the learning rate, to move slowly in the right direction until there is no longer an improvement or we come to our stopping rule

The idea is that each time we add a new scaled tree to the model, the residuals should get smaller.

At any m step, the Gradient Boosting model produces a model that is an ensemble of the previous step F(m-1) and learning rate eta multiplied with the negative derivative of the loss function with regard to the output of the model at step m-1: (weak learner at step m-1).

image-64

GBM Python Implementation

‌                                              

GBM

2.10.3 Boosting Algorithm: XGBoost

One of the most popular Boosting or Ensemble algorithms is Extreme Gradient Boosting (XGBoost).

The difference between the GBM and XGBoost is that in case of XGBoost the second-order derivatives are calculated (second-order gradients). This provides more information about the direction of gradients and how to get to the minimum of the loss function.

Remember that this is needed to identify the weak learner and improve the model by improving the weak learners.

The idea behind the XGBoost is that the 2nd order derivative tends to be more precise in terms of finding the accurate direction. Like the AdaBoost, XGBoost applies advanced regularization in the form of L1 or L2 norms to address overfitting.

Unlike the AdaBoost, XGBoost is parallelizable due to its special cashing mechanism, making it convenient to handle large and complex datasets. Also, to speed up the training, XGBoost uses an Approximate Greedy Algorithm to consider only limited amount of tresholds for splitting the nodes of the trees.

To build an XGBoost model, follow this step-by-step process:

  • Step 1: Fit a Single Decision Tree – In this step, the Loss function is calculated, for example NDCG to evaluate the model.
  • Step 2: Add the Second Tree – This is done such that when this second tree is added to the model, it lowers the Loss function based on 1st and 2nd order derivatives compared to the previous tree (where we also used learning rate eta).
  • Step 3: Finding the Direction of the Next Move – Using the first degree and second-degree derivatives, we can find the direction in which the Loss function decreases the largest. This is basically the gradient of the Loss function with regard to to the output of the previous model.
  • Step 4: Splitting the nodes – To split the observations, XGBoost uses Approximate Greedy Algorithm (about 3 approximate weighted quantiles usually) quantiles that have a similar sum of weights. For finding the split value of the nodes, it doesn't consider all the candidate thresholds but instead it uses the quantiles of that predictor only.

Optimal Learning Rate can be determined by using Cross Validation & Grid Search.

Simple XGBoost Python Implementation

Imagine you have a dataset containing information about various houses and their prices. The dataset includes features like the number of bedrooms, bathrooms, the total area, the year built, and so on, and you want to predict the price of a house based on these features.

XGBoost2

Chapter 3: Feature Selection in Machine Learning

The pathway to building effective machine learning models often involves a critical question: which features should we include to generate reliable predictions while keeping the model simple and understandable? This is where subset selection plays a key role.

In Machine Learning, in many cases we are dealing with large amount of features and not all of them are usually important and informative for the model. Including such irrelevant variables in the model leads to unnecessary complexity in the Machine Learning model and effects the model's interpretability as well as its performance.

By removing these unimportant variables, and selecting only relatively informative features, we can get a model which can be easier to interpret and is possibly more accurate.

Let’s look at a specific example of a Machine Learning model for simplicity's sake.

Let’s assume that we are looking at a Multiple Linear Regression model (multiple independent variables and single response/dependent variable) with very large number of features. This model is likely to be complex when it comes to interpreting it. On the top of that, it might be result in inaccurate predictions since some of those features might be unimportant and are not helping to explain the response variable.

The process of selecting important variables in the model is called feature selection or variable selection. This process involves identifying a subset of the p variables that we believe to be related to the dependent or the response variable. For this, we need to run the regression for all possible combinations of independent variables and select one that results in best performing model or the worst performing model.

There are various approaches you can use for Features Selection, usually broken down into the following 3 categories:

  • Subset Selection (Best Subset Selection, Step-Wise Feature Selection)
  • Regularisation Techniques (L1 Lasso, L2 Ridge Regressions)
  • Dimensionality Reduction Techniques (PCA)  

3.1 Subset Selection in Machine Learning

Subset Selection in machine learning is a technique designed to identify and use a subset of important features while omitting the rest. This helps create models that are easier to interpret and, in some cases, predict more accurately by avoiding overfitting.

Navigating through numerous features, it becomes vital to selectively choose the ones that significantly impact the predictive model. Subset selection provides a systematic approach to sifting through possible combinations of predictors. It aims to select a subset that effectively represents the data without unnecessary complexity.

  • Best Subset Selection: Examines all possible combinations and selects the most optimal set of predictors.
  • Stepwise Selection : Adds or removes predictors incrementally, which includes forward and backward stepwise selection.
  • Random Subset Selection : Chooses subsets randomly, introducing an element of randomness into model selection.

It’s a balance between using all available predictors, risking model overcomplexity and potential overfitting, and building a too-simple model that may overlook important data patterns.

In this section, we will explore these subset selection techniques. You'll learn how each approach works and affects model performance, ensuring that the models we build are reliable, simple, and effective.

3.1.1 Step-Wise Feature Selection Techniques

One of the popular subset selection techniques is the Step-Wise Feature Selection Technique. Let’s look at two different step-wise feature selection methods:

  • Forward Step-wise Selection
  • Backward Step-wise Selection

Forward Step-Wise Selection: What Forward Step-Wise Feature Selection technique does is it starts with an empty Null model with only an intercept. We then run a set of simple regressions and pick the variable which has a model with the smallest RSS (Residual Sum of Squares). Then we do the same with 2 variable regressions and continue until it’s completed.

So, Forward Step-Wise Selection begins with a model containing no predictors, and then adds predictors to the model, one at a time, until all of the predictors are in the model. In particular, at each step the variable that gives the greatest additional improvement to the fit is added to the model.

Forward Step-Wise Selection can be summarized as follows:

Step 1: Let M_0 be the null model, containing no features.

Step 2: For K = 0,…., p-1:

  • Consider all (p-k) models that contain the variables in M_k with one additional feature or predictor.
  • Choose the best model among these p-k models, and define it M_(k+1) by using performance metrics such as RSS / R-squared .

Step 3: Select the single model with the best performance among these M_0,….M_p models (one with smallest Cross Validation Error , C_p , AIC (Akaike Information Criterion) , BIC (Bayesian Information Criteria)or adjusted R-squared is your best model M*).

So, the idea behind this Selection is to start simple and increase the number of predictors in the model. Per number of predictors, consider all possible combination of variables and select a single best model: M_k. Then compare all these models with different number of predictors (best M_ks ) and the one best performing one can be selected.

When n < p, so when number of observations is larger than number of predictors in Linear Regression, you can use this approach to select features in the model in order for LR to work in the first place.

Backward Step-wise Feature Selection: Unlike in Forward Step-wise Selection, in case of Backward Step-wise Selection the feature selection algorithm starts with the full model containing all p predictors. Then the best model with p predictorss is selected.

Consequently, the model removes one by one the variable with the largest p-value and again best model is selected.

Each time, the model is fitted again to identify the least statistically significant variable until the stopping rule is reached. (For example, all p- values need to be smaller then 5%.) Then we compare all these models with different number of predictors (best M_ks) and select the single model with the best performance among these M_0,….M_p models (one with smallest Cross Validation Error , C_p , AIC (Akaike Information Criterion) , BIC (Bayesian Information Criteria)or adjusted R-squared is your best model M*).

Backward Step-Wise Feature Selection can be summarized as follows:

Step 1: Let M_p be the full model, containing all features.

Step 2: For k= p, p-1 ….,1:

  • Consider all k models that contain all variables except for one of the predictors in M_k model, for k − 1 features.
  • Choose the best model among these k models, and define it M_(k-1) by using performance metrics such as RSS / R-squared .

Like Forward Step-wise Selection, the Backward Step-Wise Feature Selection technique searches through only (p+1)/2 models, making it possible to apply in settings where p is too large to apply other selection techniques.

Also, Backward Step-Wise Feature Selection is not guaranteed to yield the best model containing a subset of the p predictors. It requires that the number of observations or data points n to be larger than the number of model variables p whereas Forward Step-Wise Selection can be used even when n < p.

image-65

3.2 Regularization in Machine Learning

Regularization, also known as Shrinkage, is a widely-used strategy to address the issue of overfitting in machine learning models.

The fundamental concept of regularization involves deliberately introducing a slight bias into the model, with the benefit of notably reducing its variance.

The term "Shrinkage" is derived from the method's ability to pull some of the estimated coefficients toward zero, imposing a penalty on them to prevent them from elevating the model's variance excessively.

Two prominent regularization techniques stand out in practice: Ridge Regression, which leverages the L2 norm, and Lasso Regression, employing the L1 norm.

3.2.1 Ridge Regression (L2 Regularization)

Let's explore examples of multiple linear regression, involving p p independent variables or predictors utilized to model the dependent variable y y .

It's worth remembering that Ordinary Least Squares (OLS), provided its assumptions are met, is a widely-adopted estimation technique for determining the parameters of linear regression. OLS seeks the optimal coefficients by minimizing the model's residual sum of squares (RSS). That is:

1*9mdYD6q-ns3ZO5KYw046Uw

where the β represents the coefficient estimates for different variables or predictors(X).

Ridge Regression is pretty similar to OLS, except that the coefficients are estimated by minimizing a slightly different cost or loss function. Namely, the Ridge Regression coefficient estimates βˆR values such that they minimize the following loss function:

1*Yri4m3wximoVgqCdfjqybg

where λ (lambda, which is always positive, ≥ 0) is the tuning parameter or the penalty parameter, and as can be seen from this formula, in the case of the Ridge, the L2 penalty or L2 norm is used.

In this way, Ridge Regression will assign a penalty to some variables shrinking their coefficients towards zero, reducing the overall model variance – but these coefficients will never become exactly zero. So, the model parameters are never set to exactly 0, which means that all p predictors of the model are still intact.

L2 Norm (Euclidean Distance)

L2 norm is a mathematical term that comes from Linear Algebra. It stands for a Euclidean norm which can be represented as follows:

1*3XOoIOpLRREo4882c2K0kQ

Tuning parameter λ : tuning parameter λ serves to control the relative impact of the penalty on the regression coefficient estimates. When λ = 0, the penalty term has no effect, and the ridge regression will produce the ordinary least squares estimates. But as λ → ∞ (gets very large), the impact of the shrinkage penalty grows, and the ridge regression coefficient estimates approach to 0. Here's a visual representation of this:

1*2ICCHEBIlr2WkJwBdH4ZpQ

Why does Ridge Regression Work?

Ridge regression’s advantage over ordinary least squares comes from the earlier introduced bias-variance trade-off phenomenon. As λ, the penalty parameter, increases, the flexibility of the ridge regression fit decreases, leading to decreased variance but increased bias.

3.2.2 Lasso Regression (L1 Regularization)

Lasso Regression overcomes this disadvantage of Ridge Regression. Namely, the Lasso Regression coefficient estimates βˆλL are the values that minimize:

1*9xgT0094jajcR3h4LuLjNQ

As with Ridge Regression, the Lasso shrinks the coefficient estimates towards zero. But in the case of the Lasso, the L1 penalty or L1 norm is used which has the effect of forcing some of the coefficient estimates to be exactly equal to zero when the tuning parameter λ is significantly large.

So, like many feature selection techniques, Lasso Regression performs variable selection besides solving the overfitting problem.

1*xxJGK_RO3yMMk78jzXC7qw

L1 Norm (Manhattan Distance)

L1 norm is a mathematical term that comes from Linear Algebra. It stands for a Manhattan norm which can be represented as follows:

1*-6vGuuy9s8FahKYyEEjSwQ

Why does Lasso Regression Work?

Like, Ridge Regression, Lasso Regression’s advantage over ordinary least squares comes from the earlier introduced bias-variance trade-off. As λ increases, the flexibility of the ridge regression fit decreases. This leads to decreased variance but increased bias. Additionally, Lasso also performs feature selection.

3.2.3 Lasso vs Ridge Regression

Lasso Regression shrinks the coefficient estimates towards zero and even forces some of these coefficients to be exactly equal to zero when the tuning parameter λ is significantly large. So, like many features selection techniques, Lasso Regression performs variable selection besides solving the overfitting problem.

Comparison between Ridge Regression and Lasso Regression becomes clear when putting earlier two graphs next to each other:

1*oq-2dyqDAC9T_MkUYnu61g

If you want to learn regularization in detail, read this tutorial:

1*sHhtYhaCe2Uc3IU0IgKwIQ

Chapter 4: Resampling Techniques in Machine Learning

When we have only training data and we want to make judgments about the performance of the model on unseen data, we can use Resampling Techniques to create artificial test data.

Resampling Techniques are often divided into two categories: Cross-Validation and Bootstrapping. They're usually used for the following three purposes:

  • Model Assessment: evaluate the model performance (to compute test error rate)
  • Model Variance: compute the variance of the model to check how generalizable your model is
  • Model Selection: select model flexibility

For example, in order to estimate the variability of a linear regression fit, we can repeatedly draw different samples from the training data, fit a linear regression to each new sample, and then examine the extent to which the resulting fits differ.

4.1 Cross-Validation

Cross-validation can be used to estimate the test error associated with a given statistical learning method in order to perform:

  • Model assessment: to evaluate its performance by calc test error rate
  • Model Selection: to select the appropriate level of flexibility.

You hold out a subset of the training observations from the fitting process, and then apply the statistical learning method to those held out observations.

CV is usually divided in the following three categories:

  • Validation Set Approach
  • K-fold Cross Validation (K-ford CV)
  • Leave One Out Cross Validation (LOOCV)

4.1.1 Validation Set Approach

This is a simple approach to randomly split the data into training and validation sets. This approach usually uses Sklearn’s train_test_split() function.

The model is then trained on the training data (usually 80% of the data) and uses it to predict the values for the hold-out or Validation Set (usually 20% of the data) which is the test error rate.

4.1.2 Leave One Out Cross Validation (LOOCV)

LOOCV is similar to the Validation set approach. But each time it leaves one observation out of the training set and uses the remaining n-1 to train the model and calculates the MSE for that one prediction. So, in the case of LOOCV, the Model has to be fit n times (where n is the number of observations in the model).

Then this process is repeated for all observations and n times MSEs are calculated. The mean of the MSEs is the Cross-Validation error rate and can be expressed as follows:

image-66

‌        

4.1.3 K-fold Cross Validation (K-ford CV)

K-Fold CV is the silver lining between the Validation Set approach (high variance and high bias but is computationally efficient) versus the LOOCV (low bias and low variance but is computationally inefficient).

In K-Fold CV, the data is randomly sampled into K equally sized samples (K- folds). Then each time, 1 is used as validation and the rest as training, and the model is fit K times. The mean of K MSEs form the Cross validation test error rate.

Note that the LOOCV is a special case of K-fold CV where K = N, and can be expressed as follows:

image-67

‌                                            

4.2 Selecting Optimal k in K-fold CV

The choice of k in K-fold is a matter of Bias-Variance Trade-Off and the efficiency of the model. Usually, K-Fold CV and LOOCV provide similar results and their performance can be evaluated using simulated data.

However, LOOCV has lower bias (unbiased) compared to K-fold CV because LOOCV uses more training data than K-fold CV does. But LOOCV has higher variance than K-fold does because LOOCV is fitting the model on almost identical data for each item and the outcomes are highly correlated compared to the outcomes of K-Fold which are less correlated.

Since the mean of highly correlated outcomes has higher variance than the one of less correlated outcomes, the LOOCV variance is higher.

  • K = N (LOOCV) , larger the K→ higher variance and lower bias
  • K = 1, smaller the K → lower variance and higher bias

Taking this information into account, we can calculate the performance of the model for various Ks lets say K = 3,5,6,7…,10 or the Type I, Type II, and total model classification error in case of classification model. Then the best performing model’s K can be the optimal K using the idea of ROC curve (classification case) or the Elbow method (regression case).

image-69

4.3 Bootstrapping

Bootstrapping is another very popular resampling technique that is used for various purposes. One of them is to effectively estimate the variability of the estimates/models or to create artificial samples from an existing sample and improve model performance (like in the case of Bagging or Random Forest).

It is used in many situations where it's hard or even impossible to directly compute the standard deviation of a quantity of interest.

  • It's a very useful way to quantify the uncertainty associated with the statistical learning method and obtain the standard errors/measure of variability.
  • It's not useful for Linear Regression since the standard R/Python provides these results (SE of coefficients).

Bootstrapping is extremely handy for other methods as well where variability is more difficult to quantify. The bootstrap sampling is performed with replacement, which means that the same observation can occur more than once in the bootstrap data set.

So, Bootstrapping takes the original training sample and resamples from it by replacement, resulting in B different samples. Then for each of these simulated samples, the coefficient estimate is computed. Then, by taking the mean of these coefficient estimates and using the common formula for SE, we calculate the Standard Error of the Bootstrapped model.

Read more about it here .‌                                              ‌             ‌

Chapter 5: Optimization Techniques

Knowing the fundamentals of the Machine Learning models and learning how to train those models is definitely big part of becoming technical Data Scientist. But that’s only a part of the job.

In order to use the Machine Learning model to solve a business problem, you need to optimize it after you have established its baseline. That is, you need to optimize the set of hyper parameters in your Machine Learning model to find the set of optimal parameters that result in the best performing model (all things being equal).

So, to optimize or to tune your Machine Learning model, you need too perform hyperparameter optimization. By finding the optimal combination of hyper parameter values, we can decrease the errors the model produces and build the most accurate model.

A model hyperparameter is a constant in the model. It's external to the model, and its value cannot be estimated from data (but rather should be specified in advanced before the model is trained). For instance, k in k-Nearest Neighbors (kNN) or the number of hidden layers in Neural Networks.

Hyperparameter optimization methods are usually categorized into:

  • Exhaustive Search or Brute Force Approach (like Grid Search)
  • Gradient Descent (Batch GD, SGD, SDG with Momentum, Adam)
  • Genetic Algorithms

In this handbook, I will discuss only the first two types of optimisation techniques.

5.1 Brute Force Approach (Grid Search)

Exhaustive Search (often referred as Grid Search or Brute Force Approach) is the process of looking for the most optimal hyperparameters by checking each of the candidates for the hyperparameters and computing the model error rate.

Once we create the list of possible values for each of the hyperparameters, for every possible combination of hyper parameter values, we calculate the model error rate and compare it to the current optimal model (one with minimum error rate). During each iteration, the optimal model is updated if the new parameter values result in lower error rate.

The optimisation method is simple. For instance, if you are working with a K-means clustering algorithm, you can manually search for the right number of clusters. But if there are hundreds or thousands of possible combination of hyperparameter values that you have to consider, the model can take hours or days to train – and it becomes incredibly heavy and slow. So most of the time, brute-force search is inefficient.

To optimize or to tune your Machine Learning model, you need to perform hyperparameter optimization. By finding the optimal combination of hyper parameter values, we can decrease the error the model produces and build the most accurate model.

When it comes to Gradient Descent type of optimisation techniques, then its variants such as Batch Gradient Descent, Stochastic Gradient Descent, and so on differ in terms of the amount of data used to compute the gradient of the Loss or Cost function.

Let's define this Loss Function by J(θ) where θ (theta) represents the parameter we want to optimize.

The amount of data usage is about a trade-off between the accuracy of the parameter update and the time it takes to perform such an update. Namely, the larger the data sample we use, we can expect a more accurate adjustment of a parameter – but the process will be then much slower.

The opposite holds true as well. The smaller the data sample, the less accurate will be the adjustments in the parameter but the process will be much faster.

5.2 Gradient Descent Optimization (GD)

The Batch Gradient Descent algorithm (often just referred to as Gradient Descent or GD), computes the gradient of the Loss Function J(θ) with respect to the target parameter using the entire training data.

We do this by first predicting the values for all observations in each iteration, and comparing them to the given value in the training data. These two values are used to calculate the prediction error term per observation which is then used to update the model parameters. This process continues until the model converges.

The gradient or the first order derivative of the loss function can be expressed as follows:

image-70

Then, this gradient is used to update the previous iterations’ value of the target parameter. That is:

image-71

  • θ : This represents the parameter(s) or weight(s) of a model that you are trying to optimize. In many contexts, especially in neural networks, θ can be a vector containing many individual weights.
  • η : This is the learning rate. It's a hyperparameter that dictates the step size at each iteration while moving towards a minimum of the cost function. A smaller learning rate might make the optimization more precise but could also slow down the convergence process, while a larger learning rate might speed up convergence but risks overshooting the minimum. Can be [0,1] but is is usually a number between (0.001 and 0.04)
  • ∇ J ( θ ): This is the gradient of the cost function J with respect to the parameter θ It indicates the direction and magnitude of the steepest increase of J . By subtracting this from the current parameter value (multiplied by the learning rate), we adjust θ in the direction of the steepest decrease of J .

There are two major disadvantages to GD which make this optimization technique not so popular especially when dealing with large and complex datasets. Since in each iteration the entire training data should be used and stored, the computation time can be very large resulting in incredibly slow process. On top of that, storing that large amount of data results in memory issues, making GD computationally heavy and slow.

image-80

5.3 Stochastic Gradient Descent (SGD)

The Stochastic Gradient Descent (SGD) method, also known as Incremental Gradient Descent, is an iterative approach for solving optimisation problems with a differential objective function, exactly like GD.

But unlike GD, SGD doesn’t use the entire batch of training data to update the parameter value in each iteration. The SGD method is often referred as the stochastic approximation of the gradient descent which aims to find the extreme or zero points of the stochastic model containing parameters that cannot be directly estimated.

SGD minimises this cost function by sweeping through data in the training dataset and updating the values of the parameters in every iteration.

In SGD, all model parameters are improved in each iteration step with only one training sample. So, instead of going through all training samples at once to modify model parameters, the SGD algorithm improves parameters by looking at a single and randomly sampled training set (hence the name Stochastic ). That is:

image-72

  • η : This is the learning rate. It's a hyperparameter that dictates the step size at each iteration while moving towards a minimum of the cost function. A smaller learning rate might make the optimization more precise but could also slow down the convergence process, while a larger learning rate might speed up convergence but risks overshooting the minimum.
  • ∇ J ( θ , x ( i ), y ( i )): This is the gradient of the cost function J with respect to the parameter θ for a given input x ( i ) and its corresponding target output y ( i ). It indicates the direction and magnitude of the steepest increase of J . By subtracting this from the current parameter value (multiplied by the learning rate), we adjust θ in the direction of the steepest decrease of J .
  • x ( i ): This represents the ith input data sample from your dataset.
  • y ( i ): This is the true target output for the ith input data sample.

In the context of Stochastic Gradient Descent (SGD), the update rule applies to individual data samples x ( i ) and y ( i ) rather than the entire dataset, which would be the case for batch Gradient Descent.

This single-step improves the speed of the process of finding the global minima of the optimization problem and this is what differentiate SGD from GD. So, SGD consistently adjusts the parameters with an attempt to move in the direction of the global minimum of the objective function.

SGD addresses the slow computation time issue of GD, because it scales well with both big data and with a size of the model. But even though SGD method itself is simple and fast, it is known as a “bad optimizer” because it's prone to finding a local optimum instead of a global optimum.

In SGD, all model parameters are improved in each iteration step with only one training sample. So, instead of going through all training samples at once to modify model parameters, SGD improves parameters by looking at a single training sample.

This single step improves the speed of the process of finding the global minimum of the optimization problem. This is what differentiates SGD from GD.

image-73

5.4 SGD with Momentum

When the error function is complex and non-convex, instead of finding the global optimum, the SGD algorithm mistakenly moves in the direction of numerous local minima. This results in higher computation time.

In order to address this issue and further improve the SGD algorithm, various methods have been introduced. One popular way of escaping a local minimum and moving right in direction of a global minimum is SGD with Momentum .

The goal of the SGD method with momentum is to accelerate gradient vectors in the direction of the global minimum, resulting in faster convergence.

The idea behind the momentum is that the model parameters are learned by using the directions and values of previous parameter adjustments. Also, the adjustment values are calculated in such a way that more recent adjustments are weighted heavier (they get larger weights) compared to the very early adjustments (they get smaller weights).

The reason for this difference is that with the SGD method we do not determine the exact derivative of the loss function, but we estimate it on a small batch. Since the gradient is noisy, it is likely that it will not always move in the optimal direction.

The momentum helps then to estimate those derivatives more accurately, resulting in better direction choices when moving towards the global minimum.

Another reason for the difference in the performance of classical SGD and SGD with momentum lies in the area referred as Pathological Curvature, also called the ravine area .

Pathological Curvature or Ravine Area can be represented by the following graph. The orange line represents the path taken by the method based on the gradient while the dark blue line represents the ideal path in towards the direction of ending the global optimum.

1*kJS9IPV1DcZWkQ4b8QEB8w

To visualise the difference between the SGD and SGD Momentum, let's look at the following figure.

1*aM92FlJ8zn1-ao6Z6ynzEg

In the left hand-side is the SGD method without Momentum. In the right hand-side is the SGD with Momentum. The orange pattern represents the path of the gradient in a search of the global minimum.

1*amVpAKdAsDXA1R-XHPfztw

5.5 Adam Optimizer

Another popular technique for enhancing SGD optimization procedure is the Adaptive Moment Estimation (Adam) introduced by Kingma and Ba (2015). Adam is the extended version of the SGD with the momentum method.

The main difference compared to the SGD with momentum, which uses a single learning rate for all parameter updates, is that the Adam algorithm defines different learning rates for different parameters.

The algorithm calculates the individual adaptive learning rates for each parameter based on the estimates of the first two moments of the gradients (first and the second order derivative of the Loss function).

So, each parameter has a unique learning rate, which is being updated using the exponential decaying average of the rst moments (the mean) and second moments (the variance) of the gradients.

image-89

Key Takeaways & What Comes Next

In this handbook, we've covered the essentials and beyond in machine learning. From the basics to advanced techniques, we've unpacked popular ML algorithms used globally in tech and the key optimization methods that power them.

While learning about each concept, we saw some practical examples and Python code, ensuring that you're not just understanding the theory but also its application.

Your Machine Learning journey is ongoing, and this guide is your reference. It's not a one-time read – it's a resource to revisit as you progress and flourish in this field. With this knowledge, you're ready to tackle most of the real-world ML challenges confidently at a high level. But this is just the beginning.

About the Author — That’s Me!

I am Tatev , Senior Machine Learning and AI Researcher. I have had the privilege of working in Data Science across numerous countries, including the US, UK, Canada, and the Netherlands.

With an MSc and BSc in Econometrics under my belt, my journey in Machine and AI has been nothing short of incredible. Drawing from my technical studies during my Bachelors & Masters, along with over 5 years of hands-on experience in the Data Science Industry, in Machine Learning and AI, I've gathered this high-level summary of ML topics to share with you.

How Can You Dive Deeper?

After studying this guide, if you're keen to dive even deeper and structured learning is your style, consider joining us at LunarTech . Follow the course " Fundamentals to Machine Learning ," a comprehensive program that offers an in-depth understanding of the theory, hands-on practical implementation, extensive practice material, and tailored interview preparation to set you up for success at your own phase.

This course is also a part of The Ultimate Data Science Bootcamp which has earned the recognition of being one of the Best Data Science Bootcamps of 2023 , and has been featured in esteemed publications like Forbes , Yahoo , Entrepreneur and more. This is your chance to be a part of a community that thrives on innovation and knowledge. You can enroll for a Free Trial of The Ultimate Data Science Bootcamp at LunarTech .

forbes-icon-hires-fau

Connect with Me:

Screenshot-2023-10-23-at-6.59.27-PM

  • Follow me on LinkedIn for a ton of Free Resources in ML and AI
  • Visit my Personal Website
  • Subscribe to my The Data Science and AI Newsletter

https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9d0853c3-c41a-48a2-a9e2-837f1cac1c70%2Fapple-touch-icon-1024x1024

Want to discover everything about a career in Data Science, Machine Learning and AI, and learn how to secure a Data Science job? Download this FREE Data Science and AI Career Handbook

Thank you for choosing this guide as your learning companion. As you continue to explore the vast field of machine learning, I hope you do so with confidence, precision, and an innovative spirit. Best wishes in all your future endeavors!

Co-founder of LunarTech, I harness power of Statistics, Machine Learning, Artificial Intelligence to deliver transformative solutions. Applied Data Scientist, MSc/BSc Econometrics

If you read this far, thank the author to show them you care. Say Thanks

Learn to code for free. freeCodeCamp's open source curriculum has helped more than 40,000 people get jobs as developers. Get started

logo

MGMT 4190/6560 Introduction to Machine Learning Applications @Rensselaer

Interact on Colab

Assignment 1 ¶

Before you start working on this assignment please click file -> save a copy in drive. ¶.

Before you turn this problem in, make sure everything runs as expected. First, restart the kernel (in the menubar, select Kernel → Restart) and then run all cells (in the menubar, select Cell → Run All). You can speak with others regarding the assignment but all work must be your own.

This is a 30 point assignment. ¶

Before you begin ¶.

Please work through each of these notebooks, which will give you some understanding of the Google Colab environment.

Working with Notebooks in Colaboratory ¶

Overview of Colaboratory

Guide to Markdown

Importing libraries and installing dependencies

Saving and loading notebooks in GitHub

Working with Data ¶

Some of this is a bit more advanced, but at this point just make sure you know where the code is for how to upload and download a file.

Loading data: Drive, Sheets, and Google Cloud Storage

Run these Cells ¶

This will setup the automated testing environment on Colab

Question 1.

In the next cell:

a. Assign the value for x to 150

b. Set the value for y to 13 times x

c. Set the value for z to y divided by x squared.

Question 2.

Packages are really important compontent of most programming languages.

In the overview, you learnned about tab completion as a way to explore python objects. This can be really useful. Let’s use it to find the formula for the the factorial of 15. Assign the results to the variable m .

Question 3.

Markdown is a useful aspect of Jupyter Notebooks.

assignment 1 machine learning

Double click on cell below to open it for markdown editing. There is no test for this question.

Header For the above header, make it an h1 tag using markdown.

Sub-Header For the above sub-header, make it an h5 tag using markdown.

https://tw.rpi.edu//images/rpi-logo-red.jpg (Embed this image)

Question 4.

Installing Packages ¶

Python packages are an important part of data science and critical to leveraging the broader Python ecosystem.

You typically have two options when installing a package. You can install it with Conda or pip .

The ! in a jupyter notebook means that the line is being processed on the commmand line and not by the Python interpreter.

If you try to import something and get an error, it is usally a tell that you need to install a package.

Install the fastparquet Package to be able to work with Parquet Files ¶

CSV (comma delimited files are great for humans to read and understand.

For “big data” though, it isn’t a great long term storage option (inefficient/slow).

Parquet is a type columnar storage format. It makes dealing with lots of columns fast.

fastparquet is a Python package for dealing with Parquet files.

Apache Spark also natively reads Parquet Files.

Look here for instructions on installing the fastparquet package.

Show All Columns in a Pandas Dataframe ¶

Notice there is a ... which indicates you are only seeing some of the columns, and the output has been truncated.

Read this article and find how to show all the columns of a pandas dataframe.

Question 5.

Importing CSV into a Pandas Dataframe ¶

Comma delimited files are a common way of transmitting data.

Data for different columns is separated by a comma.

It is possible to open a csv in different ways, but Pandas is the easiest.

Data structured like CSV’s are extremely common and known as tabular data.

Pandas will give access to many useful methods for working with data.

pandas is often imported as the abbreviated pd .

You can also get help by using a ? after the method call. For example, to find the doc string for the read csv function you could execute:

pd.read_csv? or

help(pd.read_csv)

Get CSVs from the Web/Github. ¶

You can also get a CSV directly from a web url.

View this file in your web browser. You won’t be able to load this into pandas. https://github.com/rpi-techfundamentals/introml_website_fall_2020/blob/master/files/webfile.csv

To get the link you can load, you need to click on the raw button. That should lead to this url:

https://raw.githubusercontent.com/rpi-techfundamentals/introml_website_fall_2020/master/files/webfile.csv

MAKE SURE THAT THIS ENTIRE NOTEBOOK RUNS WITHOUT ERRORS. TO TEST THIS DO RUNTIME –> RESTART AND RUN ALL ¶

It should run without errors.

Click File -> Download .ipynb to download the assignment. Then Upload it to Assignment 1 in the LMS. ¶

This work is licensed under the Creative Commons Attribution 4.0 International license agreement.

assignment 1 machine learning

Machine Learning: Assignment 1

Rahmati Ba

Packages: An introduction to the diverse array of software packages employed in machine learning.

ReLU Activation: An elucidation of the Rectified Linear Unit (ReLU) activation function utilized in neural networks.

Softmax Function: An overview of the softmax function, particularly its significance in multiclass classification tasks.

Exercise 1: Instructions for implementing the softmax function.

Neural Networks: A comprehensive examination of neural networks encompassing problem statements, datasets, model representations, and TensorFlow implementations.

4.1 Problem Statement: Articulating the specific task or problem being tackled.

4.2 Dataset: Describing the dataset employed for training and evaluation purposes.

4.3 Model Representation: Clarifying how the neural network is structured and depicted.

4.4 TensorFlow Model Implementation: Elaborating on the process of implementing the neural network using TensorFlow.

4.5 Softmax Placement: A discourse on the optimal placement of the softmax function within the neural network architecture.

Exercise 2: Hands-on practice concerning softmax function placement.

Machine Learning: Assignment 2

use a neural network to recognize the hand-written digits 0–9.

2 — ReLU Activation

4.1 problem statement.

Clarifying the particular task or issue under consideration.

Exercise 2: Softmax placement

Machine Learning Assignment 3 — Decision Tree

implement a decision tree from scratch and apply it to the task of classifying whether a mushroom is edible or poisonous.

4.2 Split dataset

4.3 calculate information gain, 4.4 get best split, 5 — building the tree, asssignment 4 — k-means clustering.

implement the K-means algorithm and use it for image compression.

1 — Implementing K-means

2 — k-means on a sample dataset, 3 — random initialization, 4 — image compression with k-means.

Rahmati Ba

Written by Rahmati Ba

Text to speech

Assignments

Jump to: [Homeworks] [Projects] [Quizzes] [Exams]

There will be one homework (HW) for each topical unit of the course. Due about a week after we finish that unit.

These are intended to build your conceptual analysis skills plus your implementation skills in Python.

  • HW0 : Numerical Programming Fundamentals
  • HW1 : Regression, Cross-Validation, and Regularization
  • HW2 : Evaluating Binary Classifiers and Implementing Logistic Regression
  • HW3 : Neural Networks and Stochastic Gradient Descent
  • HW4 : Trees
  • HW5 : Kernel Methods and PCA

After completing each unit, there will be a 20 minute quiz (taken online via gradescope).

Each quiz will be designed to assess your conceptual understanding about each unit.

Probably 10 questions. Most questions will be true/false or multiple choice, with perhaps 1-3 short answer questions.

You can view the conceptual questions in each unit's in-class demos/labs and homework as good practice for the corresponding quiz.

There will be three larger "projects" throughout the semester:

  • Project A: Classifying Images with Feature Transformations
  • Project B: Classifying Sentiment from Text Reviews
  • Project C: Recommendation Systems for Movies

Projects are meant to be open-ended and encourage creativity. They are meant to be case studies of applications of the ML concepts from class to three "real world" use cases: image classification, text classification, and recommendations of movies to users.

Each project will due approximately 4 weeks after being handed out. Start early! Do not wait until the last few days.

Projects will generally be centered around a particular methodology for solving a specific task and involve significant programming (with some combination of developing core methods from scratch or using existing libraries). You will need to consider some conceptual issues, write a program to solve the task, and evaluate your program through experiments to compare the performance of different algorithms and methods.

Your main deliverable will be a short report (2-4 pages), describing your approach and providing several figures/tables to explain your results to the reader.

You’ll be assessed on effort, the sophistication of your technical approach, the clarity of your explanations, the evidence that you present to support your evaluative claims, and the performance of your implementation. A high-performing approach with little explanation will receive little credit, while a careful set of experiments that illuminate why a particular direction turned out to be a dead end may receive close to full credit.


Spring 2024

-->


Assignment #1

Released on Wednesday 01/24/2024

Late Policy

  • You have free 8 late days.
  • You can use late days for assignments. A late day extends the deadline 24 hours.
  • Once you have used all 8 late days, the penalty is 10% for each additional late day.
  • Top Courses
  • Online Degrees
  • Find your New Career
  • Join for Free

DeepLearning.AI

Deep Learning Specialization

Become a Machine Learning expert. Master the fundamentals of deep learning and break into AI. Recently updated with cutting-edge techniques!

Taught in English

Some content may not be translated

Andrew Ng

Instructors: Andrew Ng +2 more

Instructors

Top Instructor

Financial aid available

853,422 already enrolled

Specialization - 5 course series

(133,305 reviews)

Recommended experience

Intermediate level

Intermediate Python skills: basic programming, understanding of for loops, if/else statements, data structures

A basic grasp of linear algebra & ML

What you'll learn

Build and train deep neural networks, identify key architecture parameters, implement vectorized neural networks and deep learning to applications

Train test sets, analyze variance for DL applications, use standard techniques and optimization algorithms, and build neural networks in TensorFlow

Build a CNN and apply it to detection and recognition tasks, use neural style transfer to generate art, and apply algorithms to image and video data

Build and train RNNs, work with NLP and Word Embeddings, and use HuggingFace tokenizers and transformer models to perform NER and Question Answering

Skills you'll gain

  • Recurrent Neural Network
  • Convolutional Neural Network
  • Artificial Neural Network
  • Transformers

Details to know

assignment 1 machine learning

Add to your LinkedIn profile

See how employees at top companies are mastering in-demand skills

Placeholder

Advance your subject-matter expertise

  • Learn in-demand skills from university and industry experts
  • Master a subject or tool with hands-on projects
  • Develop a deep understanding of key concepts
  • Earn a career certificate from DeepLearning.AI

Placeholder

Earn a career certificate

Add this credential to your LinkedIn profile, resume, or CV

Share it on social media and in your performance review

Placeholder

The Deep Learning Specialization is a foundational program that will help you understand the capabilities, challenges, and consequences of deep learning and prepare you to participate in the development of leading-edge AI technology.

In this Specialization, you will build and train neural network architectures such as Convolutional Neural Networks, Recurrent Neural Networks, LSTMs, Transformers, and learn how to make them better with strategies such as Dropout, BatchNorm, Xavier/He initialization, and more. Get ready to master theoretical concepts and their industry applications using Python and TensorFlow and tackle real-world cases such as speech recognition, music synthesis, chatbots, machine translation, natural language processing, and more.

AI is transforming many industries. The Deep Learning Specialization provides a pathway for you to take the definitive step in the world of AI by helping you gain the knowledge and skills to level up your career. Along the way, you will also get career advice from deep learning experts from industry and academia.

Applied Learning Project

By the end you’ll be able to:

• Build and train deep neural networks, implement vectorized neural networks, identify architecture parameters, and apply DL to your applications

• Use best practices to train and develop test sets and analyze bias/variance for building DL applications, use standard NN techniques, apply optimization algorithms, and implement a neural network in TensorFlow

• Use strategies for reducing errors in ML systems, understand complex ML settings, and apply end-to-end, transfer, and multi-task learning

• Build a Convolutional Neural Network, apply it to visual detection and recognition tasks, use neural style transfer to generate art, and apply these algorithms to image, video, and other 2D/3D data

• Build and train Recurrent Neural Networks and its variants (GRUs, LSTMs), apply RNNs to character-level language modeling, work with NLP and Word Embeddings, and use HuggingFace tokenizers and transformers to perform Named Entity Recognition and Question Answering

Neural Networks and Deep Learning

In the first course of the Deep Learning Specialization, you will study the foundational concept of neural networks and deep learning.

By the end, you will be familiar with the significant technological trends driving the rise of deep learning; build, train, and apply fully connected deep neural networks; implement efficient (vectorized) neural networks; identify key parameters in a neural network’s architecture; and apply deep learning to your own applications. The Deep Learning Specialization is our foundational program that will help you understand the capabilities, challenges, and consequences of deep learning and prepare you to participate in the development of leading-edge AI technology. It provides a pathway for you to gain the knowledge and skills to apply machine learning to your work, level up your technical career, and take the definitive step in the world of AI.

Improving Deep Neural Networks: Hyperparameter Tuning, Regularization and Optimization

In the second course of the Deep Learning Specialization, you will open the deep learning black box to understand the processes that drive performance and generate good results systematically.

By the end, you will learn the best practices to train and develop test sets and analyze bias/variance for building deep learning applications; be able to use standard neural network techniques such as initialization, L2 and dropout regularization, hyperparameter tuning, batch normalization, and gradient checking; implement and apply a variety of optimization algorithms, such as mini-batch gradient descent, Momentum, RMSprop and Adam, and check for their convergence; and implement a neural network in TensorFlow. The Deep Learning Specialization is our foundational program that will help you understand the capabilities, challenges, and consequences of deep learning and prepare you to participate in the development of leading-edge AI technology. It provides a pathway for you to gain the knowledge and skills to apply machine learning to your work, level up your technical career, and take the definitive step in the world of AI.

Structuring Machine Learning Projects

In the third course of the Deep Learning Specialization, you will learn how to build a successful machine learning project and get to practice decision-making as a machine learning project leader.

By the end, you will be able to diagnose errors in a machine learning system; prioritize strategies for reducing errors; understand complex ML settings, such as mismatched training/test sets, and comparing to and/or surpassing human-level performance; and apply end-to-end learning, transfer learning, and multi-task learning. This is also a standalone course for learners who have basic machine learning knowledge. This course draws on Andrew Ng’s experience building and shipping many deep learning products. If you aspire to become a technical leader who can set the direction for an AI team, this course provides the "industry experience" that you might otherwise get only after years of ML work experience. The Deep Learning Specialization is our foundational program that will help you understand the capabilities, challenges, and consequences of deep learning and prepare you to participate in the development of leading-edge AI technology. It provides a pathway for you to gain the knowledge and skills to apply machine learning to your work, level up your technical career, and take the definitive step in the world of AI.

Convolutional Neural Networks

In the fourth course of the Deep Learning Specialization, you will understand how computer vision has evolved and become familiar with its exciting applications such as autonomous driving, face recognition, reading radiology images, and more.

By the end, you will be able to build a convolutional neural network, including recent variations such as residual networks; apply convolutional networks to visual detection and recognition tasks; and use neural style transfer to generate art and apply these algorithms to a variety of image, video, and other 2D or 3D data. The Deep Learning Specialization is our foundational program that will help you understand the capabilities, challenges, and consequences of deep learning and prepare you to participate in the development of leading-edge AI technology. It provides a pathway for you to gain the knowledge and skills to apply machine learning to your work, level up your technical career, and take the definitive step in the world of AI.

Sequence Models

In the fifth course of the Deep Learning Specialization, you will become familiar with sequence models and their exciting applications such as speech recognition, music synthesis, chatbots, machine translation, natural language processing (NLP), and more.

By the end, you will be able to build and train Recurrent Neural Networks (RNNs) and commonly-used variants such as GRUs and LSTMs; apply RNNs to Character-level Language Modeling; gain experience with natural language processing and Word Embeddings; and use HuggingFace tokenizers and transformer models to solve different NLP tasks such as NER and Question Answering. The Deep Learning Specialization is a foundational program that will help you understand the capabilities, challenges, and consequences of deep learning and prepare you to participate in the development of leading-edge AI technology. It provides a pathway for you to take the definitive step in the world of AI by helping you gain the knowledge and skills to level up your career.

assignment 1 machine learning

DeepLearning.AI is an education technology company that develops a global community of AI talent. DeepLearning.AI's expert-led educational experiences provide AI practitioners and non-technical professionals with the necessary tools to go all the way from foundational basics to advanced application, empowering them to build an AI-powered future.

Get a head start on your degree

When you complete this Specialization, you can earn college credit if you are admitted and enroll in one of the following online degree programs.¹

Ball State University

Master of Science in Computer Science

Degree · 24 months

Illinois Tech

Bachelor of Information Technology

University of North Texas

Bachelor of Applied Arts and Sciences

Degree · 15+ hours of study/wk per course

Master of Data Science

Degree · 12-15 months

Master of Science in Data Science

University of Massachusetts Global

Bachelor of Arts in Psychology

International Institute of Information Technology, Hyderabad

Master of Science in Information Technology

Degree · 2-4 years

¹Each university determines the number of pre-approved prior learning credits that may count towards the degree requirements according to institutional policies.

Placeholder

Degree credit eligible

This Specialization has ACE® recommendation. It is eligible for college credit at participating U.S. colleges and universities. Note: The decision to accept specific credit recommendations is up to each institution. Learn more

Why people choose Coursera for their career

assignment 1 machine learning

New to Machine Learning? Start here.

Placeholder

Open new doors with Coursera Plus

Unlimited access to 7,000+ world-class courses, hands-on projects, and job-ready certificate programs - all included in your subscription

Advance your career with an online degree

Earn a degree from world-class universities - 100% online

Join over 3,400 global companies that choose Coursera for Business

Upskill your employees to excel in the digital economy

Frequently asked questions

What is deep learning why is it relevant.

Deep Learning is a subset of machine learning where artificial neural networks, algorithms based on the structure and functioning of the human brain, learn from large amounts of data to create patterns for decision-making. Neural networks with various (deep) layers enable learning through performing tasks repeatedly and tweaking them a little to improve the outcome. 

Over the last few years, the availability of computing power and the amount of data being generated have led to an increase in deep learning capabilities. Today, deep learning engineers are highly sought after, and deep learning has become one of the most in-demand technical skills as it provides you with the toolbox to build robust AI systems that just weren’t possible a few years ago. Mastering deep learning opens up numerous career opportunities.

What is the Deep Learning Specialization about?

The Deep Learning Specialization is a foundational program that will help you understand the capabilities, challenges, and consequences of deep learning and prepare you to participate in the development of leading-edge AI technology. In this Specialization, you will build and train neural network architectures such as Convolutional Neural Networks, Recurrent Neural Networks, LSTMs, Transformers, and learn how to make them better with strategies such as Dropout, BatchNorm, Xavier/He initialization, and more. Get ready to master theoretical concepts and their industry applications using Python and TensorFlow and tackle real-world cases such as speech recognition, music synthesis, chatbots, machine translation, natural language processing, and more. AI is transforming many industries. The Deep Learning Specialization provides a pathway for you to take the definitive step in the world of AI by helping you gain the knowledge and skills to level up your career. Along the way, you will also get career advice from deep learning experts from industry and academia.

What will I be able to do after completing the Deep Learning Specialization?

By the end of the Deep Learning Specialization, you will be able to:

1. Build and train deep neural networks, implement vectorized neural networks, identify architecture parameters, and apply DL to your applications. 2. Use best practices to train and develop test sets and analyze bias/variance for building DL applications, use standard NN techniques, apply optimization algorithms, and implement a neural network in TensorFlow 3. Use strategies for reducing errors in ML systems, understand complex ML settings, and apply end-to-end, transfer, and multi-task learning 4. Build a Convolutional Neural Network, apply it to visual detection and recognition tasks, use neural style transfer to generate art, and apply these algorithms to image, video, and other 2D/3D data 5. Build and train Recurrent Neural Networks and its variants (GRUs, LSTMs), apply RNNs to character-level language modeling, work with NLP and Word Embeddings, and use HuggingFace tokenizers and transformers to perform Named Entity Recognition and Question Answering

What background knowledge is necessary for the Deep Learning Specialization?

Learners should have intermediate Python experience (e.g., basic programming skills, understanding of for loops, if/else statements, data structures such as lists and dictionaries).

Recommended: 

Learners should have a basic knowledge of linear algebra (matrix-vector operations and notation).

Learners should have an understanding of machine learning concepts (how to represent data, what an ML model does, etc.)

Who is the Deep Learning Specialization for?

The Deep Learning Specialization is for early-career software engineers or technical professionals looking to master fundamental concepts and gain practical machine learning and deep learning skills.

How long does it take to complete the Deep Learning Specialization?

The Deep Learning Specialization consists of five courses. At the rate of 5 hours a week, it typically takes 5 weeks to complete each course except course 3, which takes about 4 weeks.

Who is the Deep Learning Specialization by?

The Deep Learning Specialization has been created by Andrew Ng, Kian Katanforoosh, and Younes Bensouda Mourri. 

Andrew Ng Opens in a new tab is Founder of DeepLearning.AI, General Partner at AI Fund, Chairman and Co-Founder of Coursera, and an Adjunct Professor at Stanford University. As a pioneer in machine learning and online education, Dr. Ng has changed countless lives through his work in AI, authoring or co-authoring over 100 research papers in machine learning, robotics, and related fields. Previously, he was chief scientist at Baidu, the founding lead of the Google Brain team, and the co-founder of Coursera – the world's largest MOOC platform. 

Kian Katanforoosh Opens in a new tab is the co-founder and CEO of Workera and a lecturer in the Computer Science department at Stanford University. Workera allows data scientists, machine learning engineers, and software engineers to assess their skills against industry standards and receive a personalized learning path. Kian is also the recipient of Stanford’s Walter J. Gores award (Stanford’s highest teaching award) and the Centennial Award for Excellence in teaching.

Younes Bensouda Mourri Opens in a new tab completed his Bachelor's in Applied Mathematics and Computer Science and Master's in Statistics from Stanford University. Younes helped create 3 AI courses at Stanford - Applied Machine Learning, Deep Learning, and Teaching AI - and taught two of them for a few years.

Is this a standalone course or a Specialization?

The Deep Learning Specialization is made up of 5 courses.

Do I need to take the courses in a specific order?

We recommend taking the courses in the prescribed order for a logical and thorough learning experience. Course 3 can also be taken as a standalone course.

Can I apply for financial aid?

Yes, Coursera provides financial aid to learners who cannot afford the fee.

Can I audit the Deep Learning Specialization?

You can audit the courses in the Deep Learning Specialization for free. 

Note that you will not receive a certificate at the end of the course if you choose to audit it for free instead of purchasing it.

How do I get a receipt to get this reimbursed by my employer?

Go to your Coursera account. 

Click on My Purchases and find the relevant course or Specialization.

Click Email Receipt and wait up to 24 hours to receive the receipt. 

You can read more about it here Opens in a new tab .

I want to purchase this Specialization for my employees! How can I do that?

Visit coursera.org/business Opens in a new tab for more information, to pick up a plan, and to contact Coursera. For each plan, you decide the number of courses every member can enroll in and the collection of courses they can choose from.

The Deep Learning Specialization was updated in April 2021. What is different in the new version?

All existing assignments and autograders have been refactored and updated to TensorFlow 2 across Courses 1, 2, 4, and 5.

Three new network architectures are presented with new lectures and programming assignments:

Course 4 includes MobileNet (transfer learning) and U-Net (semantic segmentation).

Course 5, once updated, will include Transformers (Network Architecture, Named Entity Recognition, Question Answering).

For a detailed list of changes, please check out the DLS Changelog Opens in a new tab .

I’m currently enrolled in one or more courses in the Deep Learning Specialization. What does this mean for me?

• Your certificates will carry over for any courses you’ve already completed.

• If your subscription is currently active, you can access the updated labs and submit assignments without paying for the month again.

• If you go to the Specialization, you will see the original version of the lecture videos and assignments. You can complete the original version if so desired (this is not recommended).

• If you would like to update to the new material, reset your deadlines Opens in a new tab . If you’re in the middle of a course, you will lose your notebook work when you reset your deadlines . Please save your work by downloading your existing notebooks before switching to the new version.

• If you do not see the option to reset deadlines, contact Coursera via the Learner Help Center Opens in a new tab .

I’ve already completed one or more courses in the Deep Learning Specialization but don’t have an active subscription. What does this mean for me?

• If your subscription is currently inactive, you will need to pay again to access the labs and submit assignments for those courses.

Can I get college credit for taking the Deep Learning Specialization?

Those planning to attend a degree program can utilize ACE®️ recommendations Opens in a new tab , the industry standard for translating workplace learning to college credit. Learners can earn a recommendation of 10 college credits for completing the Deep Learning Specialization. This aims to help open up additional pathways to learners who are interested in higher education, and prepare them for entry-level jobs.

To share proof of completion with schools, certificate graduates will receive an email prompting them to claim their Credly Opens in a new tab badge, which contains the ACE®️ credit recommendation.  Once claimed, they will receive a competency-based transcript that signifies the credit recommendation, which can be shared directly with a school from the Credly platform. Please note that the decision to accept specific credit recommendations is up to each institution and is not guaranteed. 

How do I pursue the ACE credit recommendation?

To share proof of completion with schools, certificate graduates will receive an email prompting them to claim their Credly badge, which contains the ACE®️ credit recommendation.  Once claimed, they will receive a competency-based transcript that signifies the credit recommendation, which can be shared directly with a school from the Credly platform. Please note that the decision to accept specific credit recommendations is up to each institution and is not guaranteed.

How do I know which colleges and universities grant credit for the Deep Learning Specialization?

The Deep Learning Specialization is eligible for college credit at participating colleges and universities nationwide. The decision to accept specific credit recommendations is up to each institution and not guaranteed. Read more about  ACE Credit College & University Partnerships here Opens in a new tab .

Is this course really 100% online? Do I need to attend any classes in person?

This course is completely online, so there’s no need to show up to a classroom in person. You can access your lectures, readings and assignments anytime and anywhere via the web or your mobile device.

What is the refund policy?

If you subscribed, you get a 7-day free trial during which you can cancel at no penalty. After that, we don’t give refunds, but you can cancel your subscription at any time. See our full refund policy Opens in a new tab .

Can I just enroll in a single course?

Yes! To get started, click the course card that interests you and enroll. You can enroll and complete the course to earn a shareable certificate, or you can audit it to view the course materials for free. When you subscribe to a course that is part of a Specialization, you’re automatically subscribed to the full Specialization. Visit your learner dashboard to track your progress.

Is financial aid available?

Yes. In select learning programs, you can apply for financial aid or a scholarship if you can’t afford the enrollment fee. If fin aid or scholarship is available for your learning program selection, you’ll find a link to apply on the description page.

Can I take the course for free?

When you enroll in the course, you get access to all of the courses in the Specialization, and you earn a certificate when you complete the work. If you only want to read and view the course content, you can audit the course for free. If you cannot afford the fee, you can apply for financial aid Opens in a new tab .

Will I earn university credit for completing the Specialization?

This Specialization doesn't carry university credit, but some universities may choose to accept Specialization Certificates for credit. Check with your institution to learn more.

More questions

Browse Course Material

Course info.

  • Prof. Philippe Rigollet

Departments

  • Mathematics

As Taught In

  • Algorithms and Data Structures
  • Artificial Intelligence
  • Data Mining
  • Applied Mathematics
  • Discrete Mathematics
  • Probability and Statistics

Learning Resource Types

Mathematics of machine learning, mathematics of machine learning assignment 1.

This resource contains information regarding Mathematics of machine learning assignment 1.

facebook

You are leaving MIT OpenCourseWare

IMAGES

  1. Assignment 1 CS 7641 Machine Learning

    assignment 1 machine learning

  2. Assignment 1

    assignment 1 machine learning

  3. Assignment 1 Machine Learning

    assignment 1 machine learning

  4. Assignment 1 Introduction to Machine Learning.pdf

    assignment 1 machine learning

  5. Machine Learning Assignment 1.pdf

    assignment 1 machine learning

  6. Introduction to Machine Learning

    assignment 1 machine learning

VIDEO

  1. NPTEL Introduction to Machine Learning

  2. Assignment 7

  3. Machine Learning For Soil And Crop Management Week 10 Quiz Assignment Solution

  4. NPTEL Introduction to Machine Learning

  5. Machine Learning

  6. Machine Learning

COMMENTS

  1. applied-machine-learning-in-python/Assignment+1.ipynb at master

    Solutions to the 'Applied Machine Learning In Python' Coursera course exercises - amirkeren/applied-machine-learning-in-python

  2. Lab 1: Machine Learning with Python

    scikit-learn #. One of the most prominent Python libraries for machine learning: Contains many state-of-the-art machine learning algorithms. Builds on numpy (fast), implements advanced techniques. Wide range of evaluation measures and techniques. Offers comprehensive documentation about each algorithm.

  3. Assignment 1

    Assignment 1 - Introduction to Machine Learning. For this assignment, you will be using the Breast Cancer Wisconsin (Diagnostic) Database to create a classifier that can help diagnose patients. First, read through the description of the dataset (below). :Number of Instances: 569.

  4. Applied Machine Learning| Assignment 1 solution |week 1| ML ...

    Coursera Applied Machine LearningWeek 1Key ConceptsUnderstand basic machine learning concepts and workflowDistinguish between different types of machine lear...

  5. Assignment 1: Classifier By Hand

    Assignment 1: Classifier By Hand# In this assignment, you learn step by step how to code a binary classifier by hand! Don't worry, it will be guided through. Introduction: calorimeter showers# A calorimeter in the context of experimental particle physics is a sub-detector aiming at measuring the energy of incoming particles.

  6. PDF 10-701 Machine Learning: Assignment 1

    10-701 Machine Learning: Assignment 1 Due on Februrary 20, 2014 at 12 noon Barnabas Poczos, Aarti Singh Instructions: Failure to follow these directions may result in loss of points. Your solutions for this assignment need to be in a pdf format and should be submitted to the blackboard and a webpage (to be speci ed later) for peer-reviewing.

  7. PDF CSE 446: Machine Learning Assignment 1

    CSE 446: Machine Learning Assignment 1 Due: February 3rd, 2020 9:30am Instructions Read all instructions in this section thoroughly. Collaboration: Make certain that you understand the course collaboration policy, described on the course website. You must complete this assignment individually; you are not allowed to collaborate with anyone else.

  8. Introduction to Machine Learning Course by Duke University

    Simple Introduction to Machine Learning. Module 1 • 7 hours to complete. The focus of this module is to introduce the concepts of machine learning with as little mathematics as possible. We will introduce basic concepts in machine learning, including logistic regression, a simple but widely employed machine learning (ML) method.

  9. Machine Learning Fundamentals Handbook

    In it, we'll cover the key Machine Learning algorithms you'll need to know as a Data Scientist, Machine Learning Engineer, Machine Learning Researcher, ... Step 1: Initial Weight Assignment - assign equal weight to all observations in the sample where this weight represents the importance of the observations being correctly classified: ...

  10. Assignment 1

    This can be really useful. Let's use it to find the formula for the the factorial of 15. Assign the results to the variable m. #we have to first import the math function to use tab completion. import math. #Assign the result to the variable m. Press tab after the period to show available functions m = math. m.

  11. Machine Learning: Assignment 1

    Machine Learning: Assignment 1. Packages: An introduction to the diverse array of software packages employed in machine learning. ReLU Activation: An elucidation of the Rectified Linear Unit (ReLU ...

  12. coursera-applied-machine-learning-with-python/Assignment+1.py ...

    Saved searches Use saved searches to filter your results more quickly

  13. Train Your First ML Model

    💻 For real-time updates on events, connections & resources, join our community on WhatsApp: https://jvn.io/wTBMmV0Assignment 1 of the Machine Learning with ...

  14. Assignments

    Used with permission.) Assignment 2 (PDF) Assignment 2 Solution (PDF) (Courtesy of William Perry. Used with permission.) Assignment 3 (PDF) Assignment 3 Solution (PDF) (Courtesy of William Perry. Used with permission.) This section provides three assignments for the course along with solutions.

  15. What Is Machine Learning? Definition, Types, and Examples

    Machine learning definition. Machine learning is a subfield of artificial intelligence (AI) that uses algorithms trained on data sets to create self-learning models that are capable of predicting outcomes and classifying information without human intervention. Machine learning is used today for a wide range of commercial purposes, including ...

  16. Assignments

    After completing each unit, there will be a 20 minute quiz (taken online via gradescope). Each quiz will be designed to assess your conceptual understanding about each unit. Probably 10 questions. Most questions will be true/false or multiple choice, with perhaps 1-3 short answer questions. You can view the conceptual questions in each unit's ...

  17. Assignment 1 CS 7641 Machine Learning

    Assignment 1: CS7641 - Machine Learning Saad Khan September 18, 2015 1 Introduction. I intend to apply supervised learning algorithms to classify the quality of wine samples as being of high or low quality and to segregate type 2 diabetic patients from the ones with no symp- toms. The algorithms I will be implementing for this analysis are ...

  18. Assignment #1

    Machine learning with kernel methods Spring 2024: MSc Mathematics, Vision, Machine Learning (MVA) MSc Mathematics, Machine Learning, and the Humanities (MASH) Main Navigation. Home Schedule Lectures Assignments Assignment #1. Released on Wednesday 01/24/2024. Due Date: Feb 07. MVA/MASH ENS Paris Saclay/Dauphine, PSL University ...

  19. Applied-Machine-Learning-in-Python--University-of-Michigan ...

    Course materials for the Coursera MOOC: Applied Machine Learning in Python from University of Michigan - afghaniiit/Applied-Machine-Learning-in-Python--University-of-Michigan---Coursera

  20. Deep Learning Specialization [5 courses] (DeepLearning.AI)

    Deep Learning is a subset of machine learning where artificial neural networks, algorithms based on the structure and functioning of the human brain, learn from large amounts of data to create patterns for decision-making. ... All existing assignments and autograders have been refactored and updated to TensorFlow 2 across Courses 1, 2, 4, and 5.

  21. Assignment

    Assignment 1 Introduction to Machine Learning Prof. B. Ravindran. Which of the following is a supervised learning problem? (a) Grouping related documents from an unannotated corpus. (b) Predicting credit approval based on historical data. (c) Predicting if a new image has cat or dog based on the historical data of other images of cats and dogs ...

  22. Mathematics of Machine Learning Assignment 1

    This resource contains information regarding Mathematics of machine learning assignment 1. Resource Type: Assignments. pdf. 129 kB Mathematics of Machine Learning Assignment 1 Download File DOWNLOAD. Course Info Instructor Prof. Philippe Rigollet; Departments ...

  23. Unveiling Machine Learning in Self-Driving Cars: Efficiency

    Computer-science document from University of the People, 1 page, Greetings dear Masimbakutendaishe Ngara, Excellent Work on Highlighting Machine Learning's Role in Self-Driving Cars! Your discussion assignment effectively tackles the question of how computers can learn and adapt in the context of self-driving cars. You