# [Coursera] Machine Learning

**Introduction**

**Machine learning**

Grew out of work in AINew capability for computers

Example:

Database miningLarge datasets from growth of automation/web.E.g. web click data, medical records, biology, engineering.Application can't program by hand.E.g. Autonomous helicopter, handwriting recognition, most of Natural Language Processing(NLP), Computer Vision.Self-customizing programsE.g. Amazon, Netflix product recommendation.Understanding human learning(brain, real AI).

**Machine Learning Definition**

Arthur Samuel (1959). Machine learning: Field of study that gives computers the ability to learn without being explicitly programmed.Tom Mitchell (1998). Well-posed Learning program: A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E.E.g. Suppose email program watch which emails you do or do not mark as spam, and based on that learns how to better filter spam.T: classifying emails as spam or not spam.E: watching you label emails as spam or not spam.P: The number (or fraction) of emails correctly classified as spam/not spam.

**Machine learning algorithms:**

Supervised learningUnsupervised learning

Others: Reinforcement learning, recommender systems. And practical advice for applying learning algorithms.

**Supervised Learning**

Supervised learning: "right answers" given Regression: Predict continuous valued output Classification: Discrete valued output (0 or 1)

In **supervised learning**, we are given a data set and already know what our correct output should look like, having the idea that there is a relationship between the input and the output.

Supervised learning problems are categorized into **regression** and **classification** problems. In a regression problem, we are trying to predict results within a continuous output, meaning that we are trying to map input variables to some continuous function. In a classification problem, we are instead trying to predict results in a discrete output. In other words, we are trying to map input variables into discrete categories.

**Unsupervised Learning**

**Unsupervised learning** allows us to approach problems with little or no idea what our results should look like. We can derive structure from data where we don't necessarily know the effect of the variables.

We can derive this structure by **clustering** the data based on relationships among the variables in the data.

With unsupervised learning there is no feedback based on the prediction results.

**Model and Cost function **

**Model Representation**

To describe the supervised learning problem slightly more formally, our goal is, given a training set, to learn a function h : X → Y so that h(x) is a “good” predictor for the corresponding value of y. For historical reasons, this function h is called a hypothesis. Seen pictorially, the process is therefore like this: 📷

When the target variable that we’re trying to predict is *continuous*, such as in our housing example, we call the learning problem a *regression* problem. When y can take on only a small number of *discrete* values (such as if, given the living area, we wanted to predict if a dwelling is a house or an apartment, say), we call it a *classification* problem.

**Cost Function**

We can measure the accuracy of our hypothesis function by using a cost function. This takes an average difference (actually a fancier version of an average) of all the results of the hypothesis with inputs from x's and the actual output y's.

📷

This function is otherwise called the "Squared error function", or "Mean squared error". The mean is halved a convenience for the computation of the gradient descent, as the derivative term of the square function will cancel out the 1/2 term.

**Gradient Descent**

We have our hypothesis function and we have a way of measuring how well it fits into the data. Now we need to estimate the parameters in the hypothesis.

Imagine that we graph our hypothesis function based on its fields theta0 and theta1 (We are graphing the cost function as a function of the parameter estimates). We are not graphing x and y itself, but the parameter range of our hypothesis function and cost resulting from selecting a particular set of parameters.

We put theta0 on the x axis and theta1 on the y axis, with the cost function on the vertical z axis. The points on our graph will be the result of the cost function using our hypothesis with those specific theta parameters. The graph below depicts such a setup.

📷

We will know that we have succeeded when our cost function is at the very bottom of the pits in our graph, when it's value is the minimum.

The wey we do this is by taking the derivative (the tangential line to the function) of our cost function. The slope of the tangent is the derivative at that point and it will give us a direction to move towards. We make steps down the cost function in the direction with the steepest descent. The size of each step is determined by the parameter alpha, which is called the learning rate.