Sunday, 22 October 2017

Machine Learning (ML) and Artificial Intelligence (AI): ML Algorithms: Part- Eleven by Dr. RGS Asthana Senior Member IEEE

Machine Learning (ML) and Artificial Intelligence     (AI): ML Algorithms: Part- Eleven

by
Dr. RGS Asthana
Senior Member IEEE

Figure 1: Big Data, cloud technology fuelling development of ML Algorithms [36]
Summary
ML helps us create models, which can accurately answer what if questions about certain things based on available data.
We discuss in this paper the ML algorithms, such as, Linear regression, Logical Regression, Clustering techniques like k-means & hierarchical, decision tree, Neural networks. Naïve Bays classifier, support vector machines, Backpropagation algorithm and deep learning methods.
Prerequisite
Read article [1] to [16]
Keywords
Prelude
Machine Learning (ML) is same as Machine Intelligence (MI). Human has great capability of learning from experience. If we somehow inculcate this capability of learning with experience in machines/computers then we have intelligence in machines. ML algorithms instruct computers in detail how to identify a cat in the photo; the computer learns to do things on its own by using a suitable ML algorithm. The experience in machines is, in fact, inculcated through training data.  
Machines can learn in three ways: viz. supervised [S], semi-supervised learning (SS), unsupervised [U]and Reinforcement [R] learning. In S learning you need data with the ground truth i.e. one knows desired outcomes for every data inputted, e.g. images are categorized into cats and dogs – the algorithm finds out some features which help to distinguish data during training. Then we show system or model unknown data and model has to make some prediction.  In SS learning, we know outcome about a tiny subset of whole data only.  This shows basically the very practical situation. Further in U learning, data points have no labels associated with them and goal of a U learning algorithm is to organize data in some way or to identify and describe the underlying structure of data. R learning is very close to SS. This can mean grouping it into clusters or finding different ways of looking at complex data so that it appears simpler or more organized. Reinforced or R [29] learning algorithm chooses an action in response to each data point. The learning algorithm also receives a reward signal a short time later, indicating quality of the decision i.e. how good that decision was on correct or incorrect scale.  Based on this algorithm employs a strategy which yields the highest reward. R learning is the problem of getting an agent to act in the world so as to maximize its rewards as shown in figure 2. Consider teaching a cat a new trick: you cannot tell Cat what to do, but you can reward/punish Cat if it does the right/wrong thing. It has to figure out what it did that made it get the reward/punishment, which is known as the credit assignment problem. We can use a similar method to train computers to do many tasks, such as playing backgammon or chess, scheduling jobs, and controlling robot limbs. R learning is also a natural fit for Internet of Things [16] applications. In brief, an agent or even a human being can execute an action based on an observation. It can repeat only that action where there is reward and not penalty. S Learning implemented using Neural Net, can be thought of a problem leading to memorization whereas R learning is a brute- force propagation of outcomes to knowledge about states and actions or reasoning.  
Figure 2: Reinforced learning a powerful paradigm of AI
ML, AI, Mobile Technology, Big Data, 3D Printing and Robotics are playing significant role (see figure 1). What really marks healthcare different from other disciplines?  Healthcare may often have very little labeled data (e.g., clinical NLP). This may prompt the use of semi-supervised learning algorithms i.e. keeping human in the loop (HITL). Sometimes, we have only small numbers of samples (e.g., for a rare disease) and we need to learn as much as possible from other data (e.g. EHR data of healthy patients). We may have lots of missing data that too at varying time intervals and may only get censored labels. Other more important problem which we need to solve is that ML base algorithms do not give reason for arriving at a particular decision. Therefore, it is pertinent to model the problem keeping these aspects in view and may be reason for HITL in the solution.
ML based solutions are good at prediction and diagnosis too is a prediction in a way. We, therefore, describe ML based diagnosis and treatment systems.  The only thing necessary for systems to give better prediction is training on substantial data. The areas where ML/AI based systems have impact in healthcare are: on-line consultations, Health assistance and medication management, personal genetics, development of drugs of the future, discovering new diseases, persistent care, discovering new clinical pathways and last but not the least Robotics and Healthcare.
ML algorithms:
Linear regression – The idea is to fit a line to the data points so as to divide the points into two regions for this you can begin by moving the line in arbitrary direction and minimize the error function which is computed by adding distance from each point to line, least squares algorithm - as we really don’t like negative numbers we use square of distance till we minimize the error function which is addition of square of distance from each point to line by doing number of hit and trials, Gradient descent – the aim is to draw a line or curve to separate or split data, it can be done by giving small penalties to points which are correctly found and giving big penalties to points which are wrongly classified i.e. we make the error function continuous and then we define probability function which should vary from 0 to 1. For doing this, we define an activation function where 0 should map to 0.5 and huge positive numbers map close to +1 and huge negative points map close to 0 which is nothing but a Sigmoid function i. e. f(x) = 1/(1+e–x). We now try to move the line till we reach a point where maximum number of points are correctly classified i.e. we minimize the error function. We wish to make product of probability in to addition and for doing this we take negative log of the error function and this way we have our new error computation.  In fact, predicting continuously is, generally, referred to as a regression problem; an example could be autonomous driving.
Logical Regression - It is the go-to method for binary classification or problems with two class values.   We describe logistic regression algorithm for ML.
The coefficients (Beta values b) of the logistic regression algorithm must be estimated from your training data. This is done using maximum-likelihood estimation.
The best coefficients would result in a model that would predict a value very close to 1 (e.g. male) for the default class and a value very close to 0 (e.g. female) for the other class. The intuition for maximum-likelihood for logistic regression is that a search procedure seeks values for the coefficients (Beta values) that minimize the error in the probabilities predicted by the model to those in the data (e.g. probability of 1 if the data is the primary class).
 It is enough to say that a minimization algorithm is used to optimize the best values for the coefficients for your training data. When you are learning logistic, you can implement it yourself from scratch using the much simpler gradient descent algorithm.
Making predictions with a logistic regression model is as simple as plugging in numbers into the logistic regression equation and calculating a result.
Data preparation: It is a very important step before trying any type of classification.
·   Binary Output Variable:   It predicts the probability of an instance belonging to the default class, which can be mapped into a 0 or 1 classification.
·   Remove Noise: Logistic regression assumes no error in the output variable (y), consider removing outliers and possibly misclassified instances from your training data.
·   Gaussian distribution: Logistic regression is a linear algorithm (with a non-linear transform on output).  
·   Remove Correlated Inputs: Like linear regression, the model can over fit if you have multiple highly-correlated inputs. Consider calculating the pairwise correlations between all inputs and removing highly correlated inputs.
·   Fail to Converge: It is possible for the expected likelihood estimation process that learns the coefficients to fail to converge.  
Clustering [33] & [34]: It is a very important algorithm for unsupervised machine learning and is a confirmed way to group the population or data points such that data points with similar character are put in the same group or clusters. There are mainly two types of clusters: in hard cluster, each data point either belongs to a cluster or not.  In Soft Cluster, a probability or likelihood is associated with that data point to a particular sector. Its applications include areas, such as,
  • Recommendation engines e.g. to suggest movies one may like
  • Market segmentation
  • Social network analysis
  • Search result grouping
  • Medical imaging
  • Image segmentation
  • Behavioral segmentation:
1.    Segment by purchase history
2.    Segment by activities on application, website, or platform
3.    Define personas based on interests
4.    Create profiles based on activity monitoring
·         Inventory categorization:
1.    Group inventory by sales activity
2.    Group inventory by manufacturing metrics
·         Sorting sensor measurements
  1. Detect activity types in motion sensors
  2. Group images
  3. Separate audio
·         Detecting bots or anomalies:
1.    Separate valid activity groups from bots
2.    Group valid activity to clean up outlier detection
K means clustering
K-means clustering is a type of unsupervised learning and is generally used when the resulting categories or groups in the data are unknown. X
K means is an iterative clustering algorithm. It attempts to discover local maxima in every repeat cycle see steps given below:
Steps
1: Choose the desired number of clusters K;
2: Allocate randomly each data point to a cluster;
3: Compute cluster centroids; 
4: Re-assign each point to the closest cluster centroid;
5: Re-compute cluster centroids Now, re-computing the centroids for both the clusters; and
6: Repeat steps 4 and 5 until no improvements are seen.  
Hierarchical clustering (or Linkage Clustering), forms hierarchy of clusters. It starts with each point being a separate cluster, and works by joining two closest clusters in each step until everything is in one big cluster.
We can easily choose the number of clusters afterwards by cutting the tree diagram horizontally where we find suitable. It is also repeatable but is of a higher complexity (quadratic).
If data are not labeled, S learning is not possible, and an U learning approach is required, which attempts to find natural clustering of the data to groups, and then map new data to these formed groups. The clustering algorithm which provides an improvement to the support vector machines is called support vector clustering and is often used in industrial applications either when data are not labeled or when only some data are labeled as a preprocessing for a classification pass.
Figure 3: Decision tree (Image taken from Wikipedia)
Decision tree - A decision tree for a plane crash is drawn with its root at the top.  Figure 3 shows the bold text representing a condition or internal node, based on which the tree splits into branches also referred to as edges. The end of the branch that doesn’t split anymore is the decision or called leaf, in this case, whether the passenger died or survived, represented as red and green text respectively.
Neural networks [30] - If we are unable to define a line to separate points into two categories we may either use higher order function or multiple lines to define a region or a plane. In such scenarios neural net have a role. A hidden layer in neural Net only means that it neither an input nor an output layer

Figure 4: Linearly separable data
Two classes can be linearly separable iff they can be separated by linear combination of attributes, i.e., 1-D threshold, 2-D lines, 3-D Plane or a hyper-plane (see figure 4).
Kernel trick is a way of computing the dot product of two vectors x and y in some other feature space and to reduce overall computation.  Kernel trick is  interesting because the need to compute the mapping may never arise If our algorithm can be expressed only as inner product between two vectors, all we need is replace this inner product with the inner product from some other suitable space. That is where the “trick” resides: wherever a dot product is used, it is replaced with a Kernel function.
The kernel function denotes an inner product in feature space and is usually denoted as: K(x, y) = <φ(x), φ(y)>. Using the Kernel function, the algorithm may be carried into a higher-dimension space without explicitly mapping the input points into this space. This is highly desirable, as sometimes our higher-dimensional feature space could even be infinite-dimensional and thus unfeasible to compute.
Kernel functions [32] are sometimes called "generalized dot product" also.  ML algorithms model problems in an attempt to solve a problem efficiently, e.g., we can map 2D data to 3D space by performing a non-linear transformation say dot product, i.e., k(x,y) = f (x. W) and avoid using a curve or a complex decision boundary and instead use a hyper-plane as depicted in figure 5 & 6. Tanh is another popular activation function with range -1 to +1 i.e. it is zero centered. Both Sigmoid and tanh functions suffer from vanishing gradient problem. RELU function easy on computation as compared to Sigmoid and tanh activation function.  The RELU function is invariably used today for the hidden layers and output layer still uses Sigmoid or Tanh function.


Figure 5: Sigmoid function S (z) = 1/ (1+ez) which is non-linear in nature, monotonically increasing and is also continuously differentiable

Figure 6: Rectified Linear Units Function R(x) (the step function is similar to R(x)) is defined below:

Naïve Bays classifier [35]

It’s good to know about Bayes theorem which works on conditional probability.  It tells that something will happen, given that something else has already occurred. Using the conditional probability, we can calculate the probability of an event using its prior knowledge.
Below is the formula for calculating the conditional probability.
Description: \textrm{P(H \textbar E) = }  \frac{\textrm{ P(E \textbar H) * P(H)}} {\textrm{P(E)}}
Where
  • P (H) is the probability of hypothesis H is found to be true. This is known as the prior probability.
  • P (E) is the probability of the evidence
  • P (E|H) is the probability of the evidence if hypothesis is true.
  • P (H|E) is the probability of hypothesis if evidence is found.
Naive Bayes classifier uses the Bayes Theorem and predicts membership probabilities for each class such as the probability that given record or data point belongs to a particular class. This is also known as Maximum A-posteriori Probability (MAP). The MAP for a hypothesis is:
MAP(H)=max(P(H|E))=max((P(E|H)*P(H))/P(E))=max(P(E|H)*P (H))
Where P (E) is evidence probability, and is used only to normalize the result.
Naive Bayes classifier assumes that all co-relation among the features is zero. Presence or absence of a feature does not influence the presence or absence of any other feature.  

As an example we consider three classes associated with Animal Types say, Parrot, Dog and Fish. We also consider four predictor features as Swim, Wings, Green Color and Dangerous Teeth.

It can be said that for Parrots: 10% parrots can swim, all parrots have wings, 80% parrots are Green and 0% parrots have Dangerous Teeth according to  data provided.

As per data for Dogs, 90% dogs can swim, 0% dogs have wings, 0% dogs are of Green color and 100% dogs have Dangerous Teeth.
Data for fishes show that 100% can swim, 0% have wings, 20% fishes are of Green color and only 10% fish have Dangerous Teeth.
We will demonstrate the Naive Bayes approach using above example.
For Hypothesis testing for the animal to be a Dog:
P (Dog | Swim, Green, Teeth) = P (Swim |Dog) * P (Green |Dog) * P (Teeth |Dog) * P (Dog) / P (Swim, Green, Teeth)
= 0.9 * 0 * 1 * 0.333 / P (Swim, Green, Teeth) = 0

For Hypothesis testing for the animal to be a Parrot:

P (Parrot| Swim, Green, Teeth) = P (Swim |Parrot) * P (Green |Parrot)* P (Teeth| Parrot) * P (Parrot) / P (Swim, Green, Teeth)
= 0.1 * 0.80 * 0 *0.333 / P (Swim, Green, Teeth) = 0

For Hypothesis testing for the animal to be a Fish:

P (Fish |Swim, Green, Teeth) = P (Swim |Fish) * P (Green |Fish) * P (Teeth |Fish) *P (Fish) / P (Swim, Green, Teeth)
= 1 * 0.2 * 0.1 * 0.333 / P (Swim, Green, Teeth) = 0.00666 / P (Swim, Green, Teeth)
The denominator of all the above calculations is same i.e. P (Swim, Green, Teeth). The value of P (Fish | Swim, Green, Teeth) is the only positive value greater than 0. Using Naive Bayes, we can predict that the class of this record is Fish.
As computed value of probabilities is very low, we use P (Swim, Green, Teeth) only to normalize these values.
Support Vector Machines (SVM) - It is binary classification [36] S ML algorithm. Each data item is plotted as a point in n-dimensional space (where n denotes # of features). Then, classification is carried out by finding the hyper-plane that distinguishes the two classes [37]. Support Vectors are the co-ordinates of individual observation. Support Vector Machine chooses the best (maximum margin) hyper plane (see figure 7) not only farthest to the nearest point but also separates the two classes hyper-plane/ line.
Figure 7: Maximum-margin hyperplane and margins for an SVM trained with samples from two classes. Samples on the margin are called the support vectors [37, 40].
Large Margin Decision Boundary [40]: The separator has to be as far as possible. This means that we have to maximize the margin. We can normalize the equation of the separator so the distance in the supports are 1 or −1, by r = (wTx + b)/||w||. So the length of the optimal margin is m = 2/||w|| {see figure 6}. This means that maximizing the margin is the same that minimizing the norm of the weights.
Calculating the decision Boundary: Given a set of examples {x1, x2,... , xn} with class labels yi {+1, 1} The decision boundary that classify the examples correctly holds yi(wTxi + b) ≥ 1, i. This redefines the problem of learning the weights as an optimization problem (see [40] for details).
Solving XOR through a Neural Net
The XOR (see figure 8) network opened the door to far more interesting neural network and ML designs. An implementation Of XOR function using logical gates is shown in figure 7 and a network version is shown in figure 10.  

Figure 8: XOR problem [39]
Truth Table
Input                                     output                                   
A             B                             X   
0              0                              0                                               
0              1                              1                                                
1              0                              1                                                
1              1                              0               
Figure 9 shows the four points as shown in the truth table and no single linear function can separate the red and blue points.

Figure 9: Plotting of XOR points (truth table)
It is obvious that points are not linearly separable. We can also use Rectified Linear Units Function R (x) {see figure 5} instead of Sigmoid.

Figure 10: Implementation of XOR function using Neural net with one hidden layer. ‘h1’ gives output of a logical ‘OR’ function; ‘h2’ reverses the input or is flip side of logical ‘OR’. In other words, ‘h1’ and ‘h2’ correspond to one hyper-plane each. Sigma is, in fact, a sigmoid function and ‘b’, ‘b1’ and ‘b2’ is bias for output and hidden layer element ‘h1’ and ‘h2’ respectively.
Output of element h1(figure 9)  Output of element h2      Final output X
Sigma(20*0+20*0-10)=0 Sigma(-20*0-20*0+30)=1 Sigma(20*0+20*1-30)=0 Sigma(20*1+20*1-10)=1 Sigma(-20*1-20*1+30)=0 Sigma(20*1+20*0-30)=0 Sigma(20*0+20*1-10)=1 Sigma(-20*0-20*1+30)=1 Sigma(20*1+20*1-30)=1 Sigma(20*1+20*0-10)=1 Sigma(-20*1-20*0+30)=1  Sigma(20*1+20*1-30)=1
Output ‘X’ corresponds to XOR function (see truth table).
Backpropagation in ANN
The backpropagation algorithm trains a given feed-forward multilayer neural network for a given set of input patterns with known classifications. When each entry of the sample data set is presented to the network, the network examines its output response to the sample input pattern. The training is done using an S learning method and the error function is computed using the ANN's output and a known expected output given in the data set.  Error function is presented to the ANN and it is used to modify its internal state. In fact, the backpropagation algorithm is a way for computing the weights [44].
Following steps are part of any Backpropagation algorithm:
Initialize Network:  Each neuron (also called Unit in ANN) has a set of weights that needs to be maintained. One weight for each input connection and an additional weight for the bias.  We generally initialize the network weights to small random numbers say, in the range of 0 to 1.
Forward Propagate: We can compute an output from an ANN by propagating an input signal through each layer until the output layer outputs the desired values. This is referred to as forward-propagation.  We can calculate an output from ANN by propagating an input signal through each layer until the output layer outputs its values.  We call this forward-propagation which has three distinct parts:
Neuron Activation - The input could be a row from our training dataset, as in the case of the hidden layer. It may also be the outputs from each neuron in the hidden layer, in the case of the output layer. Neuron activation is calculated as the weighted sum of the inputs like linear regression layer by layer 
hj =  ∑ wij * xi + bwhere h is jth hidden layer, w is weight and b is bias for the layer
         i
We then apply the activation function and repeat the same for the next layer. This part is broken down into two sections:
Neuron Transfer - Once a neuron is activated, we need to transfer the activation to see what the neuron output actually is.
Transfer functions used may be the sigmoid activation function (see figure 4).  Recently, the rectified Linear Units transfer function (see figure 5) has become popular, particularly, with deep learning networks.
Forward Propagation - Forward propagation is implemented for a row of data from our dataset with ANN.
Back Propagate Error: Error is computed using the expected outputs given in data and the actual outputs forward propagated from the ANN. This involves multiple iterations of exposing a training dataset to the network back-propagating the error and then modifying the network weights.
The error is then computed and propagated backward through the ANN from the output layer to the hidden layers.
Transfer Derivative – We need to remember and may use three steps of calculus before we go forward:
1.    Derivative: it is d/dx xn= H x n-1 i.e. if equation of a curve is y=x2
Then its derivative F(x) = 2 x.
2.    Partial derivative (Example) f(x, y) = y3 + 3 x*y then  ðf/ ðx=3y and
ðf/ ðy= 3y2 + 3x i.e. we treat all other variables as constant.
3.    Chain rule:
Problem if f(x) = 2x and g(x) = x2 then f (g(x)) = 2x2
Chain rule: d/dx [f (g (x))] = f’ (g (x))*g’ (x)
Solution:  So f (g(x)) = 2x*2 = 4x
Given an output value from a unit, we need to compute slope. The network is trained using gradient descent.  The first step is to calculate the error for each output neuron; this will give us our error signal (input) to propagate backwards through ANN.  
Update Weights – We need to move opposite to the derivative.  Once errors are calculated for each unit in the network via the back-propagation method layer by layer, they can be used to modify the ANN unit weights.

The equation above shows the gradient descent update rule, where ‘W’ is   weight, learning rate ‘α’ pronounced as Alpha (amount in percentage by which ANN unit weight can change in every iteration) - a parameter we are required to give, J is error and is computed by the back-propagation procedure for units of ANN and input is the input value that caused the error (see Figure 11). Back-propagation of errors is used to optimize and update weights during gradient descent. Please note that back-propagation is a recursive process.
A few words on the learning rate, because it is one of the important hyper-parameters (“settings” for ANN) that one has control over.   Too high learning rate can force that one does not get to the minimum one is searching for as one mat jump over it.  Similarly if learning rate is set very low,   ANN may take long time to get to the right weights, or may get stuck in a local minimum. It may be a good idea to arrive at the right value of learning rate by trying several values for it and pick the value that works the best for your ANN and dataset.  A neural net or ANN could be a massive composite function and chain rule may be used to reduce computation.  A similar method is used for the bias weight, except that either there is no input term, or input is the fixed value of 1.0.

Figure 11: Weight Reduction process; if we repeat the process enough, one finds   oneself nearly at the bottom of the curve and much closer to the optimal weight configuration for ANN [45]. We need to use a differentiable function to find its derivative, i.e. a non-linear function.
Remember that the input for the output layer is a collection of outputs from the hidden layer. Now we know how to update network weights, we need to figure out how to do it repeatedly.
Training Network:  As is stated before, ANN is modified using stochastic gradient descent.  This involves first looping for a fixed number of periods and within each period updating the network for each row in the training dataset. Because updates are made for each training pattern, this type of learning is called on-line learning. If errors were accumulated across a period before modifying the weights, this is called batch learning or batch gradient descent.


Figure 12:  A MLP with two hidden layers is S learning network.  Each time data is processed by a layer; it gets multiplied by interconnection weights, then summed and processed by a nonlinear activation function then sent to the next layer.  Finally the data is processed one last time within the output layer to produce the neural network output [39]. Here y is the output unit and x1, x2, … , xn are the inputs.
Predict: Making predictions with a trained neural network is easy enough. We know how to forward-propagate an input pattern to get an output. This is all we need to do to make a prediction.  Figure 12 shows a MLP with bias. In fact an ANN can be trained to realize any non-linear function.
Deep learning [31] - The term ‘Deep learning’ is derived from “deep” neural nets built by layering many networks on top of each other [13].  Due to the increasing power and falling price of computer servers and advent of cloud computing, machines with enough processing power are now available as well as are capable to run such networks.   Now you don’t need to own infra but due to democratization of data you only pay for actual use by minute as server- less environment is becoming common way of processing data today. 
Deep learning models typically use back-propagation with gradient descent. In ML, this feed forward architecture is known as the multilayer perceptron. The difference between the ANN [2] and perceptron is that ANN uses a non-linear activation function such as sigmoid as shown in figure 4 but the perceptron uses the step function (latest is ReLU which is a non-linear function) and this non-linearity gives the ANN its great control.
ANNs [2] are very flexible yet powerful deep learning models and can model any complex function. If our projected data belongs to a higher dimensional space then by carrying out a non-linear transformation the data becomes linearly separable. The green hyper-plane is the new decision boundary as shown in figure 13. This is equivalent to drawing a complex decision boundary in the original input space (see figure 14).

Figure 13: Separation boundary is plane
Further, the deepness of the network is said to be directly proportional to number of hidden layers in the network. 
There are two areas viz., military and healthcare where we have got to AGI level [13]. USA in Iraq and Afghanistan war has used stealth aircrafts and drones which had human–in–the-loop (HITL) capability only for ‘kill’ command but technology did not even need this.  ML algorithms are very good at analyzing images even better than human being, particularly, as it can process thousands of images per second. It is used for this reason for identifying even very small tumors from images. A doctor is used for finally selecting the images but it is also not necessary. HITL is required till neural net weights are set i.e. only during training phase but use of HITL becomes voluntary after that.


Figure 14: complex Decision boundary required to separate points
The convolution neural nets are less useful if the data cannot be made to look like an image as these nets only capture local spatial patterns.
It needs no initial learning material as long as some feedback mechanism is established to collect data while the system is running.
In, Reinforcement learning a computer is able to assign a value to each right or wrong turn that a rat might make on its way out of its maze. Each value is stored   and all these values are updated as system learns.
Limitations of reinforced learning:  It is often too memory expensive to store values of each state as the problems can be pretty complex.  Solution to these problems led researchers to look into areas such as Decision Trees or Neural Networks to make this process practically computationally expensive. In recent years, deep learning concept (see figure 2) is used to locate and recognize patterns in data, whether the data refers to the turns in a maze, the positions on a Go board, or the pixels shown on the screen and a suitable reward or penalty is given for each move.
A number of industrial-robot makers use this approach for training their robots to perform new tasks without manual programming. Reinforcement learning is used by Alphabet to make its data centers more energy efficient as a reinforcement-learning algorithm can learn from data and suggest, say, how and when to operate the cooling systems to save energy.
The power of this software’s remarkably humanlike behavior is in self-driving cars. The specific algorithm is needed for highway merging software and it was demoed in Barcelona by Mobileye - an Israeli automotive company - that makes vehicle safety systems used by dozens of carmakers including BMW.  Google and Uber say they are also testing reinforcement learning for their self-driving vehicles.
Way forward
There will be effort and progress towards achieving AGI and ASI levels. This in turn means development of S ML, i.e. without HITL. We will see more of AI based systems playing against AI [13] and achieving new breakthroughs particularly in healthcare industry where there are problems which human wish to solve; however, in other areas human are likely to be more careful because of unknown risk factors.
 A good example of system without HITL is Cyber-knife [38] like solution developed in first decade of 21 century. It is a non-invasive treatment. The CyberKnife system enables radiation oncologists to deliver high doses of radiation with pinpoint accuracy to a broad range of tumors to any part of the body. The patient may be treated of tumor say within a week. The GammaKnife [41] is treatment for adults and children with small to medium brain tumor, a nerve condition that causes chronic pain, and other neurological conditions. In 2007, UCSF acquired the Perfexion Leksell Gamma \Knife [42], which offers extreme accuracy, efficiency and outstanding therapeutic response.
CyberKnife as well as GammaKnife technologies are used to treat both cancerous and non-cancerous tumors [43], but GammaKnife is limited to only treatment above the ear and in the cervical spine. However, CyberKnife is dedicated Robotic System for SRS and stereotactic body radiotherapy (SBRT), capable of treating cancer throughout the entire body. GammaKnife is in use since 1950s however CyberKnife provides equivalent results for certain tumors and a better outcome for others. Further, CyberKnife is FDA cleared since 2001 for treatment of tumors throughout the entire body. Both these technologies are strong candidates for AI (with or without HITL) to give hopefully a better performance.  
References
[1] Progress and Perils of Artificial Intelligence (AI) http://newblogrgs10.blogspot.in/2017/04/progress-and-perils-of-artificial_5.html
[2] Invited Chapter 6 - Evolutionary Algorithms and Neural Networks, Pages 111-136, R.G.S. Asthana in book, Soft Computing and Intelligent Systems (Theory and Applications), Academic Press Series in Engineering, Edited by:Naresh K. Sinha, Madan M. Gupta and Lotfi A. Zadeh ISBN: 978-0-12-646490-0
[3] Future 2030 by Dr. RGS Asthana, Senior Member IEEE
[4] Machine Learning (ML) and Artificial Intelligence (AI) – Part 1, by Dr. RGS Asthana, Senior Member IEEE
[5] Machine Learning (ML) and Artificial Intelligence (AI) – Part Two, by Dr. RGS Asthana, Senior Member IEEE
[6] Machine Learning (ML) and Artificial Intelligence (AI): Cognitive Services and Robotics – Part Three by Dr. RGS Asthana, Senior Member IEEE
[7] Machine Learning (ML) and Artificial Intelligence (AI):  Big Data and 3 D Printing – Part four by Dr. RGS Asthana, Senior Member, IEEE.
[8] Machine Learning (ML) and Artificial Intelligence (AI):  Drones and Self-driving Cars– Part Five by, Dr. RGS Asthana, Senior Member IEEE
[9] Machine Learning (ML) and Artificial Intelligence (AI): Healthcare– Part Six by, Dr. RGS Asthana, Senior Member IEEE

[10] Machine Learning (ML) and Artificial Intelligence     (AI):  Will AI/ML intelligence surpass humans? Part Seven by Dr. RGS Asthana, Senior Member IEEE

[11] Machine Learning (ML) and Artificial Intelligence     (AI): Impact of AI/ML in Healthcare: Part-Eight by Dr. RGS Asthana, Senior Member IEEE
[12] Machine Learning (ML) and Artificial Intelligence     (AI): Big data & Data Science (DS) and their importance: Part-Nine by Dr. RGS Asthana,  Senior Member IEEE
[13] Machine Learning (ML) and Artificial Intelligence     (AI): Super-Intelligence - Are we afraid?: Part-ten; by Dr. RGS Asthana, Senior Member IEEE.
[14] Deep mind website
[15 IBM Watson Website
[16] Internet of Things (IoT)
[17] How to use ML in Mobile
[19] Product recommendation versus Product discovery
[20] Our product categorization just took a quantum leap with AI and Machine Learning
[21] How can e-commerce retailers leverage predictive analytics to make smarter, quicker decisions about marketing strategy?
[22] Fraud detection and prevention
[23] Is the future of ecommerce is predictive analytics?
[24] How to use ML in mobile applications? P?
[25] Phone apps driven by Artificial Intelligence
[26] Niki Web-site
[27] ios based apple app store - itunes
[28] Google play website
[29] Friendly Introduction to Machine Learning
[30] Neural Networks
[31] Applied Deep Learning - Part 1: Artificial Neural Networks

[32] Kernel Functions for Machine Learning Applications

[33] An Introduction to Clustering and different methods of clustering
[34] Clustering Algorithms: From Start to State Of The Art

[35] How the Naive Bayes classifier works in ML

[36] The 10 Algorithms Machine Learning Engineers Need to Know 

[37] Understanding Support Vector Machine algorithm from examples (along with code)

[38] Website: Cyber-knife

[39] Introduction: The XOR Problem

[40]  06 svm.pdf
[41] UCSF Medical Centre: Gamma Knife

[42] Leksell Gamma Knife® Perfexion™

[43] Cyber knife versus Gamma Knife

[44] How to Implement the Backpropagation Algorithm From Scratch In Python

[45] Neural Networks & the Backpropagation Algorithm, explained