Wednesday, 16 August 2017

Machine Learning (ML) and Artificial Intelligence (AI): Big data & Data Science (DS) and their importance: Part-Nine by Dr. RGS Asthana Senior Member IEEE

Machine Learning (ML) and Artificial Intelligence     (AI): Big data & Data Science (DS) and their importance: Part-Nine

by
Dr. RGS Asthana
Senior Member IEEE
Figure 1: Machine Learning [20]
Summary
Big Data Analytics is a very well established technology. There are many tools and software in the market to handle Big Data.  AI and machine learning are dependent on large amounts of data. But big data is hard to organize and analyze.  ML is a technology which can work on Big Data to find hidden patterns which is almost impossible for human to find merely due to data being vast. However, big data and ML/AI congregation can produce interesting results for organizations.
We define big data, Data Science (DS) and ML/AI and their role and importance in solving an organization’s problems. A list of general problems which can be solved using ML/AI is also given. The key areas of Big Data, DS and ML/AI and what do they mean? Way forward provides what we expect in the near future?
Prerequisite
Read article [1] to [14]
Keywords
Prelude
AI refers to approaches to not only developing systems that not only do intelligent things but also can learn from experience. ML is subset of AI.  The meeting of big data with ML/AI [21] has given us an opportunity to handle a lot of data which is partially made available for research in anonymized form and ML algorithms have capability to build predictive models easily. This approach has empowered companies to drive business value from their data and analytics capabilities.   Earlier limitations of data availability, limited sample sizes, and an inability to analyze massive amounts of data in real-time is now overcome due to availability of these technologies  
However, first major task is that this data is raw and needs to be pre-processed.  The second major task is to combine various databases such as shopping behavior, daily exercise and diet info, maybe personal genomes and also occasional blood test data. Pathway Genomics [15, 16] – an IBM backed company - is developing a simple blood test to determine if early prediction of certain cancers is possible.  This company produces actionable and accurate genetic information for physicians and as well as their patients to sustain health and wellness. It’s AI and deep learning empowered mobile health applications use personal genetic information to give personalized health and wellness guidance to the users.
Big Data [17] - Big data may comprise of the collection of structured as well as unstructured data. The main sources of data are social sites, YouTube and even online commerce site having a lot of data about their customers.  In fact, the big data may include information of employees, company purchase, sale records, business transactions, the previous record of organizations, social media [22] etc.
Typically, big data is defined by the factors, such as, Volume, Velocity, Variety, Complexity and Variability.
Volume: Data stored is both structured and unstructured and with the cost of storage being negligible, the collected data can also now be stored.  Cloud technologies are pretty developed now and play a major role in this.
Velocity:   Smart sensors, smart metering and RFID tags make it necessary to deal with huge data influx in almost real-time.  
Variety: The data may be structured or unstructured e.g. texts to emails, video, audio and financial transactions. In fact data comes in every conceivable form.  
Complexity:  It is necessary to deduce a connection and hierarchy between multiple data sets. The lack of such organization will cause torrents of incoming data to flood your system and memory without any valid utility.
Variability:  As the data sources vary the data flow rates also vary.  Unstructured data gives very inconsistent data flow.  Big data may have multiple data sources, such as Social Media, streaming unstructured data and/or from public sources: e.g. the CIA World, Facebook and/or European Union Open Data Portal are the most common ones. Therefore, traditional data storage like RDBMS is not suited to handle such huge quantities of data. Therefore, a different storage and computing paradigm to handle big data is necessary.  Hadoop, Spark and NoSQL are examples of such tools.
Data now resides in silos where accountability, focus, and its mission are clear. Server-less, micro-service architectures are making it  easy to access, analyze, and manage data without racking servers, configuring virtual machines, or even paying by the hour. Going server-less allows data owners to focus on their data application and pay just for use by the minute and not by hours as earlier.
DS (DS) DS, also known as data-driven science [34], is an interdisciplinary field about scientific methods, processes, and systems to extract knowledge or insights from data in various forms, either structured or unstructured. It is a broad discipline similar to Data Mining and overlaps with fields such as statistics, IoT [14], operations research, applied mathematics and also ML/AI & deep learning. Dealing with unstructured and structured data, DS is a field that includes anything related to data cleansing, preparation, and analysis. As in any fields, data scientists may also use appropriate techniques and  algorithms from other fields to handle very large unstructured data sets in automated ways i.e. without Human-in-the-loop (HITL) [25] not only to perform transactions in real-time but also to make predictions.
DS, in fact, is much wider discipline than ML. In DS, data may or may not originate from a machine automatically like survey data is manually collected and clinical trials may be only a specific type of small data. In particular, DS covers aspects such as data integration, distributed architecture, automating ML, data visualization, dashboards & BI and automated data-driven decisions [33].
Analytics will also improve as more and more data is available and better algorithms are developed. In order to feed proper training data into ML systems, it must first be cleansed or pre-processed i.e. all errors in format, duplications, etc. are fixed. Making sense of data or making it useful to enable solve a specific problem is job of the data scientist.
ML/AI  The ML/ AI success depends on data labeling and analysis. In brief, ML [20] is an application of AI (see figure 1) that provides systems the ability to automatically learn and improve from experience without being explicitly programmed.  ML in other words is a data analysis method to program extrapolated model building algorithm which in turn is a set of instructions. ML, in fact, offers a deeper insight into collected data and helps find hidden patterns.  The main goal of ML when used on big data is to identify useful patterns in it and ML algorithms predict future patterns too.  ML/AI applications based mainly on ML depend on data to develop more predictive models. 
ML can be applied to smaller datasets also; however the outcome may not be very accurate as learning chance is low. ML over the years has evolved from simple analytical algorithms to automatic application of customized algorithms and mathematical calculations to big data. The speed and iteration are unique qualities of ML.
We expect by 2020, there will be about 44 zeta bytes of data, according to IDC [30]. Here, ML/AI may play a big role as it puts an intelligence layer to big data.   Thus, complex analytical tasks can be handled much faster than humans can do.  In fact, ML/AI may take only routine jobs but as per an optimistic view there will always be work for good intellectuals, irrespective of how smart machines get.
IoT [14] will also swell with introduction of new toys, car accessories, home and security practice but will also open access for hackers and this could become cause for worry. In Deep Learning you do not tell machines what to do but machines based on the data determine what to do. This approach got Stephen Hawking and Elon Musk worried [1] — as the idea eventually may lead to development of thinking machines that will eventually surpass humans, and due to vast connectivity may take over the world, and threaten our very existence. Facebook recently [26] shut down AI bots based app where bots started ignoring the set programmed instructions and began speaking in a language unintelligible to humans instead of English which was supposed to be set method for communication. 
With advances in data processing and cloud applications, every platform is becoming cloud-available.  There are also a number of ML platforms offered by tech companies, e.g., Baidu, Amazon, Microsoft, IBM, Google, and many others.  Baidu research [32] works on ML/AI technologies in areas such as image recognition, speech recognition, high performance computing,   natural language processing (NLP) and deep learning
Problems ML/AI can solve  
In brief, the certain problems which can be solved through ML/AI (see figure 2) are given below:
·   Automate Repetitive Tasks
·   Fraud detection
·   Recommendation Systems
• Real-time ads on mobile devices
·   Credit scoring and best offer ads
·   WebPages
·   Web search results
·   New pricing models
·   Email spam filtering
·   Prediction of equipment failure
·   Detection of network intrusion(s)
·   Image recognition
·   Pattern recognition
·   Speech processing
·   Sentiment analysis from texts
·   Self-Driving Cars (SDCs) - can analyze data collected by numerous cameras mounted on SDCs thus vehicles are able to read their environment in real-time and react, much like a human driver, making trip not only faster and safer but also more efficient.

Figure 2: AI and intelligence [22]
·   Healthcare: Analyzing vast amount of medical data coming from number of locations and demographics will determine which conditions will improve the effectiveness of certain treatments and which do not.
·         Natural Language Processing (NLP)
·         Machine Translation
·         Fraud Detection
·         Financial Management and investment planning
Congregating Big Data, DS and ML/AI
The convergence of big data and ML/AI has changed how businesses value their data and use it to stay ahead of their competitors.  As the importance of big data further matures in the coming years, particularly, as ML/AI is   implemented at a rapid pace to take over laborious, routine, risky and mission-oriented jobs and to free employees to do not only more human jobs functions but also concentrate on the development of new profitable schemes for the business.
Although big data analytics is an established and independent field but Both Big Data and ML/AI are influential on their own [18]. Think of a scenario, what happens when the big data and ML/AI are merged? The availability of greater volumes and sources of data will enable capabilities in ML/AI which existed due to lack of data availability, limited sample sizes, and an inability to analyze massive amounts of data in almost real time. 
Big data empowers data scientists to access and work with massive sets of data without restriction and hesitation. Organizations can now load all of the data and let the data itself point the direction and tell the story.  Big data enables an environment that encourages data discovery through iteration.
Big Data analytics is used at Intermountain Healthcare [19] to obtain better health outcomes, thus, achieving improvement in its health services, particularly, in the area of cardiovascular, endocrinology, surgery, and other areas. The company has not only lowered its infection rates but also lowered its operating as well as total costs. Think of benefits if this company uses ML/AI on big data.
Role of ML and Big Data, DS in an organization
The role of ML/AI methods is to map to the big data or DS methods deployed in an organization to fulfill its needs. Most common ML/AI methods include supervised learning: which is generally predictive, unsupervised learning– which is generally exploratory and may deploy clustering and associated algorithms as the main tool; semi-supervised:  HITL [25], deep learning: mainly works on gradient descent methods which determine the regularities – the common clusters of visible (activated) or invisible(deactivated) pixels across multiple training images and neural nets with hidden layers and back propagation and finally reinforcement learning: which is based on award of penalties and rewards given based on the actual outcomes for the application of the negative (so called) or positive policies but it is highly computer power intensive. Q-learning - a model-free  technique used as a reinforced learning - can help find an optimal action-selection policy for any finite Markov decision process (MDP) [24].
Every business organization today needs someone like a data analyst or a data scientist to look after data [36]. This person should be capable of really pulling out trends from data. Therefore, strong business acumen and communication ability is required in a data scientist/analyst as he needs to communicate with business and IT leaders in a way that can influence how an organization approaches a business challenge. 
Infosys Nia [27] - Company’s next-generation AI platform combines technologies like big data/analytics, ML/AI, knowledge management, and cognitive automation capabilities of Mana - first-generation AI platform. The offering end-to-end Robotic process Automation (RPA) capabilities of AssistEdge [28]; advanced, high-performance and o build more accurate models in a shorter amount of time, you need high performance algorithms and to utilize all of your data. The Skytree [29] platform is built for algorithmic speed and scales to perform ML on massive amounts of data – structured and/or unstructured; and optical character recognition (OCR), natural language processing (NLP) capabilities and infrastructure management services. Supply chain as well as manufacturing [31] nowadays use machines and robotics that can think and function autonomously, or with minimal supervised programming, to perform specific actions.
Big data, DS and ML/AI
Big data and DS lets ML evolve and adjust to the everyday requirements of data analysis that an organization [22] needs. ML is not limited by human thinking or by speed and thus is able to mine valuable data from big data without any bias. ML is able to find patterns in the chunks of unstructured data, collected from everyday activities and transactions and make big data useful for the organization.

Figure 3: Success of DS Solutions: A Venn Diagram showing relationship between DS, IT and Business Skills [33]
Venn diagram in figure 3 shows the relationship between fields of DS, IT and Business Skills.
Eighty-eight percent of Data Scientists have a Master’s Degree, and 46% have PhDs [35]. Other skills data scientists need include:
·  In-depth knowledge of SAS and/or R: For DS, R is generally preferred.
·  Python coding: Python is the most common coding language that is used in DS along with Java, Perl, C/C++.
·  Hadoop platform: Although not always a requirement, knowing the Hadoop platform is still preferred for the field. Experience in Hive or Pig is a huge plus.
·  SQL database/coding: Though NoSQL and Hadoop are the major focus for data scientists; preferred candidates can write and execute complex queries in SQL.
·  Working with unstructured data: ablity to work with unstructured data say, from social media, video feeds like youtube, audio, or other sources.
·  Typical salary of Data scientist is USD 120000+.
For Big Data roles following skills may be needed:
·  Analytical skills: The ability to make sense of the enormous amounts of data. 
·  Creativity:   Ability to create new methods to gather, interpret, and analyze a data strategy.
·  Mathematics and statistical skills: Good, old fashioned “number crunching” is necessary.
·  Computer science:   Programming experience is preferred as professional may be required to develop algorithms.
·  Business skills: Big Data professionals should have an understanding of the business objectives that are in place and must have ideas about how to increase the enterprise profits.
Most jobs that specifically have "ML" in the title seem to be looking for CS people with some experience in ML.  "Data scientist" jobs seem to fall into one of two categories: (1) "data analyst" jobs that are looking for people with some background in data analysis, often looking for R/SAS/SPSS.   (2) "Computational statistician" - Python and databases experience with good statistics background.
Way forward
As, oxygen is vital to existence of living organisms so is the availability of vast amount of data or Big Data is vital for the success of the ML/AI algorithms. More the data better the results and their accuracy.  ML /AI algorithms derive patterns from big data which a human cannot locate. With vast amount of success, ML is perceived as the new management tool and companies like Coca Cola, MetLife and Netflix are using Big Data Analytics for long time and also benefitting monetarily from it. Scaled up algorithms, such as, recurrent neural networks and deep learning are powering the breakthrough of ML/AI [21].
At present ML/AI algorithm reach a conclusion without a valid explanation, work has to be done that ML/AI algorithm have a valid explanation for each action they suggest, this will improve adaptability of ML/AI in many vital fields like healthcare.
Congregating big data, DS and ML/AI may improve business performance as well as our life we, however, need real people in real-world companies to help in development of right combination of technologies onto useful platforms.
Cloud computing, Big Data and DS are enabling technologies and riding on these ML/AI has been galloping at a tremendous pace in last few years and may reach state of General AI - which can perform any intellectual task that a human can do - sooner than expected.  Use of Big Data and DS techs enables seamless capabilities in ML/ AI [38] that remained dormant for decades due to lack of data availability, limited sample sizes, and an inability to analyze massive amounts of data in almost real-time.  With the global market of AI expected to grow at 36% annually, reaching a valuation of $3 trillion by 2025 from $126 billion in 2015 [37].   Big Data and DS are more or less independent techs and have strength of their own. DS allows cleansing of data and prepares it for any desired application.  The meeting of big data & DS with AI has emerged as the single most important development that is shaping the future of how organizations are getting motivated to derive greater business value from their data and analytics capabilities.   
References
[1] Progress and Perils of Artificial Intelligence (AI) 

[2] Invited Chapter 6 - Evolutionary Algorithms and Neural Networks, Pages 111-136, R.G.S. Asthana in book, Soft Computing and Intelligent Systems (Theory and Applications), Academic Press Series in Engineering, Edited by:Naresh K. Sinha, Madan M. Gupta and Lotfi A. Zadeh ISBN: 978-0-12-646490-0

http://www.sciencedirect.com/science/book/9780126464900

[3] Future 2030 by Dr. RGS Asthana, Senior Member IEEE

[4] Machine Learning (ML) and Artificial Intelligence (AI) – Part 1, by Dr. RGS Asthana, Senior Member IEEE

[5] Machine Learning (ML) and Artificial Intelligence (AI) – Part Two, by Dr. RGS Asthana, Senior Member IEEE

[6] Machine Learning (ML) and Artificial Intelligence (AI): Cognitive Services and Robotics – Part Three by Dr. RGS Asthana, Senior Member IEEE

[7] Machine Learning (ML) and Artificial Intelligence (AI):  Big Data and 3 D Printing – Part four by Dr. RGS Asthana, Senior Member, IEEE.

[8] Machine Learning (ML) and Artificial Intelligence (AI):  Drones and Self-driving Cars– Part Five by, Dr. RGS Asthana, Senior Member IEEE
[9] Machine Learning (ML) and Artificial Intelligence (AI): Healthcare– Part Six by, Dr. RGS Asthana, Senior Member IEEE

[10] Machine Learning (ML) and Artificial Intelligence     (AI):  Will AI/ML intelligence surpass humans? Part Seven by Dr. RGS Asthana, Senior Member IEEE

http://newblogrgs10.blogspot.in/2017/06/machine-learning-ml-and-artificial.html

[11] Machine Learning (ML) and Artificial Intelligence     (AI): Impact of AI/ML in Healthcare: Part-Eight by Dr. RGS Asthana, Senior Member IEEE

[12] Deep mind website
[13] IBM Watson Website
[14] Internet of Things (IoT)

[15] How ML, Big Data and AI are changing healthcare forever

[16] Pathway Genomics pathway.com
https://www.cbinsights.com/company/pathway-genomics

[17] What is difference between Big Data and Machine Learning?

[18] How Big Data Is Empowering AI and Machine Learning at Scale

[19] Using Data to Improve Health Outcomes
[20] What is Machine Learning? A definition
[21] How Big Data Is Empowering AI and Machine Learning at Scale

[22] Overview of Artificial Intelligence and Role of Natural Language Processing in Big Data

[23] 5 big data trends that will shape AI in 2017
[24] Q-learning
[25] Human-in-the-loop
[26] Facebook researchers shut down AI bots that started speaking in a language unintelligible to humans
[27] Infosys Launches Infosys Nia™ - The Next Generation Integrated Artificial Intelligence Platform
[28] Edgeverve website
[30] Why Big Data and AI Need Each Other -- and You need them both
[31] How Big Data Drives AI
[32] About Baidu Research
[33] Difference between ML, DS, AI, Deep Learning, and Statistics
[34] DS
[35] DS vs. Big Data vs. Data Analytics
[36] What is the role of DS for Big Data
[37] Hindustan Times dated Aug. 16, 2017 page 12: Why we should not be afraid of AI
[38] How Big Data is empowering AI and ML at scale