Machine Learning (ML) and Artificial Intelligence (AI): Big data & Data Science (DS) and
their importance: Part-Nine
by
Dr. RGS Asthana
Senior Member IEEE
Figure 1: Machine Learning [20]
Summary
Big Data Analytics is a very well
established technology. There are many tools and software in the market to
handle Big Data. AI and machine learning are dependent on large amounts of data. But
big data is hard to organize and analyze.
ML is a technology which can work on Big Data to find hidden
patterns which is almost impossible for human to find merely due to data being
vast. However, big data and ML/AI congregation can produce interesting results
for organizations.
We define big data, Data Science
(DS) and ML/AI and their role and importance in solving an organization’s problems.
A list of general problems which can be solved using ML/AI is also given. The
key areas of Big Data, DS and ML/AI and what do they mean? Way forward provides
what we expect in the near future?
Prerequisite
Read article [1] to [14]
Keywords
Machine
Learning (ML) Tools, Artificial
Intelligence (AI), Neural
Networks, Internet
of Things (IoT), Data
Science (DS), Deep
Mind, IBM’s
Watson
Prelude
AI refers to approaches to not only developing
systems that not only do intelligent things but also can learn from experience.
ML is subset of AI. The meeting of big data with ML/AI [21] has given us an opportunity to
handle a lot of data which
is partially made available for research in anonymized form and ML algorithms have
capability to build predictive models easily. This approach has empowered companies to drive business value from their data and analytics
capabilities. Earlier limitations of data availability,
limited sample sizes, and an inability to analyze massive amounts of data in real-time
is now overcome due to availability of these technologies
However, first major task is that this data is raw and needs to be
pre-processed. The second major task is
to combine various databases such as shopping behavior, daily exercise and diet
info, maybe personal genomes and also occasional blood test data. Pathway Genomics [15, 16] – an IBM backed company - is developing a
simple blood test to determine if early prediction of certain cancers is
possible. This company produces
actionable and accurate genetic information for physicians and as well as their
patients to sustain health and wellness. It’s AI and deep learning empowered mobile
health applications use personal genetic information to give personalized
health and wellness guidance to the users.
Big Data [17] - Big data may comprise of the
collection of structured as well as unstructured data. The main sources of data
are social sites, YouTube and even online commerce site having a lot of data
about their customers. In fact, the big data may
include information of employees, company
purchase, sale records, business transactions, the previous record of
organizations, social media [22] etc.
Typically, big data is
defined by the factors, such as, Volume, Velocity, Variety, Complexity and
Variability.
Volume: Data stored is
both structured and unstructured and with the cost of storage being negligible,
the collected data can also now be stored.
Cloud technologies are pretty developed now and play a major role in
this.
Velocity: Smart
sensors, smart metering and RFID tags make it necessary to deal with huge data
influx in almost real-time.
Variety: The data may be
structured or unstructured e.g. texts to emails, video, audio and financial
transactions. In fact data comes in every conceivable form.
Complexity: It is necessary to deduce a connection and
hierarchy between multiple data sets. The lack of such organization will cause
torrents of incoming data to flood your system and memory without any valid
utility.
Variability: As the data sources vary the data flow rates
also vary. Unstructured data gives very
inconsistent data flow. Big data may
have multiple data sources, such as Social Media, streaming unstructured data
and/or from public sources: e.g. the CIA World, Facebook and/or European Union
Open Data Portal are the most common ones. Therefore, traditional data storage
like RDBMS is not suited to handle such huge quantities of data. Therefore, a
different storage and computing paradigm to handle big data is necessary. Hadoop, Spark and NoSQL are examples of such
tools.
Data now resides in silos
where accountability, focus, and its mission are clear. Server-less,
micro-service architectures are making it easy to access, analyze, and manage data
without racking servers, configuring virtual machines, or even paying by the
hour. Going server-less allows data owners to focus on their data application
and pay just for use by the minute and not by hours as earlier.
DS (DS) – DS, also known as data-driven science [34], is an interdisciplinary field about scientific
methods, processes, and systems to extract knowledge or
insights from data in various forms, either structured or unstructured. It is a broad discipline similar to Data Mining
and overlaps with fields such as statistics, IoT [14], operations research, applied
mathematics and also ML/AI & deep learning. Dealing with unstructured and structured
data, DS is a field that includes anything related to data cleansing,
preparation, and analysis. As in any fields,
data scientists may also use appropriate techniques and algorithms from other fields to handle very
large unstructured data sets in automated ways i.e. without Human-in-the-loop
(HITL) [25] not only to perform transactions in real-time but also to make
predictions.
DS, in fact, is much wider
discipline than ML. In DS, data may or may not originate from a machine automatically like survey data is manually collected and clinical
trials may be only a specific type of small data. In particular, DS covers
aspects such as data integration, distributed architecture, automating ML, data
visualization, dashboards & BI and automated data-driven decisions [33].
Analytics will also improve
as more and more data is available and better algorithms are developed. In
order to feed proper training data into ML systems, it must first be cleansed
or pre-processed i.e. all errors in format, duplications, etc. are fixed.
Making sense of data or making it useful to enable solve a specific problem is
job of the data scientist.
ML/AI – The ML/ AI success depends on data labeling and analysis.
In brief, ML [20] is an application of AI (see figure 1) that provides
systems the ability to automatically learn and improve from experience without
being explicitly programmed. ML in other words
is a data analysis method to program extrapolated model building algorithm
which in turn is a set of instructions. ML, in fact, offers a deeper insight
into collected data and helps find hidden patterns. The main goal of ML when used on big data is
to identify useful patterns in it and ML algorithms predict future patterns too.
ML/AI applications based mainly on ML
depend on data to develop more predictive models.
ML can be applied to smaller
datasets also; however the outcome may not be very accurate as learning chance
is low. ML over the years has evolved from simple analytical algorithms to
automatic application of customized algorithms and mathematical calculations to
big data. The speed and iteration are unique qualities of ML.
We expect by 2020, there will be about 44 zeta bytes of
data, according to IDC [30]. Here, ML/AI may play a big role as it puts an
intelligence layer to big data. Thus, complex analytical tasks can be handled
much faster than humans can do. In fact,
ML/AI may take only routine jobs but as per an optimistic view there will always
be work for good intellectuals, irrespective of how smart machines get.
IoT [14] will also swell with introduction of new toys, car accessories,
home and security practice but will also open access for hackers and this could
become cause for worry. In
Deep Learning you do not tell machines what to do but machines based on the data
determine what to do. This approach got Stephen
Hawking and Elon Musk worried [1] — as the idea eventually may
lead to development of thinking machines that will eventually surpass humans, and
due to vast connectivity may take over the world, and threaten our very
existence. Facebook recently [26] shut down AI bots based app where bots started ignoring
the set programmed instructions and began speaking in a language unintelligible
to humans instead of English which was supposed to be set method for
communication.
With advances in data processing
and cloud applications, every platform is becoming cloud-available. There are also a number of ML platforms
offered by tech companies, e.g., Baidu, Amazon,
Microsoft, IBM, Google, and many others. Baidu research [32] works on ML/AI technologies in areas such as image recognition,
speech recognition, high performance computing, natural
language processing (NLP) and deep learning
Problems ML/AI can
solve
In brief, the certain problems which can be solved
through ML/AI (see figure 2) are given below:
· Automate Repetitive
Tasks
· Fraud detection
· Recommendation
Systems
• Real-time ads on mobile devices
• Real-time ads on mobile devices
· Credit scoring and
best offer ads
· WebPages
· Web search results
· New pricing models
· Email spam filtering
· Prediction of
equipment failure
· Detection of
network intrusion(s)
· Image recognition
· Pattern recognition
· Speech processing
· Sentiment analysis
from texts
· Self-Driving Cars
(SDCs) - can analyze data collected by numerous cameras mounted on SDCs thus
vehicles are able to read their environment in real-time and react, much like a
human driver, making trip not only faster and safer but also more efficient.
Figure
2: AI and intelligence [22]
· Healthcare:
Analyzing vast amount of medical data coming from number of locations and
demographics will determine which conditions will improve the effectiveness of
certain treatments and which do not.
·
Natural
Language Processing (NLP)
·
Machine
Translation
·
Fraud
Detection
·
Financial Management and investment planning
Congregating
Big Data, DS and ML/AI
The convergence of big data and ML/AI
has changed how businesses value their data and use it to stay ahead of their
competitors. As the importance of big
data further matures in the coming years, particularly, as ML/AI is implemented
at a rapid pace to take over laborious, routine, risky and mission-oriented
jobs and to free employees to do not only more human jobs functions but also
concentrate on the development of new profitable schemes for the business.
Although big data analytics is an established and
independent field but Both Big Data and ML/AI are influential on their own [18].
Think of a scenario, what happens when the big data and ML/AI are merged? The
availability of greater volumes and sources of data will enable capabilities in
ML/AI which existed due to lack of data availability, limited sample sizes, and
an inability to analyze massive amounts of data in almost real time.
Big data empowers data scientists to access and work
with massive sets of data without restriction and hesitation. Organizations can
now load all of the data and let
the data itself point the direction and tell the story. Big data enables an environment that
encourages data discovery through iteration.
Big Data analytics is used at Intermountain Healthcare [19] to
obtain better health outcomes, thus, achieving improvement in its health
services, particularly, in the area of
cardiovascular, endocrinology, surgery, and other areas. The company has
not only lowered its infection rates but also lowered its operating as well as
total costs. Think of benefits if this company uses ML/AI on big data.
Role of ML and Big
Data, DS in an organization
The role of ML/AI methods is to map to the big data or
DS methods deployed in an organization to fulfill its needs. Most common ML/AI methods
include supervised learning: which is generally predictive, unsupervised learning–
which is generally exploratory and may deploy clustering and associated
algorithms as the main tool; semi-supervised: HITL [25], deep learning: mainly works on
gradient descent methods which determine the regularities – the common clusters
of visible (activated) or invisible(deactivated) pixels across multiple
training images and neural nets with hidden layers and back propagation and finally
reinforcement learning: which is based on award of penalties and rewards given based
on the actual outcomes for the application of the negative (so called) or
positive policies but it is highly computer power intensive. Q-learning - a model-free technique used as a reinforced learning -
can help find an optimal action-selection policy for any finite Markov decision process (MDP) [24].
Every business organization today
needs someone like a data analyst or a data scientist to look after data [36]. This
person should be capable of really pulling out trends from data. Therefore, strong
business acumen and communication ability is required in a data scientist/analyst
as he needs to communicate with business and IT leaders in a way that can
influence how an organization approaches a business challenge.
Infosys Nia [27] - Company’s next-generation
AI platform combines technologies like big
data/analytics, ML/AI, knowledge management, and cognitive automation
capabilities of Mana - first-generation AI platform. The offering end-to-end Robotic
process Automation (RPA) capabilities of AssistEdge [28]; advanced,
high-performance and o build more accurate
models in a shorter amount of time, you need high performance algorithms and to
utilize all of your data. The Skytree [29] platform is built for algorithmic
speed and scales to perform ML on massive amounts of data – structured and/or
unstructured; and optical character recognition (OCR), natural language
processing (NLP) capabilities and infrastructure management services. Supply chain as well as manufacturing [31] nowadays use
machines and robotics that can think and function autonomously, or with minimal
supervised programming, to perform specific actions.
Big data, DS and ML/AI
Big data and DS lets
ML evolve and adjust to the everyday requirements of data analysis that an organization
[22] needs. ML is not limited by human thinking or by speed and thus is able to
mine valuable data from big data without any bias. ML is able to find patterns
in the chunks of unstructured data, collected from everyday activities and
transactions and make big data useful for the organization.
Figure 3: Success of DS Solutions: A
Venn Diagram showing relationship between DS, IT and Business Skills [33]
Venn diagram in figure 3 shows the relationship between
fields of DS, IT and Business Skills.
Eighty-eight
percent of Data Scientists have a Master’s Degree, and 46% have PhDs [35].
Other skills data scientists need include:
· In-depth knowledge of SAS and/or R: For DS, R is generally
preferred.
· Python coding: Python is the most common coding language
that is used in DS along with Java, Perl, C/C++.
· Hadoop platform: Although not always a requirement,
knowing the Hadoop platform is still preferred for the field. Experience in
Hive or Pig is a huge plus.
· SQL database/coding: Though NoSQL and Hadoop are the major
focus for data scientists; preferred candidates can write and execute complex
queries in SQL.
· Working with unstructured data: ablity to work with
unstructured data say, from social media, video feeds like youtube, audio, or
other sources.
· Typical salary of Data scientist is USD 120000+.
For Big Data roles
following skills may be needed:
· Analytical skills: The
ability to make sense of the enormous amounts of data.
· Creativity: Ability to create new methods to gather,
interpret, and analyze a data strategy.
· Mathematics and statistical
skills: Good, old fashioned “number crunching” is necessary.
· Computer science: Programming experience is preferred as professional
may be required to develop algorithms.
·
Business skills: Big Data professionals should have an
understanding of the business objectives that are in place and must have ideas
about how to increase the enterprise profits.
Most jobs that specifically have "ML"
in the title seem to be looking for CS people with some experience in ML. "Data scientist" jobs seem to fall
into one of two categories: (1) "data analyst" jobs that are looking
for people with some background in data analysis, often looking for R/SAS/SPSS.
(2) "Computational statistician" -
Python and databases experience with good statistics background.
Way forward
As, oxygen is vital to existence of living organisms so is the availability
of vast amount of data or Big Data is vital for the success of the ML/AI
algorithms. More the data better the results and their accuracy. ML /AI algorithms derive patterns from big
data which a human cannot locate. With vast amount of success, ML is perceived as the new
management tool and companies like Coca Cola, MetLife and Netflix are using Big
Data Analytics for long time and also benefitting monetarily from it. Scaled up
algorithms, such as, recurrent neural networks and deep learning are powering
the breakthrough of ML/AI [21].
At present ML/AI algorithm reach a conclusion without a valid
explanation, work has to be done that ML/AI algorithm have a valid explanation
for each action they suggest, this will improve adaptability of ML/AI in many vital
fields like healthcare.
Congregating big
data, DS and ML/AI may improve business performance as well as our life we,
however, need real people in real-world companies to help in development of right
combination of technologies onto useful platforms.
Cloud computing,
Big Data and DS are enabling technologies and riding on these ML/AI has been galloping
at a tremendous pace in last few years and may reach state of General AI -
which can perform any intellectual task that a human can do - sooner than
expected. Use of Big Data and DS techs
enables seamless capabilities in ML/ AI [38] that remained dormant for decades
due to lack of data availability, limited sample sizes, and an inability to
analyze massive amounts of data in almost real-time. With the global market of AI expected to grow
at 36% annually, reaching a valuation of $3 trillion by 2025 from $126 billion
in 2015 [37]. Big Data and DS are more or less independent
techs and have strength of their own. DS allows cleansing of data and prepares
it for any desired application. The
meeting of big data & DS with AI has emerged as the single most important
development that is shaping the future of how organizations are getting
motivated to derive greater business value from their data and analytics
capabilities.
References
[1] Progress and Perils of
Artificial Intelligence (AI)
[2] Invited Chapter 6 - Evolutionary Algorithms and Neural Networks, Pages 111-136, R.G.S. Asthana
in book, Soft Computing and Intelligent Systems (Theory and Applications),
Academic Press Series in Engineering, Edited by:Naresh K. Sinha, Madan M. Gupta
and Lotfi A. Zadeh ISBN: 978-0-12-646490-0
http://www.sciencedirect.com/science/book/9780126464900
[3] Future 2030 by Dr. RGS
Asthana, Senior Member IEEE
[4] Machine Learning (ML) and
Artificial Intelligence (AI) – Part 1, by Dr. RGS Asthana, Senior Member IEEE
[5] Machine Learning
(ML) and Artificial Intelligence (AI) – Part Two, by Dr. RGS Asthana, Senior
Member IEEE
[6] Machine Learning (ML) and Artificial Intelligence (AI): Cognitive
Services and Robotics – Part Three by Dr. RGS Asthana, Senior Member IEEE
[7] Machine
Learning (ML) and Artificial Intelligence (AI): Big Data and 3 D Printing
– Part four by Dr. RGS Asthana,
Senior Member, IEEE.
[8] Machine Learning
(ML) and Artificial Intelligence (AI): Drones and Self-driving Cars– Part
Five by, Dr. RGS Asthana, Senior Member IEEE
[9] Machine Learning
(ML) and Artificial Intelligence (AI): Healthcare– Part Six by, Dr. RGS
Asthana, Senior Member IEEE
[10] Machine Learning (ML) and Artificial Intelligence
(AI): Will AI/ML intelligence surpass humans?
Part Seven by
Dr. RGS Asthana, Senior Member IEEE
http://newblogrgs10.blogspot.in/2017/06/machine-learning-ml-and-artificial.html
[11] Machine Learning (ML) and Artificial Intelligence
(AI): Impact of AI/ML in Healthcare: Part-Eight by Dr.
RGS Asthana, Senior Member IEEE
[12] Deep mind
website
[13] IBM Watson
Website
[14] Internet of Things (IoT)
[15] How ML,
Big Data and AI are changing healthcare forever
[16]
Pathway Genomics pathway.com
https://www.cbinsights.com/company/pathway-genomics
[17] What
is difference between Big Data and Machine Learning?
[18] How Big Data Is Empowering AI and Machine Learning at Scale
[19]
Using Data to Improve Health Outcomes
[20]
What is Machine Learning? A definition
[21]
How Big Data Is Empowering AI and Machine Learning at Scale
[22] Overview of Artificial Intelligence and Role of Natural Language
Processing in Big Data
[23] 5 big data trends that will shape AI in 2017
[24] Q-learning
[25] Human-in-the-loop
[26] Facebook researchers shut down AI bots that
started speaking in a language unintelligible to humans
[27] Infosys Launches Infosys Nia™ - The Next
Generation Integrated Artificial Intelligence Platform
[28]
Edgeverve website
[30] Why
Big Data and AI Need Each Other -- and You need them both
[31] How
Big Data Drives AI
[32] About Baidu Research
[33] Difference between ML, DS, AI, Deep Learning, and Statistics
[34] DS
[35] DS vs. Big Data vs. Data Analytics
[36] What
is the role of DS for Big Data
[37] Hindustan
Times dated Aug. 16, 2017 page 12: Why we should not be afraid of AI
[38] How
Big Data is empowering AI and ML at scale