Machine Learning Advancements and Applications: A COIT20249 Report

Verified

Added on 2022/10/18

AI Summary

This report delves into the multifaceted field of machine learning, synthesizing insights from various research papers and publications. It explores diverse facets of ML, encompassing neural networks, quantum machine learning, and parameter server frameworks. The report examines the evolution of machine learning algorithms, highlighting their applications in various domains, including computer vision, speech recognition, and natural language processing. It also addresses critical issues such as adversarial examples and distributed machine learning, alongside the challenges and opportunities presented by these advancements. The report covers the applications of machine learning in different fields like healthcare and finance, and offers a comprehensive overview of the current state and future prospects of the field. The analysis includes a discussion of probabilistic machine learning, graph-based frameworks, and the impact of machine learning on high-performance computing and data analysis, providing a holistic view of the subject.

Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., ... & Kudlur, M. (2016).
Tensorflow: A system for large-scale machine learning. In 12th {USENIX} Symposium
on Operating Systems Design and Implementation ({OSDI} 16) (pp. 265-283). Retrieved
from https://www.usenix.org/system/files/conference/osdi16/osdi16-abadi.pdf
Many learning algorithms train a set of parameters using some variant of SGD, which entails
computing the gradients of a loss function with respect to those parameters, then updating the
parameters based on those gradients. It includes a user-level library that differentiates a
symbolic expression for a loss function and produces a new symbolic expression representing
the gradients. For example, given a neural network as a composition of layers and a loss
function, the library will automatically derive the back propagation code. The differentiation
algorithm performs breadth-first search to identify all of the backwards paths from the target
operation (e.g., a loss function) to a set of parameters, and sums the partial gradients that each
path contributes.
Biamonte, J., Wittek, P., Pancotti, N., Rebentrost, P., Wiebe, N., & Lloyd, S. (2017).
Quantum machine learning. Nature, 549(7671), 195. Retrieved from
https://arxiv.org/pdf/1611.09347.pdf
Quantum machine learning software makes use of quantum algorithms as part of a larger
implementation. Analysing the steps that quantum algorithms prescribe, it becomes clear that
they have the potential to outperform classical algorithms for specific problems. This
potential is known as quantum speedup. Classical machine learning and data analysis can be
divided into several categories. First, computers can be used to perform ‘classic’ data analysis
methods such as least squares regression, poly-nomical interpolation, and data analysis.
Machine learning protocols can be supervised or unsupervised. In supervised learning, the
training data is divided into labelled categories, such as samples of handwritten digits
together with the actual number the handwritten digit is supposed to represent, and the job of
the machine is to learn how to assign labels to data outside the training set. In unsupervised
learning, the training set is un-labelled: the goal of the machine is to find the natural
categories into which the training data falls (e.g., different types of photos on the internet)
and then to categorize data outside of the training set. Finally, there are machine learning
tasks, such as playing Go, that involve combinations of supervised and unsupervised
learning, together with training sets that may be generated by the machine itself.

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

Carrasquilla, J., & Melko, R. G. (2017). Machine learning phases of matter. Nature
Physics, 13(5), 431. Retrieved from https://arxiv.org/pdf/1605.01735.pdf
We have shown that neural network technology, developed for engineering applications such
as computer vision and natural language processing, can be used to encode phases of matter
and discriminate phase transitions in correlated many-body systems. In particular, we have
argued that neural networks encode information about conventional ordered phases by
learning the order parameter of the phase, without knowledge of the energy or locality
conditions of Hamiltonian. Furthermore, we have shown that neural networks can encode
basic information about the ground states of unconventional disordered models.
Chen, T., Li, M., Li, Y., Lin, M., Wang, N., Wang, M., ... & Zhang, Z. (2015). Mxnet: A
flexible and efficient machine learning library for heterogeneous distributed
systems. arXiv preprint arXiv:1512.01274. Retrieved from
https://arxiv.org/pdf/1512.01274.pdf
The scale and complexity of machine learning (ML) algorithms are becoming increasingly
large. The rise of structural and computational complexity poses interesting challenges to ML
system design and implementation. The combination of the programming paradigm and
execution model yields a large design space, some of which are more interesting (and valid)
than others. Related to the issue of programming paradigms is how the computation is carried
out. Execution can be concrete, where the result is returned right away on the same thread, or
asynchronize or delayed, where the statements are gathered and transformed into a dataflow
graph as an intermediate representation first, before released to available devices. These two
execution models have different implications on how inherent parallelisms are discovered.
Concrete execution is restrictive (e.g. parallelized matrix multiplication), whereas
asynchronize /delayed execution additionally identified all parallelism within the scope of an
instance of dataflow graph automatically.
Chen, Y., Luo, T., Liu, S., Zhang, S., He, L., Wang, J., ... & Temam, O. (2014,
December). Dadiannao: A machine-learning supercomputer. In Proceedings of the 47th
Annual IEEE/ACM International Symposium on Microarchitecture (pp. 609-622). IEEE
Computer Society. Retrieved from
http://pages.saclay.inria.fr/olivier.temam/files/eval/supercomputer.pdf
Many companies are deploying services, either for consumers or industry, which are largely
based on machine-learning algorithms for sophisticated processing of large amounts of data.

Machine-Learning algorithms have become ubiquitous in a very broad range of applications
and cloud services; examples include speech recognition. It is probably not exaggerated to
say that machine-learning applications are in the process of displacing scientific computing as
the major driver for high-performance computing. From a machine-learning perspective,
there is a significant trend towards increasingly large neural networks. Remarkably enough,
at the same time this profound shift in applications is occurring, two simultaneous, albeit
apparently unrelated, transformations are occurring in the machine-learning and in the
hardware domains. Our community is well aware of the trend towards heterogeneous
computing where architecture specialization is seen as a promising path to achieve high
performance at low energy.
Ghahramani, Z. (2015). Probabilistic machine learning and artificial
intelligence. Nature, 521(7553), 452. Retrieved from
https://www.repository.cam.ac.uk/bitstream/handle/1810/248538/Ghahramani
%25202015%2520Nature.pdf?sequence=1
The key idea behind the probabilistic framework to machine learning is that learning can be
thought of as inferring plausible models to explain observed data. A machine can use such
models to make predictions about future data, and decisions that are rational given these
predictions. Uncertainty plays a fundamental role in all of this. Observed data can be
consistent with many models, and therefore which model is appropriate given the data is
uncertain. Similarly, predictions, about future data and the future consequences of actions, are
uncertain. Probability theory provides a framework for modelling uncertainty. Data are the
key ingredient of all machine learning systems. But data, even so called “Big Data”, are
useless on their own until one extracts knowledge or inferences from them. Probabilistic
modelling also has some conceptual advantages over alternatives as a normative theory for
learning in artificially intelligent (AI) systems.
Jordan, M. I., & Mitchell, T. M. (2015). Machine learning: Trends, perspectives, and
prospects. Science, 349(6245), 255-260. Retrieved from
http://www.cs.cmu.edu/~tom/pubs/Science-ML-2015.pdf
Machine learning is a discipline focused on two interrelated questions: How can one
construct computer systems that automatically improve through experience? and What are the
fundamental statistical computational-information-theoretic laws that govern all learning
systems, including computers, humans, and organizations? The study of machine learning is

important both for addressing these fundamental scientific and engineering questions and for
the highly practical computer software it has produced and fielded across many applications.
Within artificial intelligence (AI), machine learning has emerged as the method of choice for
developing practical software for computer vision, speech recognition, natural language
processing, robot control, and other applications. Many developers of AI systems now
recognize that, for many applications, it can be far easier to train a system by showing it
examples of desired input-output behaviour than to program it manually by anticipating the
desired response for all possible inputs. The effect of machine learning has also been felt
broadly across computer science and across a range of industries concerned with data-
intensive issues, such as consumer services, the diagnosis of faults in complex systems, and
the control of logistics chains.
Kurakin, A., Goodfellow, I., & Bengio, S. (2016). Adversarial machine learning at
scale. arXiv preprint arXiv:1611.01236. Retrieved from
https://arxiv.org/pdf/1611.01236.pdf
Adversarial examples are malicious inputs designed to fool machine learning models. They
often transfer from one model to another, allowing attackers to mount black box attacks
without knowledge of the target model’s parameters. Adversarial training is the process of
explicitly training a model on adversarial examples, in order to make it more robust to attack
or to reduce its test error on clean inputs. So far, adversarial training has primarily been
applied to small problems. Adversarial examples pose potential security threats for practical
machine learning applications.
Li, M., Andersen, D. G., Park, J. W., Smola, A. J., Ahmed, A., Josifovski, V., ... & Su, B.
Y. (2014). Scaling distributed machine learning with the parameter server. In 11th
{USENIX} Symposium on Operating Systems Design and Implementation ({OSDI}
14) (pp. 583-598). Retrieved from
https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-li_mu.pdf
We propose a parameter server framework for distributed machine learning problems. Both
data and workloads are distributed over worker nodes, while the server nodes maintain
globally shared parameters, represented as dense or sparse vectors and matrices. The
framework manages asynchronous data communication between nodes, and supports flexible
consistency models, elastic scalability, and continuous fault tolerance. Distributed
optimization and inference is becoming a prerequisite for solving large scale machine

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

learning problems. At scale, no single machine can solve these problems sufficiently rapidly,
due to the growth of data and the resulting model complexity, often manifesting itself in an
increased number of parameters. Implementing an efficient distributed algorithm, however, is
not easy. Both intensive computational workloads and the volume of data communication
demand careful system design. Sharing imposes three challenges: Accessing the parameters
requires an enormous amount of network bandwidth. Many machine learning algorithms are
sequential. The resulting barriers hurt performance when the cost of synchronization and
machine latency is high. At scale, fault tolerance is critical. Learning tasks are often
performed in a cloud environment where machines can be unreliable and jobs can be
preempted. Our parameter server provides five key features: Efficient communication: The
asynchronous communication model does not block computation (unless requested). It is
optimized for machine learning tasks to reduce network traffic and overhead. Flexible
consistency models: Relaxed consistency further hides synchronization cost and latency. We
allow the algorithm designer to balance algorithmic convergence rate and system efficiency.
The best trade-off depends on data, algorithm, and hardware. Elastic Scalability: New nodes
can be added without restarting the running framework. Fault Tolerance and Durability:
Recovery from and repair of non-catastrophic machine failures within 1s, without
interrupting computation. Vector clocks ensure welldefined behavior after network partition
and failure. Ease of Use: The globally shared parameters are represented as (potentially
sparse) vectors and matrices to facilitate development of machine learning applications. The
linear algebra data types come with high-performance multi-threaded libraries.
Lison, P. (2015). An introduction to machine learning. Language Technology Group
(LTG), 1, 35. Retrieved from http://folk.uio.no/plison/pdfs/talks/machinelearning.pdf
General idea: • Collect data for our problem • Use this data to learn how to solve the task 9 •
Key advantages: • Can robustly solve complex tasks • Reliance on real-world data instead of
pure intuition • Can adapt to new situations (collect more data). Certain tasks are extremely
difficult to program by hand: Spam filtering Face recognition Machine translation Speech
recognition Data mining Robot motion. In supervised learning, we have training data encoded
as pairs (i,o), where the «correct» output is often manually annotated. E.g. spam filtering,
machine translation, face recognition, etc. Unsupervised learning: Sometimes, we don’t have
access to any output value o, we simply have a collection of input examples i. Reinforcement
learning: In this setting, we don’t have direct access to «the» correct output o for an input i.

Low, Y., Gonzalez, J. E., Kyrola, A., Bickson, D., Guestrin, C. E., & Hellerstein, J.
(2014). Graphlab: A new framework for parallel machine learning. arXiv preprint
arXiv:1408.2041. Retrieved from https://arxiv.org/ftp/arxiv/papers/1408/1408.2041.pdf
Designing and implementing efficient, provably correct parallel machine learning (ML)
algorithms is challenging. Existing high-level parallel abstractions like MapReduce are
insufficiently expressive while low-level tools like MPI and Pthreads leave ML experts
repeatedly solving the same design challenges. By targeting common patterns in ML, we
developed GraphLab, which improves upon abstractions like MapReduce by compactly
expressing asynchronous iterative algorithms with sparse computational dependencies while
ensuring data consistency and achieving a high degree of parallel performance. Exponential
gains in hardware technology have enabled sophisticated machine learning (ML) techniques
to be applied to increasingly challenging real-world problems. However, recent developments
in computer architecture have shifted the focus away from frequency scaling and towards
parallel scaling, threatening the future of sequential ML algorithms. In order to benefit from
future trends in processor technology and to be able to apply rich structured models to rapidly
scaling real-world problems, the ML community must directly confront the challenges of
parallelism. We identified several limitations in applying existing parallel abstractions like
MapReduce to Machine Learning (ML) problems.
Marsland, S. (2014). Machine learning: an algorithmic perspective. Chapman and
Hall/CRC. Retrieved from
http://dai.fmph.uniba.sk/courses/ICI/References/marsland.machine-
learning.2ed.2015.pdf
Machine learning, then, is about making computers modify or adapt their actions (whether
these actions are making predictions, or controlling a robot) so that these actions get more
accurate, where accuracy is measured by how well the chosen actions reflect the correct ones.
The computational complexity of the machine learning methods will also be of interest to us
since what we are producing is algorithms. It is particularly important because we might want
to use some of the methods on very large datasets, so algorithms that have high degree
polynomial complexity in the size of the dataset (or worse) will be a problem. The size and
complexity of these datasets mean that humans are unable to extract useful information from
them. Even the way that the data is stored works against us. There is one other thing that can
help if the number of dimensions is not too much larger than three, which is to use glyphs

that use other representations, such as size or colour of the data points to represent
information about some other dimension, but this does not help if the dataset has 100
dimensions in it. They are used in many of the software programs that we use, such as
Microsoft’s infamous paperclip in Office (maybe not the most positive example), spam
filters, voice recognition software, and lots of computer games.
Supervised learning: A training set of examples with the correct responses (targets) is
provided and, based on this training set, the algorithm generalises to respond correctly to all
possible inputs. This is also called learning from exemplars. Unsupervised learning: Correct
responses are not provided, but instead the algorithm tries to identify similarities between the
inputs so that inputs that have something in common are categorised together. The statistical
approach to unsupervised learning is known as density estimation. Reinforcement learning:
This is somewhere between supervised and unsupervised learning. The algorithm gets told
when the answer is wrong, but does not get told how to correct it. It has to explore and try out
different possibilities until it works out how to get the answer right. Reinforcement learning
is sometime called learning with a critic because of this monitor that scores the answer, but
does not suggest improvements. Evolutionary learning: Biological evolution can be seen as a
learning process: biological organisms adapt to improve their survival rates and chance of
having offspring in their environment. We’ll look at how we can model this in a computer,
using an idea of fitness, which corresponds to a score for how good the current solution is.
Meng, X., Bradley, J., Yavuz, B., Sparks, E., Venkataraman, S., Liu, D., ... & Xin, D.
(2016). Mllib: Machine learning in apache spark. The Journal of Machine Learning
Research, 17(1), 1235-1241. Retrieved from http://www.jmlr.org/papers/volume17/15-
237/15-237.pdf
Modern datasets are rapidly growing in size and complexity, and there is a pressing need to
develop solutions to harness this wealth of data using statistical methods. Several ‘next
generation’ data flow engines that generalize MapReduce. Spark is a fault-tolerant and
general-purpose cluster computing system providing APIs in Java, Scala, Python, and R,
along with an optimized engine that supports general execution graphs. Moreover, Spark is
efficient at iterative computations and is thus well-suited for the development of large-scale
machine learning applications. Supported Methods and Utilities: MLlib provides fast,
distributed implementations of common learning algorithms, including (but not limited to):
various linear models, naive Bayes, and ensembles of decision trees for classification and

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

regression problems; alternating least squares with explicit and implicit feedback for
collaborative filtering; and k-means clustering and principal component analysis for
clustering and dimensionality reduction. Algorithmic Optimizations: MLlib includes many
optimizations to support efficient distributed learning and prediction Integration: MLlib
benefits from the various components within the Spark ecosystem. At the lowest level, Spark
core provides a general execution engine with over 80 operators for transforming data, e.g.,
for data cleaning and featurization.
Obermeyer, Z., & Emanuel, E. J. (2016). Predicting the future—big data, machine
learning, and clinical medicine. The New England journal of medicine, 375(13), 1216.
Retrieved from https://sci-hub.tw/10.1056/NEJMp1606181
It’s essential to remember, however, that data by themselves are useless. To be useful, data
must be analysed, interpreted, and acted on. Thus, it is algorithms — not data sets — that will
prove transformative. First, it’s important to understand what machine learning is not. Most
computer-based algorithms in medicine are “expert systems” — rule sets encoding
knowledge on a given topic, which are applied to draw conclusions. Machine learning has
become ubiquitous and indispensable for solving complex problems in most sciences.
Papernot, N., McDaniel, P., Goodfellow, I., Jha, S., Celik, Z. B., & Swami, A. (2017,
April). Practical black-box attacks against machine learning. In Proceedings of the 2017
ACM on Asia conference on computer and communications security (pp. 506-519). ACM.
Retrieved from https://arxiv.org/pdf/1602.02697.pdf
Machine learning (ML) models, e.g., deep neural networks (DNNs), are vulnerable to
adversarial examples: malicious inputs modified to yield erroneous model outputs, while
appearing unmodified to human observers. Potential attacks include having malicious content
like malware identified as legitimate or controlling vehicle behaviour. Yet, all existing
adversarial example attacks require knowledge of either the model internals or its training
data. A classifier is a ML model that learns a mapping between inputs and a set of classes.
For instance, a malware detector is a classifier taking executables as inputs and assigning
them to the benign or malware class. Our threat model thus corresponds to the real-world
scenario of users interacting with classifiers hosted remotely by a third-party keeping the
model internals secret. A deep neural network (DNN), is a ML technique that uses a
hierarchical composition of n parametric functions to model an input ~x. Each function f for i
∈ 1..n is modelled using a layer of neurons, which are elementary computing units applying

an activation function to the previous layer’s weighted representation of the input to generate
a new representation. The two types of defence strategies are: (1) reactive where one seeks to
detect adversarial examples, and (2) proactive where one makes the model itself more robust.
Quinlan, J. R. (2014). C4. 5: programs for machine learning. Elsevier. Retrieved from
https://books.google.co.in/books?
hl=en&lr=&id=b3ujBQAAQBAJ&oi=fnd&pg=PP1&dq=Quinlan,+J.+R.+(2014).
+C4.+5:+programs+for+machine+learning.
+Elsevier.&ots=sQ7vTSEoF3&sig=SzEFOdpuli8tlXIZi6Wa3NnukVo&redir_esc=y#v=o
nepage&q=Quinlan%2C%20J.%20R.%20(2014).%20C4.%205%3A%20programs
%20for%20machine%20learning.%20Elsevier.&f=false
Most applications of artificial intelligence to tasks of practical importance are based on
constructing a model of the knowledge used by a human expert. The primary properties for
this application are details of a proposed transaction and the classes correspond to any
recommendation for approval or declining. The key requirements are attribute value
description, predefined classes, discrete classes, sufficient data and logical classification
models.
Witten, I. H., Frank, E., Hall, M. A., & Pal, C. J. (2016). Data Mining: Practical machine
learning tools and techniques. Morgan Kaufmann. Retrieved from
http://thuvien.thanglong.edu.vn:8081/dspace/bitstream/DHTL_123456789/4050/1/Data
%20mining-1.pdf
Machine learning provides the technical basis of data mining. It is used to extract information
from the raw data in databases- information that is expressed in a comprehensible form and
can be used for a variety of purposes. The process is one of abstraction: taking the data, warts
and all and inferring whatever structure underlies it.
Xiao, H., Rasul, K., & Vollgraf, R. (2017). Fashion-mnist: a novel image dataset for
benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747. Retrieved
from https://arxiv.org/pdf/1708.07747.pdf
We provide some classification results to form a benchmark on this data set. All algorithms
are repeated 5 times by shuffling the training data and the average accuracy on the test set is
reported. MNIST is based on the assortment on Zalando’s website . Every product on
Zalando has a set of pictures shot by professional photographers, demonstrating different

aspects of the product, i.e. front and back looks, details, looks with model and in an outfit.
The original picture has a light-gray background (hexadecimal color: #fdfdfd) and stored in
762 × 1000 JPEG format. For efficiently serving different frontend components, the original
picture is resampled with multiple resolutions, e.g. large, medium, small, thumbnail and tiny.
1. Converting the input to a PNG image. 2. Trimming any edges that are close to the colour
of the corner pixels. The “closeness” is defined by the distance within 5% of the maximum
possible intensity in RGB space. 3. Resizing the longest edge of the image to 28 by
subsampling the pixels, i.e. some rows and columns are skipped over. 4. Sharpening pixels
using a Gaussian operator of the radius and standard deviation of 1.0, with increasing effect
near outlines. 5. Extending the shortest edge to 28 and put the image to the centre of the
canvas. 6. Negating the intensities of the image. 7. Converting the image to 8-bit grayscale
pixels.
Xingjian, S. H. I., Chen, Z., Wang, H., Yeung, D. Y., Wong, W. K., & Woo, W. C.
(2015). Convolutional LSTM network: A machine learning approach for precipitation
nowcasting. In Advances in neural information processing systems (pp. 802-810).
Retrieved from https://papers.nips.cc/paper/5955-convolutional-lstm-network-a-
machine-learning-approach-for-precipitation-nowcasting.pdf
The success of these optical flow based methods is limited because the flow estimation step
and the radar echo extrapolation step are separated and it is challenging to determine the
model parameters to give good prediction performance. These technical issues may be
addressed by viewing the problem from the machine learning perspective. However, such
learning problems, regardless of their exact applications, are nontrivial in the first place due
to the high dimensionality of the spatiotemporal sequences especially when multi-step
predictions have to be made, unless the spatiotemporal structure of the data is captured well
by the prediction model.