Report on Coherent Patient Selection in Clinical Trials Using NLP
VerifiedAdded on 2023/06/12
|30
|8160
|250
Report
AI Summary
This report discusses the application of Natural Language Processing (NLP) techniques for coherent patient selection in clinical trials, addressing the challenge of identifying patients who meet specific eligibility criteria based on their medical records. It references the 2014 i2b2/UTHealth SharedTasks and Workshop on Challenges in Natural Language Processing for Clinical Data. The report also includes a literature review of various machine learning languages and techniques applicable to this task. Key methods discussed include UMLS MetaMap for mapping biomedical text, bigrams for statistical text analysis, and tf-idf for weighting word importance in documents. The vector space model is also explored as an algebraic model for representing text documents. The goal is to streamline patient recruitment, reduce bias in clinical trials, and improve the efficiency of medical research by automating the assessment of patient eligibility.

COHERENT SELECTION FOR CLINICAL TRIALS
[Author Name(s), First M. Last, Omit Titles and Degrees]
[Institutional Affiliation(s)]
[Author Name(s), First M. Last, Omit Titles and Degrees]
[Institutional Affiliation(s)]
Secure Best Marks with AI Grader
Need help grading? Try our AI Grader for instant feedback on your assignments.

Abstract
Identifiction of patients that satify certain criteria that allows them to be plced in clincial trails
forms a fundamental aspect of medical research. This task aims to identify whether a patient
meets, does not meet, or possibly meets a selected set of eligibility criteria based on their
longitudinal records. The eligibility criteria come from real clinical trials and focus on patients’
medications, past medical histories, and whether certain events have occurred in a specified
timeframe in the patients’ records. This task uses data from the 2014 i2b2/UTHealth Shared-
Tasks and Workshop on Challenges in Natural Language Processing for Clinical Data. A
literature review on the various machine learning languages and techniques is also provided to
offer insights into these various techniques and their applicability.
Introduction
Identifiction of patients that satify certain criteria that allows them to be plced in clincial trails
forms a fundamental aspect of medical research. It is a bit of a challenge to find patients for
clincal trails owing to the sophisticated nature of the criteria of medical research which are not
easily translatable into a database query but instead require the examnation of the clcinical
narraitives that are found in the records of the patient Liu & Motoda (2012). This tends to take a
lot of timeespecially for medical resrecher who are intending to recurit patients and thus
researchers are usually linited to patients who are directed towards a cetain trial or seek for trial
on their own. Recruitment from particular places or by particular people can result in selection
bias towards certain populations which in turn can bias the results of the study Robert (2014).
Developing NLP systems that can automatically assess if a patient is eligible for a study can both
reduce the time it takes to recruit patients, and help remove bias from clinical trials.
Identifiction of patients that satify certain criteria that allows them to be plced in clincial trails
forms a fundamental aspect of medical research. This task aims to identify whether a patient
meets, does not meet, or possibly meets a selected set of eligibility criteria based on their
longitudinal records. The eligibility criteria come from real clinical trials and focus on patients’
medications, past medical histories, and whether certain events have occurred in a specified
timeframe in the patients’ records. This task uses data from the 2014 i2b2/UTHealth Shared-
Tasks and Workshop on Challenges in Natural Language Processing for Clinical Data. A
literature review on the various machine learning languages and techniques is also provided to
offer insights into these various techniques and their applicability.
Introduction
Identifiction of patients that satify certain criteria that allows them to be plced in clincial trails
forms a fundamental aspect of medical research. It is a bit of a challenge to find patients for
clincal trails owing to the sophisticated nature of the criteria of medical research which are not
easily translatable into a database query but instead require the examnation of the clcinical
narraitives that are found in the records of the patient Liu & Motoda (2012). This tends to take a
lot of timeespecially for medical resrecher who are intending to recurit patients and thus
researchers are usually linited to patients who are directed towards a cetain trial or seek for trial
on their own. Recruitment from particular places or by particular people can result in selection
bias towards certain populations which in turn can bias the results of the study Robert (2014).
Developing NLP systems that can automatically assess if a patient is eligible for a study can both
reduce the time it takes to recruit patients, and help remove bias from clinical trials.

However, matching patients to selection criteria is not a trivial task for machines, due to the
complexity the criteria often exhibit. This shared task aims to identify whether a patient meets,
does not meet, or possibly meets a selected set of eligibility criteria based on their longitudinal
records. The eligibility criteria come from real clinical trials and focus on patients’ medications,
past medical histories, and whether certain events have occurred in a specified timeframe in the
patients’ records Alpaydin (2014). This task uses data from the 2014 i2b2/UTHealth Shared-
Tasks and Workshop on Challenges in Natural Language Processing for Clinical Data, with tasks
on de-identification and heart disease risk factors. The data is composed of nearly 202 sets of
longitudinal patient records, annotated by medical professionals to determine if each patient
matches a list of 13 selection criteria. These criteria include determining whether the patient has
taken a dietary supplement (excluding Vitamin D) in the past 2 months, whether the patient has a
major diabetes-related complication, and whether the patient has advanced cardiovascular
disease.
All the files have been annotated at the document level to indicate whether the patient meets or
do not meet each criterion. The gold standard annotations will provide the category of each
patient for each criterion Alpaydin (2014). Participants will be evaluated on the predicted
category of each patient in the held-out test data. The data for this task is provided by Partners
HealthCare. All records have been fully de-identified and manually annotated for whether they
meet, possibly meet, or do not meet clinical trial eligibility criteria. The evaluation for both NLP
tasks will be conducted using withheld test data in which the participating teams are asked to
stop development as soon as they download the test data. Each team is allowed to upload
(through this website) up to three system runs for each of these tracks. System output is to be
complexity the criteria often exhibit. This shared task aims to identify whether a patient meets,
does not meet, or possibly meets a selected set of eligibility criteria based on their longitudinal
records. The eligibility criteria come from real clinical trials and focus on patients’ medications,
past medical histories, and whether certain events have occurred in a specified timeframe in the
patients’ records Alpaydin (2014). This task uses data from the 2014 i2b2/UTHealth Shared-
Tasks and Workshop on Challenges in Natural Language Processing for Clinical Data, with tasks
on de-identification and heart disease risk factors. The data is composed of nearly 202 sets of
longitudinal patient records, annotated by medical professionals to determine if each patient
matches a list of 13 selection criteria. These criteria include determining whether the patient has
taken a dietary supplement (excluding Vitamin D) in the past 2 months, whether the patient has a
major diabetes-related complication, and whether the patient has advanced cardiovascular
disease.
All the files have been annotated at the document level to indicate whether the patient meets or
do not meet each criterion. The gold standard annotations will provide the category of each
patient for each criterion Alpaydin (2014). Participants will be evaluated on the predicted
category of each patient in the held-out test data. The data for this task is provided by Partners
HealthCare. All records have been fully de-identified and manually annotated for whether they
meet, possibly meet, or do not meet clinical trial eligibility criteria. The evaluation for both NLP
tasks will be conducted using withheld test data in which the participating teams are asked to
stop development as soon as they download the test data. Each team is allowed to upload
(through this website) up to three system runs for each of these tracks. System output is to be

submitted in the exact format of the ground truth annotations, which will be provided by the
organizers Paik (2013, July).
Participants are asked to submit a 500-word long abstract describing their methodologies.
Abstracts may also have a graphical summary of the proposed architecture. The authors of either
top performing systems or particularly novel approaches will be invited to present or
demonstrate their systems at the workshop Liu & Motoda (2012). A special issue of a journal
will be organized following the workshop.
UMLS MetaMap
This is a program that is highly configurable and was developed for the purposes of mapping
biomedical text to UMLS Metathesaurus or in equal measure makes discoveries on the concept
of Metathesaurus that is referred to in a text. UMLS MetaMap makes used of an approach of
knowledge-intensive that is pegged on symbolic natural language processing as well as
computational linguistic techniques Alpaydin (2014). Other than finding its applications in IR
and data mining applications, MetaMap is acknowledged as one of the foundations of Medical
Text Indexer (MTI) of the National Library Medicine. The Medical Text Indexer is applied both
in semiautomatic as well as entirely automatic indexing of the literature of biomedicine at the
National Library Medicine.
An improved version of MetaMap is available called MetaMap2016 V2 that has come with
numerous new features having special purposes that are aimed at improving the performance of
specific input types. There are also provided JSON output besides the XML output. Among the
benefits that come with MetaMap2016 V2 include:
organizers Paik (2013, July).
Participants are asked to submit a 500-word long abstract describing their methodologies.
Abstracts may also have a graphical summary of the proposed architecture. The authors of either
top performing systems or particularly novel approaches will be invited to present or
demonstrate their systems at the workshop Liu & Motoda (2012). A special issue of a journal
will be organized following the workshop.
UMLS MetaMap
This is a program that is highly configurable and was developed for the purposes of mapping
biomedical text to UMLS Metathesaurus or in equal measure makes discoveries on the concept
of Metathesaurus that is referred to in a text. UMLS MetaMap makes used of an approach of
knowledge-intensive that is pegged on symbolic natural language processing as well as
computational linguistic techniques Alpaydin (2014). Other than finding its applications in IR
and data mining applications, MetaMap is acknowledged as one of the foundations of Medical
Text Indexer (MTI) of the National Library Medicine. The Medical Text Indexer is applied both
in semiautomatic as well as entirely automatic indexing of the literature of biomedicine at the
National Library Medicine.
An improved version of MetaMap is available called MetaMap2016 V2 that has come with
numerous new features having special purposes that are aimed at improving the performance of
specific input types. There are also provided JSON output besides the XML output. Among the
benefits that come with MetaMap2016 V2 include:
Secure Best Marks with AI Grader
Need help grading? Try our AI Grader for instant feedback on your assignments.

Suppression of Numerical concepts: There are some numerical concepts of certain
Sematic Type that have been found to inject limited value to an entity of a biomedical
name recognition application. Through MetaMap2016 V2 such unnecessary and
irrelevant concepts are automatically suppressed Goeuriot et al (2014, September).
JSON Output generation: MetaMap2016 V2 is able to produce an output of JSON
Processing of data in tables: Through MetaMap2016 V2 is possible to identify concepts
of UMLS in a better and more efficient way as they are found in tabular data.
Improved Conjunction Handling: There is provision for an improvement in the handling
of conjunction by MetaMap2016 V2.
Bigram
Also called a diagram, a bigram is a sequence made up of two elements that are adjacent to each
other from a string of tokens which are basically letter, words or syllables. It is an n-gram for
n=2. The distribution frequency of each of the bigram in the string is applied in simple statistical
analysis of texts in various applications among them speech recognition, computational
linguistics and cystography among others. Guppy bigrams or simply skipping bigrams are pairs
of words that enable gaps (may be by avoiding the use of connecting words or through
permitting some dependencies of simulation as is the case with a dependency grammar)
Rocktäschel, Weidlich & Leser (2012).
Bigrams are mainly used in the provision of conditional probability of a toke provided the
preceding token upon the application of the relation of the conditional probability
Sematic Type that have been found to inject limited value to an entity of a biomedical
name recognition application. Through MetaMap2016 V2 such unnecessary and
irrelevant concepts are automatically suppressed Goeuriot et al (2014, September).
JSON Output generation: MetaMap2016 V2 is able to produce an output of JSON
Processing of data in tables: Through MetaMap2016 V2 is possible to identify concepts
of UMLS in a better and more efficient way as they are found in tabular data.
Improved Conjunction Handling: There is provision for an improvement in the handling
of conjunction by MetaMap2016 V2.
Bigram
Also called a diagram, a bigram is a sequence made up of two elements that are adjacent to each
other from a string of tokens which are basically letter, words or syllables. It is an n-gram for
n=2. The distribution frequency of each of the bigram in the string is applied in simple statistical
analysis of texts in various applications among them speech recognition, computational
linguistics and cystography among others. Guppy bigrams or simply skipping bigrams are pairs
of words that enable gaps (may be by avoiding the use of connecting words or through
permitting some dependencies of simulation as is the case with a dependency grammar)
Rocktäschel, Weidlich & Leser (2012).
Bigrams are mainly used in the provision of conditional probability of a toke provided the
preceding token upon the application of the relation of the conditional probability

Applications
They are applied in what is termed as one of the most successful models of language in
the recognition of speech in which they act as a special case of n-gram.
The attacks of bigram frequency are usable in cryptography for solving cryptograms Del,
López, Benítez, & Herrera (2014)
One of the approaches in statistical language identification
Bigrams are involved in some of the activities of recreational linguistics or logo logy
tf-idf
When used in information retrieval, tf-idf or TFIDF is an abbreviation for term frequency-
inverse document frequency and refers to a numerical statistics that is meant for reflecting the
importance of a word in any document that is available in a collection or a corpus. It is normally
adopted as weighting factors in searches involving the retrieval of information, user modeling as
well as text mining Alpaydin (2014). There is proportional increase in the value of tf-idf with the
frequency of appearance of a word in a document and is usually offset by the now frequent the
word is in the corpus. This assists in the adjustment owing to the fact that some words generally
appear more frequently.
The aim of using tf-idf as opposed to raw frequencies of occurrence of any token in a provided
document is scaling down the effects of the tokens that may be occur more frequently in a
specific corpus and which are thus less informative in empirical terms that the features that take
place in a small fraction of the training corpus Alpaydin (2014).
Computation of tf-idf is done using the formula t is tf-idf (d, t)=tf (t)*idf (d,t) while the idf is
computed rom the from idf(d,t)=log (n/df(d,t))+1 and this is applicable under the conditions (if
They are applied in what is termed as one of the most successful models of language in
the recognition of speech in which they act as a special case of n-gram.
The attacks of bigram frequency are usable in cryptography for solving cryptograms Del,
López, Benítez, & Herrera (2014)
One of the approaches in statistical language identification
Bigrams are involved in some of the activities of recreational linguistics or logo logy
tf-idf
When used in information retrieval, tf-idf or TFIDF is an abbreviation for term frequency-
inverse document frequency and refers to a numerical statistics that is meant for reflecting the
importance of a word in any document that is available in a collection or a corpus. It is normally
adopted as weighting factors in searches involving the retrieval of information, user modeling as
well as text mining Alpaydin (2014). There is proportional increase in the value of tf-idf with the
frequency of appearance of a word in a document and is usually offset by the now frequent the
word is in the corpus. This assists in the adjustment owing to the fact that some words generally
appear more frequently.
The aim of using tf-idf as opposed to raw frequencies of occurrence of any token in a provided
document is scaling down the effects of the tokens that may be occur more frequently in a
specific corpus and which are thus less informative in empirical terms that the features that take
place in a small fraction of the training corpus Alpaydin (2014).
Computation of tf-idf is done using the formula t is tf-idf (d, t)=tf (t)*idf (d,t) while the idf is
computed rom the from idf(d,t)=log (n/df(d,t))+1 and this is applicable under the conditions (if

smooth_idf=False) in which n is the total amount of the documents and df(d,t) as the frequency
of the document; the frequency of the document refers to the number of documents of d that have
the term t. the addition of 1 in the idf of the equation shown is that in case of terms with zero idf
which is the terms that are found in all the documents in a single training set, not be all ignored
Witten et al (2016). It should be noted that there is a difference between the formula of idf listed
in this paper and that found in standard textbook notations which in most cases define idf as
idf(d, t)=log (n/(df(d, t+1))) the condition here remains if smooth_idf=True which is the default
conditions, there is addition of 1 to both the numerator and the denominator of the idf in a
manner suggesting that there was access to an extra document every time there was a collection
just once thereby preventing zero divisions: idf (d,t)=log (1+n)/(1+df (d,t))+1 Paik (2013, July)
These formulas that are furthermore used in the computation of the tf and idf are a factor of the
settings of the parameter which are in correspondence with the SMART notation that is deployed
in IR as follows:
Tf is by default n (natural), l which is the logarithmic when sublinear_tf=True. Idf is found to be
t when use_idf is provided, n (none) otherwise. The normalization is found to be c (cosine) at
norm= ‘12’, n (none) when norm=none Paik (2013, July).
There is a significant difference between sentiment analysis and tf-idf even though all of them
are treated as techniques for classification for text, they have distinct goals. While sentiment
analysis is aiming at classification of documents based on opinions such as negative and positive,
tf-idf classifies documents into various categories that are within the very documents.
of the document; the frequency of the document refers to the number of documents of d that have
the term t. the addition of 1 in the idf of the equation shown is that in case of terms with zero idf
which is the terms that are found in all the documents in a single training set, not be all ignored
Witten et al (2016). It should be noted that there is a difference between the formula of idf listed
in this paper and that found in standard textbook notations which in most cases define idf as
idf(d, t)=log (n/(df(d, t+1))) the condition here remains if smooth_idf=True which is the default
conditions, there is addition of 1 to both the numerator and the denominator of the idf in a
manner suggesting that there was access to an extra document every time there was a collection
just once thereby preventing zero divisions: idf (d,t)=log (1+n)/(1+df (d,t))+1 Paik (2013, July)
These formulas that are furthermore used in the computation of the tf and idf are a factor of the
settings of the parameter which are in correspondence with the SMART notation that is deployed
in IR as follows:
Tf is by default n (natural), l which is the logarithmic when sublinear_tf=True. Idf is found to be
t when use_idf is provided, n (none) otherwise. The normalization is found to be c (cosine) at
norm= ‘12’, n (none) when norm=none Paik (2013, July).
There is a significant difference between sentiment analysis and tf-idf even though all of them
are treated as techniques for classification for text, they have distinct goals. While sentiment
analysis is aiming at classification of documents based on opinions such as negative and positive,
tf-idf classifies documents into various categories that are within the very documents.
Paraphrase This Document
Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

Applications of tf-idf
This algorithm tends to be more useful in cases where there is large set of document that is
supposed to be characterized. It is simple as one does not need to train a model before time but
instead it will just automatically account for the variations in the lengths of the document
Rocktäschel et al (2012).
Vector Space Model Representation
Also called term vector model, vector space model is an algebraic model that is used in the
representation of text documents or any objects generally as vectors of identifiers for example in
terms of index. Vector space model is applied in filtering of information, indexing, and retrieval
and relevancy rankings Alpaydin (2014). The model was first applied in the SMART
Information Retrieval System.
In this model, documents and queries are illustrated as vectors
In the vectors shown, each of the dimensions is in correspondence with a spate term and should a
term appear in a document, then it has a non-zero vector value. There are several ways in which
these values are being computed, which is also referred to as weights Paik (2013, July). Tf-idf
weighting is one of such schemes that are best known. The definition of the model is influenced
by its application and in most single words, longer phrases or even keywords. Should words be
chosen as the terms then the dimensionality of the vector refers to the quantity of words that are
available in the vocabulary which is the number of distinct words that are found in the corpus.
This algorithm tends to be more useful in cases where there is large set of document that is
supposed to be characterized. It is simple as one does not need to train a model before time but
instead it will just automatically account for the variations in the lengths of the document
Rocktäschel et al (2012).
Vector Space Model Representation
Also called term vector model, vector space model is an algebraic model that is used in the
representation of text documents or any objects generally as vectors of identifiers for example in
terms of index. Vector space model is applied in filtering of information, indexing, and retrieval
and relevancy rankings Alpaydin (2014). The model was first applied in the SMART
Information Retrieval System.
In this model, documents and queries are illustrated as vectors
In the vectors shown, each of the dimensions is in correspondence with a spate term and should a
term appear in a document, then it has a non-zero vector value. There are several ways in which
these values are being computed, which is also referred to as weights Paik (2013, July). Tf-idf
weighting is one of such schemes that are best known. The definition of the model is influenced
by its application and in most single words, longer phrases or even keywords. Should words be
chosen as the terms then the dimensionality of the vector refers to the quantity of words that are
available in the vocabulary which is the number of distinct words that are found in the corpus.

Applications
By the use of the assumption of similarity of documents theory, it os possible to compute the
relevance rankings of documents available in a keyword search. This is achieved through making
a comparison between the angles of deviation between each of the documents vector and the
vector of the original query in which the query is represented as the very vector kind as the
documents Alpaydin (2014).
It is simpler to estimate the cosine of the angle that is formed between the vectors in real life as
opposed to calculating the angle itself.
Where d2.q is the point of intersection which is the dot product of the document and d2 has been
illustrated in the figure below alongside the query vectors which is resented in the figure by q. ||
d2|| defines the norm of vector d2 and the norm of vector q is represented by ||q||. The equation
below is used in the calculation of the norm of a vector in general Trstenjak, Mikac & Donko
(2014):
This is due to the fact that all the vectors that are being considered by this model nonnegative in
terms of elements and thus a zero value of the cosine illustrates that the query and the document
vector are orthogonal and bear no match i.e. there is no query in the document under
consideration.
By the use of the assumption of similarity of documents theory, it os possible to compute the
relevance rankings of documents available in a keyword search. This is achieved through making
a comparison between the angles of deviation between each of the documents vector and the
vector of the original query in which the query is represented as the very vector kind as the
documents Alpaydin (2014).
It is simpler to estimate the cosine of the angle that is formed between the vectors in real life as
opposed to calculating the angle itself.
Where d2.q is the point of intersection which is the dot product of the document and d2 has been
illustrated in the figure below alongside the query vectors which is resented in the figure by q. ||
d2|| defines the norm of vector d2 and the norm of vector q is represented by ||q||. The equation
below is used in the calculation of the norm of a vector in general Trstenjak, Mikac & Donko
(2014):
This is due to the fact that all the vectors that are being considered by this model nonnegative in
terms of elements and thus a zero value of the cosine illustrates that the query and the document
vector are orthogonal and bear no match i.e. there is no query in the document under
consideration.

The vector space model is divisible into three stages with the first stage being document indexing
in which the terms that bear content are extracted from the text of the document. The second
stage involves weighting the indexed terms to enable document retrieval of only those
documents that are of relevance to the user. The third stages which are also the last stage
involves ranking the documents with regard to the query as per a similarity measure Alpaydin
(2014).
Advantages
Does not have a binary term weight
It is a simple model that is based purely on linear algebra
Permits partial matching
It permits computing at a continuous degree of similarity between the documents and the
queries
It enables ranking of the documents as per their possible relevance
Limitations
in which the terms that bear content are extracted from the text of the document. The second
stage involves weighting the indexed terms to enable document retrieval of only those
documents that are of relevance to the user. The third stages which are also the last stage
involves ranking the documents with regard to the query as per a similarity measure Alpaydin
(2014).
Advantages
Does not have a binary term weight
It is a simple model that is based purely on linear algebra
Permits partial matching
It permits computing at a continuous degree of similarity between the documents and the
queries
It enables ranking of the documents as per their possible relevance
Limitations
Secure Best Marks with AI Grader
Need help grading? Try our AI Grader for instant feedback on your assignments.

It has an intuitive weighting which is not very formal
It is poor in the representation of long documents due to their poor similarity values
There is loss of order of appearance in the vector space representation as was in the
document Chandrashekar & Sahin (2014)
It has a theoretical assumption that terms are independent statistically
It has sematic sensitivity i.e. documents that have similar context but distinct term
vocabulary cannot be associated and thus bringing about a false negative match.
Feature Selection
Feature Selection forms an integral step in data processing that is done just the application of a
learning algorithm. The complexity of the computation is the main issue that is taken into
consideration when a proposal is being made on a method of feature selection. In most case, a
fast feature selection process is unable to search rough the space of the feature subset thus the
accuracy of the classification is found to be reduced Alpaydin (2014). Also known as variable
selection, variable subset selection and attribute selection, feature selection n defines the process
through which a subset of the appropriate and relevant features is selected to be used in the
model construction.
The main role of feature selection is the elimination of the irrelevant and redundant features.
Irrelevant features are features that are treated to be providing information that is of no use with
regard to the data while redundant features are those that are no longer providing information
apart from the currently chosen features. In other terms, redundant features offer information
which is of importance to the data set even though the same information is already provided by
the currently chosen features Chandrashekar & Sahin (2014).
It is poor in the representation of long documents due to their poor similarity values
There is loss of order of appearance in the vector space representation as was in the
document Chandrashekar & Sahin (2014)
It has a theoretical assumption that terms are independent statistically
It has sematic sensitivity i.e. documents that have similar context but distinct term
vocabulary cannot be associated and thus bringing about a false negative match.
Feature Selection
Feature Selection forms an integral step in data processing that is done just the application of a
learning algorithm. The complexity of the computation is the main issue that is taken into
consideration when a proposal is being made on a method of feature selection. In most case, a
fast feature selection process is unable to search rough the space of the feature subset thus the
accuracy of the classification is found to be reduced Alpaydin (2014). Also known as variable
selection, variable subset selection and attribute selection, feature selection n defines the process
through which a subset of the appropriate and relevant features is selected to be used in the
model construction.
The main role of feature selection is the elimination of the irrelevant and redundant features.
Irrelevant features are features that are treated to be providing information that is of no use with
regard to the data while redundant features are those that are no longer providing information
apart from the currently chosen features. In other terms, redundant features offer information
which is of importance to the data set even though the same information is already provided by
the currently chosen features Chandrashekar & Sahin (2014).

An example of such is the year of birth as well as the age which provide the very information
about a person. Redundant and irrelevant feature have the potential of lowering the accuracy of
learning and the model quality achieved by the learning algorithm. There are numerous
proposals that have been made in an attempt to more accurately and efficiently apply the learning
algorithms Alpaydin (2014). Such proposals reduce the dimensionality for example relief, CFS,
FOCUS. Through the removal of the irrelevant information and minimizing noise levels, the
accuracy and efficiency of leaning algorithms can significantly be improved. Feature selection
has attracted special interests especially in areas of research that call for high dimensional
datasets such as text processing, combinational chemistry and gene expression.
An evaluation measure and a search technique are the two requirements for the formation of the
algorithm of feature selection. The search technique works by giving proposal on subsets of new
feature and include search approaches as genetic algorithm, best first, greedy forward selection,
simulated annealing, greedy backward elimination as well as exhaustive Liu & Motoda (2012).
An evaluation measure is on the other hand used in scoring the various feature subsets with some
of the most common evaluation measures including error probability, entropy, correlation, inter-
class distance and mutual information. The feature selection summary is as shown in the diagram
below
about a person. Redundant and irrelevant feature have the potential of lowering the accuracy of
learning and the model quality achieved by the learning algorithm. There are numerous
proposals that have been made in an attempt to more accurately and efficiently apply the learning
algorithms Alpaydin (2014). Such proposals reduce the dimensionality for example relief, CFS,
FOCUS. Through the removal of the irrelevant information and minimizing noise levels, the
accuracy and efficiency of leaning algorithms can significantly be improved. Feature selection
has attracted special interests especially in areas of research that call for high dimensional
datasets such as text processing, combinational chemistry and gene expression.
An evaluation measure and a search technique are the two requirements for the formation of the
algorithm of feature selection. The search technique works by giving proposal on subsets of new
feature and include search approaches as genetic algorithm, best first, greedy forward selection,
simulated annealing, greedy backward elimination as well as exhaustive Liu & Motoda (2012).
An evaluation measure is on the other hand used in scoring the various feature subsets with some
of the most common evaluation measures including error probability, entropy, correlation, inter-
class distance and mutual information. The feature selection summary is as shown in the diagram
below

Feature selection flow
A feature selection algorithm performs search on all the probable feature subsets in a concept so
as to come up with the optimal subset. This process may be computationally intensive and hence
calls for some criterion for stopping Panahiazar et al (2014, October). The stopping criterion is
normally dependent on the conditions which include the number of interactions and the threshold
of evaluation. An example is forcing the feature selection process to come to a halt upon
reaching some iterations or number.
There are four main reasons for application of feature selection techniques:
To prevent the curse of dimensionality
Reduce the length of training times Liu & Motoda (2012)
Simplify the models making their interpretation by the user or researchers easier
Improve generalization through minimizing over fitting which os basically the reduction
of variance.
(Correlation-Base feature selection) + (Greedy Hill climbing search)
Correlation-Base feature selection is used in the measurement of the subsets of the feature based
on the hypothesis that “Good feature subsets have highly correlated features that have the
classification, but uncorrelated to each other.”
The equation shown below is sued in providing merit of an S feature subset that has total k
features
A feature selection algorithm performs search on all the probable feature subsets in a concept so
as to come up with the optimal subset. This process may be computationally intensive and hence
calls for some criterion for stopping Panahiazar et al (2014, October). The stopping criterion is
normally dependent on the conditions which include the number of interactions and the threshold
of evaluation. An example is forcing the feature selection process to come to a halt upon
reaching some iterations or number.
There are four main reasons for application of feature selection techniques:
To prevent the curse of dimensionality
Reduce the length of training times Liu & Motoda (2012)
Simplify the models making their interpretation by the user or researchers easier
Improve generalization through minimizing over fitting which os basically the reduction
of variance.
(Correlation-Base feature selection) + (Greedy Hill climbing search)
Correlation-Base feature selection is used in the measurement of the subsets of the feature based
on the hypothesis that “Good feature subsets have highly correlated features that have the
classification, but uncorrelated to each other.”
The equation shown below is sued in providing merit of an S feature subset that has total k
features
Paraphrase This Document
Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

In which is the mean value of the correlations of feature and classification and the
average mean of all the correlations between features and feature. The criterion of the correlation
feature selection is defined by
In which the variables are known as the correlations even though they are not
necessarily the coefficient of Pearson correlation Bache & Lichman (2013). A dissertation by Dr.
Mark adopts neither of these and instead uses various measures that show relationship, relief,
minimum description length and symmetrical uncertainty. Assuming the xi is the indicator
function of the set membership for a feature fi then an optimization problem can be achieved
through rewriting the above equation as:
The above combinatorial problems are combined 0-1 problems of linear programming which the
branch and branch algorithm can be used in solving.
Machine Learning Overview
average mean of all the correlations between features and feature. The criterion of the correlation
feature selection is defined by
In which the variables are known as the correlations even though they are not
necessarily the coefficient of Pearson correlation Bache & Lichman (2013). A dissertation by Dr.
Mark adopts neither of these and instead uses various measures that show relationship, relief,
minimum description length and symmetrical uncertainty. Assuming the xi is the indicator
function of the set membership for a feature fi then an optimization problem can be achieved
through rewriting the above equation as:
The above combinatorial problems are combined 0-1 problems of linear programming which the
branch and branch algorithm can be used in solving.
Machine Learning Overview

For obvious reasons, machine learning has become of the most widely discussed topics with
some of the reasons given is due to it being able to offer the ability of automatically get deep
insights, come up with a high performing predictive data models as well as implore and
acknowledge unknown patterns without necessarily calling for the need of explicit programming
instructions Panahiazar, Taslimitehrani, Jadhav & Pathak (2014, October). Machine learning is
defined as a computer program that can be said to have leant from experience from E with regard
to a few class of tasks T and performance measures P should its performance at tasks in T as
estimated by P improves with E, in a less formal language, machine leaning can be descried as a
subtopic in the field of computer science is which normally known as additive analytics or
otherwise predictive learning. The goal and use of machine learning is to construct and or
leverage the current algorithms so as to learn from data in building generalizable models which
offer accurate predictions or to get pattern especially with the new and hidden data which are
similar in nature Panahiazar et al (2014, October).
The process of Machine Learning
Just as has been hinted in the definition of machine learning, it leverages algorithms in
automatically modeling and gets patterns in data in most cases with the aim of predicting some
of the target output, also called the response. The algorithms are mainly based on mathematical
optimization and statistics Bache & Lichman (2013). Optimization is defined as the processing
of getting the least or greatest value (minima or maxima) of any function in most cases referred
to as a cost function or loss in the case of minimization. The gradient descent tends to be one of
the optimization algorithms that are most commonly used. The normal equation is yet another
optimization that has gained popularity over the recent past. In summary, machine learning
revolves around learning a model that is highly accurate predictive or classier automatically. It
some of the reasons given is due to it being able to offer the ability of automatically get deep
insights, come up with a high performing predictive data models as well as implore and
acknowledge unknown patterns without necessarily calling for the need of explicit programming
instructions Panahiazar, Taslimitehrani, Jadhav & Pathak (2014, October). Machine learning is
defined as a computer program that can be said to have leant from experience from E with regard
to a few class of tasks T and performance measures P should its performance at tasks in T as
estimated by P improves with E, in a less formal language, machine leaning can be descried as a
subtopic in the field of computer science is which normally known as additive analytics or
otherwise predictive learning. The goal and use of machine learning is to construct and or
leverage the current algorithms so as to learn from data in building generalizable models which
offer accurate predictions or to get pattern especially with the new and hidden data which are
similar in nature Panahiazar et al (2014, October).
The process of Machine Learning
Just as has been hinted in the definition of machine learning, it leverages algorithms in
automatically modeling and gets patterns in data in most cases with the aim of predicting some
of the target output, also called the response. The algorithms are mainly based on mathematical
optimization and statistics Bache & Lichman (2013). Optimization is defined as the processing
of getting the least or greatest value (minima or maxima) of any function in most cases referred
to as a cost function or loss in the case of minimization. The gradient descent tends to be one of
the optimization algorithms that are most commonly used. The normal equation is yet another
optimization that has gained popularity over the recent past. In summary, machine learning
revolves around learning a model that is highly accurate predictive or classier automatically. It

also involves getting the unknowns pattern in data through the use of leveraging learning of
algorithms as well as techniques for optimizations.
Types of Learning
Machine learning is primarily categorized into supervised, semi-supervised and unsupervised
learning. The response variable that is being modeled is contained in the data and the goal in this
case is to predict the class or the value of the data that has not been seen. Unsupervised learning
on the other hand entails learning from a set of data that does not have a response variable or
label and thus more of finding patterns as opposed to predictions Panahiazar et al (2014,
October).
Goals and Outputs of Machine Learning
The following output types are the main uses of machine learning algorithms:
Recommended systems
Clustering
Regressions: Multivariate, Univariate
Two class and multi-class classification
Detection of anomaly
Each output uses specific algorithm. Clustering is defined as a supervised technique that is used
in, making discoveries of the stricture ad composition of a specific set of data. It refers to the
process of bring together data into clusters to find out which groupings may be derived out of
them should there be any Panahiazar et al (2014, October). Each of the clusters is characterized
sing a cluster centroid and a set of data points in which the cluster centroid refers to the average
algorithms as well as techniques for optimizations.
Types of Learning
Machine learning is primarily categorized into supervised, semi-supervised and unsupervised
learning. The response variable that is being modeled is contained in the data and the goal in this
case is to predict the class or the value of the data that has not been seen. Unsupervised learning
on the other hand entails learning from a set of data that does not have a response variable or
label and thus more of finding patterns as opposed to predictions Panahiazar et al (2014,
October).
Goals and Outputs of Machine Learning
The following output types are the main uses of machine learning algorithms:
Recommended systems
Clustering
Regressions: Multivariate, Univariate
Two class and multi-class classification
Detection of anomaly
Each output uses specific algorithm. Clustering is defined as a supervised technique that is used
in, making discoveries of the stricture ad composition of a specific set of data. It refers to the
process of bring together data into clusters to find out which groupings may be derived out of
them should there be any Panahiazar et al (2014, October). Each of the clusters is characterized
sing a cluster centroid and a set of data points in which the cluster centroid refers to the average
Secure Best Marks with AI Grader
Need help grading? Try our AI Grader for instant feedback on your assignments.

of all the available fata point that are contained in the cluster throughout the entire features.
Classification problems entail he placement of a data point also called observation into a class or
category that is pre-defined and at other times classification problems just assign a class to data
point while in other case the goal is to approximate the probabilities that the data point is a
member of each of the provided classes.
Regression on the other hand is used to mean that a model would be assigning a response to a
data observation contrary to a discrete class Panahiazar et al (2014, October). In other time,
regression is synonymously with an algorithm that is used in the classification problems or in the
prediction of the discrete categorical response for example ham or spam. Logistic regression
offers an excellent example of regression and is used in the prediction of the probabilities of a
specific discrete value.
At times, anomalies are used to indicate a real problem which is not easily explained for example
a defect in manufacturing. In such cases, detecting anomalies is used in the provision of a
measure of quality control besides an insight into the effectiveness of the taken steps in the
reduction of the effects. In both cases, there comes a time that finding the anomalous values are
of benefit and thus the use of some algorithms of machine learning Chandrashekar & Sahin
(2014).
Recommendation systems, also called recommendation engines are an information filtering
system type that are aimed at providing recommendations in numerous application among them
books, articles, movies, restaurants, products and many others. There are two most common
approaches that are adopted: content based and collaborative filtering Panahiazar et al (2014,
October).
Classification problems entail he placement of a data point also called observation into a class or
category that is pre-defined and at other times classification problems just assign a class to data
point while in other case the goal is to approximate the probabilities that the data point is a
member of each of the provided classes.
Regression on the other hand is used to mean that a model would be assigning a response to a
data observation contrary to a discrete class Panahiazar et al (2014, October). In other time,
regression is synonymously with an algorithm that is used in the classification problems or in the
prediction of the discrete categorical response for example ham or spam. Logistic regression
offers an excellent example of regression and is used in the prediction of the probabilities of a
specific discrete value.
At times, anomalies are used to indicate a real problem which is not easily explained for example
a defect in manufacturing. In such cases, detecting anomalies is used in the provision of a
measure of quality control besides an insight into the effectiveness of the taken steps in the
reduction of the effects. In both cases, there comes a time that finding the anomalous values are
of benefit and thus the use of some algorithms of machine learning Chandrashekar & Sahin
(2014).
Recommendation systems, also called recommendation engines are an information filtering
system type that are aimed at providing recommendations in numerous application among them
books, articles, movies, restaurants, products and many others. There are two most common
approaches that are adopted: content based and collaborative filtering Panahiazar et al (2014,
October).

Machine Learning Algorithms
Some of the machines learning algorithms are as shown below:
Supervised regression
Poisson regression
Multiple liner and simple regression
Ordinal regression
Methods of nearest neighbor
Forest regression or decision tree
Artificial Neural Networks
Anomaly Detection
Principle component analysis
Support vector machine
Supervised two-class and multi-class grouping
Perception methods
Bayesian classifiers
Over against all multiclass
Artificial Neural Networks Bache & Lichman (2013)
Multinomial and logical regressions
Support vector machine
Method of nearest neighbors
Jungles, decision tree and forest
Some of the machines learning algorithms are as shown below:
Supervised regression
Poisson regression
Multiple liner and simple regression
Ordinal regression
Methods of nearest neighbor
Forest regression or decision tree
Artificial Neural Networks
Anomaly Detection
Principle component analysis
Support vector machine
Supervised two-class and multi-class grouping
Perception methods
Bayesian classifiers
Over against all multiclass
Artificial Neural Networks Bache & Lichman (2013)
Multinomial and logical regressions
Support vector machine
Method of nearest neighbors
Jungles, decision tree and forest

Unsupervised
Hierarchical clustering
K-means clustering
Naive Bayes Classifier Learning
With reference to machine learning, Naive Bayes Classifier refer to a family of probalistic
classifiers which are simple and are based upon the applications of Bayes’ theorem in which
there are strong independences in the assumptions made between the various characteristics.
Naive Bayes is one of the techniques that are used in the construction of classifiers which are
basically models that are used in assigning class labels to instances of problems and are
represented in the form of vectors of feature values in which the labels are drawn from a specific
set Bache & Lichman (2013). It is a family of algorithms that are working on a common
principle which is that all the Naive Bayes classifiers make an assumption that the value of a
specific feature does not depend on the value of another feature if the class is provided.
An example can be a case such as a fruit may be assumed to be an apple should it be red, round
and approximately 10 cm in diameter. In this case, the Naive Bayes classifiers would take into
consideration each of the characteristic to independently contribute to the possibility that the said
fruit is an apple without regard to any chances of correlation between the color, diameter and
roundness characteristics Chandrashekar & Sahin (2014). In some probability model types, it is
possible to train Naive Bayes classifiers quite efficiently in a supervised setting of learning. In
most of the practical applications in the estimation of the parameters, Naive Bayes models adopt
the use of maximum likelihood Bache & Lichman (2013). This means that it is possible to work
Hierarchical clustering
K-means clustering
Naive Bayes Classifier Learning
With reference to machine learning, Naive Bayes Classifier refer to a family of probalistic
classifiers which are simple and are based upon the applications of Bayes’ theorem in which
there are strong independences in the assumptions made between the various characteristics.
Naive Bayes is one of the techniques that are used in the construction of classifiers which are
basically models that are used in assigning class labels to instances of problems and are
represented in the form of vectors of feature values in which the labels are drawn from a specific
set Bache & Lichman (2013). It is a family of algorithms that are working on a common
principle which is that all the Naive Bayes classifiers make an assumption that the value of a
specific feature does not depend on the value of another feature if the class is provided.
An example can be a case such as a fruit may be assumed to be an apple should it be red, round
and approximately 10 cm in diameter. In this case, the Naive Bayes classifiers would take into
consideration each of the characteristic to independently contribute to the possibility that the said
fruit is an apple without regard to any chances of correlation between the color, diameter and
roundness characteristics Chandrashekar & Sahin (2014). In some probability model types, it is
possible to train Naive Bayes classifiers quite efficiently in a supervised setting of learning. In
most of the practical applications in the estimation of the parameters, Naive Bayes models adopt
the use of maximum likelihood Bache & Lichman (2013). This means that it is possible to work
Paraphrase This Document
Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

with the Naive Bayes model without necessarily admitting Bayesian probability or the use of any
of the methods of Bayes.
Bayes’ Theorem
Selection of the best hypothesis (h) and the given data (d) are normally at the top of interest in
machine learning. Under classification problem, the hypothesis (h) could be the class that is to be
assigned a new data says (d). The use of the prior knowledge offers one of the simplest ways of
picking on the most probable hypothesis from the given data Chandrashekar & Sahin (2014).
Through Bayes’ theorem, a way out for calculating the probability of the hypothesis following a
prior knowledge can be calculated. The Bayes’ theorem states that:
P ( h|d )=P ( h |d ) × P( h)¿ /P( d)
where P ( h|d ) defines the probability of the hypothesis h form the given data d. this is also called
posterior probability
P ( d |h ) is the data d probability that is given provided that the hypothesis h was true
P (h) is the probability that the hypothesis h is true without considering the data. Such is
called prior h probability Alpaydin, E. (2014)
P (d) is the probability of the data without taking into consideration the hypothesis.
As can be observed, the interest here is to determine the posterior probability of P ( h|d ) using the
probability P ( d |h ), P(h) and P(d). According to Bache & Lichman (2013) upon calculation of the
posterior probability of various hypotheses, a selection can be made on the hypothesis that has
of the methods of Bayes.
Bayes’ Theorem
Selection of the best hypothesis (h) and the given data (d) are normally at the top of interest in
machine learning. Under classification problem, the hypothesis (h) could be the class that is to be
assigned a new data says (d). The use of the prior knowledge offers one of the simplest ways of
picking on the most probable hypothesis from the given data Chandrashekar & Sahin (2014).
Through Bayes’ theorem, a way out for calculating the probability of the hypothesis following a
prior knowledge can be calculated. The Bayes’ theorem states that:
P ( h|d )=P ( h |d ) × P( h)¿ /P( d)
where P ( h|d ) defines the probability of the hypothesis h form the given data d. this is also called
posterior probability
P ( d |h ) is the data d probability that is given provided that the hypothesis h was true
P (h) is the probability that the hypothesis h is true without considering the data. Such is
called prior h probability Alpaydin, E. (2014)
P (d) is the probability of the data without taking into consideration the hypothesis.
As can be observed, the interest here is to determine the posterior probability of P ( h|d ) using the
probability P ( d |h ), P(h) and P(d). According to Bache & Lichman (2013) upon calculation of the
posterior probability of various hypotheses, a selection can be made on the hypothesis that has

the greatest pro ability which will be the maximum probable hypothesis s and may be referred to
as the maximum posteriori (MAP). This is mathematically expressed as
MAP (h) =max P ( h|d )
or
MAP (h) =P ( d |h )*P (h))/P (d)) Rocktäschel, T., Weidlich, M., & Leser, U. (2012)
Or
MAP (h) =max) = P ( d |h )*P (h)
The P(d) in the calculation is used a normalization term that permits the estimation of the
probability and can thus be dropped in case the focus is on the most probable hypothesis since it
is constant and only applied when there is need to normalize Tuarob, Bhatia, Mitra, & Giles
(2013, August).
Random Forest Machine Learning
Random Forests are ensemble learning models that are being supervised and are used in the
classification and regression. Ensemble learning models bring together numerous learning
models for machines and thus bringing about a better performance Alpaydin (2014). The idea or
logic behind such a move is that each of the models that are used in ensemble learning is weak
and hence not efficient when employed to work on its own but strength is gained upon
aggregating multiple learning models together. For the case of the Random Forest, a great
as the maximum posteriori (MAP). This is mathematically expressed as
MAP (h) =max P ( h|d )
or
MAP (h) =P ( d |h )*P (h))/P (d)) Rocktäschel, T., Weidlich, M., & Leser, U. (2012)
Or
MAP (h) =max) = P ( d |h )*P (h)
The P(d) in the calculation is used a normalization term that permits the estimation of the
probability and can thus be dropped in case the focus is on the most probable hypothesis since it
is constant and only applied when there is need to normalize Tuarob, Bhatia, Mitra, & Giles
(2013, August).
Random Forest Machine Learning
Random Forests are ensemble learning models that are being supervised and are used in the
classification and regression. Ensemble learning models bring together numerous learning
models for machines and thus bringing about a better performance Alpaydin (2014). The idea or
logic behind such a move is that each of the models that are used in ensemble learning is weak
and hence not efficient when employed to work on its own but strength is gained upon
aggregating multiple learning models together. For the case of the Random Forest, a great

number of Decision Trees which serve as the weak factors are incorporated and their yields
aggregated in which the result is an illustration of the achieved string assembly.
The random in Random Forests is derived from the fact that each of the individual tree decisions
is trained by the algorithm through the use of varied subsets of the training data Alpaydin (2014).
Still, each of the nodes of every decision tree is divided using an attribute that has been selected
from the data randomly. The algorithm thus manages to generate models that have no correlation
with each other as a result of introduction of the randomness. The result of this is chances of
errors that are evenly being spread out in the entire model, to mean the errors will finally be
cancelled out via the voting of the majority decision strategy of the models of Random Forest.
Just like is the case with forests in which the higher the number of trees in the forest the more
robust the forest, for the case of Random forest the higher the number of trees in the forest, the
higher the accuracy levels Iqbal, Ahmed & Abu-Rub (2012).
The concept of decision tree
The decision tree concept tends to be more of the rule based system. The decision tree algorithm
sets up some instructions or rules upon being treated with training data set that has features and
targets. The very set of rules are applicable in the performance of the prediction for the case of
the test dataset Alpaydin (2014).
Random forest algorithm pseudocode can be divided into two distinct stages:
Random forest creation pseudocode
Pseudocode to perform prediction from the generated random forest classifier
Random Forest Pseudocode
aggregated in which the result is an illustration of the achieved string assembly.
The random in Random Forests is derived from the fact that each of the individual tree decisions
is trained by the algorithm through the use of varied subsets of the training data Alpaydin (2014).
Still, each of the nodes of every decision tree is divided using an attribute that has been selected
from the data randomly. The algorithm thus manages to generate models that have no correlation
with each other as a result of introduction of the randomness. The result of this is chances of
errors that are evenly being spread out in the entire model, to mean the errors will finally be
cancelled out via the voting of the majority decision strategy of the models of Random Forest.
Just like is the case with forests in which the higher the number of trees in the forest the more
robust the forest, for the case of Random forest the higher the number of trees in the forest, the
higher the accuracy levels Iqbal, Ahmed & Abu-Rub (2012).
The concept of decision tree
The decision tree concept tends to be more of the rule based system. The decision tree algorithm
sets up some instructions or rules upon being treated with training data set that has features and
targets. The very set of rules are applicable in the performance of the prediction for the case of
the test dataset Alpaydin (2014).
Random forest algorithm pseudocode can be divided into two distinct stages:
Random forest creation pseudocode
Pseudocode to perform prediction from the generated random forest classifier
Random Forest Pseudocode
Secure Best Marks with AI Grader
Need help grading? Try our AI Grader for instant feedback on your assignments.

Below is the pseudocode procedure
1. Select “k” features at random from total of “m” features. The condition here must be that
k<<m
2. using the best split option, evaluate the node d among the features of k
3. divide the nodes into daughter nodes through the use of the best split Panahiazar et al
(2014, October)
4. perform steps 1 to 3 until an I number of nodes has been achieved
5. Repeat steps 1 to 4 for as the number of times as n so as to create n number of trees, this
would build the forest.
The start of every random forest algorithm is the selection of k features from the total number of
features available which is usually m features. As can be observed in the procedure, the features
and the observations are randomly being taken Quinlan (2014). The next phase involves using
the k features that were randomly selected in obtaining the root node through the use of best split
approach. This step s followed closely is calculation of the daughter nodes which adopts the very
best split approach. These first three stages are to be rewound until a tree having a root node is
obtained in which the target is the leaf node. The final stage is to repeat the first 4 stages in the
creation of n trees that are randomly created. From the randomly created tress is the random
forest.
Random forest prediction pseudocode
The pseudocode below is adopted in performing prediction through the use of trained random
forest algorithm
1. Select “k” features at random from total of “m” features. The condition here must be that
k<<m
2. using the best split option, evaluate the node d among the features of k
3. divide the nodes into daughter nodes through the use of the best split Panahiazar et al
(2014, October)
4. perform steps 1 to 3 until an I number of nodes has been achieved
5. Repeat steps 1 to 4 for as the number of times as n so as to create n number of trees, this
would build the forest.
The start of every random forest algorithm is the selection of k features from the total number of
features available which is usually m features. As can be observed in the procedure, the features
and the observations are randomly being taken Quinlan (2014). The next phase involves using
the k features that were randomly selected in obtaining the root node through the use of best split
approach. This step s followed closely is calculation of the daughter nodes which adopts the very
best split approach. These first three stages are to be rewound until a tree having a root node is
obtained in which the target is the leaf node. The final stage is to repeat the first 4 stages in the
creation of n trees that are randomly created. From the randomly created tress is the random
forest.
Random forest prediction pseudocode
The pseudocode below is adopted in performing prediction through the use of trained random
forest algorithm

Picks on the forest features and adopt the rules of each of the randomly created decision
tree in the prediction of the outcome and keep the predicted outcomes Chandrashekar &
Sahin (2014)
Calculation of the votes of each of the predicted targets
Taking into consideration the predicted target the received the highest number of votes
and considering it as the fin prediction from the algorithm of the random forest.
Using the trained random forest algorithm in performing the prediction demands passing the test
features through following the rules that have been set for each of the randomly created forests.
By taking for example that there were 100 random decision trees that were formed from the
random test, an analysis can be done Chandrashekar & Sahin (2014). The first thing that needs to
be understood in that each of the random forest would predict various targets for the same feature
of the test. From consideration of each of the predicted target, calculation of the votes is done.
Assuming that the 100 random decision trees are a prediction of about 3 unique targets named x,
y and z, then the votes from x does not count much but instead just out of 100 random decision
tree the number of trees whose prediction is x. this is applicable to the other two targets, y and z
as well Quinlan (2014). Assuming that x had the highest number of votes, say from the 100
random decision tree there are 60 trees that are predicting its target, the x would become the
target and thus the final random forest would return the x as the outcome or the predicted target.
Such a concept is referred to as majority voting.
Applications of random forest algorithm
There are numerous applications of random algorithm among them banking, e-commerce,
medicine and stock market. Random forest algorithm has two main applications in the banking
tree in the prediction of the outcome and keep the predicted outcomes Chandrashekar &
Sahin (2014)
Calculation of the votes of each of the predicted targets
Taking into consideration the predicted target the received the highest number of votes
and considering it as the fin prediction from the algorithm of the random forest.
Using the trained random forest algorithm in performing the prediction demands passing the test
features through following the rules that have been set for each of the randomly created forests.
By taking for example that there were 100 random decision trees that were formed from the
random test, an analysis can be done Chandrashekar & Sahin (2014). The first thing that needs to
be understood in that each of the random forest would predict various targets for the same feature
of the test. From consideration of each of the predicted target, calculation of the votes is done.
Assuming that the 100 random decision trees are a prediction of about 3 unique targets named x,
y and z, then the votes from x does not count much but instead just out of 100 random decision
tree the number of trees whose prediction is x. this is applicable to the other two targets, y and z
as well Quinlan (2014). Assuming that x had the highest number of votes, say from the 100
random decision tree there are 60 trees that are predicting its target, the x would become the
target and thus the final random forest would return the x as the outcome or the predicted target.
Such a concept is referred to as majority voting.
Applications of random forest algorithm
There are numerous applications of random algorithm among them banking, e-commerce,
medicine and stock market. Random forest algorithm has two main applications in the banking

sector: finding fraud customers and getting loyal customers. Loyal customer in this case is
customer who pays well and takes a large amount of large and delivers the loan interest
effectively to the bank Ferrucci et al (2013). The growth and development of any bank is
influenced directly by the by the availability of loyal customers. Using the details of the
customer, the bank highly analyzes the customer to establish his pattern. In the very way, it is of
equal importance to identify a customer who is of little profit if any to the bank.
Such a customer is one who does not take loan and should he do he would not effectively pay the
interest. By having an opportunity to identify such customer before loans can be advanced to
them, the bank could be able to reject approvals for their loans Panahiazar et al (2014, October).
Random forest algorithm is again applicable in this case in the identification of the non-
profitable customers. Random forest algorithm is applied in the stock market for identification of
the behavior of the stock and the anticipated profit or loss when a certain stock is purchased.
In the medical field, random forest algorithm aids in the identification of the right combination of
the various component that would be used in validating a drug. Random forest algorithm is equal
of importance in the identification of the disease through the analysis of the medical records of
patients Chandrashekar & Sahin (2014). On the other hand, random forest algorithm is used in
very small segments in the recommendation engine, usable in the identification of the chances
that a customer prefers the recommended products based on similar types of customers when it
comes to e commerce. High end GPU systems are needed to aid in running random forest
algorithm on huge dataset but it is possible to run the machine learning models in the desktop
hosted in the cloud in case GPU systems are unavailable Ferrucci et al. (2013)
Advantages of Random forest algorithm
customer who pays well and takes a large amount of large and delivers the loan interest
effectively to the bank Ferrucci et al (2013). The growth and development of any bank is
influenced directly by the by the availability of loyal customers. Using the details of the
customer, the bank highly analyzes the customer to establish his pattern. In the very way, it is of
equal importance to identify a customer who is of little profit if any to the bank.
Such a customer is one who does not take loan and should he do he would not effectively pay the
interest. By having an opportunity to identify such customer before loans can be advanced to
them, the bank could be able to reject approvals for their loans Panahiazar et al (2014, October).
Random forest algorithm is again applicable in this case in the identification of the non-
profitable customers. Random forest algorithm is applied in the stock market for identification of
the behavior of the stock and the anticipated profit or loss when a certain stock is purchased.
In the medical field, random forest algorithm aids in the identification of the right combination of
the various component that would be used in validating a drug. Random forest algorithm is equal
of importance in the identification of the disease through the analysis of the medical records of
patients Chandrashekar & Sahin (2014). On the other hand, random forest algorithm is used in
very small segments in the recommendation engine, usable in the identification of the chances
that a customer prefers the recommended products based on similar types of customers when it
comes to e commerce. High end GPU systems are needed to aid in running random forest
algorithm on huge dataset but it is possible to run the machine learning models in the desktop
hosted in the cloud in case GPU systems are unavailable Ferrucci et al. (2013)
Advantages of Random forest algorithm
Paraphrase This Document
Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

Regression and classification tasks can use the same random forest algorithm
Random forest algorithm eliminates the problem of over fitting in any classification
problem Mikolov, Chen, Corrado & Dean (2013)
Random forest algorithm is applicable in feature engineering
Information Gain
This is a synonym for Kullback-Leibler divergence which refers not the non-symmetric measure
of the level of divergence between the functions of probability P and Q
Kullback-Leibler is indeed the anticipated divergence of logarithmic difference between the
probability model Q and P. when the probability functions of P and Q are equal; the value of
Kullback-Leibler divergence is zero Quinlan (2014). Mutual information may also be used in
defining information gain in which it refers to the reduction in the entropy achievable through a
learning variable A as shown:
where H(S) is the given dataset entropy, H (St) is the entropy of the ith subset that produced by
partitioning the feature that is based on S. information gain helps in ranking the various features
in machine learning in which the feature having the highest information gain is treated to be
ranking higher than the other features since it has stronger power when it comes to classification
of the data Ferrucci et al (2013). Information gain can as well be denied for a set of features that
Random forest algorithm eliminates the problem of over fitting in any classification
problem Mikolov, Chen, Corrado & Dean (2013)
Random forest algorithm is applicable in feature engineering
Information Gain
This is a synonym for Kullback-Leibler divergence which refers not the non-symmetric measure
of the level of divergence between the functions of probability P and Q
Kullback-Leibler is indeed the anticipated divergence of logarithmic difference between the
probability model Q and P. when the probability functions of P and Q are equal; the value of
Kullback-Leibler divergence is zero Quinlan (2014). Mutual information may also be used in
defining information gain in which it refers to the reduction in the entropy achievable through a
learning variable A as shown:
where H(S) is the given dataset entropy, H (St) is the entropy of the ith subset that produced by
partitioning the feature that is based on S. information gain helps in ranking the various features
in machine learning in which the feature having the highest information gain is treated to be
ranking higher than the other features since it has stronger power when it comes to classification
of the data Ferrucci et al (2013). Information gain can as well be denied for a set of features that

are joined as entropy reduction archived thorough bearing a joint feature set, F. the above
definition is similar to the one for this case:
where H (St) defines the entropy of ith subset that is produced through partitioning of all the
features that are based on S found in the joint feature set F.
Methods
First Method
The study proposes to make use of Bigrams in the concatenation of every patient’s record text.
Then, filters will be used to extract and define the feature that was found to have occurred for
more than fifth in the entire patients' records. Tf-idf will then be calculated based on the noted
key features. Then, Vector Space Model use to represent results of Tf-idf. Weka tools will then
be used to select features and train model Quinlan (2014).
Second Method
I will make use of UMLS MetaMap. From the MetaMap, I will deduce the concepts of every
patient’s records individual text. Then, filters will be used to extract and define the feature that
was found to have occurred for more than fifth in the entire patients' records. After that, the tf-idf
of every feature will be calculated. Then, Vector Space Model use to represent results of Tf-idf.
After that, Weka tools will be used to select features and train model Ferrucci et al (2013).
definition is similar to the one for this case:
where H (St) defines the entropy of ith subset that is produced through partitioning of all the
features that are based on S found in the joint feature set F.
Methods
First Method
The study proposes to make use of Bigrams in the concatenation of every patient’s record text.
Then, filters will be used to extract and define the feature that was found to have occurred for
more than fifth in the entire patients' records. Tf-idf will then be calculated based on the noted
key features. Then, Vector Space Model use to represent results of Tf-idf. Weka tools will then
be used to select features and train model Quinlan (2014).
Second Method
I will make use of UMLS MetaMap. From the MetaMap, I will deduce the concepts of every
patient’s records individual text. Then, filters will be used to extract and define the feature that
was found to have occurred for more than fifth in the entire patients' records. After that, the tf-idf
of every feature will be calculated. Then, Vector Space Model use to represent results of Tf-idf.
After that, Weka tools will be used to select features and train model Ferrucci et al (2013).

Conclusion
This task aimed to identify whether a patient meets, does not meet, or possibly meets a selected
set of eligibility criteria based on their longitudinal records. The eligibility criteria come from
real clinical trials and focus on patients’ medications, past medical histories, and whether certain
events have occurred in a specified timeframe in the patients’ records. This task uses data from
the 2014 i2b2/UTHealth Shared-Tasks and Workshop on Challenges in Natural Language
Processing for Clinical Data. A literature review on the various machine learning languages and
techniques is also provided to offer insights into these various techniques and their applicability.
Different machine languages and techniques have different advantages and disadvantages as has
been seen in the above discussion which should be taken into consideration before a choice is
made on a machine learning language, technique or algorithm in the identification of the
eligibility criteria of patents for clinical trials.
References
Robert, C. (2014). Machine learning, a probabilistic perspective
Michalski, R. S., Carbonell, J. G., & Mitchell, T. M. (Eds.). (2013). Machine learning: An
artificial intelligence approach. Springer Science & Business Media
This task aimed to identify whether a patient meets, does not meet, or possibly meets a selected
set of eligibility criteria based on their longitudinal records. The eligibility criteria come from
real clinical trials and focus on patients’ medications, past medical histories, and whether certain
events have occurred in a specified timeframe in the patients’ records. This task uses data from
the 2014 i2b2/UTHealth Shared-Tasks and Workshop on Challenges in Natural Language
Processing for Clinical Data. A literature review on the various machine learning languages and
techniques is also provided to offer insights into these various techniques and their applicability.
Different machine languages and techniques have different advantages and disadvantages as has
been seen in the above discussion which should be taken into consideration before a choice is
made on a machine learning language, technique or algorithm in the identification of the
eligibility criteria of patents for clinical trials.
References
Robert, C. (2014). Machine learning, a probabilistic perspective
Michalski, R. S., Carbonell, J. G., & Mitchell, T. M. (Eds.). (2013). Machine learning: An
artificial intelligence approach. Springer Science & Business Media
Secure Best Marks with AI Grader
Need help grading? Try our AI Grader for instant feedback on your assignments.

Witten, I. H., Frank, E., Hall, M. A., & Pal, C. J. (2016). Data Mining: Practical machine
learning tools and techniques. Morgan Kaufmann.
Quinlan, J. R. (2014). C4. 5: programs for machine learning. Elsevier
Alpaydin, E. (2014). Introduction to machine learning. MIT press
Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., & Kudlur, M. (2016, November).
TensorFlow: A System for Large-Scale Machine Learning. In OSDI (Vol. 16, pp. 265-283)
Bache, K., & Lichman, M. (2013). UCI machine learning repository
Liu, H., & Motoda, H. (2012). Feature selection for knowledge discovery and data mining (Vol.
454). Springer Science & Business Media
Chandrashekar, G., & Sahin, F. (2014). A survey on feature selection methods. Computers &
Electrical Engineering, 40(1), 16-28
Brown, G., Pocock, A., Zhao, M. J., & Luján, M. (2012). Conditional likelihood maximisation: a
unifying framework for information theoretic feature selection. Journal of machine learning
research, 13(Jan), 27-66
Panahiazar, M., Taslimitehrani, V., Jadhav, A., & Pathak, J. (2014, October). Empowering
personalized medicine with big data and semantic web technology: promises, challenges, and use
cases. In Big Data (Big Data), 2014 IEEE International Conference on (pp. 790-795). IEEE
Goeuriot, L., Kelly, L., Li, W., Palotti, J., Pecina, P., Zuccon, G., & Mueller, H. (2014,
September). Share/clef ehealth evaluation lab 2014, task 3: User-centred health information
retrieval. In Proceedings of CLEF 2014
learning tools and techniques. Morgan Kaufmann.
Quinlan, J. R. (2014). C4. 5: programs for machine learning. Elsevier
Alpaydin, E. (2014). Introduction to machine learning. MIT press
Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., & Kudlur, M. (2016, November).
TensorFlow: A System for Large-Scale Machine Learning. In OSDI (Vol. 16, pp. 265-283)
Bache, K., & Lichman, M. (2013). UCI machine learning repository
Liu, H., & Motoda, H. (2012). Feature selection for knowledge discovery and data mining (Vol.
454). Springer Science & Business Media
Chandrashekar, G., & Sahin, F. (2014). A survey on feature selection methods. Computers &
Electrical Engineering, 40(1), 16-28
Brown, G., Pocock, A., Zhao, M. J., & Luján, M. (2012). Conditional likelihood maximisation: a
unifying framework for information theoretic feature selection. Journal of machine learning
research, 13(Jan), 27-66
Panahiazar, M., Taslimitehrani, V., Jadhav, A., & Pathak, J. (2014, October). Empowering
personalized medicine with big data and semantic web technology: promises, challenges, and use
cases. In Big Data (Big Data), 2014 IEEE International Conference on (pp. 790-795). IEEE
Goeuriot, L., Kelly, L., Li, W., Palotti, J., Pecina, P., Zuccon, G., & Mueller, H. (2014,
September). Share/clef ehealth evaluation lab 2014, task 3: User-centred health information
retrieval. In Proceedings of CLEF 2014

Rocktäschel, T., Weidlich, M., & Leser, U. (2012). ChemSpot: a hybrid system for chemical
named entity recognition. Bioinformatics, 28(12), 1633-1640.
Ferrucci, D., Levas, A., Bagchi, S., Gondek, D., & Mueller, E. T. (2013). Watson: beyond
jeopardy! Artificial Intelligence, 199, 93-105
Paik, J. H. (2013, July). A novel TF-IDF weighting scheme for effective ranking. In Proceedings
of the 36th international ACM SIGIR conference on Research and development in information
retrieval (pp. 343-352). ACM.
Hong, T. P., Lin, C. W., Yang, K. T., & Wang, S. L. (2013). Using TF-IDF to hide sensitive
itemsets. Applied Intelligence, 38(4), 502-510.
Trstenjak, B., Mikac, S., & Donko, D. (2014). KNN with TF-IDF based Framework for Text
Categorization. Procedia Engineering, 69, 1356-1364.
Del Río, S., López, V., Benítez, J. M., & Herrera, F. (2014). On the use of MapReduce for
imbalanced big data using Random Forest. Information Sciences, 285, 112-137.
Tuarob, S., Bhatia, S., Mitra, P., & Giles, C. L. (2013, August). Automatic detection of
pseudocodes in scholarly documents using machine learning. In Document Analysis and
Recognition (ICDAR), 2013 12th International Conference on (pp. 738-742). IEEE.
Iqbal, A., Ahmed, S. M., & Abu-Rub, H. (2012). Space vector PWM technique for a three-to-
five-phase matrix converter. IEEE Transactions on Industry Applications, 48(2), 697-707.
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word
representations in vector space. ArXiv preprint arXiv: 1301.3781.
named entity recognition. Bioinformatics, 28(12), 1633-1640.
Ferrucci, D., Levas, A., Bagchi, S., Gondek, D., & Mueller, E. T. (2013). Watson: beyond
jeopardy! Artificial Intelligence, 199, 93-105
Paik, J. H. (2013, July). A novel TF-IDF weighting scheme for effective ranking. In Proceedings
of the 36th international ACM SIGIR conference on Research and development in information
retrieval (pp. 343-352). ACM.
Hong, T. P., Lin, C. W., Yang, K. T., & Wang, S. L. (2013). Using TF-IDF to hide sensitive
itemsets. Applied Intelligence, 38(4), 502-510.
Trstenjak, B., Mikac, S., & Donko, D. (2014). KNN with TF-IDF based Framework for Text
Categorization. Procedia Engineering, 69, 1356-1364.
Del Río, S., López, V., Benítez, J. M., & Herrera, F. (2014). On the use of MapReduce for
imbalanced big data using Random Forest. Information Sciences, 285, 112-137.
Tuarob, S., Bhatia, S., Mitra, P., & Giles, C. L. (2013, August). Automatic detection of
pseudocodes in scholarly documents using machine learning. In Document Analysis and
Recognition (ICDAR), 2013 12th International Conference on (pp. 738-742). IEEE.
Iqbal, A., Ahmed, S. M., & Abu-Rub, H. (2012). Space vector PWM technique for a three-to-
five-phase matrix converter. IEEE Transactions on Industry Applications, 48(2), 697-707.
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word
representations in vector space. ArXiv preprint arXiv: 1301.3781.
1 out of 30

Your All-in-One AI-Powered Toolkit for Academic Success.
+13062052269
info@desklib.com
Available 24*7 on WhatsApp / Email
Unlock your academic potential
© 2024 | Zucol Services PVT LTD | All rights reserved.