Knowledge Engineering: Rapid Miner

Verified

Added on  2023/03/17

|20
|3897
|53
AI Summary
This document discusses the use of Rapid Miner in the field of Knowledge Engineering. It explores the techniques of knowledge creation, including classification, clustering, and association rule mining. The document also provides a business case and discusses external source knowledge creations.

Contribute Materials

Your contribution can guide someone’s learning journey. Share your documents today.
Document Page
University
Semester
KNOWLEDGE ENGINEERING –
RAPID MINOR
Student ID
Student Name
Submission Date
1

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.
Document Page
Table of Contents
Introduction................................................................................................................................4
Part A - Business Case...............................................................................................................4
1. Tweet Data Description...................................................................................................4
2. Knowledge Creation Techniques....................................................................................5
2.1 Classification – Decision Tree.................................................................................5
2.2 Clustering – SOM (Self Organizing Map)...............................................................7
2.3 Association Rule Mining – FP Growth Algorithm................................................11
3. External Source Knowledge Creations.........................................................................16
Part – B Knowledge Creations.................................................................................................18
Conclusion................................................................................................................................19
References................................................................................................................................20
2
Document Page
3
Document Page
Introduction
In today’s world of Internet and social media awareness, people feel a sense of
connected to the world and their loved ones by being on social platforms like Facebook,
Twitter, Instagram, WhatsApp etc. By using the Rapid Miner tool, we shall evaluate the
activity of users online on the basis of their tweet data. That will be the intention of this whole
project. As Twitter, Flicker, and Facebook and other social media platforms cover more and
more applications, tools and websites, people have started using these platforms to share their
feelings, experiences, gather information, news and convey their opinions with their family,
friends and peers. We shall focus on of the most popular social media platforms and
information acquiring media used in today’s world, Twitter. It is an online news, information,
and social networking platform. The account holders and the users post “tweets” which is
interactive and can also attach photos, links, services etc. It saves all the activities of the user
as and what he/she posts (tweets) on the platform and it is a real-time social media service. A
“Dynamic Data Streaming” system is what Twitter has become with the daily updates from
all its users and the posts that they keep on “tweeting”.
To understand and procure the in-depth knowledge of the activity of online users who
tweet and regularly use Twitter, a Knowledge Engineer will make use of the creation
technique and the knowledge representation to analyse all these data and information. On the
online pages, Twitter API etc, all these different data sets and the information packets are
available. So our focus for this project will be to acquire an in-depth knowledge and
information of the user’s online activities by using the user’s tweets, Twitter data and
evaluating the same (Challenges with Big Data Analytics, 2015).
Part A - Business Case
1. Tweet Data Description
The Twitter websites and servers collect all the tweet data, information from a user by
his /her tweets. This same data will be used in this project. To understand and acquire in-
depth knowledge of the user’s online activity, we shall evaluate this data that is generated by
the user’s tweets. The below given attributes are part of this data,
ID
4

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.
Document Page
Time
Content
UserId
UserHomeTown
Location
2. Knowledge Creation Techniques
We shall now use the knowledge creation techniques and methods to evaluate and study
the user’s tweet data (C and Babu, 2016). Now the three “knowledge creation techniques”
used for this process are,
1. Classification as decision tree
2. Clustering as Self Organizing Map (SOM)
3. Association rule mining as FP growth by using the Rapid Miner tool.
We shall now discuss and study each of these in detail.
2.1 Classification – Decision Tree
Below is the image for the “Decision Tree” for the tweet data,
We shall now use the following operator in Rapid Miner given below, after we have created
the “Decision Tree”,
a) Select Attributes
b) Decision Tree
c) Performance
Using an example for selecting the subset of attributes, the selected attributes will be used
and same time they will also be used for removal of the other attributes. To make the process
of attribution selection easy, the operator will make the use of the various filter types. For the
classification and regression use, the above decision tree will be used (Comparative Study of
K-NN, Naive Bayes and Decision Tree Classification Techniques, 2016). Now to create an
5
Document Page
estimate of a numerical target value or for making the decision on the value affiliation to a
class, the tree like set of node will be utilized. To separate and for regression of the different
type of classes, the classification rules are generated for these values. Now these are linked
and they shall minimize the error in an optimal way for these chosen parameter criteria’s.
To find the value and the dependence on the common examples which it will reach for the
leaf during its generation, the decision tree model shall be made use to predict this class label
attributes. These shall also be used for averaging all the values of a leaf and then get the
numerical value. To generate the Decision Tree model, the input is taken from the training set
as selected and a Decision tree and example set will be the outcome of this. Form this output
port the decision tree model output will be used for delivering. There will be no changing of
the output through the port and same time the example set will be used for taking the input.
On the few selected models, the performance evaluation will be carried out by using
the performance vector and a list of the performance criteria values will be delivered by
utilising the performance model. In order to firthe learnig type, the order automatically will
be determined and also at the same time it will calculate the most common criteria for that
particulr category. The criterias like Accuracy and Kappa statistics, will be used for the
performance vector for polynominal classification task. The parameters for this will
optionally taken from theperformance vector and the labelled data will act as the inputs and
the expected example set. Accuracy and Kappa statistics will be the resultant outcome for
this. The models will be using the Accuracy and Kappa parameters for the finding out the
performance vector whch will be used to predict the output values.
Accuracy
To display the percentage of predictions, the accuracy parameters will be used. The
below image shows the accuracy value so created for the decision model,which is 0.0%.
6
Document Page
Kappa
For the given tweet data, to measure the simple percentage of the correct prediction
we shall be using the “Kappa” parameter. 0.0 is the value of the Kappa for the created
decision model and same is displayed below.
2.2 Clustering – SOM (Self Organizing Map)
Below is the image of the tweet data’s self-organizing map,
7

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Document Page
Following are the operators, based on the self-organizing map and used in the Rapid miner,
Select Attributes
Self Organizing Map
Performance
The subset of attributes which is used for an example set is created by the selected
attributes. This will also provide different filter types and for removal of all the other
attributes for easing out the selection process of the attribution (Repin et al., 2019).
By identifying and specifying the required number of dimensions, the self-organizing map
will be used for performing the dimensionality minimisation of the provided tweet data.
For the trained data to make a low dimensional and discretized representation of the input
space, one of the types of artificial neural network will be used on the provided data set. This
is known as a “map”. Also popularly called as Kohonen map, it provides useful visualizing of
the low dimensional views of high dimensional data. Mapping and Training are the two
models and the types of operation for the SOM. The main difference between these two
models is that the “Mapping model” will automatically classify the new input vector while
the “Training model” will be used for building the input examples. The “nodes” which are the
components of the SOM use the example set as the input and the output is provided as an
example set and the pre-processing model. Below is the image representation of the “Self-
organizing” map,
8
Document Page
Below is the representation of the output for the self-organizing map and its statistical
information,
Below is the statistics information which represents the chart visualization output for self-
organizing map,
9
Document Page
Below is the representation of the SOM dimensionality reduction model and its output,
On the chosen few models, the performance evaluation is carried out by the
performance vector. A list of all the performance criteria values is shown up by using the
performance model. Accuracy and Kappa statistics are some of the criteria’s for the
performance vector for polynomial classification task. The performance vector for parameters
is an optional take while the labelled data which it expects to be an example set, is taken for
the input. Accuracy and Kappa statistics will be the outputs.
Accuracy
10

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.
Document Page
The percentage of predictions will be displayed by using the accuracy parameters. 0.0% is the
Accuracy value so calculated for the created self-organizing map model, which is represented
as below,
Kappa
For the given tweet data and information, we shall measure and calculate the simple
percentage of correct prediction by using the Kappa parameter. 0.0% will be the value of
Kappafor the created self-organizing map model and the same is represented below,
2.3 Association Rule Mining – FP Growth Algorithm
11
Document Page
Below is the representation of the association rule mining for the FP growth algorithm on the
given tweet data
We shall be using the following operator’s in Rapid miner such as on the created FP growth
algorithm,
Select Attributes
Discretize
Performance
FP Growth
Text to Nominal
Now all the selected attributes will be used to carry out the following tasks in the project.
To select the subset of attributes of an example set
Used for removing all the other attributes
To provide various filter types for creating attribution selection easy
To change the type of the selected text attributes to nominal, text to nominal operator will be
used As the FP growth algorithm only accepts the nominal attribute, it will map all these
values of the attributes to their corresponding nominal values.
To change the type of selected numerical attributes to nominal, the discretize operator will be
used. As the FP growth algorithm only accepts the nominal attributes, it maps all the values
of these attributes to their corresponding nominal values.
In the example sets, to calculate all the frequently occurring item sets, the effective use of the
FP growth algorithm is carried out. The output will be the example set and frequent set for the
example set as the input. Used for the frequency item, the frequent sets are used to create the
association rule. Below is the representation of the output of FP growth algorithm.
12
Document Page
Frequency Item Sets
Below is the representation of the output of frequency Item sets of tweet data,
Example Set
Below is the representation of the output of example set of tweet data,
Below is the representation of the statistical information about the created association rule,
13

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Document Page
Below is the representation of the chart visualization of the output of FP growth,
We shall now use the performance vector operator for analysing the performance for the
created model. A list of all the performance criteria values shall be delivered. Accuracy and
Kappa statistics are the criteria’s for the performance vector for polynomial classification
task. It expects as an example set is expected for the labelled data as the input and
performance vector for parameters is the optional output. Accuracy and Kappa statistics are
the provided outputs.
Accuracy
14
Document Page
The percentages of predictions are used for displaying the accuracy of the parameters.
0.0% is the Accuracy value for the created FP growth algorithm for association rule and it is
represented as below.
Kappa
On the provided tweet data, to measure and calculate the simple percentage of correct
prediction, we shall be using the Kappa parameter. 0.0 is the Kappa value for the created FP
growth algorithm for association rule and same has been represented below,
15
Document Page
So we have used the Performance operator, have analysed and evaluated it, and on the basis
of the overall conduct, we can safely trust the outcome and its results. For the created models,
this is the operator that was used for giving the effective results.
3. External Source Knowledge Creations
In today’s world of digital connectivity, decision making and intelligence gathering is
largely based on the social media and how it constitutes the new age of information spreading
and acquiring, as represented by this paper.. As a primary and fundamental source of news
and information, Twitter has become a very popular and common social media platform for
knowledge discovery, by making use of its tweets, messages, linked pictures/ videos and
URL links (Singh, 2018). Now, the large amount of Data and Information via the tweets is
used for evaluation and analysis by researchers for application programming interface
(TOFAN, 2014). To perform the knowledge discovery, information from the tweet API is
utilized. For knowledge acquisition and data study, the tweets have the most useful keys like,
Source - It is used to specify the website and application used to create a tweet.
Coordinates, place, geo - It is used to specify the location of the tweet authors.
Lang - It is used to specify the language of the tweet.
id - It is used to specify the unique identification of the tweet
Text - It is used to specify the text of the tweet.
user - It is used to contain the following fields such as,
o name - It is used to specify the use name
o friends count - It is used to specify the number of friends
o location - It is used to specify the actual locations
o description - It is used to specify the additional information about the
user
o time zone - It is used to specify the time zone of computer or mobile
o Created at - It is used to specify the account creation date and time
o Followers count - It is used to specify the number of followers.
To analyse and evaluate the tweet data, the following data analysis methods were used in this
academic paper,
Network information - The network information, based on various types of
relatedness, is used to connect the user accounts.
16

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.
Document Page
Sentiment analysis - The sentiment analysis is used to express either the
negative or positive, discuss the current issues and complains.
Maximum Entropy - To perform classifications, the maximum entropy relies
on the probability distribution estimation technique.
Text mining - To perform the process of deriving information from a text, the
text mining data set is used.
Collecting user information - To describe the user information, the collection
user information is used.
Geolocated Information - The diversity of tweet users in terms of locations is
linked to the characterization of the Geolocated information.
Naive Bayes - It is used to find the maximum prospect of probability of any
word fitting to a particular given or predefined class.
Random Forest –This move incrementally downward until a leaf node is
detected by using this given tool to perform classification preliminary from the
root node.
This academic paper, based on Data analysis, is defined as all the true predicted case
against the all predicted cases. That means that all the actual cases denoteall the predicted
cases, precisely the same and 100% accuracy.
We understanding and gaining an in depth knowledge of user online activity, for our
projects we have also been using the tweet data to analysing and evaluating the tweets.
For knowledge acquisition and data study, our tweet data has the most important and useful
keys,
ID - It is used to specify the unique identification of the tweet from_user
Time - It is used to specify the time zone of computer or mobile
Content - It is used to specify the text of the tweet
UserId - It is used to specify the unique identification of the tweet from_user
UserHomeTown - It is used to specify the unique identification of the tweet
from_user
Location - It is used to specify the actual locations
We have used the three knowledge creation techniques such as,
Classification as decision tree
Clustering as Self Organizing Map (SOM)
Association rule mining as FP growth by using the Rapid Miner tool.
17
Document Page
Our research paper has defined, based on data analysis and evaluation, that the true predicted
case against the all predicted cases. As, it denotes the predicted cases are not precisely same
as the actual cases, it has 0.0% accuracy.
Part – B Knowledge Creations
In this part, we are discuss the provided dataset of diabetes patients based on
interesting patterns are observed from this data set. The provided diabetes patient’s
information is obtained from two sources such as paper records and automatic electronic
recording devices. The automatic recording device had an internal clock to specify the
timestamp events, whereas the paper records only provided "logical time" slots including the
breakfast, lunch, dinner and bedtime. For paper records, fixed times were assigned to
breakfast (08:00), lunch (12:00), dinner (18:00), and bedtime (22:00). Thus paper records
have fictitious uniform recording times whereas electronic records have more realistic time
stamps. Diabetes files consist of four fields per record. Each field is separated by a tab and
each record is separated by a newline.
File Names and format:
Date in MM-DD-YYYY format
Time in XX: YY format
Code
Value
The Code field is deciphered as follows:
33 = It is used to specify the Regular insulin dose
34 = It is used to specify the NPH insulin dose
35 = It is used to specify the Ultra Lente insulin dose
48 = It is used to specify the Unspecified blood glucose measurement
57 = It is used to specify the Unspecified blood glucose measurement
58 = It is used to specify the Pre-breakfast blood glucose measurement
59 = It is used to specify the Post-breakfast blood glucose measurement
60 = It is used to specify the Pre-lunch blood glucose measurement
61 = It is used to specify the Post-lunch blood glucose measurement
62 = It is used to specify the pre-supper blood glucose measurement
63 = It is used to specify the Post-supper blood glucose measurement
64 = It is used to specify the Pre-snack blood glucose measurement
18
Document Page
65 = It is used to specify the Hypoglycemic symptoms
66 = It is used to specify the Typical meal ingestion
67 = It is used to specify the More-than-usual meal ingestion
68 = It is used to specify the Less-than-usual meal ingestion
69 = It is used to specify the Typical exercise activity
70 = It is used to specify the More-than-usual exercise activity
71 = It is used to specify the Less-than-usual exercise activity
72 = It is used to specify the Unspecified special event
Conclusion
On the basis of the tweet data, this Project was successfully analysed/ evaluated for
the user online activity by using the Rapid Miner tool. Twitter was selected as the main social
media resource for this project. The study and research of the twitter data was used in
understanding and gaining an in depth knowledge of the user online activity by using the
knowledge creation, representation and providing the detailed information.Twitter API,
tweets, online pages and more were the information sets that were used in this project study.
To analyse the tweet data for gaining an in depth knowledge of the user online activity, in this
project, we had used the entire gathered user’s twitter data. We have also discussed/ evaluated
in detail the three knowledge creation techniques/ methodology that were used in this project.
Namely,
1. Classification as decision tree
2. Clustering as Self Organizing Map (SOM)
3. Association rule mining as FP growth by using the Rapid Miner tool.
We have also used case studies from various academic research projects to successfully
compare and also the contrast, each of the knowledge categories which we have discussed
studied and analysed in details.
19

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Document Page
References
Challenges with Big Data Analytics. (2015). International Journal of Science and
Research (IJSR), 4(12), pp.778-780.
Singh, A. (2018). Customized biomedical informatics. Big Data Analytics, 3(1).
C, K. and Babu, V. (2016). A Survey on Issues of Decision Tree and Non-Decision Tree
Algorithms. International Journal of Artificial Intelligence and Applications for Smart
Devices, 4(1), pp.9-32.
Comparative Study of K-NN, Naive Bayes and Decision Tree Classification Techniques.
(2016). International Journal of Science and Research (IJSR), 5(1), pp.1842-1845.
Repin, M., Pampou, S., Garty, G. and Brenner, D. (2019). RABiT-II: A Fully-Automated
Micronucleus Assay System with Shortened Time to Result. Radiation Research, 191(3),
p.232.
TOFAN, C. (2014). Optimization Techniques of Decision Making - Decision
Tree. Advances in Social Sciences Research Journal, 1(5), pp.142-148.
20
1 out of 20
circle_padding
hide_on_mobile
zoom_out_icon
[object Object]

Your All-in-One AI-Powered Toolkit for Academic Success.

Available 24*7 on WhatsApp / Email

[object Object]