This document discusses the use of Rapid Miner in the field of Knowledge Engineering. It explores the techniques of knowledge creation, including classification, clustering, and association rule mining. The document also provides a business case and discusses external source knowledge creations.
Contribute Materials
Your contribution can guide someone’s learning journey. Share your
documents today.
University Semester –KNOWLEDGE ENGINEERING – RAPID MINOR Student ID Student Name Submission Date 1
Secure Best Marks with AI Grader
Need help grading? Try our AI Grader for instant feedback on your assignments.
Table of Contents Introduction................................................................................................................................4 Part A - Business Case...............................................................................................................4 1.Tweet Data Description...................................................................................................4 2.Knowledge Creation Techniques....................................................................................5 2.1Classification – Decision Tree.................................................................................5 2.2Clustering – SOM (Self Organizing Map)...............................................................7 2.3Association Rule Mining – FP Growth Algorithm................................................11 3.External Source Knowledge Creations.........................................................................16 Part – B Knowledge Creations.................................................................................................18 Conclusion................................................................................................................................19 References................................................................................................................................20 2
3
Introduction In today’s world of Internet and social media awareness, people feel a sense of connected to the world and their loved ones by being on social platforms like Facebook, Twitter, Instagram, WhatsApp etc. By using the Rapid Miner tool, we shall evaluate the activity of users online on the basis of their tweet data. That will be the intention of this whole project. As Twitter, Flicker, and Facebook and other social media platforms cover more and more applications, tools and websites, people have started using these platforms to share their feelings, experiences, gather information, news and convey their opinions with their family, friends and peers. We shall focus on of the most popular social media platforms and information acquiring media used in today’s world, Twitter. It is anonline news, information, and social networking platform. The account holders and the users post “tweets” which is interactive and can also attach photos, links, services etc. It saves all the activities of the user as and what he/she posts (tweets) on the platform and it is a real-time social media service. A “Dynamic Data Streaming” system is what Twitter has become with the daily updates from all its users and the posts that they keep on “tweeting”. To understand and procure the in-depth knowledge of the activity of online users who tweet and regularly use Twitter, a Knowledge Engineer will make use of the creation technique and the knowledge representation to analyse all these data and information. On the online pages, Twitter API etc, all these different data sets and the information packets are available. So our focus for this project will be to acquire an in-depth knowledge and information of the user’s online activities by using the user’s tweets, Twitter data and evaluating the same (Challenges with Big Data Analytics, 2015). Part A - Business Case 1.Tweet Data Description The Twitter websites and servers collect all the tweet data, information from a user by his /her tweets. This same data will be used in this project. To understand and acquire in- depth knowledge of the user’s online activity, we shall evaluate this data that is generated by the user’s tweets. The below given attributes are part of this data, ID 4
Paraphrase This Document
Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Time Content UserId UserHomeTown Location 2.Knowledge Creation Techniques We shall now use the knowledge creation techniques and methods to evaluate and study the user’s tweet data (C and Babu, 2016). Now the three “knowledge creation techniques” used for this process are, 1.Classification as decision tree 2.Clustering as Self Organizing Map (SOM) 3.Association rule mining as FP growth by using the Rapid Miner tool. We shall now discuss and study each of these in detail. 2.1Classification – Decision Tree Below is the image for the “Decision Tree” for the tweet data, We shall now use the following operator in Rapid Miner given below, after we have created the “Decision Tree”, a)Select Attributes b)Decision Tree c)Performance Using an example for selecting the subset of attributes, the selected attributes will be used and same time they will also be used for removal of the other attributes. To make the process of attribution selection easy, the operator will make the use of the various filter types. For the classification and regression use, the above decision tree will be used (Comparative Study of K-NN, Naive Bayes and Decision Tree Classification Techniques, 2016). Now to create an 5
estimate of a numerical target value or for making the decision on the value affiliation to a class, the tree like set of node will be utilized. To separate and for regression of the different type of classes, the classification rules are generated for these values. Now these are linked and they shall minimize the error in an optimal way for these chosen parameter criteria’s. To find the value and the dependence on the common examples which it will reach for the leaf during its generation, the decision tree model shall be made use to predict this class label attributes. These shall also be used for averaging all the values of a leaf and then get the numerical value. To generate the Decision Tree model, the input is taken from the training set as selected and a Decision tree and example set will be the outcome of this. Form this output port the decision tree model output will be used for delivering. There will be no changing of the output through the port and same time the example set will be used for taking the input. On the few selected models, the performance evaluation will be carried out by using the performance vector and a list of the performance criteria values will be delivered by utilising the performance model. In order to firthe learnig type, the order automatically will be determined and also at the same time it will calculate the most common criteria for that particulr category. The criterias like Accuracy and Kappa statistics, will be used for the performancevectorforpolynominalclassificationtask.Theparametersforthiswill optionally taken from theperformance vector and the labelled data will act as the inputs and the expected example set. Accuracy and Kappa statistics will be the resultant outcome for this. The models will be using the Accuracy and Kappa parameters for the finding out the performance vector whch will be used to predict the output values. Accuracy To display the percentage of predictions, the accuracy parameters will be used. The below image shows the accuracy value so created for the decision model,which is 0.0%. 6
Kappa For the given tweet data, to measure the simple percentage of the correct prediction we shall be using the “Kappa” parameter. 0.0 is the value of the Kappa for the created decision model and same is displayed below. 2.2Clustering – SOM (Self Organizing Map) Below is the image of the tweet data’s self-organizing map, 7
Secure Best Marks with AI Grader
Need help grading? Try our AI Grader for instant feedback on your assignments.
Following are the operators, based on the self-organizing map and used in the Rapid miner, Select Attributes Self Organizing Map Performance The subset of attributes which is used for an example set is created by the selected attributes. This will also provide different filter types and for removal of all the other attributes for easing out the selection process of the attribution (Repin et al., 2019). By identifying and specifying the required number of dimensions, the self-organizing map will be used for performing the dimensionality minimisation of the provided tweet data. For the trained data to make a low dimensional and discretized representation of the input space, one of the types of artificial neural network will be used on the provided data set. This is known as a “map”. Also popularly called as Kohonen map, it provides useful visualizing of the low dimensional views of high dimensional data. Mapping and Training are the two models and the types of operation for the SOM. The main difference between these two models is that the “Mapping model” will automatically classify the new input vector while the “Training model” will be used for building the input examples. The “nodes” which are the components of the SOM use the example set as the input and the output is provided as an example set and the pre-processing model. Below is the image representation of the “Self- organizing” map, 8
Below is the representation of the output for the self-organizing map and its statistical information, Below is the statistics information which represents the chart visualization output for self- organizing map, 9
Below is the representation of the SOM dimensionality reduction model and its output, Onthechosenfewmodels,theperformanceevaluationiscarriedoutbythe performance vector. A list of all the performance criteria values is shown up by using the performancemodel.Accuracyand Kappa statisticsaresomeof thecriteria’sfor the performance vector for polynomial classification task. The performance vector for parameters is an optional take while the labelled data which it expects to be an example set, is taken for the input. Accuracy and Kappa statistics will be the outputs. Accuracy 10
Paraphrase This Document
Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
The percentage of predictions will be displayed by using the accuracy parameters. 0.0% is the Accuracy value so calculated for the created self-organizing map model, which is represented as below, Kappa For the given tweet data and information, we shall measure and calculate the simple percentage of correct prediction by using the Kappa parameter. 0.0% will be the value of Kappafor the created self-organizing map model and the same is represented below, 2.3Association Rule Mining – FP Growth Algorithm 11
Below is the representation of the association rule mining for the FP growth algorithm on the given tweet data We shall be using the following operator’s in Rapid miner such as on the created FP growth algorithm, Select Attributes Discretize Performance FP Growth Text to Nominal Now all the selected attributes will be used to carry out the following tasks in the project. To select the subset of attributes of an example set Used for removing all the other attributes To provide various filter types for creating attribution selection easy To change the type of the selected text attributes to nominal, text to nominal operator will be used As the FP growth algorithm only accepts the nominal attribute, it will map all these values of the attributes to their corresponding nominal values. To change the type of selected numerical attributes to nominal, the discretize operator will be used. As the FP growth algorithm only accepts the nominal attributes, it maps all the values of these attributes to their corresponding nominal values. In the example sets, to calculate all the frequently occurring item sets, the effective use of the FP growth algorithm is carried out. The output will be the example set and frequent set for the example set as the input. Used for the frequency item, the frequent sets are used to create the association rule. Below is the representation of the output of FP growth algorithm. 12
Frequency Item Sets Below is the representation of theoutput of frequency Item sets of tweet data, Example Set Below is the representation of theoutput of example set of tweet data, Below is the representation of the statistical information about the created association rule, 13
Secure Best Marks with AI Grader
Need help grading? Try our AI Grader for instant feedback on your assignments.
Below is the representation of the chart visualization of the output of FP growth, We shall now use the performance vector operator for analysing the performance for the created model. A list of all the performance criteria values shall be delivered. Accuracy and Kappa statistics are the criteria’s for the performance vector for polynomial classification task. It expects as an example set is expected for the labelled data as the input and performance vector for parameters is the optional output. Accuracy and Kappa statistics are the provided outputs. Accuracy 14
The percentages of predictions are used for displaying the accuracy of the parameters. 0.0% is the Accuracy value for the created FP growth algorithm for association rule and it is represented as below. Kappa On the provided tweet data, to measure and calculate the simple percentage of correct prediction, we shall be using the Kappa parameter. 0.0 is the Kappa value for the created FP growth algorithm for association rule and same has been represented below, 15
So we have used the Performance operator, have analysed and evaluated it, and on the basis of the overall conduct, we can safely trust the outcome and its results. For the created models, this is the operator that was used for giving the effective results. 3.External Source Knowledge Creations In today’s world of digital connectivity, decision making and intelligence gathering is largely based on the social media and how it constitutes the new age of information spreading and acquiring, as represented by this paper.. As a primary and fundamental source of news and information, Twitter has become a very popular and common social media platform for knowledge discovery, by making use of its tweets, messages, linked pictures/ videos and URL links (Singh, 2018). Now, the large amount of Data and Information via the tweets is used for evaluation and analysis by researchers for application programming interface (TOFAN, 2014). To perform the knowledge discovery, information from the tweet API is utilized. For knowledge acquisition and data study, the tweets have the most useful keys like, Source - It is used to specify the website and application used to create a tweet. Coordinates, place, geo - It is used to specify the location of the tweet authors. Lang - It is used to specify the language of the tweet. id -It is used to specify the unique identification of the tweet Text - It is used to specify the text of the tweet. user - It is used to contain the following fields such as, oname - It is used to specify the use name ofriends count - It is used to specify the number of friends olocation - It is used to specify the actual locations odescription - It is used to specify the additional information about the user otime zone - It is used to specify the time zone of computer or mobile oCreated at - It is used to specify the account creation date and time oFollowers count - It is used to specify the number of followers. To analyse and evaluate the tweet data, the following data analysis methods were used in this academic paper, Network information - The network information, based on various types of relatedness, is used to connect the user accounts. 16
Paraphrase This Document
Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Sentiment analysis - The sentiment analysis is used to express either the negative or positive, discuss the current issues and complains. Maximum Entropy - To perform classifications, the maximum entropy relies on the probability distribution estimation technique. Text mining - To perform the process of deriving information from a text, the text mining data set is used. Collecting user information - To describe the user information, the collection user information is used. Geolocated Information - The diversity of tweet users in terms of locations is linked to the characterization of the Geolocated information. Naive Bayes - It is used to find the maximum prospect of probability of any word fitting to a particular given or predefined class. Random Forest –This move incrementally downward until a leaf node is detected by using this given tool to perform classification preliminary from the root node. This academic paper, based on Data analysis, is defined as all the true predicted case against the all predicted cases. That means that all the actual cases denoteall the predicted cases, precisely the same and 100% accuracy. We understanding and gaining an in depth knowledge of user online activity, for our projects we have also been using the tweet data to analysing and evaluating the tweets. For knowledge acquisition and data study, our tweet data has the most important and useful keys, ID - It is used to specify the unique identification of the tweet from_user Time- It is used to specify the time zone of computer or mobile Content - It is used to specify the text of the tweet UserId- It is used to specify the unique identification of the tweet from_user UserHomeTown - It is used to specify the unique identification of the tweet from_user Location - It is used to specify the actual locations We have used the three knowledge creation techniques such as, Classification as decision tree Clustering as Self Organizing Map (SOM) Association rule mining as FP growth by using the Rapid Miner tool. 17
Our research paper has defined, based on data analysis and evaluation, that the true predicted case against the all predicted cases. As, it denotes the predicted cases are not precisely same as the actual cases, it has 0.0% accuracy. Part – B Knowledge Creations In this part, we are discuss the provided dataset of diabetes patients based on interestingpatternsareobservedfromthisdataset.Theprovideddiabetespatient’s information is obtained from two sources such as paper records and automatic electronic recording devices. The automatic recording device had an internal clock to specify the timestamp events, whereas the paper records only provided "logical time" slots including the breakfast, lunch, dinner and bedtime.For paper records, fixed times were assigned to breakfast (08:00), lunch (12:00), dinner (18:00), and bedtime (22:00).Thus paper records have fictitious uniform recording times whereas electronic records have more realistic time stamps. Diabetes files consist of four fields per record.Each field is separated by a tab and each record is separated by a newline. File Names and format: Date in MM-DD-YYYY format Time in XX: YY format Code Value The Code field is deciphered as follows: 33 = It is used to specify the Regular insulin dose 34 = It is used to specify the NPH insulin dose 35 = It is used to specify the Ultra Lente insulin dose 48 = It is used to specify the Unspecified blood glucose measurement 57 = It is used to specify the Unspecified blood glucose measurement 58 = It is used to specify the Pre-breakfast blood glucose measurement 59 = It is used to specify the Post-breakfast blood glucose measurement 60 = It is used to specify the Pre-lunch blood glucose measurement 61 = It is used to specify the Post-lunch blood glucose measurement 62 = It is used to specify the pre-supper blood glucose measurement 63 = It is used to specify the Post-supper blood glucose measurement 64 = It is used to specify the Pre-snack blood glucose measurement 18
65 = It is used to specify the Hypoglycemic symptoms 66 = It is used to specify the Typical meal ingestion 67 = It is used to specify the More-than-usual meal ingestion 68 = It is used to specify the Less-than-usual meal ingestion 69 = It is used to specify the Typical exercise activity 70 = It is used to specify the More-than-usual exercise activity 71 = It is used to specify the Less-than-usual exercise activity 72 = It is used to specify the Unspecified special event Conclusion On the basis of the tweet data, this Project was successfully analysed/ evaluated for the user online activity by using the Rapid Miner tool. Twitter was selected as the main social media resource for this project. The study and research of the twitter data was used in understanding and gaining an in depth knowledge of the user online activity by using the knowledge creation, representation and providing the detailed information.Twitter API, tweets, online pages and more were the information sets that were used in this project study. To analyse the tweet data for gaining an in depth knowledge of the user online activity, in this project, we had used the entire gathered user’s twitter data. We have also discussed/ evaluated in detail the three knowledge creation techniques/ methodology that were used in this project. Namely, 1.Classification as decision tree 2.Clustering as Self Organizing Map (SOM) 3.Association rule mining as FP growth by using the Rapid Miner tool. We have also used case studies from various academic research projects to successfully compare and also the contrast, each of the knowledge categories which we have discussed studied and analysed in details. 19
Secure Best Marks with AI Grader
Need help grading? Try our AI Grader for instant feedback on your assignments.
References Challenges with Big Data Analytics. (2015).International Journal of Science and Research (IJSR), 4(12), pp.778-780. Singh, A. (2018). Customized biomedical informatics.Big Data Analytics, 3(1). C, K. and Babu, V. (2016). A Survey on Issues of Decision Tree and Non-Decision Tree Algorithms.International Journal of Artificial Intelligence and Applications for Smart Devices, 4(1), pp.9-32. Comparative Study of K-NN, Naive Bayes and Decision Tree Classification Techniques. (2016).International Journal of Science and Research (IJSR), 5(1), pp.1842-1845. Repin, M., Pampou, S., Garty, G. and Brenner, D. (2019). RABiT-II: A Fully-Automated Micronucleus Assay System with Shortened Time to Result.Radiation Research, 191(3), p.232. TOFAN, C. (2014). Optimization Techniques of Decision Making - Decision Tree.Advances in Social Sciences Research Journal, 1(5), pp.142-148. 20