Relational Data Visualization and Analysis: A Gephi Case Study

Verified

Added on 2023/06/14

AI Summary

This report details the use of Gephi for relational data visualization, focusing on the analysis of a BigMart sales dataset. It covers the objective of visualizing data for business insights, introduces the conceptual model in data visualization, and explores the dataset using Gephi. The report discusses data cleaning techniques, model building with a random forest algorithm, and the advantages and disadvantages of data visualization. The analysis reveals key features like Item_MRP, Outlet_Type, and Outlet_Location_Type as important factors influencing sales. Critical thinking on the results highlights the importance of store location and product pricing in achieving higher sales. The document concludes with references to relevant literature, providing a comprehensive overview of relational data visualization using Gephi for practical data analysis.

“Relational Data Visualization”
Data Analysis using Gephi

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

Table of Contents
Objective ..…………………………………………………………………………...1
Introduction…………..………………………………………………………………2
Conceptual Model in Data Visualization……………………………………… ….2
Data Exploration……………………………………………………………………..8
Data Cleaning…………………………………………………………………… …..8
Model Building……………………………………………………………………......9
Advantages of Data Visualization……………………………………………..…....9
Disadvantages of Data Visualization……………………………………….……..10
Analysis….……………………………………………………………………..…….10
Critical Thinking………………………………………………………………..…….11
References……………………………………………………………………….…..13

Objective
Big data visualization is playing a very important role in today’s world. We can
visualize data based on the scenarios which we need for business or for organizations.
Data is picture of thousand words.
Data Visualization is an important role to helping big data to get a pictorial view
of data and values of the data. Relational data visualization integrated with applications
so that the work can be done is real time.
Current data visualization role are predict to good result in case of applications
and technology.
The visualization techniques behind with data generator in a myriad of disciplines
is rapidly increasing, typically faster than the techniques available to manage and use the
resultant data.
Fitzgerald’s use of “old sports”throughout the novel suggests that Gatsby
considered Nick Carraway a close friend (2004).
Computer based visualization is changing rapidly over the time. Tools and
systems which support that is typically evolved rather then being formally designed.
1

Introduction
As per the survey the current state of research on relational data visualization we
have reviewed in various fields related to data visualization, information visualization,
statistics, graphic design and human computer interaction.
Data Visualization is a rapidly growing industry in the current decade to visualize
the data in the form of information and graphics.
Visualization is not a part of any fixed industry, it is using in every scenarios
which can be education, social media, information technology, data science, artificial
intelligence, automobile, construction and communication (Agrawal, R. & Ailamaki, A.
2008). People is going to visualize and predict the data on dashboard and in form of
reports.
Conceptual Model in Data Visualization
The Data Visualization technique is changed over the time and this situation is the
fact of visualization, that involves the integration of graphics, images, data management
and human perception.
Relational Model : Structure is the most important ingredient in any data model. One of
the major contributions is Codd’s relational model is the focus on the importance of
functional dependencies (Anthes, G. 2010). In fact normalization is driven by a
modelers desire for a relation where strict functional dependencies applies.
2

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

Dataset :
importing dataset into Gephi, it shows 120 Nodes and 978 Edges are presented. It means
the connection is about 120 Nodes and 978 Edges, which is connected each others.
Figure 1:
Dataset having multiple column which is in relational data format.
The whole overview of the BigMart sales dataset using Gephi data visualization tool. In
this case all the relationship is based on Nodes which is connected together.
Used Tools and Techniques :
The techniques which is used in report is Gephi.
3

Gephi is a software for data visualization for relational database, which is based on java
and Netbeen platform. The application is an open source used for network analysis and
visualization.
Below graph is containing Big Mart sales relational data through nodes and branches.
Figure 2:
Gephi is special visualization tools which directly shows linked nodes attached together
and no linked nodes are far from that.
The black dot shows the highest sales market stores which is directly connected through
each other. The stores is connected to each through light branches but the dark black
branches is connected to highest sales stores.
Gephi has all the filter and scaling parameter which we can apply connection and degree
to the nodes.
4

Sales View : User model is the sub-part of computer communication which describes the
process of connecting relation and modifying a basic understanding of the user. This is
role model to interact with users details and data activities.
Figure 1.3
Above Graph shows two types of market – supermarket 1 and supermarket 2.
Supermarket 1 where sales higher than supermarket 2. The red dot represents
supermarket 2 where as blue dot represents supermarket 1.
Computation Sales View : It takes the form of an algorithm, that is precise description
of the steps that are carried out.
The algorithm takes set of inputs and eventually turns them into output. We can
implement computation model using python, c, c++, Fortran and many more language.
(Thomsen, E. 2006).
We just need to write algorithm to process the job work flow in the system.
5

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

Figure 1.4
The graph shows the visibility of the stores based on the location and sales. Here filter
and parameter are applied on the dataset to get actual sales values based on the outlet or
stores.
Dataset for Visualization technique :
We have to take “BigMart sales prediction” dataset. Inside we will discuss about below
topics -
 Hypothesis generation
 Data Exploration
 Data Cleaning
 Feature Selection
 Model building
Hypothesis generation : It is very important step in the process of analyzing the data. It
involves understanding of the questions problem and making some hypothesis test about
what we need to change to get good impact on the outcome.
So we need to do hypothesis testing on the dataset which we have and find out the
insights from the dataset.
6

We have collected sales data for year 2013 which have 1600 products and near by 10
stores in different cities.
So the target is to build a predictive model which can find out the sales of each product
at a particular store using visualization techniques.
Dataset variables which we have to define -
A. Store level dataset variables :
City type : Where is the store located ? In urban or tier 1 cities.
Population Density : Store located in density populated area because it gives higher
sales for more requirements.
Store Capacity : Stores which are very big in size should have higher sales.
Competitors : Stores should have less due to more competitors in market.
Product Marketing : Stores which have good marketing devision should have good
sales.
Location : Stores should be located in popular marketplace will be on higher sales
because of better connectivity with customers.
Customer behaviors : Store will have a right to design the products based on the
customers behaviors.
Policy : Stores should have managed with rules and policy and politeness with people
will have higher sales.
B. Product level dataset variables :
Brand : Good quality product should have good sales.
Packaging : Products with good packaging can attract to customers for sale.
Utility : Routine products should have higher sales.
Display area : Products which are selling should be displayed to catch the customers
attention to buy more products.
Visibility of Store : The location of the stores should be impact on higher sales.
Advertisement : Better advertisement of products will gives higher sales.
Promotional Offer : Gives the discounts on the selected products will increase sales.
7

Data Exploration :
In this phase we will do some data exploration to get the inferences about the data.
Data Exploration is the technique to identify predictor and target variables from the data.
(Stolte, C. & Tang, D. 2009)
Now we will invariable find features which we hypothesized. We will combine all data
training and testing into one,performing feature extraction algorithm and then combine
into data frame (Mansmann, S. & Scholl, M.H. 2007).
Below is the procedure which summarize the data :
training [‘data’] = ‘training’ | testing [‘data’] = ‘testing’
data_set =pd.concat([training,testing]),ignore_index = True)
The main challenges in any data set is missing values. Which can impact on the sales
and target customers.
Now we will check missing values using some functions.
data.apply(lambda x:sum(x.isnull()))
8

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

In BigMart sales our target variable is Item_Outlet_Sales and Missing_Values are one in
the testing set. So we’ll impute Missing_Values in Item_Weight and Outlet_Size in the
data cleaning process .
Data plays an important role in every relational database system to predict the future
results (Bhattacharya, I. & Getoor, L. 2006).
Now we need to check basic statics in our data. The variables which we are using in the
dataset will predict the solution and output.
The below table gives and clear picture about the sales and target variables.
data.describe()
from the dataset now we can observe that :
 The variable Item_Visibility has a min value of zero in above table so it will make
sense that when a product is being sold in store the visibility should not be 0.
 The variable Outlet_Establishment_Year is not stable, it is varying from 1985 to
2009.
Data Cleaning :
In this phase the missing values also imputing with data and outliers. Though the
outliers removal is important in visualization techniques.
Figure 1.5
In our dataset some missing values is their so first will apply data cleaning technique,
then will use machine learning model to predict insights from data and finally we can
visualize the data on dashboard.
9

Model Building :
As of now we have ready data so that we can apply some machine learning algorithm on
data and get some output based on model selection. Here we will use best outfit machine
learning model random forest.
So the model verified most informative features from the dataset. We can see that
Item_MRP is the most insightful features which can express the data and sales.
Advantages of Data Visualization :
a) Relevant Business Insights : Using data visualization in business organizations it
improves quality and ability to find the information they need.
As per the study managers using visualization technique is companies is getting 30%
more accurate and timely information (Schulz, H. - J. & Treevis 2011).
10