Web Crawling: Measuring Web Trends for Increased Business Productivity
VerifiedAdded on 2024/06/28
|94
|10733
|135
Literature Review
AI Summary
This literature review investigates the use of web crawling techniques for analyzing web trends and improving business productivity. It examines various approaches, including machine learning algorithms for modeling web traffic, big data analytics for forecasting tourism destination arrivals, and the PolarHub web crawling engine for geospatial data discovery. The review discusses the advantages and limitations of each technique, highlighting their applicability in different contexts. Key themes include identifying bot traffic, analyzing social media data, and leveraging web search queries for business insights. The reviewed solutions aim to address challenges such as non-stationary web traffic, overcrowded tourist destinations, and the efficient retrieval of geospatial resources, ultimately contributing to increased business efficiency and informed decision-making. Desklib offers a platform for students to access this document and numerous other solved assignments and study resources.

Literature Review (Secondary Research)
Student Name & CSU
ID
Project Topic Title Analysing the website and measuring the web trends using web crawler based on page tagging for increasing
the productivity of businesses
1
Student Name & CSU
ID
Project Topic Title Analysing the website and measuring the web trends using web crawler based on page tagging for increasing
the productivity of businesses
1
Paraphrase This Document
Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

Version 1.0 _ Week 1 (5 Journal Papers from CSU Library)
1
Reference in APA format that will be
in 'Reference List'
Suchacka, G., & Wotzka, D. (2017). Modeling a non-stationary bots’ arrival process at an e-commerce
Web site. Journal of computational science, 22, 198-208.
Citation that will be in the content Suchacka & Wotzka, 2017
URL of the Reference Level of Journal (Q1, Q2, …Qn) Keywords in this Reference
https://www-sciencedirect-
com.ezproxy.csu.edu.au/science/article/
pii/S0165032717323510
Level of journal: Q1 Web traffic analysis and modeling, Web traffic,
characterization, Log file analysis, Web server,
Internet robot, Web bot, Modeling and
simulation, Regression analysis, Heavy-tailed
distribution
The Name of the Current Solution
(Technique/ Method/ Scheme/
Algorithm/ Model/ Tool/ Framework/
... etc )
The Goal (Objective) of this Solution &
What is the Problem that need to be
solved
What are the components of it?
Technique/Algorithm name:
Machine learning
Internet marketing
Online payments
Tools:
Problem:
The major problem is of modeling a real
arrival process in which bots are requested
on the web server of an e-commerce.
Goal:
Machine learning
Web applications
2
1
Reference in APA format that will be
in 'Reference List'
Suchacka, G., & Wotzka, D. (2017). Modeling a non-stationary bots’ arrival process at an e-commerce
Web site. Journal of computational science, 22, 198-208.
Citation that will be in the content Suchacka & Wotzka, 2017
URL of the Reference Level of Journal (Q1, Q2, …Qn) Keywords in this Reference
https://www-sciencedirect-
com.ezproxy.csu.edu.au/science/article/
pii/S0165032717323510
Level of journal: Q1 Web traffic analysis and modeling, Web traffic,
characterization, Log file analysis, Web server,
Internet robot, Web bot, Modeling and
simulation, Regression analysis, Heavy-tailed
distribution
The Name of the Current Solution
(Technique/ Method/ Scheme/
Algorithm/ Model/ Tool/ Framework/
... etc )
The Goal (Objective) of this Solution &
What is the Problem that need to be
solved
What are the components of it?
Technique/Algorithm name:
Machine learning
Internet marketing
Online payments
Tools:
Problem:
The major problem is of modeling a real
arrival process in which bots are requested
on the web server of an e-commerce.
Goal:
Machine learning
Web applications
2

Mathematical model
Simulation model
Distribution model
Applied Area:
Web applications
To solve the problem of non-stationary
web traffic and chunks are used with the
help of models.
The Process (Mechanism) of this Work; The process steps of the Technique/system
Process Steps Advantage (Purpose of this step) Disadvantage (Limitation/Challenge)
1 e-commerce website: A website for either
purchasing or selling online.
All the processing is done according to the
website and user requirement.
In this, because of many users, there is a
traffic.
2 Constructing session based web traffic:
On the user, request sessions are
constructed.
According to the request the sessions
come in access after that they expire.
N/A
3 Identifying bot’s traffic: Request the
session and at one time one request is there
in the session.
Any one condition satisfied bot session is
met.
N/A
Validation Criteria (Measurement Criteria)
Dependent Variable Independent Variable
Bot arrival process Time-zone
Web traffic analysis Web server log data
3
Simulation model
Distribution model
Applied Area:
Web applications
To solve the problem of non-stationary
web traffic and chunks are used with the
help of models.
The Process (Mechanism) of this Work; The process steps of the Technique/system
Process Steps Advantage (Purpose of this step) Disadvantage (Limitation/Challenge)
1 e-commerce website: A website for either
purchasing or selling online.
All the processing is done according to the
website and user requirement.
In this, because of many users, there is a
traffic.
2 Constructing session based web traffic:
On the user, request sessions are
constructed.
According to the request the sessions
come in access after that they expire.
N/A
3 Identifying bot’s traffic: Request the
session and at one time one request is there
in the session.
Any one condition satisfied bot session is
met.
N/A
Validation Criteria (Measurement Criteria)
Dependent Variable Independent Variable
Bot arrival process Time-zone
Web traffic analysis Web server log data
3
⊘ This is a preview!⊘
Do you want full access?
Subscribe today to unlock all pages.

Trusted by 1+ million students worldwide

Input and Output Critical Thinking: Feature of this work, and
Why (Justify)
Critical Thinking: Limitations of the
research current solution, and Why
(Justify)
Input (Data) Output (View)
Multiple user logins Web bot traffic is
analyzed and it
shows the share of
overall traffic on the
web server.
This work majorly helps inapplicability in
respect to analytical modeling benchmarking
and sending data packets. This model also
helps in reproducing a stream of bot request.
And it also helps in generating synthetic bot
traffic for using in simulation experiments.
In this generalization cannot be done for all e-
commerce websites. The accurate modeling is
not done in the standard time. Delay in
seconds can cause errors. Also fitting the real
data in the regression function is an issue.
(Describe the research/current solution) Evaluation Criteria How this research/current solution is
valuable for your project
The Web-bot traffic is analyzed and it shows
that traffic has a large share in the traffic that
is found on e-commerce Web server. Bots are
accessing in different ways. It is not like
humans.
It helps in analyzing the web traffic on e-
commerce sites and tries to overcome it.
This model helps in producing bot requests
which have features similar to Web traffic. It
also helps in generation of bursty synthetic bot
traffic that helps in simulation experiments. It
also helps other online stores that do online
promotion.
Diagram/Flowchart
4
Why (Justify)
Critical Thinking: Limitations of the
research current solution, and Why
(Justify)
Input (Data) Output (View)
Multiple user logins Web bot traffic is
analyzed and it
shows the share of
overall traffic on the
web server.
This work majorly helps inapplicability in
respect to analytical modeling benchmarking
and sending data packets. This model also
helps in reproducing a stream of bot request.
And it also helps in generating synthetic bot
traffic for using in simulation experiments.
In this generalization cannot be done for all e-
commerce websites. The accurate modeling is
not done in the standard time. Delay in
seconds can cause errors. Also fitting the real
data in the regression function is an issue.
(Describe the research/current solution) Evaluation Criteria How this research/current solution is
valuable for your project
The Web-bot traffic is analyzed and it shows
that traffic has a large share in the traffic that
is found on e-commerce Web server. Bots are
accessing in different ways. It is not like
humans.
It helps in analyzing the web traffic on e-
commerce sites and tries to overcome it.
This model helps in producing bot requests
which have features similar to Web traffic. It
also helps in generation of bursty synthetic bot
traffic that helps in simulation experiments. It
also helps other online stores that do online
promotion.
Diagram/Flowchart
4
Paraphrase This Document
Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

Figure: Flow diagram of design
5
5

2
Reference in APA format that will be
in 'Reference List'
Liu, Y. Y., Tseng, F. M., & Tseng, Y. H. (2018). Big Data analytics for forecasting tourism
destination arrivals with the applied Vector Autoregression model. Technological Forecasting and
Social Change, 130, 123-134.
Citation that will be in the content Liu, et. Al., 2018
URL of the Reference Level of Journal (Q1, Q2, …Qn) Keywords in this Reference
https://www-sciencedirect-
com.ezproxy.csu.edu.au/science/
article/pii/S0040162518301045
Level of Journal: Q1 Granger causality, Big Data analytics, Destination
Management and Marketing, Vector Autoregression
model
The Name of the Current Solution
(Technique/ Method/ Scheme/
Algorithm/ Model/ Tool/ Framework/
... etc )
The Goal (Objective) of this Solution &
What is the Problem that need to be
solved
What are the components of it?
Technique/Algorithm name:
Forecast Pro
ARIMA
Tools:
Virtual reality
Big data
Problem:
With the data available and methods, plus
a using web search queries to the
performance analysis and provide results.
Goal:
The goal to clear the concept that weather
and temperature have no relation to the
travelers traveling to the destination.
Big data
Virtual reality
6
Reference in APA format that will be
in 'Reference List'
Liu, Y. Y., Tseng, F. M., & Tseng, Y. H. (2018). Big Data analytics for forecasting tourism
destination arrivals with the applied Vector Autoregression model. Technological Forecasting and
Social Change, 130, 123-134.
Citation that will be in the content Liu, et. Al., 2018
URL of the Reference Level of Journal (Q1, Q2, …Qn) Keywords in this Reference
https://www-sciencedirect-
com.ezproxy.csu.edu.au/science/
article/pii/S0040162518301045
Level of Journal: Q1 Granger causality, Big Data analytics, Destination
Management and Marketing, Vector Autoregression
model
The Name of the Current Solution
(Technique/ Method/ Scheme/
Algorithm/ Model/ Tool/ Framework/
... etc )
The Goal (Objective) of this Solution &
What is the Problem that need to be
solved
What are the components of it?
Technique/Algorithm name:
Forecast Pro
ARIMA
Tools:
Virtual reality
Big data
Problem:
With the data available and methods, plus
a using web search queries to the
performance analysis and provide results.
Goal:
The goal to clear the concept that weather
and temperature have no relation to the
travelers traveling to the destination.
Big data
Virtual reality
6
⊘ This is a preview!⊘
Do you want full access?
Subscribe today to unlock all pages.

Trusted by 1+ million students worldwide

Applied Area:
Website analysis
The Process (Mechanism) of this Work; The process steps of the Technique/system
Process Steps Advantage (Purpose of this step) Disadvantage (Limitation/Challenge)
1 Co-integration test and unit roots-
Long-term equilibrium needs and need of
stationary variable should be satisfied.
They are designed without time trends. Unit root test is not passed by non-
stationary variables.
2 VAR (p) modeling-
Variable is treated as endogenous and the
value of it retreats where dependent
variables are lagged.
This is an unrestricted model. N/A
3 Granger causality test-
This is done to check what is the effect of a
lagged variable on the current value
With this six are formulated. N/A
Validation Criteria (Measurement Criteria)
7
Website analysis
The Process (Mechanism) of this Work; The process steps of the Technique/system
Process Steps Advantage (Purpose of this step) Disadvantage (Limitation/Challenge)
1 Co-integration test and unit roots-
Long-term equilibrium needs and need of
stationary variable should be satisfied.
They are designed without time trends. Unit root test is not passed by non-
stationary variables.
2 VAR (p) modeling-
Variable is treated as endogenous and the
value of it retreats where dependent
variables are lagged.
This is an unrestricted model. N/A
3 Granger causality test-
This is done to check what is the effect of a
lagged variable on the current value
With this six are formulated. N/A
Validation Criteria (Measurement Criteria)
7
Paraphrase This Document
Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

Dependent Variable Independent Variable
Forecast modeling Statistical data
Big data Internet
Input and Output Critical Thinking: Feature of this
work, and Why (Justify)
Critical Thinking: Limitations of the research
current solution, and Why (Justify)
Input (Data) Output (View)
Big data Find out and further
reveals that the
weather and
temperature have no
correlation.
This study shows that weather has no
impact on domestic tourist visitations and
also it does not affect cultural
destinations. At cultural destinations, we
have natural resources that are also not
impacted by web search queries because
it only helps in showing travelers
intentions.
The major limitation is destination management.
Overcrowding is caused and also it increases the
conflicts. And overcrowding is a challenge at public
places because there are no tolls at highway on
public holidays.
(Describe the research/current solution) Evaluation Criteria How this research/current solution is valuable
for your project
In this information related to the destinations
revealed such as destination data, destination
assets. It helps in making sense of content
and improves competitiveness.
In this web search, queries are performed
for predicting tourists at tourists’
destination. Big analytical data is studied
for management and marketing
Weather doesn’t impact the tourist's in selecting
domestic destinations, cultural destinations etc. As
web search queries reflect the travel intentions but
the weather does not. Also, the temperature has a
positive impact on domestic tourism where stays are
till night.
8
Forecast modeling Statistical data
Big data Internet
Input and Output Critical Thinking: Feature of this
work, and Why (Justify)
Critical Thinking: Limitations of the research
current solution, and Why (Justify)
Input (Data) Output (View)
Big data Find out and further
reveals that the
weather and
temperature have no
correlation.
This study shows that weather has no
impact on domestic tourist visitations and
also it does not affect cultural
destinations. At cultural destinations, we
have natural resources that are also not
impacted by web search queries because
it only helps in showing travelers
intentions.
The major limitation is destination management.
Overcrowding is caused and also it increases the
conflicts. And overcrowding is a challenge at public
places because there are no tolls at highway on
public holidays.
(Describe the research/current solution) Evaluation Criteria How this research/current solution is valuable
for your project
In this information related to the destinations
revealed such as destination data, destination
assets. It helps in making sense of content
and improves competitiveness.
In this web search, queries are performed
for predicting tourists at tourists’
destination. Big analytical data is studied
for management and marketing
Weather doesn’t impact the tourist's in selecting
domestic destinations, cultural destinations etc. As
web search queries reflect the travel intentions but
the weather does not. Also, the temperature has a
positive impact on domestic tourism where stays are
till night.
8

Diagram/Flowchart
Figure: potential sources with market share of cities
9
Figure: potential sources with market share of cities
9
⊘ This is a preview!⊘
Do you want full access?
Subscribe today to unlock all pages.

Trusted by 1+ million students worldwide

Figure: The destination's potential visitors from neighbouring cities and distant cities
10
10
Paraphrase This Document
Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

3
Reference in APA format that will be
in 'Reference List' Li, W., Wang, S., & Bhatia, V. (2016). PolarHub: A large-scale web crawling engine for OGC service
discovery in cyberinfrastructure. Computers, Environment and Urban Systems, 59, 195-207.
Citation that will be in the content Li et al.,2016
URL of the Reference Level of Journal (Q1, Q2, …Qn) Keywords in this Reference
https://www-sciencedirect-
com.ezproxy.csu.edu.au/science/
article/pii/S0198971516301260
Level of journal- Q1 Polar Hub, Big data access, Geospatial interoperability,
Scalability, Cyberinfrastructure
The Name of the Current Solution
(Technique/ Method/ Scheme/
Algorithm/ Model/ Tool/ Framework/
... etc )
The Goal (Objective) of this Solution &
What is the Problem that need to be
solved
What are the components of it?
Technique/Algorithm name:
Polar hub
Tools:
Service-oriented architecture
Applied Area:
Cyber infrastructure in web crawling
Problem:
The increase availability of the geospatial
have marked an identification on the web
as a web signature of the voluminous
resources.
Goal:
Providing solution to the cyber
infrastructure through the Polar Hub which
organise the web crawling to find the
scattered geospatial data and resources to
complete the objective effectively and
Geospatial
Web signature
Cyber infrastructure
Polar hub
Web crawling
11
Reference in APA format that will be
in 'Reference List' Li, W., Wang, S., & Bhatia, V. (2016). PolarHub: A large-scale web crawling engine for OGC service
discovery in cyberinfrastructure. Computers, Environment and Urban Systems, 59, 195-207.
Citation that will be in the content Li et al.,2016
URL of the Reference Level of Journal (Q1, Q2, …Qn) Keywords in this Reference
https://www-sciencedirect-
com.ezproxy.csu.edu.au/science/
article/pii/S0198971516301260
Level of journal- Q1 Polar Hub, Big data access, Geospatial interoperability,
Scalability, Cyberinfrastructure
The Name of the Current Solution
(Technique/ Method/ Scheme/
Algorithm/ Model/ Tool/ Framework/
... etc )
The Goal (Objective) of this Solution &
What is the Problem that need to be
solved
What are the components of it?
Technique/Algorithm name:
Polar hub
Tools:
Service-oriented architecture
Applied Area:
Cyber infrastructure in web crawling
Problem:
The increase availability of the geospatial
have marked an identification on the web
as a web signature of the voluminous
resources.
Goal:
Providing solution to the cyber
infrastructure through the Polar Hub which
organise the web crawling to find the
scattered geospatial data and resources to
complete the objective effectively and
Geospatial
Web signature
Cyber infrastructure
Polar hub
Web crawling
11

efficiently
The Process (Mechanism) of this Work; The process steps of the Technique/system
Process Steps Advantage (Purpose of this step) Disadvantage (Limitation/Challenge)
1 Polar Hub Class design( the internal
structure of the crawler is buils in the UML
structure)
Crawler do not allows the duplicates entry The URL which is being visited cannot
be crawl again
2 Developing crawling
algorithm(developing algorithm for the
general purpose crawling depth and width)
Helps to determine the scope of the
exploited web
N/A
3 Asynchronous processing (the web page
which is directly extracted from the seed is
being visiting
Rules are pre determine only administrator
had a right to start the task
Deep searching can rise the expenses of
the crawling
4 Monitoring (client will receive the updates
by the server once the new result is being
found).
It allows GUI interface which provides a
impressive response to the user
N/A
5
Validation Criteria (Measurement Criteria)
Dependent Variable Independent Variable
crawler Crawler server of Polar Hub
12
The Process (Mechanism) of this Work; The process steps of the Technique/system
Process Steps Advantage (Purpose of this step) Disadvantage (Limitation/Challenge)
1 Polar Hub Class design( the internal
structure of the crawler is buils in the UML
structure)
Crawler do not allows the duplicates entry The URL which is being visited cannot
be crawl again
2 Developing crawling
algorithm(developing algorithm for the
general purpose crawling depth and width)
Helps to determine the scope of the
exploited web
N/A
3 Asynchronous processing (the web page
which is directly extracted from the seed is
being visiting
Rules are pre determine only administrator
had a right to start the task
Deep searching can rise the expenses of
the crawling
4 Monitoring (client will receive the updates
by the server once the new result is being
found).
It allows GUI interface which provides a
impressive response to the user
N/A
5
Validation Criteria (Measurement Criteria)
Dependent Variable Independent Variable
crawler Crawler server of Polar Hub
12
⊘ This is a preview!⊘
Do you want full access?
Subscribe today to unlock all pages.

Trusted by 1+ million students worldwide
1 out of 94
Related Documents

Your All-in-One AI-Powered Toolkit for Academic Success.
+13062052269
info@desklib.com
Available 24*7 on WhatsApp / Email
Unlock your academic potential
Copyright © 2020–2025 A2Z Services. All Rights Reserved. Developed and managed by ZUCOL.