Web Crawling: Measuring Web Trends for Increased Business Productivity

Verified

Added on  2024/06/28

|94
|10733
|135
Literature Review
AI Summary
This literature review investigates the use of web crawling techniques for analyzing web trends and improving business productivity. It examines various approaches, including machine learning algorithms for modeling web traffic, big data analytics for forecasting tourism destination arrivals, and the PolarHub web crawling engine for geospatial data discovery. The review discusses the advantages and limitations of each technique, highlighting their applicability in different contexts. Key themes include identifying bot traffic, analyzing social media data, and leveraging web search queries for business insights. The reviewed solutions aim to address challenges such as non-stationary web traffic, overcrowded tourist destinations, and the efficient retrieval of geospatial resources, ultimately contributing to increased business efficiency and informed decision-making. Desklib offers a platform for students to access this document and numerous other solved assignments and study resources.
Document Page
Literature Review (Secondary Research)
Student Name & CSU

ID

Project Topic Title
Analysing the website and measuring the web trends using web crawler based on page tagging for increasing
the productivity of businesses

1
tabler-icon-diamond-filled.svg

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Document Page
Version 1.0 _ Week 1 (5 Journal Papers from CSU Library)
1

Reference in APA format that will be

in 'Reference List'

Suchacka, G., & Wotzka, D. (2017). Modeling a non-stationary bots’ arrival process at an e-commerce

Web site.
Journal of computational science, 22, 198-208.
Citation that will be in the content
Suchacka & Wotzka, 2017
URL of the
Reference Level of Journal (Q1, Q2, …Qn) Keywords in this Reference
https://www-sciencedirect-

com.ezproxy.csu.edu.au/science/article/

pii/S0165032717323510

Level of journal: Q1
Web traffic analysis and modeling, Web traffic,
characterization, Log file analysis, Web server,

Internet robot, Web bot, Modeling and

simulation, Regression analysis, Heavy-tailed

distribution

The Name of the Current Solution

(Technique/ Method/ Scheme/

Algorithm/ Model/ Tool/ Framework/

... etc )

The Goal (Objective) of this Solution &

What is the Problem that need to be

solved

What are the components of it?

Technique/Algorithm name:

Machine learning

Internet marketing

Online payments

Tools:

Problem:

The major
problem is of modeling a real
arrival process in which bots are requested

on the web server of an e-commerce.

Goal:

Machine learning
Web applications
2
Document Page
Mathematical model
Simulation model

Distribution model

Applied Area:

Web applications

To solve the problem of non-stationary

web traffic and chunks are used with the

help of models.

The Process (Mechanism) of this Work; The process steps of the Technique/system

Process Steps
Advantage (Purpose of this step) Disadvantage (Limitation/Challenge)
1
e-commerce website: A website for either
purchasing or selling online.

All the processing is done according to the

website and user requirement.

In this, because of many users, there is a

traffic.

2
Constructing session based web traffic:
On the user, request sessions are

constructed.

According to the request the sessions

come in access after that they expire.

N/A

3
Identifying bot’s traffic: Request the
session and at one time one request is there

in the session.

Any one condition satisfied bot session is

met.

N/A

Validation Criteria (Measurement Criteria)

Dependent Variable
Independent Variable
Bot arrival process
Time-zone
Web traffic analysis
Web server log data
3
Document Page
Input and Output Critical Thinking: Feature of this work, and
Why (Justify)

Critical Thinking:
Limitations of the
research current solution, and Why

(Justify)

Input (Data)
Output (View)
Multiple user logins
Web bot traffic is
analyzed and it

shows the share of

overall traffic on the

web server.

This work majorly helps inapplicability in

respect to analytical modeling benchmarking

and sending data packets. This model also

helps in reproducing a stream of bot request.

And it also helps in generating synthetic bot

traffic for using in simulation experiments.

In this generalization cannot be done for
all e-
commerce websites. The accurate modeling is

not done in the standard time. Delay in

seconds can cause errors. Also fitting the real

data in the regression function is an issue.

(
Describe the research/current solution) Evaluation Criteria How this research/current solution is
valuable for your project

The Web-bot traffic is analyzed and it shows

that traffic has a large share in the traffic that

is found on e-commerce Web server. Bots are

accessing in different ways. It is not like

humans.

It helps in analyzing the web traffic on e-

commerce sites and tries to overcome it.

This model helps in producing bot requests

which have features similar to Web traffic. It

also helps in generation of bursty synthetic bot

traffic that helps in simulation experiments. It

also helps other online stores that do online

promotion.

Diagram/Flowchart

4
tabler-icon-diamond-filled.svg

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Document Page
Figure: Flow diagram of design
5
Document Page
2
Reference in APA format that will be

in 'Reference List'

Liu, Y. Y., Tseng, F. M., & Tseng, Y. H. (2018). Big Data analytics for forecasting tourism

destination arrivals with the applied Vector Autoregression model.
Technological Forecasting and
Social Change
, 130, 123-134.
Citation that will be in the content
Liu, et. Al., 2018
URL of the
Reference Level of Journal (Q1, Q2, …Qn) Keywords in this Reference
https://www-sciencedirect-

com.ezproxy.csu.edu.au/science/

article/pii/S0040162518301045

Level of Journal: Q1
Granger causality, Big Data analytics, Destination
Management and Marketing, Vector Autoregression

model

The Name of the Current Solution

(Technique/ Method/ Scheme/

Algorithm/ Model/ Tool/ Framework/

... etc )

The Goal (Objective) of this Solution &

What is the Problem that need to be

solved

What are the components of it?

Technique/Algorithm name:

Forecast Pro

ARIMA

Tools:

Virtual reality

Big data

Problem:

With the data available and methods, plus

a
using web search queries to the
performance analysis and provide results.

Goal:

The goal to clear the concept that weather

and temperature have no relation to the

travelers traveling to the destination.

Big data
Virtual reality
6
Document Page
Applied Area:
Website analysis

The Process (Mechanism) of this Work; The process steps of the Technique/system

Process Steps
Advantage (Purpose of this step) Disadvantage (Limitation/Challenge)
1
Co-integration test and unit roots-
Long-term equilibrium needs and need of

stationary variable should be satisfied.

They are designed without time trends.
Unit root test is not passed by non-
stationary variables.

2
VAR (p) modeling-
Variable is treated as endogenous and the

value of it retreats where dependent

variables are lagged.

This is an unrestricted model.
N/A
3
Granger causality test-
This is done to check what is the effect of a

lagged variable on the current value

With this six are formulated.
N/A
Validation Criteria (Measurement Criteria)

7
tabler-icon-diamond-filled.svg

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Document Page
Dependent Variable Independent Variable
Forecast modeling
Statistical data
Big data
Internet
Input and Output
Critical Thinking: Feature of this
work
, and Why (Justify)
Critical Thinking:
Limitations of the research
current solution, and Why (Justify)

Input (Data)
Output (View)
Big data
Find out and further
reveals that the

weather and

temperature have no

correlation.

This study shows that weather has no

impact on domestic tourist visitations and

also it does not affect cultural

destinations. At cultural destinations, we

have natural resources that are also not

impacted by web search queries because

it only helps in showing travelers

intentions.

The major limitation is destination management.

Overcrowding is caused and also it increases the

conflicts. And overcrowding is a challenge at public

places because there are no tolls at highway on

public holidays.

(
Describe the research/current solution) Evaluation Criteria How this research/current solution is valuable
for your project

In this information related to the destinations

revealed such as destination data, destination

assets. It helps in making sense of content

and improves competitiveness.

In this web search, queries are performed

for predicting tourists at tourists’

destination. Big analytical data is studied

for management and marketing

Weather doesn’t impact the tourist's in selecting

domestic destinations, cultural destinations etc. As

web search queries reflect the travel intentions but

the weather does not. Also, the temperature has a

positive impact on domestic tourism where stays are

till night.

8
Document Page
Diagram/Flowchart
Figure: potential sources with market share of cities

9
Document Page
Figure: The destination's potential visitors from neighbouring cities and distant cities
10
tabler-icon-diamond-filled.svg

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Document Page
3
Reference in APA format that will be

in 'Reference List'
Li, W., Wang, S., & Bhatia, V. (2016). PolarHub: A large-scale web crawling engine for OGC service
discovery in cyberinfrastructure.
Computers, Environment and Urban Systems, 59, 195-207.
Citation that will be in the content
Li et al.,2016
URL of the
Reference Level of Journal (Q1, Q2, …Qn) Keywords in this Reference
https://www-sciencedirect-

com.ezproxy.csu.edu.au/science/

article/pii/S0198971516301260

Level of journal- Q1
Polar Hub, Big data access, Geospatial interoperability,
Scalability, Cyberinfrastructure

The Name of the Current Solution

(Technique/ Method/ Scheme/

Algorithm/ Model/ Tool/ Framework/

... etc )

The Goal (Objective) of this Solution &

What is the Problem that need to be

solved

What are the components of it?

Technique/Algorithm name:

Polar hub

Tools:

Service-oriented architecture

Applied Area:

Cyber infrastructure in web crawling

Problem:

The increase availability of the geospatial

have marked an identification on the web

as a web signature of the voluminous

resources.

Goal:

Providing solution to the cyber

infrastructure through the Polar Hub which

organise the web crawling to find the

scattered geospatial data and resources to

complete the objective effectively and

Geospatial
Web signature
Cyber infrastructure
Polar hub
Web crawling
11
Document Page
efficiently
The Process (Mechanism) of this Work; The process steps of the Technique/system

Process Steps
Advantage (Purpose of this step) Disadvantage (Limitation/Challenge)
1
Polar Hub Class design( the internal
structure of the crawler is buils in the UML

structure)

Crawler do not allows the duplicates entry
The URL which is being visited cannot
be crawl again

2
Developing crawling
algorithm
(developing algorithm for the
general purpose crawling depth and width)

Helps to determine the scope of the

exploited web

N/A

3
Asynchronous processing (the web page
which is directly extracted from the seed is

being visiting

Rules are pre determine only administrator

had a right to start the task

Deep searching can rise the expenses of

the crawling

4
Monitoring (client will receive the updates
by the server once the new result is being

found).

It allows GUI interface which provides a

impressive response to the user

N/A

5

Validation Criteria (Measurement Criteria)

Dependent Variable
Independent Variable
crawler
Crawler server of Polar Hub
12
chevron_up_icon
1 out of 94
circle_padding
hide_on_mobile
zoom_out_icon
[object Object]