Data Science: Data Scraping and Twitter Data Extraction Analysis

Verified

Added on  2023/06/04

|7
|1484
|347
Practical Assignment
AI Summary
This assignment focuses on data scraping from Twitter, utilizing JavaScript and Node.js. The student explores extracting various data points such as likes, timelines, and connections. The solution includes code snippets for scraping different Twitter functionalities like likes, search, and connections. The assignment discusses the ethical and security ramifications of data scraping, including the potential for privacy violations and the inability to modify existing posts. The student suggests mitigation strategies, such as end-to-end encryption, to enhance user data security. References to relevant research papers are also provided. This assignment is a practical demonstration of data extraction techniques from social media platforms. You can find similar assignments and solutions on Desklib, a platform designed to assist students with their studies.
Document Page
DATA SCRAPPING
Student Name
Affiliation
Date
tabler-icon-diamond-filled.svg

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Document Page
Data Scrapping
Information scratching is for the most part considered an impromptu, inelegant system, regularly
utilized just "if all else fails" when no other instrument for information trade is accessible. Beside
the higher programming and preparing overhead, yield shows planned for human utilization
regularly change structure much of the time. People can adapt to this effortlessly, yet a PC
program may report hogwash, have been advised to peruse information in a specific organization
or from a specific place, and with no learning of how to check its outcomes for legitimacy,
(Bakker, 2014).
Code Development and Results
Twitter data scrapping
Twitter which is a social networking site contains works in such a way that the users who are
people all over the world can sign up and create and account with twitter. During data scrapping
of twitter, the following are the data which can be extracted;
i. Twitter likes
ii. Twitter list
iii. Twitter search
iv. Twitter profile
v. Twitter conversation
vi. Twitter timeline
vii. Twitter connections
For the process of crapping off, Javascript language was used for this case;
The following code is used to call and scrape off data on twitter;
#!/usr/bin/env node
'use strict'
const execa = require('execa')
const args = process.argv.slice(2)
const scrapeTwitterCommand = `scrape-twitter-${args[0]}`
Document Page
const scrapeTwitterFlags = args.slice(1)
const command = execa(scrapeTwitterCommand, scrapeTwitterFlags)
command.stdout.pipe(process.stdout)
command.stderr.pipe(process.stderr)
command.catch(() => {
console.log(`
Access Twitter data without an API key.
Usage
$ scrape-twitter <command>
Commands
profile Get a user's profile.
timeline Get a user's timeline.
likes Get a user's likes.
connections Get a user's connections.
conversation Get a particular conversation.
list Get the timeline of a particular list.
search Query Twitter for matching tweets.
`)
return true
})
Twitter likes
The likes on twitter comes about when a person posts anything on his or her account, when this
happens the people who are able to access or follow the posts can like the posts. Now this posts
can easily be extracted off the system which the node.js which is a scripting method of scrapping
data. The following sample code was used for the process of scrapping likes from twitter
handles.
"scripts": {
"test": "npm run -s lint && npm run -s unit",
"lint": "eslint {bin,src,test} --fix",
"unit": "jest --testEnvironment=node",
"build": "rimraf dist && babel src --out-dir dist",
"push": "git push --follow-tags origin master",
"release": "npm test && npm run -s build && standard-version && npm run -s push && npm publish"
},
"dependencies": {
Document Page
"JSONStream": "^1.3.4",
"cheerio": "^1.0.0-rc.2",
"debug": "^3.1.0",
"dotenv": "^6.0.0",
"execa": "^1.0.0",
"expand-home-dir": "^0.0.3",
"fetch-cookie": "^0.7.2",
"isomorphic-fetch": "^2.2.1",
"meow": "^5.0.0",
"pump": "^3.0.0",
"query-string": "^6.1.0",
"readable-stream": "^3.0.2",
"touch": "^3.1.0",
"url-regex": "^4.1.1"
},
"devDependencies": {
"babel-cli": "^6.26.0",
"babel-eslint": "^9.0.0",
"babel-jest": "^23.4.2",
"babel-plugin-transform-class-properties": "^6.24.1",
"babel-plugin-transform-object-rest-spread": "^6.26.0",
"babel-preset-env": "^1.7.0",
"eslint": "^5.4.0",
"eslint-config-standard": "^12.0.0",
"eslint-plugin-import": "^2.14.0",
"eslint-plugin-node": "^7.0.1",
"eslint-plugin-promise": "^4.0.0",
"eslint-plugin-standard": "^4.0.0",
"jest": "^23.5.0",
"prettier": "^1.14.2",
"rimraf": "^2.6.2",
"standard-version": "^4.4.0",
"stream-to-promise": "^2.2.0",
"validate-commit-msg": "^2.14.0"
}
For the extracting linkage process, the below code is used.
tabler-icon-diamond-filled.svg

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Document Page
#!/usr/bin/env node
'use strict'
require('../dist/bin/scrape-twitter-likes')
Twitter Search
Information scratching is normally done on the interface of the program or on the internal
functionalities of the program that does not have instruments that are more better compared to
the most recent equipment or to the interface to the system which is externally extracted which
the help of an API. Employing a different technique, the admin of the external system will then
make observations of screen scrapping as not desirable since the system which is then expanded
from the core mother system for extraction will have to loss the original content to the scrapping
system.
The search engines on twitter helps in finding the followers or the twitter handle which is
trending or posted by the user. This can be extracted easily using the below linkage code.
#!/usr/bin/env node
'use strict'
require('../dist/bin/scrape-twitter-search')
Connections
Connections are made in order to collect all the methods on one node which is able to
communicate which all the other nodes in the JavaScript language.
const TimelineStream = require('./lib/timeline-stream')
const MediaTimelineStream = require('./lib/media-stream')
const ConversationStream = require('./lib/conversation-stream')
const ThreadedConversationStream = require('./lib/threaded-conversation-stream')
const TweetStream = require('./lib/tweet-stream')
const ListStream = require('./lib/list-stream')
const LikeStream = require('./lib/like-stream')
const ConnectionStream = require('./lib/connection-stream')
const getUserProfile = require('./lib/twitter-query').getUserProfile
module.exports = {
Document Page
TimelineStream,
MediaTimelineStream,
ConversationStream,
ThreadedConversationStream,
TweetStream,
ListStream,
LikeStream,
ConnectionStream,
getUserProfile
}
Findings and results
During data scrapping, all the target data which was supposed to be extracted were successfully
extracted from the script developed in javascript. The likes, timeline and other main
functionalities of twitter accounts were scraped and be able to be viewed from the script end.
Ethic and security ramification of data scraping.
During data scrapping, it was not possible to modify the posts and the likes of any twitter user
though it was possible to intrude into the user account and view the actual number of likes, the
responses and the posts of the user, (Smith et al 2013). This obviously interferes with the privacy of
the information which is posted by the user though this is normally public in twitter. There is no
security threat since during extraction of data, it is not possible for the extractor to post of
manipulate the information already posted but only getting access, (Park et al 2016).
Suggested Mitigation
In order for twitter company to avoid such scrapping of users data, they can prevent this by
making the functionality of end to end encryption which means that even themselves will not be
able to view information send until data is delivered to the desired address, (Malik & Rizvi 2011).
Document Page
References
Malik, S.K. and Rizvi, S.A.M., 2011, October. Information extraction using web usage mining,
web scrapping and semantic annotation. In 2011 International Conference on Computational
Intelligence and Communication Systems (pp. 465-469). IEEE.
Park, S.J., Choi, K.H., Park, J. and Kim, J.B., 2016. A study on spatial analysis using R-based
deep learning. International Journal of Software Engineering and Its Applications, vol.10, no.5,
pp.87-94.
Smith, T.W.P., O'Keeffe, E., Aldous, L. and Agnolucci, P., 2013. Assessment of Shipping's
Efficiency Using Satellite AIS data.
Bakker, P., 2014. Mr. Gates returns: Curation, community management and other new roles for
journalists. Journalism studies, vol.15, no.5, pp.596-606.
chevron_up_icon
1 out of 7
circle_padding
hide_on_mobile
zoom_out_icon
[object Object]