BUS5CA Assignment 1: SAS Text Miner Analysis of Social Media Data

Verified

Added on  2022/12/27

|8
|465
|43
Report
AI Summary
This assignment report details the application of SAS Text Miner for analyzing social media data, as part of the BUS5CA Customer Analytics and Social Media course. The analysis focuses on extracting keywords from titles across different data channels to identify the top 10 most used topics in each category. The report also explores common topics that correlate with high and low numbers of shares across the dataset. The methodology involves using the SAS Result window for explanation, and the whole dataset to identify the articles with the high number of shares and the low number of shares by using appropriate thresholds with the top 10% and the bottom 10% in the dataset. The assignment requires the use of the 'Title' column as the only 'Text' role for topic modelling. The report includes code and explanations, addressing word count analysis, identification of common and uncommon words, and the visualization of unigrams, bigrams, and trigrams. The TF-IDF vectorizer is used to refine word counts, highlighting words that are more important to the context.
Document Page
Use the SAS Text Miner to extract the keywords from the title in each data channel.
What are the highly used (top 10) topics in each category? Use the SAS Result window
to explain your answers.
(Hint: ‘Topic’ column will need to be set as the only ‘Text’ role.)
Are there common topics which span across data channels and relate to a high
number of shares and a low number of shares? Use the whole dataset in the SAS Text
Miner to identify the relationship. You should provide the explanation to support
your
argument.
(Hint: Use the whole dataset to identify the articles with the high number of shares
and the low number of shares – by using appropriate thresholds with the top 10%
and the bottom 10% in the dataset. Separate the dataset using Excel based on this
before the analysis and use these two datasets to analyse the common topics in each
of them. In this question, please use ‘Title’ column as the only ‘Text’ role for topic
modelling.)
CODE AND EXPLAINATION
tabler-icon-diamond-filled.svg

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Document Page
Document Page
The average word count is about 8 words per title. The word count ranges from a
minimum of 1 to a maximum of 20. The word count is important to give us an indication
of the size of the dataset that we are handling as well as the variation in word counts
across the rows.
Most common and uncommon words
A peek into the most common words gives insights not only on the frequently used
words but also words that could also be potential data specific stop words. A
Document Page
comparison of the most common words and the default English stop words will give us
a list of words that need to be added to a custom stop word list.
tabler-icon-diamond-filled.svg

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Document Page
Document Page
Visualize top N uni-grams, bi-grams & tri-grams
We can use the CountVectoriser to visualise the top 20 unigrams, bi-grams and tri-
grams.
Converting to a matrix of integers
The next step of refining the word counts is using the TF-IDF vectoriser. The deficiency
of a mere word count obtained from the countVectoriser is that, large counts of certain
common words may dilute the impact of more context specific words in the corpus. This
is overcome by the TF-IDF vectoriser which penalizes words that appear several times
across the document. TF-IDF are word frequency scores that highlight words that are
more important to the context rather than those that appear frequently across
documents.
TF-IDF consists of 2 components:
TF — term frequency
Document Page
IDF — Inverse document frequency
tabler-icon-diamond-filled.svg

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Document Page
chevron_up_icon
1 out of 8
circle_padding
hide_on_mobile
zoom_out_icon
[object Object]