Using SAS Text Miner for Keyword Extraction and Topic Modeling

   

Added on  2022-12-27

8 Pages1446 Words313 Views
Use the SAS Text Miner to extract the keywords from the title in each data
channel.
What are the highly used (top 10) topics in each category? Use the SAS
Result window
to explain your answers.
(Hint: ‘Topic’ column will need to be set as the only ‘Text’ role.)
Are there common topics which span across data channels and relate to a
high
number of shares and a low number of shares? Use the whole dataset in the
SAS Text
Miner to identify the relationship.
You should provide the explanation to
support your
argument.
(Hint: Use the whole dataset to identify the articles with the high number of
shares
and the low number of shares – by using appropriate thresholds with the top
10%
and the bottom 10% in the dataset. Separate the dataset using Excel based
on this
before the analysis and use these two datasets to analyse the common
topics in each
of them. In this question, please use ‘Title’ column as the only ‘Text’ role for
topic
modelling.)
CODE:
Using SAS Text Miner for Keyword Extraction and Topic Modeling_1
import pandas
# load the dataset
dataset = pandas.read_excel (r'online_popularity_data.xlsx')
dataset.head()
##Descriptive statistics of word counts
dataset.word_count.describe()
#Fetch wordcount for each Title
dataset['word_count'] = dataset['Title'].apply(lambda x: len(str(x).split(" ")))
dataset[['Title','word_count']].head()
#Identify common words
freq = pandas.Series(' '.join(dataset['Title']).split()).value_counts()[:20]
freq
#Identify uncommon words
freq1 = pandas.Series(' '.join(dataset
['Title']).split()).value_counts()[-20:]
freq1
# Libraries for text preprocessing
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
Using SAS Text Miner for Keyword Extraction and Topic Modeling_2
from nltk.tokenize import RegexpTokenizer
from nltk.stem.wordnet import WordNetLemmatizer
##Creating a list of stop words and adding custom stopwords
stop_words = set(stopwords.words("english"))
##Creating a list of custom stopwords
new_words = ["using", "show", "result", "large", "also", "iv", "one", "two", "new",
"previously", "shown"]
stop_words = stop_words.union(new_words)
corpus = []
for i in range(0, 3847):
#Remove punctuations
text = re.sub('[^a-zA-Z]', ' ', dataset['Title'][i])
#Convert to lowercase
text = text.lower()
#remove tags
text=re.sub("</?.*?>"," <> ",text)
# remove special characters and digits
text=re.sub("(\\d|\\W)+"," ",text)
##Convert to list from string
text = text.split()
##Stemming
ps=PorterStemmer()
Using SAS Text Miner for Keyword Extraction and Topic Modeling_3

End of preview

Want to access all the pages? Upload your documents or become a member.

Related Documents
Use the SAS Text Miner to extract the keywords from the title in
|8
|465
|43