1.Data Acquisition and Pre-Processing
This study used the official API of Sina Weibo data center. Using keywords of coronavirus and pneumonia and time period from Jan 9, 2020 (when the pathogen of the pneumonia was first determined to be novel coronavirus) to Dec 9, 2020, it extracted property attributes including user name, user id, post text, geographic location and the time of post from Sina Weibo.
The original post text had interfering information like space, http links and punctuations. To remove these noises and to increase parsing efficiency, source text must be filtered first. Python regular expression was used to filter the original post text to remove the interfering information (such as http links and punctuations), stop words, low quality text and repeating text. At the end, 6,946,196 post texts were obtained and 328,241 of them contained geographical location information.
2.Methods
In this study, topic extraction and classification framework were constructed using the Latent Dirichlet Allocation (LDA) topic model and Random Forest algorithm, then public topic sentiments were obtained by stratifying social media text related to COVID-19.
First, Chinese phrases were parsed (which means chopped into individual word segments) using Python Chinese parsing tool called Jieba (meaning “stutter”). Then based on Gensim library from Python, topics were extracted using LDA topic model to generate topic probability distributions for all topics and word probability distributions within each topic. Finally, labelled topic sample data were used as training sets for the Random Forest algorithm based on the Scikit-Learn library from Python to do the entire dataset classification.