Detecting Non‐personal and Spam Users on Geo‐tagged Twitter Network
With the rapid growth and popularity of mobile devices and location‐aware technologies, online social networks such as Twitter have become an important data source for scientists to conduct geo‐social network research. Non‐personal accounts, spam users and junk tweets, however, pose severe problems to the extraction of meaningful information and the validation of any research findings on tweets or twitter users. Therefore, the detection of such users is a critical and fundamental step for twitter‐related geographic research. In this study, we develop a methodological framework to: (1) extract user characteristics based on geographic, graph‐based and content‐based features of tweets; (2) construct a training dataset by manually inspecting and labeling a large sample of twitter users; and (3) derive reliable rules and knowledge for detecting non‐personal users with supervised classification methods. The extracted geographic characteristics of a user include maximum speed, mean speed, the number of different counties that the user has been to, and others. Content‐based characteristics for a user include the number of tweets per month, the percentage of tweets with URLs or Hashtags, and the percentage of tweets with emotions, detected with sentiment analysis. The extracted rules are theoretically interesting and practically useful. Specifically, the results show that geographic features, such as the average speed and frequency of county changes, can serve as important indicators of non‐personal users. For non‐spatial characteristics, the percentage of tweets with a high human factor index, the percentage of tweets with URLs, and the percentage of tweets with mentioned/replied users are the top three features in detecting non‐personal users.
No Supplementary Data
No Article Media
Document Type: Research Article
Publication date: June 1, 2014