How Document Sampling and Vocabulary Pruning Affect the Results of Topic Models

Daniel Maier; Andreas Niekler; Gregor Wiedemann; Daniela Stoltenberg

doi:10.5117/CCR2020.2.001.MAIE

E-ISSN: 2665-9085

oa How Document Sampling and Vocabulary Pruning Affect the Results of Topic Models
By Daniel Maier, Andreas Niekler, Gregor Wiedemann & Daniela Stoltenberg
Publisher: Amsterdam University Press
Source: Computational Communication Research, Volume 2, Issue 2, Oct 2020, p. 139 - 152
DOI: https://doi.org/10.5117/CCR2020.2.001.MAIE
Language: English
- Published online: 01 Oct 2020

Previous Article
Table of Contents
Next Article

Abstract

Topic modeling enables researchers to explore large document corpora. Large corpora, however, can be extremely costly to model in terms of time and computing resources. In order to circumvent this problem, two techniques have been suggested: (1) to model random document samples, and (2) to prune the vocabulary of the corpus. Although frequently applied, there has been no systematic inquiry into how the application of these techniques affects the respective models. Using three empirical corpora with different characteristics (news articles, websites, and Tweets), we systematically investigated how different sample sizes and pruning affect the resulting topic models in comparison to models of the full corpora. Our inquiry provides evidence that both techniques are viable tools that will likely not impair the resulting model. Sample-based topic models closely resemble corpus-based models if the sample size is large enough (> 10,000 documents). Moreover, extensive pruning does not compromise the quality of the resultant topics.

Article metrics loading...

/content/journals/10.5117/CCR2020.2.001.MAIE

2020-10-01

2025-05-08

The full text of this item is not currently available.

References

Bischof, J., & Airoldi, E. M.(2012). Summarizing topical content with word frequency and exclusivity. Proceedings of the 29th International Conference on Machine Learning, 201-208.
[Google Scholar]
Blei, D., Ng, A. Y., & Jordan, M. I.(2003). Latent Dirichlet allocation. Journal of Machine Learning Research, (3), 993-1022.
[Google Scholar]
Blei, D. M.(2012). Probabilistic topic models. Communications of the ACM, 55(4), 77-84.
[Google Scholar]
Bonilla, T., & Grimmer, J.(2013). Elevated threat levels and decreased expectations: How democracy handles terrorist threats. Poetics, 41(6), 650-669. doi:10.1016/j.poetic.2013.06.003
[Google Scholar]
Denny, M. J., & Spirling, A.(2018). Text preprocessing for unsupervised learning: Why it matters, when it misleads, and what to do about it. Political Analysis, 26(2), 168-189. doi:10.1017/pan.2017.44
[Google Scholar]
Grimmer, J.(2010). A Bayesian hierarchical topic model for political texts: Measuring expressed agendas in Senate press releases. Political Analysis, 18(1), 1-35.
[Google Scholar]
Grimmer, J., & Stewart, B. M.(2013). Text as data: The promise and pitfalls of automatic content analysis methods for political texts. Political analysis, 21(3), 267-297.
[Google Scholar]
Guo, L., Vargo, C. J., Pan, Z., Ding, W., & Ishwar, P.(2016). Big social data analytics in journalism and mass communication: Comparing dictionary-based text analysis and unsupervised topic modeling. Journalism & Mass Communication Quarterly, 93(2), 332-359. doi:10.1177/1077699016639231
[Google Scholar]
Hanks, P.(2012). The corpus revolution in lexicography. International Journal of Lexicography, 25(4), 398-436.
[Google Scholar]
Hong, L., & Davison, B. D.(2010). Empirical study of topic modeling in Twitter. Proceedings of the first ACM workshop on social media analytics, 80-88.
[Google Scholar]
Hopkins, D. J., & King, G.(2010). A method of automated nonparametric content analysis for social science. American Journal of Political Science, 54(1), 229-247.
[Google Scholar]
Maier, D., Waldherr, A., Miltner, P., Wiedemann, G., Niekler, A., Keinert, A., ... Adam, S.(2018). Applying LDA topic modeling in communication research: Toward a valid and reliable methodology. Communication Methods and Measures, 12(2-3), 93-118. doi:10.1080/19312458.2018.1430754
[Google Scholar]
Manning, C. D., & Schütze, H.(2003). Foundations of statistical natural language processing. Cambridge, MA: MIT Press.
[Google Scholar]
Niekler, A.(2018). Automatisierte Verfahren für die Themenanalyse nachrichtenorientierter Textquellen. Köln: Herbert von Halem Verlag.
[Google Scholar]
Niekler, A., & Jähnichen, P.(2012). Matching results of latent Dirichlet allocation for text. Proceedings of 11th International Conference on Cognitive Modeling, 317-322.
[Google Scholar]
Quinn, K. M., Monroe, B. L., Colaresi, M., Crespin, M. H., & Radev, D. R.(2010). How to analyze political attention with minimal assumptions and costs. American Journal of Political Science, 54(1), 209-228.
[Google Scholar]
Sievert, C., & Shirley, K. E.(2014). LDAvis: A method for visualizing and interpreting topics. Proceedings from the Workshop on Interactive Language Learning, Visualization, and Interfaces. Baltimore, MD.
[Google Scholar]
Waldherr, A., Maier, D., Miltner, P., & Günther, E.(2017). Big data, big noise: The challenge of finding issue networks on the web. Social Science Computer Review, 35(4), 427-443.
[Google Scholar]

/content/journals/10.5117/CCR2020.2.001.MAIE

How Document Sampling and Vocabulary Pruning Affect the Results of Topic Models

CCR 2, 139 (2020); https://doi.org/10.5117/CCR2020.2.001.MAIE

/content/journals/10.5117/CCR2020.2.001.MAIE

Data & Media loading...

Article Type: Research Article

Keyword(s): latent Dirichlet allocation; model selection; preprocessing; text analysis; topic model

Most Cited Most Cited RSS feed

- oa A framework for privacy preserving digital trace data collection through data donation
  
  Authors: Laura Boeschoten, Jef Ausloos, Judith E. Möller, Theo Araujo & Daniel L. Oberski
- oa The 4CAT Capture and Analysis Toolkit: A Modular Tool for Transparent and Traceable Social Media Research
  
  Authors: Stijn Peeters & Sal Hagen
- oa Fifteen Seconds of Fame: TikTok and the Supply Side of Social Video
  
  Authors: Benjamin Guinaudeau, Kevin Munger & Fabio Votta
- oa OSD2F: An Open-Source Data Donation Framework
  
  Authors: Theo Araujo, Jef Ausloos, Wouter van Atteveldt, Felicia Loecherbach, Judith Moeller, Jakob Ohme, Damian Trilling, Bob van de Velde, Claes de Vreese & Kasper Welbers
- oa Conversational Agent Research Toolkit
  
  By Theo Araujo
- oa Computational observation
  
  Authors: Mario Haim & Angela Nienierza
- oa Detecting Impoliteness and Incivility in Online Discussions
  
  Authors: Anke Stoll, Marc Ziegele & Oliver Quiring
- oa Four best practices for measuring news sentiment using ‘off-the-shelf’ dictionaries: a large-scale p-hacking experiment
  
  Authors: Chung-hong Chan, Joseph Bajjalieh, Loretta Auvil, Hartmut Wessler, Scott Althaus, Kasper Welbers, Wouter van Atteveldt & Marc Jungblut
- oa The Pervasive Presence of Chinese Government Content on Douyin Trending Videos
  
  Authors: Yingdan Lu & Jennifer Pan
- oa How Document Sampling and Vocabulary Pruning Affect the Results of Topic Models
  
  Authors: Daniel Maier, Andreas Niekler, Gregor Wiedemann & Daniela Stoltenberg
More Less

oa How Document Sampling and Vocabulary Pruning Affect the Results of Topic Models

Abstract

Most Read This Month

Most Cited Most Cited RSS feed

oa A framework for privacy preserving digital trace data collection through data donation

oa The 4CAT Capture and Analysis Toolkit: A Modular Tool for Transparent and Traceable Social Media Research

oa Fifteen Seconds of Fame: TikTok and the Supply Side of Social Video

oa OSD2F: An Open-Source Data Donation Framework

oa Conversational Agent Research Toolkit

oa Computational observation

oa Detecting Impoliteness and Incivility in Online Discussions

oa Four best practices for measuring news sentiment using ‘off-the-shelf’ dictionaries: a large-scale p-hacking experiment

oa The Pervasive Presence of Chinese Government Content on Douyin Trending Videos

oa How Document Sampling and Vocabulary Pruning Affect the Results of Topic Models