Library guides: Text and data mining: Home

What is text and data mining (TDM)?

Data mining is the process of using computational methods to retrieve and analyse large sets of data for the purposes of identifying patterns, trends and relationships. Data collected by others (secondary data) can be used to answer new research questions, validate findings and enable comparative or longitudinal trend analysis.

Text mining, also referred to as text analysis, is a kind of data mining that uses computational methods to search for, extract and analyse text data. The purpose of text mining is to identify patterns, trends, or connections in textual material.

Use this guide to find secondary data sources for your research, including text-based platforms, statistics portals, and research data repositories. See the Data visualisation guide for information about visually presenting your data.

Examples of TDM

Explore these examples of text mining in research:

The Catholic Church and the media: A text mining analysis of Vatican documents from 1967 to 2020 – Study utilised NVivo software to text mine official Vatican documents.
How loneliness is talked about in social media during COVID-19 pandemic: Text mining of 4,492 Twitter feeds – Researchers used R software to analyse Twitter feeds related to loneliness and COVID-19.
Automated text analysis in theology: an application - The use of automated text analysis methods to analyse a corpus of 37 homilies written by Josemaría Escrivá, a Spanish priest who founded the Catholic Church institution "Opus Dei".
Previously unrecognised adverse reactions associated with COVID-19 vaccine revealed by text mining of survey records - Word frequency and network analysis using R packages enhanced understanding of vaccine side effects in Korea.

Examples of the use of clinical trial and other health data for secondary research purposes are detailed in A Practical Guide produced by the Australian Research Data Commons, including:

Sharing secondary data to give babies born too early a better chance of survival - Combining secondary data in an individual participant data meta-analysis provided high-certainty evidence of the benefits of delayed umbilical cord clamping for premature babies.
Using existing trial datasets to determine the clinical accuracy of tumour marker blood test - Pooling of unpublished high-quality trial data inform national and international clinical guidelines to include periodic imaging as part of surveillance of ovarian cancer progression.
Sharing secondary data to develop, test and demonstrate new statistical methods - Existing datasets were utilised to develop a new statistical algorithm and illustrate applications to other researchers across a range of real-world scenarios.

Issues to consider

Copyright

There is no specific exception in the Copyright Act for text or data mining. Copying or reformatting material without permission may therefore infringe copyright. See Australian Government: Text and data mining for more information on current law.

Licencing

Licencing conditions specific to data and text mining vary from publisher to publisher. Always check for Creative Commons licenses, repository terms and data use agreements. If data is not clearly licensed and made publicly available, you must seek permission from the data owner or custodian.

Ensure you understand any issues around copyright and licencing conditions for the content you wish to use before proceeding.

Ethics

Ensure you have considered any ethical issues that might arise from your use of the content, particularly when working with sensitive information. Original participant consent may not cover secondary use. Reuse must consider context, ownership and benefit sharing, and the CARE Principles for Indigenous Data Governance.

More information is available on our ethics and consent page.

Research data management

Text and data mining may involve working with and storing large data sets. To help ensure your data is stored securely, refer to the research data management toolkit.

Find content

Text and data mining may involve the extraction and analysis of text or data from a range of sources including licenced library databases, open-source data, and researcher generated data.

Library/subscription

The following are some of our library resources that allow text and data mining by authorised users. Permission may be required.

Source	Description
Austlit (Australian Literature Gateway)	Database of Australian literature and criticism. Authorised users may engage in text mining/data mining activities for academic research and other educational purposes. Contact your librarian team for further information on access.
Brill	Brill’s publications focus on the humanities and social sciences, international law and selected areas in the sciences. Contact your librarian team for further information on access.
Cambridge Core	Multidisciplinary collection. Authorised users may download, extract, store and index the products for the purposes of TDM for non-commercial research purposes. Refer to Cambridge text and data mining for more information.
JSTOR	Academic journals, books and primary sources in the arts and humanities. JSTOR Constellate offers text mining and visualisation of JSTOR content, an analytics lab, and text analysis tutorials. Available via JSTOR > Tools > Constellate. Requests for more complex data mining research projects may be subject to more terms of use. See JSTOR text mining support for more information.
Wiley Online Library	Multidisciplinary collection. ACU researchers can perform text and data mining under license (or in accordance with statutory rights under applicable legislation) on subscribed content for non-commercial purposes at no extra cost. See Wiley text and data mining for more information.

Open access sources

The Australian Research Data Commons is the national infrastructure hub facilitating secondary data access and includes Research Data Australia with an online portal for finding datasets from over 100 institutions, and Health Data Australia, which supports secondary data use of clinical trial and health datasets. Some of these datasets are readily open access and free to use. Sensitive data may require data access agreements and secure, controlled computing environments.

Australian Government portals that provide access to broad and structured open datasets include:

Data.gov.au - the central source of Australian government public data published by federal, state and local government agencies.
Australian Bureau of Statistics - national employment, health, education, and census data accessible to the public. By registering with an institutional email and affiliation, ACU researchers can also gain access to finer grained ABS microdata.

A selection of other open access sources for test and data mining is shown in the table below. In some cases, an Application Programming Interface (API) may be available to capture the text or data in machine-readable form.

Disclaimer: It is the responsibility of individual researchers to organise API access and seek appropriate permissions where applicable.

Source	Description	Further information
arXiv	Open access e-prints in the fields of physics, mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and systems science, and economics.	Visit arXiv Bulk Data Access for more information.
Australian Data Archive (ADA)	The ADA collects and preserves digital research data. Access to data is mediated and requires a login and agreement to terms of use.	See ADA accessing data for more information.
Australian Text Analytics Platform (ATAP)	ATAP lists tools and training resources, as well as access to 2 datasets. Datasets include the Corpus of Oz Early English (CoOEE) which consists of approximately 2 million tokens of Australian material produced between 1788 and 1900.	See ATAP collections for access to the datasets.
BioMed Central (BMC)	Open access peer-reviewed research from over 300 journals.
CORE	Based in the U.K., CORE is a global aggregator of open-access repositories and journals.	Information on the CORE API is available.
Crossref	Crossref is an official digital object identifier (DOI) registration agency of the International DOI Foundation. The Crossref API allows researchers to harvest full text documents.	See Crossref text and data mining for researchers for more information.
Data.gov.au	Central source of Australian open government data.	See Data.gov.au Terms of Use for more information.
Digital Public Library of America (DPLA)	Public access to digital holdings of America's libraries, archives, and museums.	See DPLA API Codex for documentation and information about using the API.
GLAM Workbench	This Australian site lists datasets, data repositories and tools to explore datasets.
Google Books	One of the world’s largest collections of digital books. In Google Advanced Search, select Books and "Full view" under the "Any view" drop-down menu. Individual titles can be downloaded.
Google Books Ngram Viewer	Charts the frequency of words or phrases over time using Google Books content.	Access the Ngram Viewer datasets.
HathiTrust Digital Library	Text data of public domain works are available for bulk download by researchers.	See HathiTrust data availability and APIs to learn more about the API and approval processes.
Internet Archive	Contains over 20 million downloadable ebooks and texts.
PLOS	PLOS is a nonprofit open access publisher in the fields of science and medicine.	Information is provided on PLOS text and data mining.
Project Gutenberg	Downloadable ebooks, with a focus on older works for which U.S. copyright has expired.	Refer to Project Gutenberg help pages for information on permissions and licencing.
Trove, National Library of Australia	Trove collates content from Australian libraries, museums, archives and other research organisations.	See Trove: using the API and the Trove Data Guide to find out more about capturing large datasets.
X (formerly Twitter)	X's tweets, trends, lists, and other elements can be mined for research.	See X academic research for detailed information.
Wikidata	A central storage repository for data supporting the Wikimedia movement.	See Wikidata data access for more information.

Evaluating secondary data

Once relevant data sets have been found, the next step is to evaluate what is there - not just for technical fit, but for ethical and analytical integrity. Consider the following dimensions:

Provenance and purpose. Understanding the original intent helps you spot potential biases and assess whether the data framing aligns with your own research values.

Fitness for purpose. Does the data suit your research question, method and audience?

Ethical alignment. Are there any privacy concerns, consent limitations or cultural sensitivities? If the data involves Indigenous communities, for example, CARE Principles and the AIATSIS Code of Ethics may shape how it should be accessed, interpreted and shared.

Quality and integrity. Are there gaps, anomalies or inconsistencies? These affect not just your analysis, but the credibility of your findings.

Contextual richness. Are there metadata, documentation, and provenance notes? Without this sort of context, data can be misleading or even harmful.

Licensing and access. Consider terms of use. Can you share, adapt or publish findings?

Infrastructure fit. Is the data in a form you can work with? Will it integrate smoothly into your workflow, tools and analysis pipeline?

TDM tools

The following tools can assist with analysing text and data. Many tools also support data visualisation.

ACU supplied software

Leximancer is text mining software used to analyse data and to visually display the extracted information in a browser. Information on installation of Leximancer is available on Service Central (Log in required).
NVivo is qualitative and mixed methods data analysis software available to ACU researchers. Text analysis and data visualisation is also supported. Contact eresearch@acu.edu.au for more information. Staff can install NVivo via Service Central (Log in required) on your ACU computer or personal computer. Students can also access installation information through the Student Portal.
SPSS is a comprehensive statistical analysis software platform. Installation instructions can be found on Service Central. Off-site access can be obtained by requesting an SPSS Work from Home Licence.

Online or downloadable software

Disclaimer: Many of these tools are free or open source, however some may require you to create an account or pay to access premium features or content. Permission may be needed before installing on an ACU device - check with IT or Service Central.

Gephi – an open-source network data analysis and visualisation tool.
Jupyter Notebook – a web application for creating and sharing computational documents.
Orange – a free open-source graphical user interface for data analysis and visualisation using Python.
Python – open-source programming language that enables data and network analysis.
R and RStudio (now Posit) – R is a programming language and environment for statistical computing and graphics. RStudio is a development environment for R. Note that RStudio has officially changed its name from RStudio to Posit.
VosViewer – tool for constructing and visualising bibliometric networks.
Voyant Tools – software tools for analysing digital texts and creating visualisations.

Tool directories

Find out more about data tools using these directories:

ARDC discipline-specific resources:
Australian Text Analytics Platform (ATAP) – directory of tools and training materials for analysing and exploring text.
Digital Humanities Toychest – guides, tools, and other resources for practical work in the digital humanities.
GLAM Workbench – this Australian site lists datasets, data repositories, and tools to explore datasets.
Text Analysis Portal for Research (TAPoR 3) – a curated tool list for studying texts.

Library resources on TDM

Doosti, H. (Ed.). (2024). Ethics in statistics: Opportunities and challenges (First edition). Ethics International Press Ltd.
Elnahla, N., & Neilson, L. C. (2022). A step-by-step guide on preparing online interview data for analysis. SAGE Publications Ltd.
Kakulapati, V. (Ed.) (2023). New trends and challenges in open data. IntechOpen.
Tan, P.-N., Steinbach, M., Karpatne, A., & Kumar, V. (2020). Introduction to data mining (2nd ed). Pearson Education.
Törnberg, P. (2024). How to use large-language models for text analysis. SAGE Publications Ltd.
Wallace, M., Poulopoulos, V., & Antoniou, A. (Eds.). (2023). Big data analytics for cultural heritage. MDPI.
LinkedIn Learning lists several courses on text and data mining, including courses using tools such as Python and R.
Sage Research Methods Video includes a number of videos related to text and data mining.

Text and data mining