Library guides: Text and data mining: Home

What is text and data mining (TDM)?

Data mining is the process of using computational methods to retrieve and analyse large sets of data for the purposes of identifying patterns, trends and relationships.

Text mining, also referred to as text analysis, is a kind of data mining that uses computational methods to search for, extract and analyse text data. The purpose of text mining is to identify patterns, trends, or connections in textual material. Text and data mining (TDM) are often referred to together.

Use this guide to find resources and tools for TDM. See the Data visualisation guide for information about visually presenting your data.

Examples of TDM

Explore these examples of TDM in research:

The Catholic Church and the media: A text mining analysis of Vatican documents from 1967 to 2020 – Study utilised NVivo software to text mine official Vatican documents.
How loneliness is talked about in social media during COVID-19 pandemic: Text mining of 4,492 Twitter feeds – Researchers used R software to analyse Twitter feeds related to loneliness and COVID-19.
Automated text analysis in theology: an application - The use of automated text analysis methods to analyse a corpus of 37 homilies written by Josemaría Escrivá, a Spanish priest who founded the Catholic Church institution "Opus Dei".
Text mining on green policies for integrating sustainability in higher education - A Natural Language Processing approach was used to measure the alignment between sustainability policy documents and to integrate and operationalise content into HE curriculum.

Issues to consider

Copyright

There is no specific exception in the Copyright Act for text or data mining. Copying or reformatting material without permission may therefore infringe copyright. See Australian Government: Text and data mining for more information on current law.

Licencing

Licencing conditions specific to data and text mining vary from publisher to publisher. Downloading or scraping large amounts of data from a database may be prohibited. Ensure you understand any issues around copyright and licencing conditions for the content you wish to use before proceeding.

Ethics

Ensure you have considered any ethical issues that might arise from your use of the content, particularly when working with sensitive information. The Association of Internet Researchers has established Ethical Guidelines for Internet Research to assist researchers to make ethical decisions. You can also refer to our ethics and consent page for more information.

Research data management

Text and data mining may involve working with and storing large data sets. To help ensure your data is stored securely, refer to the research data management toolkit.

Find content

Text and data mining may involve the extraction and analysis of text or data from a range of sources including licenced library databases, open-source data, and researcher generated data.

Library/subscription

The following are some of our library resources that allow text and data mining by authorised users. Permission may be required.

Source	Description
Austlit (Australian Literature Gateway)	Database of Australian literature and criticism. Authorised users may engage in text mining/data mining activities for academic research and other educational purposes. Contact your librarian team for further information on access.
Brill	Brill’s publications focus on the humanities and social sciences, international law and selected areas in the sciences. Contact your librarian team for further information on access.
Cambridge Core	Multidisciplinary collection. Authorised users may download, extract, store and index the products for the purposes of TDM for non-commercial research purposes. Refer to Cambridge text and data mining for more information.
JSTOR	Academic journals, books and primary sources in the arts and humanities. JSTOR Constellate offers text mining and visualisation of JSTOR content, an analytics lab, and text analysis tutorials. Available via JSTOR > Tools > Constellate. Requests for more complex data mining research projects may be subject to more terms of use. See JSTOR text mining support for more information.
Wiley Online Library	Multidisciplinary collection. ACU researchers can perform text and data mining under license (or in accordance with statutory rights under applicable legislation) on subscribed content for non-commercial purposes at no extra cost. See Wiley text and data mining for more information.

Open access sources

Open access sources range from publicly available Twitter feeds to historic manuscripts in the public domain. An Application Programming Interface (API) may be available to capture the text or data in machine-readable form.

Disclaimer: It is the responsibility of individual researchers to organise API access and seek appropriate permissions where applicable.

Source	Description	Further information
arXiv	Open access e-prints in the fields of physics, mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and systems science, and economics.	Visit arXiv Bulk Data Access for more information.
Australian Data Archive (ADA)	The ADA collects and preserves digital research data. Access to data is mediated and requires a login and agreement to terms of use.	See ADA accessing data for more information.
Australian Text Analytics Platform (ATAP)	ATAP lists tools and training resources, as well as access to 2 datasets. Datasets include the Corpus of Oz Early English (CoOEE) which consists of approximately 2 million tokens of Australian material produced between 1788 and 1900.	See ATAP collections for access to the datasets.
BioMed Central (BMC)	Open access peer-reviewed research from over 300 journals.
CORE	Based in the U.K., CORE is a global aggregator of open-access repositories and journals.	Information on the CORE API is available.
Crossref	Crossref is an official digital object identifier (DOI) registration agency of the International DOI Foundation. The Crossref API allows researchers to harvest full text documents.	See Crossref text and data mining for researchers for more information.
Data.gov.au	Central source of Australian open government data.	See Data.gov.au Terms of Use for more information.
Digital Public Library of America (DPLA)	Public access to digital holdings of America's libraries, archives, and museums.	See DPLA API Codex for documentation and information about using the API.
GLAM Workbench	This Australian site lists datasets, data repositories and tools to explore datasets.
Google Books	One of the world’s largest collections of digital books. In Google Advanced Search, select Books and "Full view" under the "Any view" drop-down menu. Individual titles can be downloaded.
Google Books Ngram Viewer	Charts the frequency of words or phrases over time using Google Books content.	Access the Ngram Viewer datasets.
HathiTrust Digital Library	Text data of public domain works are available for bulk download by researchers.	See HathiTrust data availability and APIs to learn more about the API and approval processes.
Internet Archive	Contains over 20 million downloadable ebooks and texts.
PLOS	PLOS is a nonprofit open access publisher in the fields of science and medicine.	Information is provided on PLOS text and data mining.
Project Gutenberg	Downloadable ebooks, with a focus on older works for which U.S. copyright has expired.	Refer to Project Gutenberg help pages for information on permissions and licencing.
Trove, National Library of Australia	Trove collates content from Australian libraries, museums, archives and other research organisations.	See Trove: using the API to find out more about capturing large datasets.
X (formerly Twitter)	X's tweets, trends, lists, and other elements can be mined for research.	See X academic research for detailed information.
Wikidata	A central storage repository for data supporting the Wikimedia movement.	See Wikidata data access for more information.

TDM tools

The following tools can assist with analysing text and data. Many tools also support data visualisation.

ACU supplied software

Leximancer is text mining software used to analyse data and to visually display the extracted information in a browser. Information on installation of Leximancer is available on Service Central (Log in required).
NVivo is qualitative and mixed methods data analysis software available to ACU researchers. Text analysis and data visualisation is also supported. Contact eresearch@acu.edu.au for more information. Staff can install NVivo via Service Central (Log in required) on your ACU computer or personal computer. Students can also access installation information through the Student Portal.
SPSS is a comprehensive statistical analysis software platform. Installation instructions can be found on Service Central. Off-site access can be obtained by requesting an SPSS Work from Home Licence.

Online or downloadable software

Disclaimer: Many of these tools are free or open source, however some may require you to create an account or pay to access premium features or content. Permission may be needed before installing on an ACU device - check with IT or Service Central.

Gephi – an open-source network data analysis and visualisation tool.
Jupyter Notebook – a web application for creating and sharing computational documents.
Orange – a free open-source graphical user interface for data analysis and visualisation using Python.
Python – open-source programming language that enables data and network analysis.
R and RStudio (now Posit) – R is a programming language and environment for statistical computing and graphics. RStudio is a development environment for R. Note that RStudio has officially changed its name from RStudio to Posit.
VosViewer – tool for constructing and visualising bibliometric networks.
Voyant Tools – software tools for analysing digital texts and creating visualisations.

Tool directories

Find out more about TDM tools using these directories:

Australian Text Analytics Platform (ATAP) – directory of tools and training materials for analysing and exploring text.
Digital Humanities Toychest – guides, tools, and other resources for practical work in the digital humanities.
GLAM Workbench – this Australian site lists datasets, data repositories, and tools to explore datasets.
Text Analysis Portal for Research (TAPoR 3) – a curated tool list for studying texts.

Library resources on TDM

Doosti, H. (Ed.). (2024). Ethics in statistics: Opportunities and challenges (First edition). Ethics International Press Ltd.
Elnahla, N., & Neilson, L. C. (2022). A step-by-step guide on preparing online interview data for analysis. SAGE Publications Ltd.
Kakulapati, V. (Ed.) (2023). New trends and challenges in open data. IntechOpen.
Tan, P.-N., Steinbach, M., Karpatne, A., & Kumar, V. (2020). Introduction to data mining (2nd ed). Pearson Education.
Törnberg, P. (2024). How to use large-language models for text analysis. SAGE Publications Ltd.
Wallace, M., Poulopoulos, V., & Antoniou, A. (Eds.). (2023). Big data analytics for cultural heritage. MDPI.
LinkedIn Learning lists several courses on text and data mining, including courses using tools such as Python and R.
Sage Research Methods Video includes a number of videos related to text and data mining.

Text and data mining