Skip to Main Content

Text and data mining

What is text and data mining (TDM)?

Data mining is the process of using computational methods to retrieve and analyse large sets of data for the purposes of identifying patterns, trends and relationships.

Text mining, also referred to as text analysis, is a kind of data mining that uses computational methods to search for, extract and analyse text data. The purpose of text mining is to identify patterns, trends, or connections in textual material. Text and data mining (TDM) are often referred to together.

Use this guide to find resources and tools for TDM. See the Data visualisation guide for information about visually presenting your data.

Watch How does Text Mining Work? (YouTube video) to see how text mining can be used in research.

Examples of TDM

Explore these examples of TDM in research:

Issues to consider

Copyright

There is no specific exception in the Copyright Act for text or data mining. Copying or reformatting material without permission may therefore infringe copyright. See Australian Government: Text and data mining for more information on current law.

Licencing

Licencing conditions specific to data and text mining vary from publisher to publisher. Downloading or scraping large amounts of data from a database may be prohibited. Ensure you understand any issues around copyright and licencing conditions for the content you wish to use before proceeding.

Ethics

Ensure you have considered any ethical issues that might arise from your use of the content, particularly when working with sensitive information. The Association of Internet Researchers has established Ethical Guidelines for Internet Research to assist researchers to make ethical decisions. You can also refer to our Ethics and consent page for more information.

Research data management

Text and data mining may involve working with and storing large data sets. To help ensure your data is stored securely, refer to the Research data management toolkit.

Find content

Text and data mining may involve the extraction and analysis of text or data from a range of sources including licenced library databases, open-source data, and researcher generated data.

Library/subscription

The following are some of our library resources that allow text and data mining by authorised users. Permission may be required.

Source Description
Austlit (Australian Literature Gateway) Database of Australian literature and criticism. Authorised users may engage in text mining/data mining activities for academic research and other educational purposes. Contact your librarian team for further information on access.
Brill Brill’s publications focus on the humanities and social sciences, international law and selected areas in the sciences. Contact your librarian team for further information on access.
Cambridge Core Multidisciplinary collection. Authorised users may download, extract, store and index the products for the purposes of TDM for non-commercial research purposes. Refer to Cambridge text and data mining for more information.
JSTOR Academic journals, books and primary sources in the arts and humanities. JSTOR Constellate offers text mining and visualisation of JSTOR content, an analytics lab, and text analysis tutorials. Available via JSTOR > Tools > Constellate. Requests for more complex data mining research projects may be subject to more terms of use. See JSTOR text mining support for more information.
Wiley Online Library Multidisciplinary collection. ACU researchers can perform text and data mining under license (or in accordance with statutory rights under applicable legislation) on subscribed content for non-commercial purposes at no extra cost. See Wiley text and data mining for more information.

 

Open access sources

 

Open access sources range from publicly available Twitter feeds to historic manuscripts in the public domain. An Application Programming Interface (API) may be available to capture the text or data in machine-readable form.

Disclaimer: It is the responsibility of individual researchers to organise API access and seek appropriate permissions where applicable.

Source Description Further information
arXiv Open access e-prints in the fields of physics, mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and systems science, and economics. Visit arXiv Bulk Data Access for more information.
Australian Data Archive (ADA) The ADA collects and preserves digital research data. Access to data is mediated and requires a login and agreement to terms of use. See ADA accessing data for more information.
Australian Text Analytics Platform (ATAP) ATAP lists tools and training resources, as well as access to 2 datasets. Datasets include the Corpus of Oz Early English (CoOEE) which consists of approximately 2 million tokens of Australian material produced between 1788 and 1900. See ATAP collections for access to the datasets.
BioMed Central (BMC) Open access peer-reviewed research from over 300 journals.  
CORE Based in the U.K., CORE is a global aggregator of open-access repositories and journals. Information on the CORE API is available.
Crossref Crossref is an official digital object identifier (DOI) registration agency of the International DOI Foundation. The Crossref API allows researchers to harvest full text documents. See Crossref text and data mining for researchers for more information.
Data.gov.au Central source of Australian open government data. See Data.gov.au Terms of Use for more information.
Digital Public Library of America (DPLA) Public access to digital holdings of America's libraries, archives, and museums. See DPLA API Codex for documentation and information about using the API.
GLAM Workbench This Australian site lists datasets, data repositories and tools to explore datasets.  
Google Books One of the world’s largest collections of digital books. In Google Advanced Search, select Books and "Full view" under the "Any view" drop-down menu. Individual titles can be downloaded.  
Google Books Ngram Viewer Charts the frequency of words or phrases over time using Google Books content. Access the Ngram Viewer datasets.
HathiTrust Digital Library Text data of public domain works are available for bulk download by researchers. See HathiTrust data availability and APIs to learn more about the API and approval processes.
Internet Archive Contains over 20 million downloadable ebooks and texts.  
PLOS PLOS is a nonprofit open access publisher in the fields of science and medicine. Information is provided on PLOS text and data mining.
Project Gutenberg Downloadable ebooks, with a focus on older works for which U.S. copyright has expired. Refer to Project Gutenberg help pages for information on permissions and licencing.
Trove, National Library of Australia Trove collates content from Australian libraries, museums, archives and other research organisations. See Trove: using the API to find out more about capturing large datasets.
Twitter Twitter's tweets, trends, lists, and other elements can be mined for research. See Twitter academic research for detailed information.
Wikidata A central storage repository for data supporting the Wikimedia movement. See Wikidata data access for more information.

TDM tools

The following tools can assist with analysing text and data. Many tools also support data visualisation.

ACU supplied software

Online or downloadable software

Disclaimer: Many of these tools are free or open source, however some may require you to create an account or pay to access premium features or content. Permission may be needed before installing on an ACU device - check with IT or Service Central.

  • Gephi – an open-source network data analysis and visualisation tool.
  • Jupyter Notebook – a web application for creating and sharing computational documents.
  • Orange – a free open-source graphical user interface for data analysis and visualisation using Python.
  • Python – open-source programming language that enables data and network analysis.
  • R and RStudio (now Posit) – R is a programming language and environment for statistical computing and graphics. RStudio is a development environment for R. Note that RStudio has officially changed its name from RStudio to Posit.
  • VosViewer – tool for constructing and visualising bibliometric networks.
  • Voyant Tools – software tools for analysing digital texts and creating visualisations.

Tool directories

Find out more about TDM tools using these directories:

Library resources on TDM