Data mining is the process of using computational methods to retrieve and analyse large sets of data for the purposes of identifying patterns, trends and relationships.
Text mining, also referred to as text analysis, is a kind of data mining that uses computational methods to search for, extract and analyse text data. The purpose of text mining is to identify patterns, trends, or connections in textual material. Text and data mining (TDM) are often referred to together.
Use this guide to find resources and tools for TDM. See the Data visualisation guide for information about visually presenting your data.
Watch How does Text Mining Work? (YouTube video) to see how text mining can be used in research.
Explore these examples of TDM in research:
There is no specific exception in the Copyright Act for text or data mining. Copying or reformatting material without permission may therefore infringe copyright. See Australian Government: Text and data mining for more information on current law.
Licencing conditions specific to data and text mining vary from publisher to publisher. Downloading or scraping large amounts of data from a database may be prohibited. Ensure you understand any issues around copyright and licencing conditions for the content you wish to use before proceeding.
Ensure you have considered any ethical issues that might arise from your use of the content, particularly when working with sensitive information. The Association of Internet Researchers has established Ethical Guidelines for Internet Research to assist researchers to make ethical decisions. You can also refer to our ethics and consent page for more information.
Text and data mining may involve working with and storing large data sets. To help ensure your data is stored securely, refer to the research data management toolkit.
Text and data mining may involve the extraction and analysis of text or data from a range of sources including licenced library databases, open-source data, and researcher generated data.
The following are some of our library resources that allow text and data mining by authorised users. Permission may be required.
Source | Description |
---|---|
Austlit (Australian Literature Gateway) | Database of Australian literature and criticism. Authorised users may engage in text mining/data mining activities for academic research and other educational purposes. Contact your librarian team for further information on access. |
Brill | Brill’s publications focus on the humanities and social sciences, international law and selected areas in the sciences. Contact your librarian team for further information on access. |
Cambridge Core | Multidisciplinary collection. Authorised users may download, extract, store and index the products for the purposes of TDM for non-commercial research purposes. Refer to Cambridge text and data mining for more information. |
JSTOR | Academic journals, books and primary sources in the arts and humanities. JSTOR Constellate offers text mining and visualisation of JSTOR content, an analytics lab, and text analysis tutorials. Available via JSTOR > Tools > Constellate. Requests for more complex data mining research projects may be subject to more terms of use. See JSTOR text mining support for more information. |
Wiley Online Library | Multidisciplinary collection. ACU researchers can perform text and data mining under license (or in accordance with statutory rights under applicable legislation) on subscribed content for non-commercial purposes at no extra cost. See Wiley text and data mining for more information. |
Open access sources range from publicly available Twitter feeds to historic manuscripts in the public domain. An Application Programming Interface (API) may be available to capture the text or data in machine-readable form.
Disclaimer: It is the responsibility of individual researchers to organise API access and seek appropriate permissions where applicable.
Source | Description | Further information |
arXiv | Open access e-prints in the fields of physics, mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and systems science, and economics. | Visit arXiv Bulk Data Access for more information. |
Australian Data Archive (ADA) | The ADA collects and preserves digital research data. Access to data is mediated and requires a login and agreement to terms of use. | See ADA accessing data for more information. |
Australian Text Analytics Platform (ATAP) | ATAP lists tools and training resources, as well as access to 2 datasets. Datasets include the Corpus of Oz Early English (CoOEE) which consists of approximately 2 million tokens of Australian material produced between 1788 and 1900. | See ATAP collections for access to the datasets. |
BioMed Central (BMC) | Open access peer-reviewed research from over 300 journals. | |
CORE | Based in the U.K., CORE is a global aggregator of open-access repositories and journals. | Information on the CORE API is available. |
Crossref | Crossref is an official digital object identifier (DOI) registration agency of the International DOI Foundation. The Crossref API allows researchers to harvest full text documents. | See Crossref text and data mining for researchers for more information. |
Data.gov.au | Central source of Australian open government data. | See Data.gov.au Terms of Use for more information. |
Digital Public Library of America (DPLA) | Public access to digital holdings of America's libraries, archives, and museums. | See DPLA API Codex for documentation and information about using the API. |
GLAM Workbench | This Australian site lists datasets, data repositories and tools to explore datasets. | |
Google Books | One of the world’s largest collections of digital books. In Google Advanced Search, select Books and "Full view" under the "Any view" drop-down menu. Individual titles can be downloaded. | |
Google Books Ngram Viewer | Charts the frequency of words or phrases over time using Google Books content. | Access the Ngram Viewer datasets. |
HathiTrust Digital Library | Text data of public domain works are available for bulk download by researchers. | See HathiTrust data availability and APIs to learn more about the API and approval processes. |
Internet Archive | Contains over 20 million downloadable ebooks and texts. | |
PLOS | PLOS is a nonprofit open access publisher in the fields of science and medicine. | Information is provided on PLOS text and data mining. |
Project Gutenberg | Downloadable ebooks, with a focus on older works for which U.S. copyright has expired. | Refer to Project Gutenberg help pages for information on permissions and licencing. |
Trove, National Library of Australia | Trove collates content from Australian libraries, museums, archives and other research organisations. | See Trove: using the API to find out more about capturing large datasets. |
X (formerly Twitter) | X's tweets, trends, lists, and other elements can be mined for research. | See X academic research for detailed information. |
Wikidata | A central storage repository for data supporting the Wikimedia movement. | See Wikidata data access for more information. |
The following tools can assist with analysing text and data. Many tools also support data visualisation.
Disclaimer: Many of these tools are free or open source, however some may require you to create an account or pay to access premium features or content. Permission may be needed before installing on an ACU device - check with IT or Service Central.
Find out more about TDM tools using these directories: