Data mining is the process of using computational methods to retrieve and analyse large sets of data for the purposes of identifying patterns, trends and relationships. Data collected by others (secondary data) can be used to answer new research questions, validate findings and enable comparative or longitudinal trend analysis.
Text mining, also referred to as text analysis, is a kind of data mining that uses computational methods to search for, extract and analyse text data. The purpose of text mining is to identify patterns, trends, or connections in textual material.
Use this guide to find secondary data sources for your research, including text-based platforms, statistics portals, and research data repositories. See the Data visualisation guide for information about visually presenting your data.
Explore these examples of text mining in research:
Examples of the use of clinical trial and other health data for secondary research purposes are detailed in A Practical Guide produced by the Australian Research Data Commons, including:
There is no specific exception in the Copyright Act for text or data mining. Copying or reformatting material without permission may therefore infringe copyright. See Australian Government: Text and data mining for more information on current law.
Licencing conditions specific to data and text mining vary from publisher to publisher. Always check for Creative Commons licenses, repository terms and data use agreements. If data is not clearly licensed and made publicly available, you must seek permission from the data owner or custodian.
Ensure you understand any issues around copyright and licencing conditions for the content you wish to use before proceeding.
Ensure you have considered any ethical issues that might arise from your use of the content, particularly when working with sensitive information. Original participant consent may not cover secondary use. Reuse must consider context, ownership and benefit sharing, and the CARE Principles for Indigenous Data Governance.
More information is available on our ethics and consent page.
Text and data mining may involve working with and storing large data sets. To help ensure your data is stored securely, refer to the research data management toolkit.
Text and data mining may involve the extraction and analysis of text or data from a range of sources including licenced library databases, open-source data, and researcher generated data.
The following are some of our library resources that allow text and data mining by authorised users. Permission may be required.
Source | Description |
---|---|
Austlit (Australian Literature Gateway) | Database of Australian literature and criticism. Authorised users may engage in text mining/data mining activities for academic research and other educational purposes. Contact your librarian team for further information on access. |
Brill | Brill’s publications focus on the humanities and social sciences, international law and selected areas in the sciences. Contact your librarian team for further information on access. |
Cambridge Core | Multidisciplinary collection. Authorised users may download, extract, store and index the products for the purposes of TDM for non-commercial research purposes. Refer to Cambridge text and data mining for more information. |
JSTOR | Academic journals, books and primary sources in the arts and humanities. JSTOR Constellate offers text mining and visualisation of JSTOR content, an analytics lab, and text analysis tutorials. Available via JSTOR > Tools > Constellate. Requests for more complex data mining research projects may be subject to more terms of use. See JSTOR text mining support for more information. |
Wiley Online Library | Multidisciplinary collection. ACU researchers can perform text and data mining under license (or in accordance with statutory rights under applicable legislation) on subscribed content for non-commercial purposes at no extra cost. See Wiley text and data mining for more information. |
The Australian Research Data Commons is the national infrastructure hub facilitating secondary data access and includes Research Data Australia with an online portal for finding datasets from over 100 institutions, and Health Data Australia, which supports secondary data use of clinical trial and health datasets. Some of these datasets are readily open access and free to use. Sensitive data may require data access agreements and secure, controlled computing environments.
Australian Government portals that provide access to broad and structured open datasets include:
A selection of other open access sources for test and data mining is shown in the table below. In some cases, an Application Programming Interface (API) may be available to capture the text or data in machine-readable form.
Disclaimer: It is the responsibility of individual researchers to organise API access and seek appropriate permissions where applicable.
Source | Description | Further information |
arXiv | Open access e-prints in the fields of physics, mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and systems science, and economics. | Visit arXiv Bulk Data Access for more information. |
Australian Data Archive (ADA) | The ADA collects and preserves digital research data. Access to data is mediated and requires a login and agreement to terms of use. | See ADA accessing data for more information. |
Australian Text Analytics Platform (ATAP) | ATAP lists tools and training resources, as well as access to 2 datasets. Datasets include the Corpus of Oz Early English (CoOEE) which consists of approximately 2 million tokens of Australian material produced between 1788 and 1900. | See ATAP collections for access to the datasets. |
BioMed Central (BMC) | Open access peer-reviewed research from over 300 journals. | |
CORE | Based in the U.K., CORE is a global aggregator of open-access repositories and journals. | Information on the CORE API is available. |
Crossref | Crossref is an official digital object identifier (DOI) registration agency of the International DOI Foundation. The Crossref API allows researchers to harvest full text documents. | See Crossref text and data mining for researchers for more information. |
Data.gov.au | Central source of Australian open government data. | See Data.gov.au Terms of Use for more information. |
Digital Public Library of America (DPLA) | Public access to digital holdings of America's libraries, archives, and museums. | See DPLA API Codex for documentation and information about using the API. |
GLAM Workbench | This Australian site lists datasets, data repositories and tools to explore datasets. | |
Google Books | One of the world’s largest collections of digital books. In Google Advanced Search, select Books and "Full view" under the "Any view" drop-down menu. Individual titles can be downloaded. | |
Google Books Ngram Viewer | Charts the frequency of words or phrases over time using Google Books content. | Access the Ngram Viewer datasets. |
HathiTrust Digital Library | Text data of public domain works are available for bulk download by researchers. | See HathiTrust data availability and APIs to learn more about the API and approval processes. |
Internet Archive | Contains over 20 million downloadable ebooks and texts. | |
PLOS | PLOS is a nonprofit open access publisher in the fields of science and medicine. | Information is provided on PLOS text and data mining. |
Project Gutenberg | Downloadable ebooks, with a focus on older works for which U.S. copyright has expired. | Refer to Project Gutenberg help pages for information on permissions and licencing. |
Trove, National Library of Australia | Trove collates content from Australian libraries, museums, archives and other research organisations. | See Trove: using the API and the Trove Data Guide to find out more about capturing large datasets. |
X (formerly Twitter) | X's tweets, trends, lists, and other elements can be mined for research. | See X academic research for detailed information. |
Wikidata | A central storage repository for data supporting the Wikimedia movement. | See Wikidata data access for more information. |
Once relevant data sets have been found, the next step is to evaluate what is there - not just for technical fit, but for ethical and analytical integrity. Consider the following dimensions:
Provenance and purpose. Understanding the original intent helps you spot potential biases and assess whether the data framing aligns with your own research values.
Fitness for purpose. Does the data suit your research question, method and audience?
Ethical alignment. Are there any privacy concerns, consent limitations or cultural sensitivities? If the data involves Indigenous communities, for example, CARE Principles and the AIATSIS Code of Ethics may shape how it should be accessed, interpreted and shared.
Quality and integrity. Are there gaps, anomalies or inconsistencies? These affect not just your analysis, but the credibility of your findings.
Contextual richness. Are there metadata, documentation, and provenance notes? Without this sort of context, data can be misleading or even harmful.
Licensing and access. Consider terms of use. Can you share, adapt or publish findings?
Infrastructure fit. Is the data in a form you can work with? Will it integrate smoothly into your workflow, tools and analysis pipeline?
The following tools can assist with analysing text and data. Many tools also support data visualisation.
Disclaimer: Many of these tools are free or open source, however some may require you to create an account or pay to access premium features or content. Permission may be needed before installing on an ACU device - check with IT or Service Central.
Find out more about data tools using these directories: