Research tools to analyze Chinese language texts

— by Jacky Li

Digital humanities has become increasingly prominent in recent years with significant resources being invested in digital preservation and large-scale applications for digital humanities projects. Although it does not necessarily lead to a paradigm shift in humanities studies, as traditional research methods have their accumulated values and numerous academic achievements, digital humanities provides an aspect that researchers in the humanities can consider for future development. Its ability to accumulate and process data at a faster pace surpasses that of traditional methods. It may also serve as a bridge between the humanities and other fields of knowledge (disciplines). This offers an exciting opportunity to expand inquiries and invigorate interest in humanistic studies. 

Chinese language texts, with their vast resources, are an important area of study within the humanities. However, analyzing Chinese language texts sometimes presents unique challenges. In this blog post, we will introduce some tools for facilitating the analysis of Chinese language texts. 

1. CORPRO (庫博中文語料庫分析工具) 

http://nlp.cse.ntou.edu.tw/CORPRO/ 

CORPRO is a user-friendly Chinese corpus-based text mining software. It enables users, even those without programming skills, to conduct textual mining independently. CORPRO allows users to incorporate their own compiled corpus and customize analysis conditions. They can also define dictionaries, stop words and word grouping according to their needs. The software provides various corpus analysis functions, including term frequency examination, collocations, keywords in the corpus and concordance, all with related statistics. 

Here is an example of using Key Word In Context function in CORPRO with Eileen Chang’s translation of “The Old Man and the Sea”. 

Figure 1: Keyword search with Key Word In Context function (CUHKLibraries, 2020, 15:53) 

Figure 2: Keyword search by part of speech with Key Word In Context function in CORPRO (CUHKLibraries, 2020, 29:48) 

Full video can be found at: https://youtu.be/2A9wliiZJTY?si=MjjUiTEE6mJxZfjd 

2. DocuSky (DocuSky 數位人文學術研究平台) 

https://docusky.org.tw/DocuSky/home/

DocuSky is an online platform tailored to the needs of humanities scholars, providing personalized organization and analysis of materials. It offers features such as tagging and editing, changing textual formats, creating and reformatting databases, text mining and analysis, GIS and visualization, accessing external resources, and utilizing external tools. 

This clip demonstrates how users can utilize MARKUS to tag texts and conduct word frequency analysis within the DocuSky Collaboration Platform. For example, with this function, users can determine which demon from Journey to the West (西遊記) is mentioned most frequently. 

This clip guides users to draw coordinates in DocuGIS using the known placenames’ coordinates. For example, using the built-in route-creating tools within DocuGIS, users can mark Monk Xuanzang’s (玄奘) itinerary route to the Western Regions on a map. 

3. Digital Analysis System for Humanities (DASH) (數位人文研究平台) 

https://dh.ascdc.sinica.edu.tw/member/

The Digital Analysis System for Humanities (DASH) is developed to meet the demands of humanities research, aiming to assist scholars to enhance research quality. The tool enables data archives discovery, shared editing, content search, data analysis, and data visualization. E.g., On DASH, researchers can upload texts and authority files, or directly utilize open texts and authority files already available. They can also carry out textual analysis easily, with functions such as similar-passage comparison, Boolean search, word proximity search, and statistical filtering.  

Figure 3: Similar passage comparison between the Old and the New Book of Tang (舊唐書、新唐書) using DASH (Gomars0419, 2021, 4:22) 

Figure 4: Social network analysis based on the results of co-occurrence statistics of authority terms in Compendium of Materia Medica (本草綱目) (Wang & Lee, 2021) 

The above tools are all open access. Users can utilize them by creating a personal account for free. With the aid of specialized research tools, researchers can enhance their understanding of Chinese language texts and accelerate their research endeavors. 

Extended readings 

潘柏翰(2018):全球化時代愈來愈醒目的「數位人文學」:爭議、現況與未來. https://bigdata.nccu.edu.tw/t/topic/274 

中研院近史所郭廷以圖書館(2024):生成式AI如何輔助歷史研究(三)數位人文篇. https://asmhlibref.blogspot.com/2024/02/ai.html 

References 

Chan, Holly. (2023): How To Open CORPRO In MacOS. https://digitalhumanities.hkust.edu.hk/tutorials/how-to-open-corpro-in-mac/ 

CUHKLibraries, “2019.11 庫博中文語料庫分析工具 (CORPRO) 應用工作坊錄影 (第四部份)”, YouTube, uploaded 17 Mar 2020. From https://www.youtube.com/watch?v=2A9wliiZJTY, retrieved 23 May 2024. 

Gomars0419, “08_中研院數位人文研究平台:兩文本差異分析”, YouTube, uploaded 15 February 2021. From https://www.youtube.com/watch?v=J_oQDgVQ3FA&list=PLcCxAfkNZg8JgXi2bBdMNcPaO07lCs1b3&index=8, retrieved 23 May 2024. 

Wang, Hsiang-An, Lee, You-Sheng (2021): The Development and Applications of the Academia Sinica Digital Humanities Research Platform. https://doi.org/10.6853/DADH.202104_(7).0004 

Share