Introduction

The "Text Tools" plugin for the Chinese Text Project provides a collection of online tools for simple computational text analysis and visualization of Chinese texts. These tools are designed to be easily used with texts from the Chinese Text Project, but can also be used with texts from other sources provided they are in a suitable format.

To use this tool as a plugin, first log in to your ctext.org account, then install the plugin into your account.

Loading texts

The N-gram, Regex, Similarity, and Diff tools operate on one or more textual objects; to use any of these tools, it is first necessary to load the textual objects to which the tools are to be applied. The easiest way to load textual objects is by providing the CTP URN to load a textual object directly from ctext.org using the ctext.org API.

Alternatively, you can import texts from another source. To do this, copy and paste your text into the large box beneath "Fetch text by URN", and choose enter a suitable title in the "Title" box. In order for the tool to understand the structure of the text:

Paragraphs should be separated by one or more line breaks.
Chapters (or sections) of your text should be marked up by placing a single asterisk (*) at the start of the line containing the chapter title.

You may find the Replace tool useful in converting texts into this type of format.

Compatibility

Text Tools works with most major browsers, including Chrome, Firefox, and Safari. Some functions may not be available or work correctly in Microsoft browsers such as Internet Explorer and Edge.

Tutorials

Exploring text reuse - overview with examples
Regular expressions with Text Tools - overview with examples
Text Tools for ctext.org - detailed step by step tutorial with exercises

Textual analysis and manipulation

N-gram

The N-gram tool calculates the frequency of character n-grams of a specified length occurring in one or more texts. To use this tool, first load one or more texts, select any desired options as described below, and click "Run".

Option	Effect
Value of n	Each time this tool is run, it will compute character n-grams for one specified value of n - i.e. n=1 for 1-grams, n=2 for 2-grams, etc.
Minimum count	Exclude n-grams which occur fewer than this many times throughout all selected texts.
Normalize by length	In order to facilitate a meaningful comparison of occurrences across texts of different lengths, divide the counts for each text by the length of the text in characters.
Exclude punctuation	Do not include punctuation characters in the results.
Stop at breaks	If set to 'None', computed n-grams can cross phrase boundaries and paragraph boundaries - e.g. "學而時習之，不亦說乎？" will generate 3-grams including "習之不" and "之不亦". 'Paragraph' prevents n-grams from being counted where they cross paragraph breaks, and 'All' prevent n-grams from being counted when they cross the punctuation marks "，", "。", "；", etc.

Regex

The Regex tool provides flexible full-text search functions using regular expressions. To use this tool, first load one or more texts, then enter your search terms (or regular expressions) in the box provided, one on each line, select any desired options as described below, and click "Run".

Option	Effect
Group rows by	Determines the granularity of tabulated results in the "Summary" tab. "None" reports a single count for each matched string in each text; "Paragraph" reports per-paragraph counts; "Chapter" reports per-chapter counts.
Minimum distinct items in row	Only report results in which at least this many distinct matches occur in the same row of tabulated output
Normalize by length	In order to facilitate a meaningful comparison of occurrences across texts of different lengths, divide the counts for each text by the length of the text in characters. (Only available when "Group rows by" is set to "None".

Some commonly used regex syntax includes the following:

Syntax	Effect
xyz	Matches exactly the text xyz
.	Matches any one character exactly once
[abcdef]	Matches any one of the characters a,b,c,d,e,f exactly once
[^abcdef]	Matches any one character other than a,b,c,d,e,f
(xyz)	Matches xyz, and saves the result as a numbered group.
?	After a character/group, makes that character/group optional (i.e. match zero or 1 times)
?	After +, * or {...}, makes matching ungreedy (i.e. choose shortest match, not longest)
*	After a character/group, makes that character/group match zero or more times
+	After a character/group, makes that character/group match one or more times
{2,5}	After a character/group, makes that character/group match 2,3,4, or 5 times
{2,}	After a character/group, makes that character/group match 2 or more times
{2}	After a character/group, makes that character/group match exactly 2 times
\3	Matches whatever was matched into group number 3 (first group from left is numbered 1)

The output from this tool is displayed in two tabs: "Matched text" and "Summary". The "Matched text" tab lists all parts of the input texts which matched the regular expression(s) specified. Each match is highlighted in a color corresponding to the regular expression used. The "Summary" tab contains tabulated counts of matches to the regular expressions specified. Within the "Summary" tab, the following export and visualization options are shown:

Export CSV: Download the table in Comma-Separated Value format (this type of file can be opened directly in spreadsheet software such as Microsoft Excel).
Chart: (Only displayed when "Group rows by" is set to "None") Charts the counts of each match according to the text in which they were found.
Graph: (Only displayed when "Group rows by" is set to "Paragraph" or "Chapter") Creates a network graph representing relationships between matched items. The graph created contains edges representing co-occurrences of matched items within the same chapter (paragraph); edge weights are equal to the number of chapters (paragraphs) in which matches co-occurred.

Replace

The Replace tool provides a search-and-replace function supporting regular expressions. This tool transforms the content of the selected text according to the search and replacement expressions you provide. To use this tool, first load one or more texts, then select the text you wish to run the replacement operation on from the list provided, enter your search expression and replacement expression, and click "Run".

Syntax for search expressions is the same as for the Regex tool. Replacements can include ordinary strings, as well as references to groups defined in the search expression. To reference a matched group. use the syntax $n, where n is the number of the group starting from 1 for the first group - e.g. $1, $2, etc.

You can save the changes back to the text you applied the replacement to by clicking "Save over original text".

Similarity

The Similarity tool uses n-gram shingling to provide quantification and visualization of text reuse relationships between textual materials. To use this tool, first load one or more texts, select any desired options as described below, and click "Run".

Option	Effect
Value of n	Each time this tool is run, it will compute similarity using character n-grams for one specified value of n - i.e. n=1 for 1-grams, n=2 for 2-grams, etc.
Only compare between texts	Only consider matches between distinct texts, ignoring matches occurring within the same text.
Normalize by length	In order to facilitate a meaningful comparison of similarity across texts of different lengths, divide the n-gram count by the sum of the lengths of the two texts in characters to give a similarity score.

The output from this tool is displayed in two tabs: "Matched text" and "Chapter summary". The "Matched text" tab lists all parts of the input texts which contained an n-gram match with other parts of the selected corpus. N-gram matches are shown in shades of red; successively brighter shades of red indicate larger numbers of overlapping n-gram matches occurring at that location in the text. This visualization is interactive: clicking on any piece of matched text applies a constraint to the displayed text, causing only those passages containing the selected n-gram to be displayed. Clicking a chapter title causes only matches including material from that chapter to be displayed. Constraints can be removed by clicking the symbol to the right of the constraint.

The "Chapter summary" tab contains tabulated information about similarity between chapters of the text(s) specified. In this table, the first two columns (labeled "Chapter 1" and "Chapter 2") specify which pair of chapters are being compared; "N-grams" reports the total number of shared n-grams between the two units; "Length 1" and "Length 2" are the lengths (in characters excluding punctuation) of the chapters referenced in the first two columns; "Similarity" is a similarity score calculated as the number of n-grams divided by the sum of "Length 1" and "Length 2".

Within the "Summary" tab, the following export and visualization options are shown:

Export CSV: Download the table in Comma-Separated Value format (this type of file can be opened directly in spreadsheet software such as Microsoft Excel).
Graph: Creates a network graph representing relationships between matched items. The graph created contains edges representing similarity between chapters of text; edge weights are the "Similarity" values from the table as described above. Clicking this link will open a specification of the graph in the Network tool.

Vectors

The Vectors tool calculates document vectors on the basis of term frequency (TF) and (optionally) inverse document frequency (IDF), and uses these to compare similarity of documents using cosine similarity.

Diff

The Diff tool identifies and highlights differences between two similar texts, such as two copies of a chapter of the same work based on different editions. To use this tool, first load two texts, then click "Find differences".

Transform

The Transform tool facilitates textual transformations by means of user-selectable services. A default service is provided, but it is also possible to connect to other services which may provide different services (e.g. tokenization of other languages or using different algorithms) in the future.

Visualization

Network

The network tool visualizes relational data as a network graph. Graph data can be created directly from the Regex and Similarity tools, or it can be imported from other sources. Graph data consists of data describing edges and nodes, in a subset of the GraphViz format (currently undirected graphs, edge weights, and node colors are supported).

Option	Effect
Skip edges between same node	Omit edges joining any node to that same node.
Skip edges with weight less than	Omit edges where the edge weight is less than this value. This can be used to simplify a graph by excluding less significant data.

Graph data can be input manually or copied and pasted into the text box provided, or can be created automatically from the output of the Regex and Similarity tools. With your desired graph data displayed in the box provided and desired options selected, clicking "Draw" will display a network graph of the data in the space below. The "Save data" link will save the graph data as a GraphViz format file, which can be opened in compatible software such as Gephi.

Within the displayed graph, modifications to the layout can be made by dragging nodes with the mouse. The graph will automatically respond by attempting to reorganize so as to provide a stable layout consistent with a force-based layout algorithm. Clicking and dragging any part of the background of the graph (i.e. anything other than a node or edge) will pan the current viewpoint; alternatively the keyboard arrow keys may be used. The graph can be zoomed in or out using the mouse wheel, or using the "[" and "]" keys on the keyboard.

Additionally, for graphs created using the Similarity tool, double-clicking on an edge will jump to the "Matched text" tab of the Similarity results view with the two chapters corresponding to the selected edge chosen as constraints, thus showing the n-gram matches which explain the thickness of the chosen edge. (As this information is not stored in the graph data file, you will need to re-run the similarity comparison again to enable it in subsequent sessions.)

Word cloud

The Word cloud tool visualizes lists of words and associated weight values. Word cloud data can be created directly from the N-gram and Regex tools, or it can be imported from other sources. Word cloud data consists of data describing nodes, in a subset of the GraphViz format (currently node weights and colors are supported).

Option	Effect
Largest font size	Data is scaled so that the item with the largest weight is displayed in a font of this size.
Use log scale	If selected, text sizes are proportional to the logarithm of the weights; if not, they are directly proportional to the weights.
Limit maximum items	If selected, use only this many rows of the data to generate the word cloud.

Word cloud data can be input manually or copied and pasted into the text box provided, or can be created automatically from the output of the N-gram and Similarity tools. With your desired word cloud data displayed in the box provided and desired options selected, clicking "Draw" will display a word cloud of the data in the space below. The "Save data" link will save the data as a GraphViz format file, which can be opened in compatible software such as Gephi.

Chart

The Chart tool is a simple tool for quickly visualizing a set of values optionally grouped by a list of labels. Chart data can be created directly from the N-gram and Regex tools, or it can be imported from other sources. Chart data consists of a table in Comma-Separated Value format, in which the first column contains an item label, and each remaining column contains the data point for that label in one or more groups.

Option	Effect
Limit maximum rows	If selected, use only this many rows of the data to generate the chart.

With your desired word cloud data displayed in the box provided and desired options selected, clicking "Draw" will display a bar chart of the data in the space below. The "Save data" link will save the data as a CSV file, which can be opened in compatible software such as Microsoft Excel.

Research data management

Save/Load

All currently loaded texts can be exported using the "Download corpus" function - this will produce a .zip file containing each text as a UTF-8 encoded file.

Texts can be imported either as UTF-8 encoded text files, or as a .zip file containing UTF-8 encoded text files. As well as using the Save/Load tab, files can also be imported by dragging files directly onto the list of texts.

Scripting

Whenever actions to load data via API or perform data analysis or visualizations are performed using the user interface, a record of these actions is created and stored in the "Script" tab.

Application Programming Interface

Text Tools includes a simple API enabling access to its functionality from external tools. The API is accessed by an external application constructing links to the Text Tools plugin page and passing parameters using a fragment identifier (the final part of a URL occurring to the right of any "#" character). This can be used to provide data from an external source, e.g. data to visualize as a network graph, word cloud, or chart.

The invocation syntax is as follows:

https://ctext.org/plugins/texttools/#pagename|param1=value1|param2=value2|...

Within this link, pagename must be one of "ngram", "regex", "replace", "similarity", "diff", "network", "wordcloud", or "chart", corresponding to the desired page to open. The remaining parameter-value pairs can be chosen from the following list:

Parameter	Pages	Value
networkdata	network	Graph data in supported subset of GraphViz format.
wordclouddata	wordcloud	Word cloud data in supported subset of GraphViz format.
chartdata	chart	Chart data in CSV format.
urn	ngram, regex, similarity, diff	Load a text using its ctext.org URN.
text	ngram, regex, similarity, diff	Load specified text content.

The fragment identifier should be URL-encoded (e.g. using encodeURIComponent() in Javascript).

Examples:

Sample graph
Sample word cloud
Sample chart