To run PCA, first create vectors using the Vectors tab or the Regex tab.
The "Text Tools" plugin for the Chinese Text Project provides a collection of online tools for simple computational text analysis and visualization of Chinese texts. These tools are designed to be easily used with texts from the Chinese Text Project, but can also be used with texts from other sources provided they are in a suitable format.
To use this tool as a plugin, first log in to your ctext.org account, then install the plugin into your account.
The N-gram, Regex, Similarity, and Diff tools operate on one or more textual objects; to use any of these tools, it is first necessary to load the textual objects to which the tools are to be applied. The easiest way to load textual objects is by providing the CTP URN to load a textual object directly from ctext.org using the ctext.org API.
Alternatively, you can import texts from another source. To do this, copy and paste your text into the large box beneath "Fetch text by URN", and choose enter a suitable title in the "Title" box. In order for the tool to understand the structure of the text:
Text Tools works with most major browsers, including Chrome, Firefox, and Safari. Some functions may not be available or work correctly in Microsoft browsers such as Internet Explorer and Edge.
The N-gram tool calculates the frequency of character n-grams of a specified length occurring in one or more texts. To use this tool, first load one or more texts, select any desired options as described below, and click "Run".
Option | Effect |
---|---|
Value of n | Each time this tool is run, it will compute character n-grams for one specified value of n - i.e. n=1 for 1-grams, n=2 for 2-grams, etc. |
Minimum count | Exclude n-grams which occur fewer than this many times throughout all selected texts. |
Normalize by length | In order to facilitate a meaningful comparison of occurrences across texts of different lengths, divide the counts for each text by the length of the text in characters. |
Exclude punctuation | Do not include punctuation characters in the results. |
Stop at breaks | If set to 'None', computed n-grams can cross phrase boundaries and paragraph boundaries - e.g. "學而時習之,不亦說乎?" will generate 3-grams including "習之不" and "之不亦". 'Paragraph' prevents n-grams from being counted where they cross paragraph breaks, and 'All' prevent n-grams from being counted when they cross the punctuation marks ",", "。", ";", etc. |
The Regex tool provides flexible full-text search functions using regular expressions. To use this tool, first load one or more texts, then enter your search terms (or regular expressions) in the box provided, one on each line, select any desired options as described below, and click "Run".
Option | Effect |
---|---|
Group rows by | Determines the granularity of tabulated results in the "Summary" tab. "None" reports a single count for each matched string in each text; "Paragraph" reports per-paragraph counts; "Chapter" reports per-chapter counts. |
Minimum distinct items in row | Only report results in which at least this many distinct matches occur in the same row of tabulated output |
Normalize by length | In order to facilitate a meaningful comparison of occurrences across texts of different lengths, divide the counts for each text by the length of the text in characters. (Only available when "Group rows by" is set to "None". |
Some commonly used regex syntax includes the following:
Syntax | Effect |
---|---|
xyz | Matches exactly the text xyz |
. | Matches any one character exactly once |
[abcdef] | Matches any one of the characters a,b,c,d,e,f exactly once |
[^abcdef] | Matches any one character other than a,b,c,d,e,f |
(xyz) | Matches xyz, and saves the result as a numbered group. |
? | After a character/group, makes that character/group optional (i.e. match zero or 1 times) |
? | After +, * or {...}, makes matching ungreedy (i.e. choose shortest match, not longest) |
* | After a character/group, makes that character/group match zero or more times |
+ | After a character/group, makes that character/group match one or more times |
{2,5} | After a character/group, makes that character/group match 2,3,4, or 5 times |
{2,} | After a character/group, makes that character/group match 2 or more times |
{2} | After a character/group, makes that character/group match exactly 2 times |
\3 | Matches whatever was matched into group number 3 (first group from left is numbered 1) |
The output from this tool is displayed in two tabs: "Matched text" and "Summary". The "Matched text" tab lists all parts of the input texts which matched the regular expression(s) specified. Each match is highlighted in a color corresponding to the regular expression used. The "Summary" tab contains tabulated counts of matches to the regular expressions specified. Within the "Summary" tab, the following export and visualization options are shown:
The Replace tool provides a search-and-replace function supporting regular expressions. This tool transforms the content of the selected text according to the search and replacement expressions you provide. To use this tool, first load one or more texts, then select the text you wish to run the replacement operation on from the list provided, enter your search expression and replacement expression, and click "Run".
Syntax for search expressions is the same as for the Regex tool. Replacements can include ordinary strings, as well as references to groups defined in the search expression. To reference a matched group. use the syntax $n, where n is the number of the group starting from 1 for the first group - e.g. $1, $2, etc.
You can save the changes back to the text you applied the replacement to by clicking "Save over original text".
The Similarity tool uses n-gram shingling to provide quantification and visualization of text reuse relationships between textual materials. To use this tool, first load one or more texts, select any desired options as described below, and click "Run".
Option | Effect |
---|---|
Value of n | Each time this tool is run, it will compute similarity using character n-grams for one specified value of n - i.e. n=1 for 1-grams, n=2 for 2-grams, etc. |
Only compare between texts | Only consider matches between distinct texts, ignoring matches occurring within the same text. |
Normalize by length | In order to facilitate a meaningful comparison of similarity across texts of different lengths, divide the n-gram count by the sum of the lengths of the two texts in characters to give a similarity score. |
The output from this tool is displayed in two tabs: "Matched text" and "Chapter summary". The "Matched text" tab lists all parts of the input texts which contained an n-gram match with other parts of the selected corpus. N-gram matches are shown in shades of red; successively brighter shades of red indicate larger numbers of overlapping n-gram matches occurring at that location in the text. This visualization is interactive: clicking on any piece of matched text applies a constraint to the displayed text, causing only those passages containing the selected n-gram to be displayed. Clicking a chapter title causes only matches including material from that chapter to be displayed. Constraints can be removed by clicking the symbol to the right of the constraint.
The "Chapter summary" tab contains tabulated information about similarity between chapters of the text(s) specified. In this table, the first two columns (labeled "Chapter 1" and "Chapter 2") specify which pair of chapters are being compared; "N-grams" reports the total number of shared n-grams between the two units; "Length 1" and "Length 2" are the lengths (in characters excluding punctuation) of the chapters referenced in the first two columns; "Similarity" is a similarity score calculated as the number of n-grams divided by the sum of "Length 1" and "Length 2".
Within the "Summary" tab, the following export and visualization options are shown:
The Vectors tool calculates document vectors on the basis of term frequency (TF) and (optionally) inverse document frequency (IDF), and uses these to compare similarity of documents using cosine similarity.
The Diff tool identifies and highlights differences between two similar texts, such as two copies of a chapter of the same work based on different editions. To use this tool, first load two texts, then click "Find differences".
The Transform tool facilitates textual transformations by means of user-selectable services. A default service is provided, but it is also possible to connect to other services which may provide different services (e.g. tokenization of other languages or using different algorithms) in the future.
The network tool visualizes relational data as a network graph. Graph data can be created directly from the Regex and Similarity tools, or it can be imported from other sources. Graph data consists of data describing edges and nodes, in a subset of the GraphViz format (currently undirected graphs, edge weights, and node colors are supported).
Option | Effect |
---|---|
Skip edges between same node | Omit edges joining any node to that same node. |
Skip edges with weight less than | Omit edges where the edge weight is less than this value. This can be used to simplify a graph by excluding less significant data. |
Graph data can be input manually or copied and pasted into the text box provided, or can be created automatically from the output of the Regex and Similarity tools. With your desired graph data displayed in the box provided and desired options selected, clicking "Draw" will display a network graph of the data in the space below. The "Save data" link will save the graph data as a GraphViz format file, which can be opened in compatible software such as Gephi.
Within the displayed graph, modifications to the layout can be made by dragging nodes with the mouse. The graph will automatically respond by attempting to reorganize so as to provide a stable layout consistent with a force-based layout algorithm. Clicking and dragging any part of the background of the graph (i.e. anything other than a node or edge) will pan the current viewpoint; alternatively the keyboard arrow keys may be used. The graph can be zoomed in or out using the mouse wheel, or using the "[" and "]" keys on the keyboard.
Additionally, for graphs created using the Similarity tool, double-clicking on an edge will jump to the "Matched text" tab of the Similarity results view with the two chapters corresponding to the selected edge chosen as constraints, thus showing the n-gram matches which explain the thickness of the chosen edge. (As this information is not stored in the graph data file, you will need to re-run the similarity comparison again to enable it in subsequent sessions.)
The Word cloud tool visualizes lists of words and associated weight values. Word cloud data can be created directly from the N-gram and Regex tools, or it can be imported from other sources. Word cloud data consists of data describing nodes, in a subset of the GraphViz format (currently node weights and colors are supported).
Option | Effect |
---|---|
Largest font size | Data is scaled so that the item with the largest weight is displayed in a font of this size. |
Use log scale | If selected, text sizes are proportional to the logarithm of the weights; if not, they are directly proportional to the weights. |
Limit maximum items | If selected, use only this many rows of the data to generate the word cloud. |
Word cloud data can be input manually or copied and pasted into the text box provided, or can be created automatically from the output of the N-gram and Similarity tools. With your desired word cloud data displayed in the box provided and desired options selected, clicking "Draw" will display a word cloud of the data in the space below. The "Save data" link will save the data as a GraphViz format file, which can be opened in compatible software such as Gephi.
The Chart tool is a simple tool for quickly visualizing a set of values optionally grouped by a list of labels. Chart data can be created directly from the N-gram and Regex tools, or it can be imported from other sources. Chart data consists of a table in Comma-Separated Value format, in which the first column contains an item label, and each remaining column contains the data point for that label in one or more groups.
Option | Effect |
---|---|
Limit maximum rows | If selected, use only this many rows of the data to generate the chart. |
With your desired word cloud data displayed in the box provided and desired options selected, clicking "Draw" will display a bar chart of the data in the space below. The "Save data" link will save the data as a CSV file, which can be opened in compatible software such as Microsoft Excel.
All currently loaded texts can be exported using the "Download corpus" function - this will produce a .zip file containing each text as a UTF-8 encoded file.
Texts can be imported either as UTF-8 encoded text files, or as a .zip file containing UTF-8 encoded text files. As well as using the Save/Load tab, files can also be imported by dragging files directly onto the list of texts.
Whenever actions to load data via API or perform data analysis or visualizations are performed using the user interface, a record of these actions is created and stored in the "Script" tab.
Text Tools includes a simple API enabling access to its functionality from external tools. The API is accessed by an external application constructing links to the Text Tools plugin page and passing parameters using a fragment identifier (the final part of a URL occurring to the right of any "#" character). This can be used to provide data from an external source, e.g. data to visualize as a network graph, word cloud, or chart.
The invocation syntax is as follows:
https://ctext.org/plugins/texttools/#pagename|param1=value1|param2=value2|...Within this link, pagename must be one of "ngram", "regex", "replace", "similarity", "diff", "network", "wordcloud", or "chart", corresponding to the desired page to open. The remaining parameter-value pairs can be chosen from the following list:
Parameter | Pages | Value |
---|---|---|
networkdata | network | Graph data in supported subset of GraphViz format. |
wordclouddata | wordcloud | Word cloud data in supported subset of GraphViz format. |
chartdata | chart | Chart data in CSV format. |
urn | ngram, regex, similarity, diff | Load a text using its ctext.org URN. |
text | ngram, regex, similarity, diff | Load specified text content. |
The fragment identifier should be URL-encoded (e.g. using encodeURIComponent() in Javascript).
Examples: