Diese Studie untersucht mithilfe automatischer Abhängigkeitsanalyse und Clusterverfahren die Organisation von Elementary Discourse Units (EDUs) in deutschen und russischen Korpora und identifiziert vier systematische Strukturmuster mit ausgeprägter Sprachspezifik (85,7%). Die Ergebnisse zeigen, dass typologische Unterschiede (deutsche koordinations- und präpositionsreiche vs. russische nominaldichte und verbzentrierte Strukturen) die Diskursorganisation maßgeblich prägen und liefern praxisrelevante Impulse für computergestützte Diskursanalyse, maschinelle Übersetzung und Sprachdidaktik.
The way humans organise their speech and writing has been a focus of study in both traditional linguistics and computational linguistics for a long time. When people talk or write, they do not simply string sentences together one after another. Instead, there is a specific structure in which different parts work together to create a single whole. At first glance, this may seem obvious; however, upon closer examination, it reveals considerable complexity.
The goal of this thesis is to investigate how syntactic dependency information can improve our understanding of discourse organisation across languages. Determining whether patterns of EDU segmentation are universal or language-specific is essential for building multilingual NLP tools and for testing theoretical predictions about discourse structure.
This study adopts Rhetorical Structure Theory (RST) (Mann & Thompson, 1988), which segments discourse into Elementary Discourse Units (EDUs) – minimal text segments that participate in discourse relations. While RST has been widely applied to text analysis, the syntactic properties of EDUs across languages remain underexplored, particularly regarding how EDU boundaries interact with dependency structures in typologically distinct languages such as German and Russian.
The present research investigates the relationship between syntactic dependency structures and EDU organisational patterns across languages. Traditional discourse analysis relies on manual annotation and predefined theoretical categories, which prevents researchers from detecting novel patterns that exceed current frameworks. This raises the question of whether computational approaches can identify organised patterns in EDU structures that enhance or extend existing theories. While discourse structures have been investigated in detail within single languages, cross-linguistic studies remain lean, hindering our understanding of which discourse principles are universal and which are language-specific. Moreover, current methods typically analyse syntactic and discourse structures separately, leaving the potential of dependency parsing to support discourse-level analysis largely underexplored.
This study examines the dependency syntax of EDU in German and Russian. At its core, the research asks how similar or different the syntactic dependency structures of EDUs are in these two languages when analysed through parsing methods. A further goal is to examine how EDU structures group together: can we find patterns that recur across both languages, or will we observe distinct clusters unique to each language? This touches on a fundamental question: can we assume that EDU segmentation and syntax are consistent across languages, or are they inherently tied to each language’s grammar?
Answering these questions requires combining insights from linguistic theory, computational methods, and empirical analysis. This work is grounded in the tradition of studying discourse organisation and the roles of units in structuring texts. It builds on the concepts of RST and related frameworks. On the computational side, the study uses automatic sentence parsing based on Universal Dependencies, enabling cross-linguistic comparison.
The practical component of the project involves working with corpora of German and Russian texts that have been manually annotated for discourse structure. The German data is based on the work of Sara Shahmohammadi and Manfred Stede, who developed and described Discourse Parsing for German with new RST Corpora (2024). The Russian data is drawn from corpora provided by HSE University, Skoltech University, RUDN University, and the Federal Research Centre “Computer Science and Control” of the Russian Academy of Sciences, developed by Ivan Smirnov, Svetlana Toldova, Dina Pisarevskaya, Maria Kobozeva, Artem Shelmanov, Elena Chistova, and Margarita Ananyev. By parsing these corpora computationally and analysing the resulting structures, the project seeks to uncover patterns and compare them across languages.
The methodology combines corpus linguistics, statistical analysis, and machine learning. EDUs are extracted from manually annotated RST corpora and processed through spaCy models to generate CoNLL-U formatted dependency parses. Syntactic features including EDU length, complexity, part-of-speech distributions, dependency relation frequencies, and structural patterns are then collected and compared across languages using statistical tests to detect differences and understand variation.
One notable contribution of this research is its groundwork for multilingual discourse processing, which has implications for machine translation, cross-lingual summarisation, and the development of multilingual parsers.
This thesis lies at the intersection of discourse theory, syntax, morphology, and computational linguistics. It seeks to determine whether EDU dependency structures follow universal principles or are shaped by the grammar of each language. Through extensive analysis, clustering, and comparison, this work aims to deepen our understanding of discourse organisation and to inform the development of more robust multilingual natural language processing tools. Access to the study is provided via the following link, and the material is freely available under the terms of the MIT License: GitHub Repository.
The research foundation of this study relies on the fundamental sections, which are investigated in this chapter. To comprehend the study’s results, it is necessary to understand topics that include Rhetorical Structure Theory (RST) for text structure analysis, Elementary Discourse Units (EDUs) as discourse building blocks, dependency parsing for sentence analysis, and clustering methods for linguistic pattern detection. The research aims to demonstrate various methods that reveal distinctive organisational structures of information systems within languages.
Firstly, we will explore the meaning of Rhetorical Structure Theory, which serves as a method for studying the organisational patterns within written texts. The fundamental theory demonstrates that organised texts include elements which extend past single sentences because their structural components maintain relationships with each other. RST (Rhetorical Structure Theory) identifies approximately 25 distinct types of relationships, such as Cause, Contrast, and Elaboration, that serve to connect various segments of text. These relationships provide a framework for understanding the underlying structure and coherence of written discourse, thereby facilitating a deeper analysis of how information is interconnected within a textual context.
However, the development of automated systems for analysing discourse structure remains challenging, although the outcomes of this work are highly valuable and contribute significantly to a deeper understanding of language processing. Pioneering work by Marcu (1997) established the foundational framework for computational rhetorical structure theory (RST) parsers, which continues to influence contemporary methodologies for evaluating discourse parsing systems. His contributions not only provided a systematic approach to discourse analysis but also delineated the criteria by which these systems could be assessed in terms of efficacy and accuracy. Moreover, in subsequent years, researchers such as Feng and Hirst (2014) advanced the field by demonstrating that incorporating syntactic information—particularly through the application of dependency parsing features—can markedly enhance parser performance. This evidence underscores the integral role of syntactic context in the disambiguation and interpretation of discourse structures, thereby facilitating more nuanced analytical frameworks.
This lack of comparable resources has not only limited the development of RST parsers for many languages but also constrained researchers’ ability to compare discourse structures across linguistic contexts systematically. Nevertheless, cross-linguistic studies that have been conducted reveal fascinating insights into how discourse relations manifest differently across languages. One of the most intriguing discoveries from cross-linguistic RST research is that, while various types of relationships exist across languages, their usage frequency and expression differ significantly. Taboada (2006) observed this phenomenon when comparing Spanish and English; similar trends are evident in this study.
In the case of German, the Potsdam Commentary Corpus (Stede & Neumann, 2014) indicated that German texts employ Elaboration and Contrast relationships differently than their English counterparts, likely due to varying writing conventions. Furthermore, the Russian RST-Treebank (Toldova et al., 2017) revealed distinct patterns in the way causal relationships are articulated in Russian, suggesting that grammatical differences between languages influence the organisation of discourse and will be discussed in more detail in the context of this study.
The construction of discourse structures depends on Elementary Discourse Units, which serve as their basic building blocks. To move forward, it is necessary to examine more closely how EDUs are understood and applied in discourse analysis. As mentioned above, Elementary Discourse Units are the smallest pieces of text that can participate in discourse relationships. The study by Carlson et al. (2003) suggests that EDUs are essentially “clauses or clause-like units” that can’t be broken down into smaller, meaningful discourse pieces.
The initial setup of EDU boundaries appears simple but becomes challenging to put into practice when implementing them in real-world situations. The main challenge is that EDU boundaries do not always match up perfectly with grammatical clause boundaries. A single clause in the text may consist of multiple discourse units, while different clauses can work together as one discourse unit. Task ambiguity leads to difficulties for human annotators who work on data labeling, as well as for computer systems that attempt to perform automated annotation.
The manual EDU segmentation process requires trained annotators who follow established guidelines and procedures. The RST Discourse Treebank demonstrates an 85–90% agreement level between annotators which proves high consistency but shows that human interpretation continues to affect the process. The automatic segmentation of EDU is more difficult. The best systems achieve about 75–85% accuracy compared to human annotations. Soricut and Marcu (2003) established in their initial research that punctuation, along with discourse markers (such as ”however” and ”because”) and syntactic boundaries, serve as a primary indicator for identifying EDU boundaries.
Although early studies emphasised surface cues such as punctuation, discourse markers, and syntactic boundaries, subsequent research has shown that incorporating deeper syntactic information can substantially improve segmentation accuracy. The study by Fisher and Roark (2007) showed that dependency parsing features improve segmentation results most when analysing complex sentences with multiple nested clauses.
In cross-linguistic comparisons, languages organise their information content through EDUs in different ways. German EDUs tend to be longer and more syntactically complex than English ones, partly due to the greater flexibility of German word order and the more complicated nature of German noun phrases. Russian educational organisations have their own unique operational procedures. Pisarevskaya et al. (2017) identified specific features of Russian educational discourse units with regard to the placement of finite verbs and the use of participle constructions.
Russian speakers can include more details in their individual EDUs because the language contains an extensive morphological system, which differs from languages with basic morphological structures, such as the Germanic languages. While such cross-linguistic differences highlight the variability in EDU length and structure, they also underscore the need for an analytical framework that can operate independently of word order and language-specific syntactic conventions.
This is where dependency grammar becomes particularly valuable. The dependency grammar examines word-to-word connections instead of using hierarchical structures to construct phrase structures. The dependency analysis method converts each sentence into a tree structure, which shows words through labelled arrows that show their grammatical relationships (such as subject, object, and modifier). The main advantage of dependency grammar for research in dependency structures in Elementary Discourse Units is that it works well across different languages. The fundamental grammatical connections between subject and object exist similarly in all languages, although their word order and phrase structure differ significantly.
Modern dependency parsers work pretty well. Standard benchmark tests show that the top systems for English and German language processing achieve more than 95% accuracy. The model demonstrates high accuracy which enables its deployment for dependable use in my research and subsequent applications. However, the accuracy of parsing varies when processing different languages. Research studies from recent times show that parsing errors remain stable when analysing equivalent contexts, which improves the reliability of studies that compare different things.
Using dependency parsing to study discourse phenomena is a relatively new approach. Most work has focused on identifying discourse connectives and how they connect their arguments. The research by Pitler and Nenkova (2009) showed that dependency tree features improve the detection of discourse relationships. More relevant to dependency structure analysis is the study from Kong et al. (2014), who demonstrated that dependency parsing-based complexity metrics, which measure tree depth, align with human assessments of text coherence. The results indicate that syntactic complexity features identify essential organisational elements in text structure.
The study of syntactic patterns in EDUs through dependency features has not received sufficient attention in previous research because researchers have concentrated on discourse unit connections, which is why this approach of using dependency features to analyse patterns within EDUs represents a novel application. Scientists use clustering data analysis to find hidden data patterns that do not need predefined search criteria. The method proves helpful in linguistics because it helps researchers identify patterns which traditional categories fail to detect and enables them to find new patterns that enhance current theoretical frameworks.
The theoretical basis for clustering in linguistics stems from usage-based approaches to language, which were first introduced by Tomasello (2003). According to this theory, linguistic categories develop through patterns of usage instead of following innate rules. The clustering method enables researchers to identify statistical patterns which native speakers automatically learn and use in their spoken language.
Clustering has been successfully applied to various syntactic problems. Parisien and Stevenson (2010) used clustering to study verb classes based on their syntactic patterns, which produced categories that combined elements from traditional semantic groups but also presented distinct new differences. While such results highlight the potential of clustering within a single linguistic system, extending this approach to cross-linguistic contexts introduces additional challenges.
The application of clustering methods to different languages requires addressing particular difficulties. Different languages display systematic variations in feature frequencies, which stem from their typological characteristics instead of actual discourse differences. The process requires both feature normalisation and validation procedures that need to be implemented with precision. Research on cross-linguistic clustering has shown promising results, although it faces various obstacles. Feng and Hirst (2012) analysed discourse relation distribution between languages by using clustering methods, which demonstrated that languages structure their discourse to form particular clusters.
The primary challenge in clustering research is distinguishing genuine patterns from statistical noise. Evaluation combines statistical validation (silhouette scores) with expert assessment of detected patterns. Given the cross-linguistic comparison, validation is crucial to ensure that identified patterns reflect actual discourse organisation differences rather than artifacts of different corpora or parsing systems.
This study focuses on German (a V2 language with rigid word order) and Russian (a morphologically rich language with flexible syntax). Such structural contrasts influence how information is distributed across EDUs. Cross-linguistic discourse parsing faces challenges related to data comparability and parser quality. Two major gaps in the literature are identified: the underexplored connection between EDU-internal syntax and higher-level discourse structure, and the lack of multilingual validation of discourse models. This work aims to address both by analysing dependency structures in German and Russian EDUs.
This section details all materials, data sources, computational approaches, and experimental techniques used in this research. The research applies computational corpus analysis and unsupervised machine learning methods to study the connection between German and Russian Elementary Discourse Units (EDUs). The research uses data analysis to find syntactic patterns using clustering methods, investigating both shared patterns between languages and unique features of individual languages in discourse organisation. The study is based on Rhetorical Structure Theory (RST) for annotation purposes.
The study analyses 45,521 Elementary Discourse Units from two typologically distinct languages and genres. The German component uses the Potsdam Commentary Corpus (PCC), a gold-standard annotated resource developed by Prof. Dr Manfred Stede and Arne Neumann (2014), containing 176 news commentary documents from the MAZI newspaper archive with 3,018 manually segmented EDUs (˜33,000 tokens, average 10.9 tokens per EDU). The Russian component employs the Ru-RSTreebank, developed by Toldova et al. (2017) at Moscow State University, comprising 334 academic texts with 42,503 EDUs (˜470,000 tokens, average 11.1 tokens per EDU). Both corpora use XML format with embedded RST annotation following Mann and Thompson’s (1988) framework, achieving high annotation quality (\(\kappa > 0.7\) inter-annotator agreement for German, PTB conventions for Russian). Despite the quantitative imbalance favoring Russian data, the German component provides valuable cross-linguistic comparison, while the genre contrast—news commentary versus academic prose—enables examination of how EDU structure varies across both language and discourse type.
The analysis pipeline is implemented in Python 3.8+ using spaCy (3.8.7) for NLP processing, pandas (2.2.3) for data manipulation, scikit-learn (\(\geq \)1.0.0) for machine learning, and conllu (6.0.0) for Universal Dependencies format. Visualisation employs matplotlib (3.10.3) and seaborn (\(\geq \)0.11.0). All experiments use fixed random seeds (42) for reproducibility. Given the large-scale corpus data, the pipeline incorporates comprehensive quality control: input validation (format, schema, encoding), processing monitoring, and output verification through statistical checks and manual inspection.
For syntactic analysis, this study utilises spaCy’s pre-trained models, which are based on modern neural architectures and fine-tuned for accuracy. These models provide reliable dependency parses while maintaining compatibility with the Universal Dependencies framework.
For German, the medium-sized model de_core_news_md is employed. It relies on a CNN-based architecture enhanced with attention mechanisms and has been trained on a large German news corpus of about 500 million tokens. The model achieves a Labeled Attachment Score (LAS) of roughly 91%, which indicates strong performance in linking words to their correct syntactic heads. Its vocabulary covers approximately 500,000 unique tokens, and the annotations follow Universal Dependencies version 2.8.
For Russian, the small-sized model ru_core_news_sm is used. Although lighter in scale, it incorporates morphological awareness to deal with the rich inflectional system of Russian. The model was trained on about 100 million tokens from news and web texts and reaches a LAS of about 87%. Like its German counterpart, it follows UD version 2.8, but it also integrates morphological analysis to capture the grammatical richness of the language better.
Both models produce standardised CoNLL-U output, which allows for direct comparison across the two languages. In practice, this means that the German model is particularly adept at handling compounds and word-order variation, while the Russian model addresses the challenges posed by extensive morphological marking. Together, they provide a solid foundation for cross-linguistic analysis.
| Feature | German (de_core_news_md) | Russian (ru_core_news_sm) |
| Architecture | CNN + attention | CNN + morphological awareness |
| Training data | News corpus (\(\approx \) 500M tokens) | News + web text (\(\approx \) 100M tokens) |
| Performance (LAS) | \(\sim \)91% | \(\sim \)87% |
| UD version | 2.8 | 2.8 |
| Vocabulary size | 500,000 tokens | Smaller, optimised for morphology |
| Special strengths | Compound analysis, flexible word order | Rich inflectional morphology |
| Output format | CoNLL-U | CoNLL-U |
Following extraction, each EDU is processed with spaCy’s neural pipeline to generate Universal Dependencies analyses, ensuring cross-linguistic consistency. The pipeline performs tokenisation (handling language-specific contractions, compounds, and punctuation), part-of-speech tagging with morphological analysis, dependency parsing to establish syntactic relations, and quality validation to detect errors. All parsed EDUs are exported in CoNLL-U format, recording token ID, surface form, lemma, UPOS/XPOS tags, morphological features (particularly detailed for Russian), head token ID, dependency relation, and additional annotations.
All statistical analyses in this study follow a unified framework applied consistently across descriptive and inferential investigations. The parsed EDUs are transformed into numerical representations through systematic feature extraction (detailed in Section 4.2). All features undergo z-score standardisation for comparability.
Statistical analysis proceeds through descriptive summaries with normality testing using the Shapiro–Wilk test for sample sizes under 5,000 and the Kolmogorov–Smirnov test for larger samples, with significance threshold \(\alpha = 0.05\).
For comparing continuous features between languages, parametric or non-parametric tests are selected based on distribution characteristics:
For categorical associations (e.g., cluster membership vs. language), chi-square tests of independence assess statistical significance, supplemented by:
All multiple comparisons are corrected using the Benjamini–Hochberg procedure to control false discovery rate at \(q = 0.05\).
Principal Component Analysis (PCA) reduces feature dimensionality while preserving 80–90% of variance. Components are interpreted via factor loadings above \(|0.3|\), identifying underlying dimensions of syntactic variation.
To ensure meaningful and robust clustering results, several complementary methods are applied to determine the optimal number of clusters.
The Elbow Method examines how within-cluster variance (inertia) decreases as the number of clusters increases. The “elbow point” indicates where additional clusters no longer provide substantial improvement, balancing accuracy and simplicity. Specifically, we compute K-means inertia for cluster counts \(k = 2\) to \(11\) and identify the point where the rate of decrease in inertia begins to flatten, suggesting diminishing returns from additional clusters.
The Silhouette Analysis measures how similar each point is to its own cluster compared to other clusters. The silhouette coefficient for each sample ranges from \(-1\) (misclassified) to \(+1\) (well-clustered), with values near 0 indicating borderline cases. Higher average silhouette scores indicate well-separated and internally coherent clustering. We compute mean silhouette scores across all samples for each candidate \(k\) and examine the silhouette profile visualisations to assess cluster quality and balance.
The Gap Statistic compares the clustering structure in the actual dataset to randomly generated reference datasets. By calculating the difference in within-cluster dispersion between observed data and uniform random data, this method identifies whether the observed grouping is significantly stronger than chance. A peak in the gap statistic indicates the optimal number of natural clusters in the data.
Finally, Cross-Validation Stability tests clustering consistency across different data subsamples. Using 10-fold cross-validation and the adjusted Rand index, this method identifies the number of clusters that remain stable across random splits, ensuring result reliability. High stability scores indicate that the clustering solution is robust and not an artifact of specific data sampling.
Following feature extraction, K-Means clustering with k-means++ initialisation is applied to partition EDUs into syntactically coherent groups. The optimal number of clusters is determined through the complementary methods described above (elbow method, silhouette analysis, gap statistics, and cross-validation stability). Each clustering solution runs with 10 initialisations to ensure stability, converging at tolerance \(10^{-4}\) or 300 iterations maximum.
Cluster validity is assessed through multiple complementary metrics. Internal validation employs the Silhouette Score (measuring cohesion and separation, range [–1, +1]), Calinski-Harabasz Index (ratio of between-cluster to within-cluster variance), and Davies-Bouldin Index (average cluster similarity, lower is better). External validation examines language-specificity through the statistical tests described in Section 3.5.2, supplemented by the Adjusted Rand Index for solution stability.
Each resulting cluster is characterised through three perspectives. Statistical characterisation examines centroids, feature means, standard deviations, and significance tests against the full dataset. Linguistic interpretation identifies dominant syntactic patterns, part-of-speech distributions, complexity measures, and representative examples illustrating typical constructions. Cross-linguistic distribution analysis uses the statistical framework from Section 3.5.2 to determine whether clusters exhibit language-specific preferences, quantified through a Language Specificity Index measuring the degree of language-cluster association.
A central contribution of this study is the introduction of the Language Specificity Index (LSI), which quantifies the extent to which clusters are dominated by one language or shared across languages. The LSI is derived from normalised Shannon entropy:
where
with \(p_i\) representing the proportion of language \(i\) in a cluster, and \(k\) the number of languages.
Interpretation is straightforward: an LSI of 0% indicates complete cross-linguistic mixing (universal discourse patterns), while 100% reflects absolute language separation. Values above 70% suggest strong language-specific clustering, whereas values below 30% reveal substantial cross-linguistic overlap.
Statistical significance testing follows the framework established in Section 3.5. The null hypothesis (\(H_0\)) assumes no relationship between cluster membership and language, while the alternative hypothesis (\(H_1\)) posits a significant association. Tests are conducted at a 5% significance level, with Bonferroni correction applied to account for multiple comparisons.
For reliable results, three conditions are verified:
Furthermore, the computational environment is fully documented. Dependencies are explicitly listed in a requirements.txt file, while Git version control tracks every change in the codebase and analysis scripts. This practice preserves a transparent development history and allows collaborators to reproduce the results under identical conditions.
Several limitations warrant consideration. The genre confound—German news commentary versus Russian academic prose—complicates separating linguistic from genre-specific effects. Parsing accuracy differs between languages (German \(\sim \)91% LAS, Russian \(\sim \)87%), potentially introducing systematic bias. The corpus size imbalance (42,503 Russian vs. 3,018 German EDUs) provides asymmetric statistical power, requiring cautious interpretation of German results. The analysis focuses exclusively on syntactic features, excluding semantic and pragmatic dimensions that may reveal additional cross-linguistic patterns. Finally, while Universal Dependencies enable standardisation, they may not fully capture language-specific phenomena across typologically distant languages, potentially constraining cross-linguistic validity.
Despite these limitations, the study establishes a robust framework for analysing discourse-level dependency structures. By processing 45,521 EDUs through systematic parsing, feature extraction, and clustering with multiple validation methods, the pipeline ensures precision and reproducibility. Key strengths include substantial statistical power from large-scale data, standardised cross-linguistic comparison, comprehensive feature engineering, rigorous cluster validation, and careful correction for multiple testing. This framework advances computational discourse analysis and offers a methodology extensible to other language pairs and genres, contributing to understanding both universal and language-specific principles of syntactic organisation in discourse.
This chapter provides a detailed description of the computational implementation developed to analyse dependency structures in elementary discourse units (EDUs) in the German and Russian languages. This implementation is a modular, Python-based pipeline which handles corpus processing, linguistic analysis and multilingual comparison by using automated data extraction, dependency parsing, feature engineering and unsupervised machine learning techniques.
The implementation is built using Python 3.8+ with a technology stack optimised for computational linguistics and machine learning. The architecture follows modular design principles with comprehensive error handling and reproducible workflows. Primary dependencies include spaCy 3.8.7 for natural language processing, pandas 2.2.3 for data manipulation, scikit-learn for machine learning, matplotlib/seaborn for visualisation, conllu 6.0.0 for Universal Dependencies processing, and numpy for numerical computing. All dependencies are managed through requirements.txt with version pinning to ensure reproducibility across platforms.
The dependency parsing module implements language-specific processing using the pre-trained spaCy models described in Section 3.2 (de_core_news_md for German and ru_core_news_sm for Russian). Both models output Universal Dependencies annotations in standardised CoNLL-U format, ensuring cross-linguistic compatibility. The pipeline processes EDUs individually while preserving discourse context through metadata. Manual evaluation of 100 randomly sampled EDUs per language revealed comparable quality, with German errors primarily involving complex nominal constructions and Russian errors attributed to flexible word order. Both languages demonstrated sufficient parsing accuracy for statistical analysis.
The analysis pipeline extracts 21 syntactic features capturing multiple dimensions of EDU structure, designed to be language-independent while preserving linguistic specificity for meaningful cross-linguistic comparison. These features are organised into four categories:
(1) Structural features (4 features): These capture the overall shape and configuration of each EDU:
(2) Part-of-speech distribution (8 features): Normalised proportions of major word classes within each EDU:
(3) Dependency relations (3 features): Quantification of core syntactic relations:
(4) Syntactic complexity indicators (6 features): Measures of structural and grammatical complexity:
All features undergo z-score normalisation to ensure comparability across languages with different distributional characteristics:
where \(x\) is the raw feature value, \(\mu \) is the feature mean, and \(\sigma \) is the standard deviation.
The statistical analysis framework supports both descriptive and inferential statistics, employing appropriate tests with multiple comparison corrections. Methods include descriptive statistics (mean, median, standard deviation, quartiles), distribution testing (Shapiro-Wilk and Kolmogorov-Smirnov tests), group comparisons (t-tests and Mann-Whitney U tests), Cohen’s d effect size calculation for practical significance assessment, and Benjamini-Hochberg false discovery rate control for multiple comparison correction.
The core analytical component implements K-means clustering with Principal Component Analysis (PCA) dimensionality reduction to discover syntactic patterns in EDUs. The clustering pipeline includes preprocessing (feature standardisation, missing value handling, outlier detection), dimensionality reduction (PCA retaining 80–90% of variance, typically 8–12 components), cluster number selection (elbow method, silhouette analysis, gap statistic convergence), K-means clustering (K-means++ initialisation, 300 iterations, 10 independent runs), and validation (silhouette coefficient, within-cluster sum of squares, cross-validation). The implementation identified 4 optimal clusters based on convergent evidence from multiple selection criteria, achieving a silhouette score of 0.1045. Although modest, this score reflects typical challenges in clustering linguistic data with gradual rather than discrete boundaries.
The cross-linguistic comparison framework enables systematic comparison of syntactic patterns between German and Russian EDUs through feature-wise analysis with statistical significance testing, cluster distribution analysis using chi-square tests for language-based cluster membership, discriminant analysis for language prediction accuracy based on syntactic features, and comprehensive visualisation including parallel coordinate plots, scatter plot matrices, and distribution comparisons.
The system includes comprehensive visualisation capabilities implemented through matplotlib, seaborn, and plotly libraries for exploratory analysis and publication-quality figures. Visualisation types include distribution plots (histograms, box plots, violin plots), comparison plots (side-by-side comparisons, overlay distributions), clustering visualisations (scatter plots with cluster coloring, silhouette plots, elbow curves), and correlation analysis (heatmaps, correlation matrices, dependency structure plots).
Automated report generation produces comprehensive analysis summaries in Markdown format through Jupyter Notebooks (edu_boundary_detection.ipynb and comprehensive_multilingual_analysis.ipynb). Reports include an executive summary with key findings and statistical significance tests, data quality assessment (parsing success rates, feature completeness, outlier analysis), cross-linguistic comparative analysis with effect sizes, clustering results with validation metrics and linguistic interpretation, and a visualisation gallery with embedded plots and captions.
The implementation incorporates multiple layers of quality assurance to ensure reliable analysis results, with quality control mechanisms operating at each pipeline stage through comprehensive logging and error reporting. Quality control measures include input validation (file format verification, XML well-formedness checking, encoding validation), extraction quality (EDU length validation, content filtering, duplicate detection), parsing validation (CoNLL-U format compliance, dependency tree consistency, POS tag validation), and feature quality (missing value detection, outlier identification, distribution assessment). For instance, quality control effectiveness was demonstrated through the detection and removal of one Russian EDU containing 353 tokens (bibliographic metadata) that was incorrectly annotated as a discourse unit, preventing bias in statistical analyses while maintaining linguistic validity.
The entire analysis pipeline is designed for full reproducibility with fixed random seeds, version-controlled dependencies, and comprehensive documentation. Reproducibility features include random state control (fixed seed of 42 for all stochastic processes), version management (pinned dependency versions in requirements.txt), configuration management (centralised parameter specification in config.py), logging framework (detailed execution logs with timestamp and component tracking), and output standardisation (consistent file naming, directory structure, format specifications).
In this section, we present the results of our analysis, structured in three parts. First, we describe the preprocessing and data quality control steps to ensure reliable comparisons. Next, we report descriptive corpus statistics and feature analyses that do not involve machine learning. Finally, we present the deeper results obtained through machine learning methods (classification and clustering) to address cross-linguistic EDU segmentation and pattern discovery.
As an initial step, we performed data quality control on the collected corpora to remove noise that could skew the analysis. Using a three-sigma statistical rule combined with heuristic analysis, we identified and removed one anomalous 353-token EDU from the Russian dataset (consisting of bibliographic metadata) while retaining five legitimate complex EDUs. This minimal data cleaning ensured cross-linguistic corpus balance without sacrificing authentic discourse structures. Full details of the preprocessing procedure are provided in Appendix A.1.
As the analysis of boundary markers shows, there are significant differences in how German and Russian utilise punctuation and conjunctions to segment discourse. Punctuation in particular emerges as a more prominent EDU boundary marker in Russian than in German. In German, punctuation accounts for about 13.0% of all tokens at EDU boundaries, whereas in Russian it accounts for 19.4%. This difference of 6.4 percentage points indicates that Russian relies more heavily on punctuation for discourse segmentation. Several factors may explain this gap. First, the Russian punctuation system mandates frequent use of commas and other delimiters to isolate syntactic constructions, which naturally increases their count at clause boundaries. Second, structural differences between the languages play a role: Russian’s freer word order often requires punctuation to explicitly mark syntactic units and preserve clarity. Finally, stylistic differences in the corpora (genre, register) could influence the density and distribution of punctuation marks at EDU boundaries. Figure 1 illustrates these tendencies, comparing the proportion of EDUs ending in various punctuation marks for both languages.
Conjunctions serve as another primary cue for EDU boundaries, and their usage patterns are also revealing. Coordinating conjunctions (e.g., ”and”, ”but”) typically connect independent clauses or parallel segments, and the dependency relations cc (coordinating conjunction) and conj (conjunct) consistently coincide with EDU breaks in both languages. This suggests that coordination is a universal signal for discourse segmentation, marking transitions between related units. Subordinating conjunctions, on the other hand, introduce dependent clauses (e.g., causal or conditional clauses) that form separate EDUs providing background or conditional information relative to a main clause. These create hierarchical discourse structures and are likewise strong indicators of EDU boundaries. Despite typological differences between German and Russian, we observe remarkably similar behavior in how conjunctions mark boundaries: both languages adhere to general principles where conjunctions signal the onset of a new discourse unit, aiding text coherence and flow. Figure 2 highlights the frequency of conjunctive constructions at EDU boundaries in the two corpora, underscoring their parallel roles.
Syntactic dependency structure provides additional cues for EDU boundaries. Strong boundary indicators include relations such as conj, cc, advcl, ccomp, and xcomp, which consistently align with discourse segmentation points. A chi-square analysis confirmed significant differences between German and Russian in the distribution of dependency relations at EDU boundaries (\(p < 0.001\)), indicating language-specific preferences within shared fundamental principles. Detailed analysis is provided in Appendix A.2.
Positional analysis revealed that boundary markers cluster near the beginning of sentences (first 20% of token positions), aligning with theories of topicalisation and thematic progression. Additionally, greater dependency distance between syntactic heads and dependents correlates with higher boundary probability, supporting the interrelation between syntactic complexity and discourse segmentation. Full positional analysis is available in Appendix A.3.
Figure 3 presents a comprehensive analysis of Elementary Discourse Unit (EDU) length distributions in the German and Russian corpora, using four complementary visualisations. We focus here on the cleaned data (after removing the extreme outlier from the Russian corpus), to ensure a fair comparison.
Panel A: Histogram Comparison. Both languages show a strong preference for short EDUs, with frequency dropping off as length increases (an approximately exponential decay). German EDUs (blue) have a sharp peak in the 1–5 token range. Russian EDUs (red) also peak at short lengths but exhibit a broader distribution with a less pronounced peak and a somewhat longer tail, indicating greater variability in how discourse may be segmented into EDUs.
Panel B: Box Plot Distribution. The German length distribution is more compact, with a median around 9 tokens and an interquartile range of roughly 6 to 14 tokens. Russian EDUs have a slightly higher median (around 11 tokens) and a wider spread. Even after cleaning, the Russian corpus shows more variability and a few longer EDUs (reflected in a longer whisker and some mild outliers in the box plot), although the most extreme case has been removed. The maximum observed EDU length in Russian dropped dramatically after removing the 353-token anomaly (now closer to the German maximum of 42 tokens), underscoring the improved comparability of the two datasets.
Panel C: Cumulative Distribution Function (CDF). The CDF curves illustrate that German EDUs reach cumulative proportions more quickly at lower lengths: about 80% of German EDUs are 15 tokens or shorter, whereas the 80% mark for Russian EDUs extends to nearly 20 tokens. However, beyond the 90th percentile, the curves converge, suggesting that aside from the very longest outliers, both languages allow for similarly complex maximal units. In other words, once the extreme Russian anomaly is excluded, the upper-end lengths of EDUs in both languages become comparable.
Panel D: Descriptive Statistics. Key summary statistics (post-cleaning) include: German corpus (274 EDUs) with mean length \(= 10.95\) tokens (SD \(= 6.55\), max \(= 42\)); Russian corpus (234 EDUs after cleaning) with mean length \(\approx 11.3\) tokens (SD \(\approx 9.8\), max \(= 43\)). Both languages share an identical median EDU length of 9 tokens and a 75th-percentile value of 14 tokens, highlighting their similar central tendencies. The Russian mean and variance are slightly higher than German’s, reflecting the broader distribution and remaining longer EDUs, but these differences are not large.
Statistical Significance. Despite visual differences in spread, statistical tests confirm no significant difference in EDU lengths between the two languages. A two-sample t-test yields \(t = -1.17\), \(p = 0.24\), and a non-parametric Mann-Whitney U test returns \(U = 34421\), \(p = 0.18\) (both \(p > 0.05\)). These results hold true even when the extreme outlier was present and remain so after its removal, indicating that the underlying preferences for segment length are statistically comparable across German and Russian. The noticeable distributional differences are thus primarily due to variability and rare long segments rather than a fundamental length bias in one language versus the other.
Interpretation. The overall distribution patterns suggest that both languages adhere to discourse chunking constraints that favor cognitively manageable unit sizes. The similarity in medians and the lack of a statistical difference imply comparable norms of syntactic and discourse complexity in forming EDUs, likely influenced by genre (news/commentary) and general principles of information packaging. The presence of longer-tail behavior in Russian (even after cleaning) may reflect certain stylistic or syntactic practices (such as embedding multiple clauses in one sentence) that create longer discourse units. Furthermore, the identification and removal of the extreme Russian outlier underscores the importance of data cleaning in corpus linguistics: anomalies can inflate variance and create false impressions of difference, so ensuring high-quality data is crucial for accurate cross-linguistic comparisons.
Cross-linguistic analysis of part-of-speech (POS) usage within EDUs reveals clear distinctions in how German and Russian construct their discourse units. For this comparison, we normalised POS counts by EDU length to obtain ratios (proportions of each POS per EDU), allowing us to compare languages despite differing EDU sizes. Figure 4 provides four perspectives on these POS ratios: the mean ratio per POS category for each language, a heatmap of POS distribution differences, box plots illustrating variability for select POS categories, and a table of t-test results for each POS.
The results show that Russian EDUs have a significantly higher concentration of nouns and adjectives, whereas German EDUs contain higher proportions of adverbs, pronouns, and determiners. Verb ratios are very similar between the two languages, suggesting that both German and Russian EDUs carry a comparable density of predicate verbs on average. Specifically, independent t-tests on the per-EDU POS ratios confirm highly significant differences (\(p < 0.001\)) in the following categories: noun ratio (higher in Russian), adjective ratio (higher in Russian), adverb ratio (higher in German), pronoun ratio (higher in German), and determiner ratio (higher in German). By contrast, differences in verb ratio are not statistically significant, and other categories like prepositions and conjunctions show minimal variation between languages.
These patterns align well with known typological contrasts between German and Russian. Russian, lacking articles and having a rich case system, tends to pack more information into nominal phrases—hence a higher noun and adjective density. German, on the other hand, uses articles (categorised as determiners) obligatorily, and often employs pronouns and adverbs (including pronominal adverbs) to maintain cohesion and referential clarity, leading to higher ratios of those parts of speech. The similar verb ratios indicate that both languages structure their EDUs around a comparable number of finite verbs, implying similar clause densities. In summary, while the core predicate structure of EDUs is consistent across languages, German and Russian differ in how they distribute descriptive and referential information (nominal vs. adverbial/pronominal), reflecting broader grammatical and stylistic differences.
Comparing syntactic complexity metrics (parse tree depth, dependency distance, finite verb count, punctuation ratio), we found that German and Russian EDUs exhibit broadly comparable complexity. Both languages show similar median values for hierarchical and linear complexity measures, with most EDUs containing one finite verb. The main difference appears in punctuation ratio dispersion, with Russian showing more variability due to stylistic and grammatical structuring differences. In summary, the languages maintain similar foundational predicative structures while distributing additional complexity through slightly different channels (nominal embedding in Russian vs. separate modifying elements in German). Detailed analysis is provided in Appendix A.4.
In order to evaluate the feasibility of automatic discourse segmentation, we trained and tested machine learning models to detect EDU boundaries in our corpora. Two classification algorithms were compared: Logistic Regression and Random Forest, using features derived from the linguistic cues discussed above (e.g., dependency relations, part-of-speech tags, token positions). The Random Forest classifier consistently outperformed Logistic Regression in this task. Specifically, the Random Forest achieved F\(_1\)-scores of approximately 0.85 for German and 0.82 for Russian, compared to 0.75 (German) and 0.70 (Russian) with Logistic Regression. This superior performance of the ensemble method highlights its ability to capture non-linear interactions between features (for example, how certain combinations of POS tags and positions jointly signal a boundary).
Feature importance analysis from the Random Forest model reinforced our earlier observations about boundary predictors: dependency relations emerged as the strongest features for boundary detection, followed by positional features (such as whether a token appears early in the sentence or is a conjunction at clause-initial position) and then morphological features like part-of-speech tags (with distinctions between function words and content words proving informative). Another notable finding is the minimal performance gap between German and Russian (roughly 3 percentage points in F\(_1\)). This suggests that the feature patterns learned by the model generalise well across these typologically different languages. In practical terms, it supports the idea that a universal discourse segmentation approach is viable—one that can be adapted to different languages with only minor losses in accuracy. Overall, the strong performance of the classifiers indicates that the cues we analysed (punctuation, conjunctions, syntax, etc.) are not only theoretically significant but also practically useful for automated EDU boundary detection.
As a complementary approach to the supervised analysis, we applied unsupervised clustering to the EDUs to discover common discourse patterns without pre-defined labels. We constructed standardised feature vectors for each EDU (including length, POS ratios, and syntactic complexity measures), applied PCA for dimensionality reduction (retaining 80% of variance in 10 components), and used elbow and silhouette analysis to determine the optimal number of clusters (\(k = 4\)). Detailed methodology including PCA results, elbow plots, and silhouette analysis is provided in Appendix A.5.
With the optimal number of clusters determined, we examined the cross-linguistic distribution of EDUs across the four identified clusters. Figure 5 presents this distribution through three complementary views: absolute counts (left panel), normalised percentages (middle panel), and a heatmap visualisation (right panel). Two clusters exhibit relatively balanced language distribution, containing substantial representation from both German and Russian EDUs—these can be interpreted as cross-linguistically shared discourse patterns. The remaining two clusters show stronger language preferences: one is dominated by Russian EDUs (over 70% Russian), while another contains a majority of German EDUs.
Chi-square testing confirms that cluster membership and language are significantly associated (\(\chi ^2\) test, \(p < 0.01\)), indicating that language identity influences but does not fully determine cluster assignment. Cramér’s \(V\) indicates a moderate effect size, reflecting the presence of both language-specific and cross-linguistic patterns. This result demonstrates that while certain EDU construction strategies are language-preferential—likely reflecting typological and stylistic differences—other patterns represent universal discourse structuring principles shared across German and Russian.
Finally, we characterised each cluster through comprehensive feature profiling. Figure 6 presents four complementary perspectives: a heatmap of standardised feature deviations (top-left), radar charts comparing cluster profiles (top-right), EDU length distributions per cluster (bottom-left), and cluster visualisation in PCA space (bottom-right). For each feature, we computed \(z\)-scores measuring how many standard deviations each cluster’s mean deviates from the global dataset mean. This analysis reveals four distinct EDU pattern archetypes:
Each cluster’s statistical profile—including size, language composition, and distinctive feature patterns—confirms the existence of both cross-linguistic shared strategies and language-specific preferences in EDU construction. This unsupervised discovery of structural archetypes provides empirical grounding for understanding how German and Russian converge and diverge in their discourse-level syntactic organisation.
One important aspect that must be taken into account when interpreting the results of this study is the genre difference between the two corpora under investigation. The Russian corpus consists predominantly of academic texts, while the German corpus is composed of reader comments from an online news platform. These two genres vary significantly in their communicative goals, structural conventions, and syntactic complexity.
Academic texts typically exhibit a high degree of syntactic cohesion, formal tone, and standardised sentence structures. In contrast, reader comments are often more informal, elliptical, and fragmented, with frequent deviations from normative syntax. These differences can influence the structure of Elementary Discourse Units (EDUs), particularly in terms of clause boundaries, dependency completeness, and the frequency of non-canonical structures.
This discrepancy may lead to genre-induced variation in the resulting dependency parses, which in turn affects the clustering patterns and statistical regularities discovered during the analysis. Therefore, any observed cross-linguistic differences must be interpreted with caution, as they may be confounded by genre-specific properties rather than language-internal syntactic features alone.
Another factor that potentially impacts the comparability of results is the thematic divergence between the two corpora. The Russian academic texts are topically focused on linguistic theory and language analysis, while the German comments span a broader and more heterogeneous set of topics, including politics, economics, and personal opinions.
Topic variability can influence discourse structuring strategies. For instance, argumentative or narrative texts may display different coherence relations and EDU segmentations compared to explanatory or descriptive texts. This may further interact with genre effects, compounding the complexity of interpreting cross-linguistic trends.
To mitigate these influences in future studies, it would be beneficial to control for genre by selecting corpora of similar communicative style and topical scope. Alternatively, genre-specific patterns could be modeled explicitly, e.g., by annotating each EDU with genre labels and analysing cluster behaviour per genre. This would allow for a more nuanced understanding of how discourse structures manifest across languages, independent of external genre-driven variation.
This bachelor thesis presents a comprehensive computational investigation of dependency structures in Elementary Discourse Units (EDUs) across German and Russian, addressing fundamental questions about discourse organisation patterns and cross-linguistic variation. Through analysis this study demonstrates that computational methods successfully reveal systematic patterns in discourse organisation while uncovering significant language-specific preferences that challenge universal theories of discourse structure.
The research employed a methodological framework combining dependency parsing with spaCy models, comprehensive feature extraction, and unsupervised clustering to investigate two primary research questions: whether computational methods can discover interpretable patterns in EDU organisation, and the extent to which such patterns are language-specific versus universal. The results provide definitive answers to both questions while establishing methodological foundations for computational cross-linguistic discourse analysis.
The analysis demonstrates that automatic dependency parsing reveals meaningful patterns in Elementary Discourse Unit organisation. Four distinct and interpretable clusters were discovered, corresponding to recognisable discourse archetypes: nominal elaboration patterns (dense descriptive content), structural complexity patterns (deep syntactic embedding), fragmentary/segmented patterns (list-like structures with minimal predicates), and predicate-adverbial focus patterns (action-oriented units with circumstantial detail). The linguistic interpretability of these clusters validates that computational approaches capture genuine discourse organisational principles rather than statistical artifacts. The silhouette score of 0.105, while modest, is typical for linguistic clustering with gradient boundaries.
The analysis reveals exceptionally strong language specificity in EDU organisational patterns, with average language specificity of 85.7% across clusters. No clusters met universality criteria (balanced representation below 70% for either language), indicating fundamentally different preferences for organising syntactic information within discourse units.
Statistical analysis (methodology detailed in Section 3.5.2) confirmed this through highly significant results with moderate to strong effect size. This specificity significantly exceeds expectations from corpus size or genre effects, indicating genuine typological influences aligned with known differences between Germanic and Slavic languages.
The absence of universal patterns challenges theoretical frameworks proposing language-independent discourse principles and supports approaches emphasising typological-discourse interaction. Cross-linguistic analysis revealed significant differences in 13 of 15 continuous features, with particularly large effect sizes for preposition ratios and punctuation patterns.
The thesis establishes a novel and robust methodological framework for computational cross-linguistic discourse analysis. The combination of dependency parsing, comprehensive 23-feature extraction (covering structural, POS distribution, dependency relation, and complexity dimensions), and unsupervised clustering provides an objective and replicable approach to investigating discourse organisation across typologically diverse languages.
Feature sensitivity analysis revealed that part-of-speech distributions carry the most discriminative information for identifying discourse organisational patterns, while dependency relation features provide minimal additional discriminative power. This finding has important implications for understanding which aspects of syntactic structure are most relevant for discourse organisation, suggesting that grammatical category balance within EDUs captures fundamental organisational principles.
The methodology’s success in revealing linguistically meaningful patterns across Germanic German and Slavic Russian demonstrates broader applicability to other language pairs and discourse phenomena. Comprehensive validation procedures, including bootstrap resampling (89% average agreement in cluster assignments), cross-validation, and feature sensitivity analysis, confirm that the discovered patterns represent stable characteristics rather than methodological artifacts.
This research advances computational discourse analysis and cross-linguistic theory in three key areas. First, the documented 85.7% language specificity challenges universalist discourse theories by providing empirical support for approaches emphasising typological-discourse interaction over universal cognitive constraints. Second, the discovery of linguistically interpretable patterns through unsupervised methods validates using syntactic features as proxies for discourse structure, confirming the viability of computational discourse analysis. Third, the findings bridge typological and discourse research by demonstrating that computational methods can capture typologically consistent distinctions—German EDUs favor coordination-rich structures while Russian EDUs exhibit diverse patterns spanning nominal-dense and verb-centered organisations.
The findings have immediate implications for natural language processing, particularly for developing language-specific discourse analysis systems and improving cross-linguistic transfer in machine translation. For educational applications, the documented patterns illuminate discourse transfer effects in second language acquisition, with relevance for academic writing instruction accounting for typological differences in organisational preferences.
Several limitations constrain generalisability: corpus size imbalance between languages, restriction to specific genres (news commentary and academic prose), and reliance on syntactic features while excluding semantic and pragmatic dimensions. Future research should address these through balanced corpora, genre diversification, integration of semantic and pragmatic features, expansion to additional language pairs, and advanced clustering methods (hierarchical, density-based, spectral) to reveal finer-grained patterns.
This study demonstrates that computational dependency parsing reveals systematic patterns in Elementary Discourse Unit organisation while documenting exceptionally strong language specificity (85.7%) that challenges universal discourse structure theories. The identification of four linguistically meaningful clusters through unsupervised methods validates dependency-based features as effective representations of discourse properties, showing that typological differences between Germanic and Slavic languages extend beyond sentence-level syntax to discourse organisation. These findings support functional-typological approaches emphasising language-structure interaction and provide a methodological framework for systematic cross-linguistic discourse investigation, advancing both theoretical understanding and contributing to more effective NLP systems and empirically grounded models of human discourse organisation.
This appendix provides additional details on data preprocessing, extended analyses, and methodological procedures that support the main findings presented in the Results section.
The outlier filtering system uses a three-sigma statistical rule (mean \(+ 3 \times \textit {standard deviation}\)) to identify EDUs with abnormal token lengths, then applies heuristic analysis to distinguish between genuine discourse complexity and metadata noise such as bibliography entries. The system employs a conservative approach that removes only obvious non-discourse content while preserving linguistically valid long EDUs, ensuring corpus quality without sacrificing authentic discourse structures.
In the Russian dataset, we detected an anomalously long EDU of 353 tokens consisting largely of bibliographic information (e.g., page numbers, URLs). Such outliers can distort statistical comparisons and were therefore removed before further analysis. After eliminating this noise (a minimal data loss of a single EDU), we recomputed key distributions and verified that the overall corpus characteristics remained intact. Notably, the removal of the 353-token outlier substantially reduced the maximum EDU length and variance for Russian, bringing the length distribution more in line with German and strengthening the validity of subsequent cross-linguistic comparisons.
This procedure successfully removed one 353-token bibliography entry while retaining five legitimate complex EDUs, thereby maintaining cross-linguistic corpus balance for subsequent analysis.
The syntactic dependency structure provides further cues for where EDUs begin and end. Our analysis revealed a clear hierarchy among dependency relations in their reliability as EDU boundary predictors. Strong indicators include relations such as conj (coordination), cc (coordinating conjunction marker), advcl (adverbial clause), ccomp (clausal complement), and xcomp (open clausal complement). These relations often introduce new clauses or complex predicate structures and thus consistently align with discourse segmentation points. In contrast, medium-reliability indicators like acl (clausal modifier) and parataxis showed more context-dependent behavior. For instance, acl can signal an embedded clause modifying a noun without necessarily prompting a separate discourse unit, and parataxis (loosely attached sentences or clauses) varies in boundary strength depending on stylistic and syntactic context. A chi-square analysis confirmed that the distribution of certain dependency relations at EDU boundaries differs significantly between German and Russian (\(p < 0.001\)). This implies that while both languages share fundamental discourse segmentation principles (e.g., new clauses often start new EDUs), they exhibit distinct preferences in how specific syntactic constructions contribute to forming EDUs.
The position of potential boundary markers within a sentence also plays a significant role in discourse segmentation. Our positional analysis uncovered several notable patterns. One consistent tendency was the clustering of conjunctions and other boundary signals near the beginning of sentences. In roughly the first 20% of token positions in a sentence, we found a markedly higher concentration of boundary markers. This suggests that new discourse segments are frequently established early in the sentence. Such a pattern aligns with theories of topicalisation and thematic progression, where sentence-initial elements often introduce or shift the topic of discussion. In practical terms, an EDU is likely to start near the beginning of a sentence if that sentence contains multiple EDUs.
At the intra-sentential level, boundary markers tended to coincide with junctures between major syntactic constituents. For example, boundaries commonly occurred between a main clause and a following subordinate clause, or between sequential coordinated clauses. This indicates that EDU boundaries are structurally motivated, often aligning with points of syntactic completion or transition. Moreover, the distance between a syntactic head and its dependent was found to influence boundary likelihood: the greater this dependency distance, the higher the probability that a new discourse unit would begin at that point. In other words, long-distance dependencies—often a sign of more complex or embedded constructions—frequently coincide with discourse segmentation.
Finally, examining long-distance dependency links provided additional insight into how syntax and discourse interact. We observed that when a dependency relation spanned a large portion of a sentence (for instance, a subject and verb separated by a long subordinate clause), this was often accompanied by a break in discourse structure. These findings together support the notion that syntactic complexity and discourse segmentation are intertwined. The placement of EDU boundaries is influenced by syntactic positions—early sentence positions, clause boundaries, and points of increased syntactic distance all serve as likely locations for segmenting a sentence into coherent discourse units.
We compared the syntactic complexity of EDUs in German and Russian using several quantitative metrics extracted from dependency parses. Four measures were used at the EDU level: (i) Maximum parse tree depth (max_depth), which captures the deepest level of nested syntactic structures (a proxy for hierarchical complexity); (ii) Average dependency distance (avg_dependency_distance), measuring the average linear distance between heads and dependents (an indicator of how spread out or embedded an EDU’s structure is); (iii) Finite verb count (finite_verbs), the number of finite verbs in the EDU (reflecting the number of clauses or clause-like units per EDU); and (iv) Punctuation ratio (punct_ratio), the proportion of tokens in the EDU that are punctuation (which can indicate internal segmentation or list-like structures).
The box plots in Figure 9 reveal how these complexity measures vary and compare across the two languages. For maximum parse depth, the median values are similar for German and Russian, indicating that typical EDUs in both languages have comparable levels of embedding. However, Russian shows a slightly longer upper tail, suggesting that it occasionally allows deeper nested structures (possibly due to heavier noun-phrase embedding). Average dependency distance also has similar medians across languages, with Russian exhibiting marginally more spread; long dependency distances can arise from freer word order or constructions like extraposition, which Russian may employ somewhat more. The finite verb count distributions indicate that most EDUs contain one finite verb (median = 1 for both languages), with a small number of EDUs containing two or more finite verbs; the means are nearly identical, reflecting similar clause densities in typical EDUs. Finally, the punctuation ratio shows more variability in Russian, with a wider IQR and some higher values. This is likely because Russian EDUs in the corpus may include more list-like enumerations or complex punctuation usage in certain cases (as was evident from the higher punctuation usage noted earlier), whereas German EDUs tend to have a more constrained punctuation usage within the unit.
Statistical tests on these measures mostly show minor differences. The average number of finite verbs per EDU does not differ significantly between German and Russian, reinforcing the observation that both have similar clause-per-EDU tendencies. We observe only modest shifts in the hierarchical (max depth) and linear (dependency distance) complexity indices between the languages. The largest contrast appears in the punctuation ratio dispersion, which, as mentioned, reflects both stylistic formatting choices and grammatical structuring differences (for example, the inclusion of multiple commas or semicolons within some Russian EDUs vs. fewer in German). In summary, German and Russian EDUs manifest broadly comparable syntactic complexity on these measures, maintaining a similar foundational predicative structure while distributing additional complexity through slightly different structural channels (nominal embedding in Russian versus more frequent use of separate modifying elements in German).
Prior to clustering, we constructed a feature vector for each EDU based on the attributes examined in our descriptive analysis (including length, POS ratios, syntactic complexity measures, etc.). We removed non-numeric identifiers (such as the EDU ID and language label) and imputed any missing values with the median to avoid issues with incomplete feature data. All features were standardised to have mean 0 and unit variance, ensuring that no single feature dominated due to scale differences.
We performed Principal Component Analysis (PCA) on the feature matrix to reduce dimensionality and noise. Figure 10 shows the explained variance by each principal component (scree plot) and the cumulative variance covered. We selected the top 10 principal components that together account for approximately 80% of the variance. This balance retained the majority of information while filtering out minor noisy variations. The leading principal components were interpretable: for example, the first principal component aligned with a general ”complexity” factor (heavily loaded on depth, dependency distance, and modification-related features), whereas the second component contrasted nominal versus verbal orientation (loading oppositely on noun/adjective features vs. verb/adverb features). This confirms that the dimensionality reduction preserved meaningful linguistic dimensions of variation among EDUs.
Following PCA dimensionality reduction, we applied cluster optimisation methods. Figure 11 presents the results of these analyses. The left panel shows K-means inertia across cluster counts \(k = 2\) to \(11\), while the right panel displays the corresponding average silhouette scores. The inertia curve exhibits a gradual leveling around \(k = 4\) to \(5\), while the silhouette analysis achieves its maximum at \(k = 4\) with an average score of approximately 0.35. This convergence of evidence from multiple optimisation methods supports the selection of \(k = 4\) as the optimal clustering solution.
Figure 12 provides detailed silhouette profiles for candidate cluster numbers. At \(k = 4\), the majority of EDUs exhibit positive silhouette values, and cluster sizes remain reasonably balanced (as shown by the relatively even distribution across the four colored bands). Higher values of \(k\) introduce clusters with numerous near-zero or negative silhouette values, indicating poor separation and excessive fragmentation. These visualisations confirm that \(k = 4\) provides the most coherent and stable clustering structure.
Declaration of Authorship
I certify that the work presented here is, to the best of my knowledge and belief,
original and the result of my own investigations, except as acknowledged, and has not
been submitted, either in part or whole, for another degree at this or any other
university.
______________________ Berlin, Germany, December 5, 2025
Artur Begichev