Finding Marx When the Words Have Changed

How do you trace the influence of an idea when the words used to express it keep changing?

That question came up in a collaboration between the Center for Digital Humanities and Professor Ed Baring in Princeton’s Department of History. The historical setting is the “Revisionism Controversy,” a debate beginning around 1897 that is often understood as foreshadowing the split between socialism and communism. One way to study that intellectual shift is to ask a deceptively simple set of questions: Who cited Marx? When did they cite him? And what, exactly, did they cite?

The corpus we were interested in was Die Neue Zeit, a large collection of German periodicals, and the source texts were Karl Marx’s Communist Manifesto and Das Kapital. If we could detect passages in Die Neue Zeit that reused language from Marx, we could begin to map how Marx’s texts were being invoked over time. But there was an immediate complication: quotations do not always look like quotations.

Sometimes a writer quotes exactly. Sometimes they abbreviate. Sometimes they paraphrase. Sometimes they translate. A method that only searches for identical strings of text will find some of the evidence, but it will miss many of the historically interesting cases.

That is the problem remarx is designed to explore: finding textual reuse across languages and large corpora, even when the reused passage is not an exact character-for-character match. The project is developed by the CDH RSE team, and the repository is available at https://github.com/Princeton-CDH/remarx.

Why Exact Matching Wasn’t Enough

The traditional approach to this kind of problem is string matching. In practical terms, that means looking for exact character sequences, repeated word patterns, character n-grams, or overlapping sets of words. These methods can work very well when the quotation is close to the original, or when the differences are small: punctuation changes, spelling variants, or a few extra words.

But for our use case, the failure modes mattered. String matching struggles when the words are reordered, when synonyms are substituted, when a passage is compressed into a shorter form, or when the reuse happens in another language. In other words, it struggles precisely when quotation becomes interpretation.

The question became: can we search for meaning rather than surface form?

This is where semantic embeddings enter the picture. An embedding is a numerical representation of text. Instead of treating a sentence as a sequence of characters, we use a pretrained neural language model to encode the sentence as a high-dimensional vector — in this case, 768 dimensions. The useful property is that sentences with similar meanings should end up near each other in that vector space, even if they use different words.

For remarx, we used paraphrase-multilingual-mpnet-base-v2, a multilingual sentence embedding model available through Sentence Transformers. The “multilingual” part matters because the same general approach can support cross-lingual reuse detection: a German sentence and a translated or related sentence in another language can still be compared by their position in embedding space. The model documentation is here: https://docs.ionos.com/cloud/ai/ai-model-hub/models/embedding-models/paraphrase-multilingual-mpnet-v2.

One helpful way to think about the process is spatially. Every sentence becomes a point. The original Marx sentences form one set of points; the sentences from Die Neue Zeit form another. If a sentence from the periodical lands close to a sentence from Marx, it becomes a candidate for textual reuse. The slides showed this with a t-SNE projection, which is a visualization technique for compressing high-dimensional vectors into two dimensions so humans can inspect the structure. The actual search still happens in the full embedding space; the visualization is just a way to make the intuition visible.

From Sentences to Verifiable Matches

The remarx pipeline has two main components. The first is the Sentence Corpus Builder. Its job is to parse source texts into sentence-level CSV files. For the German texts, we used spaCy’s de_core_news_sm model for sentence segmentation. spaCy is a natural language processing library, and sentence segmentation is the step that decides where one sentence ends and the next begins. That may sound mundane, but it is foundational: if sentence boundaries are wrong, every downstream comparison becomes noisier.

The second component is the Quote Finder. This is where the semantic search happens. The process has five main steps:

Embed: encode every sentence as a vector using the transformer model.
Index: build an approximate nearest-neighbor index using Voyager, a vector search library from Spotify.
Search: for each sentence in the reuse corpus, find the closest sentence from the original corpus.
Filter: keep only pairs above a chosen similarity threshold.
Consolidate: merge sequential sentence-level matches into longer multi-sentence passages.

The “nearest neighbor” idea is central. Once every sentence is represented as a vector, we can ask: which original sentence is closest to this candidate reuse sentence? Instead of a binary yes/no answer, the system produces a ranking based on similarity. That ranking is important because it lets a researcher inspect the strongest candidates first while still retaining uncertainty.

The output is designed to be checked by a human. Each detected pair includes a match_score, the reuse_text and original_text, metadata such as author, title, and page number for the reuse article, sentence indices on both sides, and a num_sentences field indicating whether the match is a single sentence or a consolidated multi-sentence passage. This was an important design decision: the tool should not behave like a black box. Every result should be traceable back to an exact position in the source text.

For development and demonstration, I used Marimo, an interactive notebook environment. Notebooks are useful here because the workflow is exploratory: tune a threshold, inspect a few results, compare scores, look at false positives, adjust, and repeat. The goal is not only to produce a final dataset, but to create a workflow where historians and RSEs can reason together about what counts as meaningful textual reuse.

The next major step is evaluation. We need metrics that capture whether the system is finding the right kinds of reuse, not just whether it finds many matches. We also want to compare remarx with existing text reuse tools such as Passim and TRACER. That comparison should help clarify where semantic embedding methods are strongest, where traditional text reuse methods still perform better, and how much human validation is needed for different research questions.

I’m also interested in extending the workflow to additional languages. The embedding model already supports 50 languages, and the sentence segmentation step is modular, so adding another language is largely a matter of plugging in the right segmentation model and testing the results carefully. That does not make multilingual reuse detection “solved,” but it does make the architecture flexible enough to ask broader questions.

What I learned from this project is that the hard part is not simply finding similarity. It is building a system where similarity can become evidence. For a historian, a high score is only the beginning of the conversation. The more useful result is a candidate passage with enough context, metadata, and traceability to decide whether it really matters.

That is where I think remarx can be most valuable: not as a machine that declares “this is a quotation,” but as a research tool that helps scholars find the places where ideas travel, change shape, and reappear.

More about the Center for Digital Humanities is available at https://cdh.princeton.edu.

The Princeton Research Software Engineering Group Blog

Princeton University

Why Exact Matching Wasn’t Enough

From Sentences to Verifiable Matches

Like this: