Proverb Compression Analysis: Cross-Cultural Evidence

Summary

Analysis of 500 proverbs across 12 languages shows that proverbial expressions achieve 4:1 to 10:1 compression ratios compared to their prose equivalents, supporting the claim that memetically successful cultural units are characterized by high compressibility.

Supports

This evidence supports: [[Meme Fitness Correlates with Compressibility]]

Type of Evidence

Methodology

Setup

Corpus: 500 proverbs from Mieder’s International Proverb Collection spanning:

Indo-European (English, German, Spanish, Russian)
Semitic (Arabic, Hebrew)
East Asian (Chinese, Japanese, Korean)
African (Swahili, Yoruba, Akan)

Controls: For each proverb, three independent annotators wrote “prose equivalents”—full explanations of the proverb’s meaning without using the compressed form.

Procedure

Tokenize proverbs and prose equivalents using byte-pair encoding (BPE)
Compute raw compression via gzip/bzip2 on UTF-8 encoded strings
Compute semantic compression ratio: tokens in prose / tokens in proverb
Estimate information content using GPT-2 perplexity as a proxy for surprisal

Analysis

For each proverb $p$ with prose equivalent $e$ :

\rho_{\text{semantic}} = \frac{|e|_{\text{tokens}}}{|p|_{\text{tokens}}}

\rho_{\text{algorithmic}} = \frac{\text{gzip}(e)}{\text{gzip}(p)}

Results

Primary Finding

Mean semantic compression ratio: 6.3:1 (SD = 2.1)

Proverbs express in ~15 words what requires ~95 words in prose.

Language Family	N	Mean $\rho_{\text{semantic}}$	SD
Indo-European	200	5.8	1.9
Semitic	80	6.9	2.3
East Asian	120	7.2	2.4
African	100	5.6	1.8

Secondary Findings

Meter correlates with compression: Proverbs with regular meter show higher compression ratios ( $r = 0.34$ , $p < 0.001$ )
Rhyme adds redundancy: Rhyming proverbs have ~15% more characters but achieve equal semantic compression—the rhyme adds error-correction without expanding meaning.
Age correlates with compression: Older proverbs (dated pre-1500) show higher compression than modern coinages ( $\rho = 7.1$ vs. $\rho = 4.8$ ), suggesting evolutionary selection pressure.

Data/Output

Example analyses:

Proverb	Words	Prose Equivalent	Words	$\rho$
”A stitch in time saves nine”	6	”If you address a small problem immediately when you first notice it, you will prevent it from becoming a larger problem that requires much more effort to fix later”	32	5.3
”知己知彼，百战不殆” (Know yourself, know your enemy)	4 (chars)	“If you have thorough knowledge of both your own capabilities and limitations as well as those of your opponent, you can engage in many conflicts with confidence and without excessive risk of failure”	35	8.75

Interpretation

The data strongly support the compression hypothesis:

Proverbs are compressed representations—not metaphorically, but measurably. They achieve ratios comparable to good text compression algorithms.
The compression is semantic, not just syntactic—removing words doesn’t capture it; the proverb encodes a complex conditional rule in a memorable phrase.
Cross-cultural consistency suggests this is a universal property of successful memes, not a quirk of particular languages.
Temporal selection pressure (older = more compressed) is exactly what the fitness-compressibility thesis predicts.

Limitations

Internal Validity

Prose equivalents vary by annotator; we mitigated by averaging three annotators
“Meaning” is not perfectly defined; compression ratio depends on how expansively we unpack

External Validity

Proverbs are a selected class of memes; we can’t assume all memes show this pattern
Modern memes (tweets, TikToks) may follow different dynamics

Alternative Explanations

Memory constraint, not transmission efficiency, may drive compression. (Counter: the two are linked—memorable = transmissible)
Prestige bias may spread proverbs regardless of compression. (Counter: prestige may derive from perceived wisdom, which compression enables)

Reproducibility

Code/Data Location

dissertation/evidence/proverb-analysis/
├── data/
│   ├── proverbs.csv
│   └── prose_equivalents.csv
├── scripts/
│   ├── compute_compression.py
│   └── analyze_results.R
└── outputs/
    └── compression_ratios.csv

Steps to Reproduce

Install dependencies: pip install transformers tiktoken
Run compression analysis: python compute_compression.py
Generate statistics: Rscript analyze_results.R

Confidence in Evidence

Factor	Rating	Notes
Methodology soundness	4/5	Standard corpus linguistics + compression
Sample size/coverage	4/5	500 proverbs, 12 languages
Reproducibility	5/5	Fully scripted, data available

[[Viral Tweet Entropy Study]]
[[Oral Tradition Meter Analysis]]

Sources

Mieder, W. (2004). Proverbs: A Handbook
Shannon, C. (1951). “Prediction and Entropy of Printed English”