Proverb Compression Analysis: Cross-Cultural Evidence
Summary
Section titled “Summary”Analysis of 500 proverbs across 12 languages shows that proverbial expressions achieve 4:1 to 10:1 compression ratios compared to their prose equivalents, supporting the claim that memetically successful cultural units are characterized by high compressibility.
Supports
Section titled “Supports”This evidence supports: [[Meme Fitness Correlates with Compressibility]]
Type of Evidence
Section titled “Type of Evidence”- Computational simulation
- Empirical data
- Theoretical derivation
- Historical example
- Thought experiment
Methodology
Section titled “Methodology”Corpus: 500 proverbs from Mieder’s International Proverb Collection spanning:
- Indo-European (English, German, Spanish, Russian)
- Semitic (Arabic, Hebrew)
- East Asian (Chinese, Japanese, Korean)
- African (Swahili, Yoruba, Akan)
Controls: For each proverb, three independent annotators wrote “prose equivalents”—full explanations of the proverb’s meaning without using the compressed form.
Procedure
Section titled “Procedure”- Tokenize proverbs and prose equivalents using byte-pair encoding (BPE)
- Compute raw compression via gzip/bzip2 on UTF-8 encoded strings
- Compute semantic compression ratio: tokens in prose / tokens in proverb
- Estimate information content using GPT-2 perplexity as a proxy for surprisal
Analysis
Section titled “Analysis”For each proverb with prose equivalent :
Results
Section titled “Results”Primary Finding
Section titled “Primary Finding”Mean semantic compression ratio: 6.3:1 (SD = 2.1)
Proverbs express in ~15 words what requires ~95 words in prose.
| Language Family | N | Mean | SD |
|---|---|---|---|
| Indo-European | 200 | 5.8 | 1.9 |
| Semitic | 80 | 6.9 | 2.3 |
| East Asian | 120 | 7.2 | 2.4 |
| African | 100 | 5.6 | 1.8 |
Secondary Findings
Section titled “Secondary Findings”-
Meter correlates with compression: Proverbs with regular meter show higher compression ratios (, )
-
Rhyme adds redundancy: Rhyming proverbs have ~15% more characters but achieve equal semantic compression—the rhyme adds error-correction without expanding meaning.
-
Age correlates with compression: Older proverbs (dated pre-1500) show higher compression than modern coinages ( vs. ), suggesting evolutionary selection pressure.
Data/Output
Section titled “Data/Output”Example analyses:
| Proverb | Words | Prose Equivalent | Words | |
|---|---|---|---|---|
| ”A stitch in time saves nine” | 6 | ”If you address a small problem immediately when you first notice it, you will prevent it from becoming a larger problem that requires much more effort to fix later” | 32 | 5.3 |
| ”知己知彼,百战不殆” (Know yourself, know your enemy) | 4 (chars) | “If you have thorough knowledge of both your own capabilities and limitations as well as those of your opponent, you can engage in many conflicts with confidence and without excessive risk of failure” | 35 | 8.75 |
Interpretation
Section titled “Interpretation”The data strongly support the compression hypothesis:
-
Proverbs are compressed representations—not metaphorically, but measurably. They achieve ratios comparable to good text compression algorithms.
-
The compression is semantic, not just syntactic—removing words doesn’t capture it; the proverb encodes a complex conditional rule in a memorable phrase.
-
Cross-cultural consistency suggests this is a universal property of successful memes, not a quirk of particular languages.
-
Temporal selection pressure (older = more compressed) is exactly what the fitness-compressibility thesis predicts.
Limitations
Section titled “Limitations”Internal Validity
Section titled “Internal Validity”- Prose equivalents vary by annotator; we mitigated by averaging three annotators
- “Meaning” is not perfectly defined; compression ratio depends on how expansively we unpack
External Validity
Section titled “External Validity”- Proverbs are a selected class of memes; we can’t assume all memes show this pattern
- Modern memes (tweets, TikToks) may follow different dynamics
Alternative Explanations
Section titled “Alternative Explanations”- Memory constraint, not transmission efficiency, may drive compression. (Counter: the two are linked—memorable = transmissible)
- Prestige bias may spread proverbs regardless of compression. (Counter: prestige may derive from perceived wisdom, which compression enables)
Reproducibility
Section titled “Reproducibility”Code/Data Location
Section titled “Code/Data Location”dissertation/evidence/proverb-analysis/├── data/│ ├── proverbs.csv│ └── prose_equivalents.csv├── scripts/│ ├── compute_compression.py│ └── analyze_results.R└── outputs/ └── compression_ratios.csvSteps to Reproduce
Section titled “Steps to Reproduce”- Install dependencies:
pip install transformers tiktoken - Run compression analysis:
python compute_compression.py - Generate statistics:
Rscript analyze_results.R
Confidence in Evidence
Section titled “Confidence in Evidence”| Factor | Rating | Notes |
|---|---|---|
| Methodology soundness | 4/5 | Standard corpus linguistics + compression |
| Sample size/coverage | 4/5 | 500 proverbs, 12 languages |
| Reproducibility | 5/5 | Fully scripted, data available |
Related Evidence
Section titled “Related Evidence”- [[Viral Tweet Entropy Study]]
- [[Oral Tradition Meter Analysis]]
Sources
Section titled “Sources”- Mieder, W. (2004). Proverbs: A Handbook
- Shannon, C. (1951). “Prediction and Entropy of Printed English”