The Zipf Mystery - Summary

Summary

The video discusses the phenomenon known as Zipf's Law, which describes the distribution of word usage in natural languages. The speaker explains that the most frequently used word in the English language is "the," followed by "of," "and," "to," "a," "in," "is," "I," "that," "it," "for," "you," "was," "with," "on," "as," "have," "but," "be," "they," and so on. The speaker also mentions that the frequency of a word is proportional to one over its rank; for example, the second most used word appears about half as often as the most used word, the third one a third as often, and so on. This pattern is observed across various languages and even in ancient languages yet to be translated.

Zipf's Law is also found in other areas such as city populations, solar flare intensities, protein sequences, immune receptors, traffic websites, earthquake magnitudes, the number of times academic papers are cited, the diameter of Moon craters, and more. Although there are many theories about why language is 'zipf-y,' no definitive conclusions have been reached.

The video also discusses the Pareto Principle, which states that 20% of the causes are responsible for 80% of the outcome. This principle is applied to language, where the most frequently used 18% of words account for over 80% of word occurrences.

The speaker then discusses the idea of the Principle of Least Effort, which suggests that language development in our species led to a few words being used often and many words being used rarely. Recent research suggests that having a few short, often used, predictable words helps dissipate information load density on listeners.

The speaker also mentions that there are exponentially more different long words than short words, which could explain why Zipf's Law is observed in language. The speaker also discusses the concept of preferential attachment, which occurs when something is given out according to how much is already possessed.

In the end, the speaker suggests that the mysterious distribution of Zipf's Law in language might be a result of a combination of random naming, preferential attachment, and the principle of least effort when speaking and listening.

Facts

1. The word "the" is the most used word in the English language, appearing in about 1 out of every 16 words encountered daily.
2. The top 20 most common English words are "the," "of," "and," "to," "a," "in," "is," "I," "that," "it," "for," "you," "was," "with," "on," "as," "have," "but," "be," "they."
3. Zipf's Law, which states that the frequency of any word in a language is inversely proportional to its rank in the frequency table, applies not only to English but also to other languages, including ancient ones.
4. According to WordCount.org, which ranks words as found in the British National Corpus, "sauce" is the 5,555th most common English word.
5. The word "the" appears about 181 million times in the entire Gutenberg Corpus of public domain books.
6. Zipf's law is also found in various other real-world processes, such as city populations, solar flare intensities, protein sequences, immune receptors, the amount of traffic websites get, earthquake magnitudes, the number of times academic papers are cited, last names, the firing patterns of neural networks, ingredients used in cookbooks, the number of phone calls people received, the diameter of Moon craters, the number of people that die in wars, the popularity of opening chess moves, and the rate at which we forget.
7. Zipf's law was popularized by George Zipf, a linguist at Harvard University. It is a discrete form of the continuous Pareto distribution from which we get the Pareto Principle.
8. The Pareto Principle states that it's worth assuming that 20% of the causes are responsible for 80% of the outcome. This principle is often applied in language, where the most frequently used 18 percent of words account for over 80% of word occurrences.
9. George Zipf himself thought languages' interesting rank frequency distribution was a consequence of the Principle of Least Effort, the tendency for life and things to follow the path of least resistance.
10. A few years after Zipf's seminal paper, Benoit Mandelbrot showed that there may be nothing mysterious about Zipf's law at all, because even if you just randomly type on a keyboard you will produce words distributed according to Zipf's law.
11. There are exponentially more different long words than short words. For instance, the English alphabet can be used to make 26 one letter words, but 26 squared 2 letter words.
12. Zipf's law describes what naturally happens when you segment the observable world and the mental world into labels.
13. The law of language distribution by frequency is 'Zipf-ian.' A small number of chains contain a disproportionate amount of the total count. This is simply because the longer a chain gets, the greater proportion of the whole it contains, which gives it a better chance of being picked up in the future and consequently made even longer.
14. The law of language distribution by frequency is 'Zipf-ian.' More often than not after a while you will have a distribution that looks 'Zipf-ian.' A small number of chains contain a disproportionate amount of the total count.
15. The word "quizzaciously" is a 'hapax legomenon,' a word that is used only once in a given selection of words.
16. Most of what you experience on a day-to-day basis is forgotten at a rate quite similar to Zipf's law.
17. Ralph Waldo Emerson once said, "I cannot remember the books I've read any more than the meals I have eaten. Even so, they have made me."