Concept: Zipf's law
The largest cities, the most frequently used words, the income of the richest countries, and the most wealthy billionaires, can be all described in terms of Zipf’s Law, a rank-size rule capturing the relation between the frequency of a set of objects or events and their size. It is assumed to be one of many manifestations of an underlying power law like Pareto’s or Benford’s, but contrary to popular belief, from a distribution of, say, city sizes and a simple random sampling, one does not obtain Zipf’s law for the largest cities. This pathology is reflected in the fact that Zipf’s Law has a functional form depending on the number of events N. This requires a fundamental property of the sample distribution which we call ‘coherence’ and it corresponds to a ‘screening’ between various elements of the set. We show how it should be accounted for when fitting Zipf’s Law.
Zipf’s law on word frequency and Heaps' law on the growth of distinct words are observed in Indo-European language family, but it does not hold for languages like Chinese, Japanese and Korean. These languages consist of characters, and are of very limited dictionary sizes. Extensive experiments show that: (i) The character frequency distribution follows a power law with exponent close to one, at which the corresponding Zipf’s exponent diverges. Indeed, the character frequency decays exponentially in the Zipf’s plot. (ii) The number of distinct characters grows with the text length in three stages: It grows linearly in the beginning, then turns to a logarithmical form, and eventually saturates. A theoretical model for writing process is proposed, which embodies the rich-get-richer mechanism and the effects of limited dictionary size. Experiments, simulations and analytical solutions agree well with each other. This work refines the understanding about Zipf’s and Heaps' laws in human language systems.
We analyze the occurrence frequencies of over 15 million words recorded in millions of books published during the past two centuries in seven different languages. For all languages and chronological subsets of the data we confirm that two scaling regimes characterize the word frequency distributions, with only the more common words obeying the classic Zipf law. Using corpora of unprecedented size, we test the allometric scaling relation between the corpus size and the vocabulary size of growing languages to demonstrate a decreasing marginal need for new words, a feature that is likely related to the underlying correlations between words. We calculate the annual growth fluctuations of word use which has a decreasing trend as the corpus size increases, indicating a slowdown in linguistic evolution following language expansion. This “cooling pattern” forms the basis of a third statistical regularity, which unlike the Zipf and the Heaps law, is dynamical in nature.
Novelties are a familiar part of daily life. They are also fundamental to the evolution of biological systems, human society, and technology. By opening new possibilities, one novelty can pave the way for others in a process that Kauffman has called “expanding the adjacent possible”. The dynamics of correlated novelties, however, have yet to be quantified empirically or modeled mathematically. Here we propose a simple mathematical model that mimics the process of exploring a physical, biological, or conceptual space that enlarges whenever a novelty occurs. The model, a generalization of Polya’s urn, predicts statistical laws for the rate at which novelties happen (Heaps' law) and for the probability distribution on the space explored (Zipf’s law), as well as signatures of the process by which one novelty sets the stage for another. We test these predictions on four data sets of human activity: the edit events of Wikipedia pages, the emergence of tags in annotation systems, the sequence of words in texts, and listening to new songs in online music catalogues. By quantifying the dynamics of correlated novelties, our results provide a starting point for a deeper understanding of the adjacent possible and its role in biological, cultural, and technological evolution.
Despite being a paradigm of quantitative linguistics, Zipf’s law for words suffers from three main problems: its formulation is ambiguous, its validity has not been tested rigorously from a statistical point of view, and it has not been confronted to a representatively large number of texts. So, we can summarize the current support of Zipf’s law in texts as anecdotic. We try to solve these issues by studying three different versions of Zipf’s law and fitting them to all available English texts in the Project Gutenberg database (consisting of more than 30 000 texts). To do so we use state-of-the art tools in fitting and goodness-of-fit tests, carefully tailored to the peculiarities of text statistics. Remarkably, one of the three versions of Zipf’s law, consisting of a pure power-law form in the complementary cumulative distribution function of word frequencies, is able to fit more than 40% of the texts in the database (at the 0.05 significance level), for the whole domain of frequencies (from 1 to the maximum value), and with only one free parameter (the exponent).
In spite of decades of theorizing, the origins of Zipf’s law remain elusive. I propose that a Zipfian distribution straightforwardly follows from the interaction of syntax (word classes differing in class size) and semantics (words having to be sufficiently specific to be distinctive and sufficiently general to be reusable). These factors are independently motivated and well-established ingredients of a natural-language system. Using a computational model, it is shown that neither of these ingredients suffices to produce a Zipfian distribution on its own and that the results deviate from the Zipfian ideal only in the same way as natural language itself does.
- Journal of the Royal Society, Interface / the Royal Society
- Published over 4 years ago
The distribution of firms' growth and firms' sizes is a topic under intense scrutiny. In this paper, we show that a thermodynamic model based on the maximum entropy principle, with dynamical prior information, can be constructed that adequately describes the dynamics and distribution of firms' growth. Our theoretical framework is tested against a comprehensive database of Spanish firms, which covers, to a very large extent, Spain’s economic activity, with a total of 1 155 142 firms evolving along a full decade. We show that the empirical exponent of Pareto’s law, a rule often observed in the rank distribution of large-size firms, is explained by the capacity of economic system for creating/destroying firms, and that can be used to measure the health of a capitalist-based economy. Indeed, our model predicts that when the exponent is larger than 1, creation of firms is favoured; when it is smaller than 1, destruction of firms is favoured instead; and when it equals 1 (matching Zipf’s law), the system is in a full macroeconomic equilibrium, entailing ‘free’ creation and/or destruction of firms. For medium and smaller firm sizes, the dynamical regime changes, the whole distribution can no longer be fitted to a single simple analytical form and numerical prediction is required. Our model constitutes the basis for a full predictive framework regarding the economic evolution of an ensemble of firms. Such a structure can be potentially used to develop simulations and test hypothetical scenarios, such as economic crisis or the response to specific policy measures.
The results from urban scaling in recent years have held the promise of increased efficiency to the societies who could actively control the distribution of their cities' size. However, little evidence exists as to the factors which influence the level of urban unevenness, as expressed by the slope of the rank-size distribution, partly because the diversity of results found in the literature follows the heterogeneity of analysis specifications. In this study, I set up a meta-analysis of Zipf’s law which accounts for technical as well as topical factors of variations of Zipf’s coefficient. I found 86 studies publishing at least one empirical estimation of this coefficient and recorded their metadata into an open database. I regressed the 1962 corresponding estimates with variables describing the study and the estimation process as well as socio-demographic variables describing the territory under enquiry. A dynamic meta-analysis was also performed to look for factors of evolution of city size unevenness. The results of the most interesting models are presented in the article, whereas all analyses can be reproduced on a dedicated online platform. The results show that on average, 40% of the variation of Zipf’s coefficients is due to the technical choices. The main other variables associated with distinct evolutions are linked to the urbanisation process rather than the process of economic development and population growth. Finally, no evidence was found to support the effectiveness of past planning actions in modifying this urban feature.
Using public data (Forbes Global 2000) we show that the asset sizes for the largest global firms follow a Pareto distribution in an intermediate range, that is “interrupted” by a sharp cut-off in its upper tail, where it is totally dominated by financial firms. This flattening of the distribution contrasts with a large body of empirical literature which finds a Pareto distribution for firm sizes both across countries and over time. Pareto distributions are generally traced back to a mechanism of proportional random growth, based on a regime of constant returns to scale. This makes our findings of an “interrupted” Pareto distribution all the more puzzling, because we provide evidence that financial firms in our sample should operate in such a regime. We claim that the missing mass from the upper tail of the asset size distribution is a consequence of shadow banking activity and that it provides an (upper) estimate of the size of the shadow banking system. This estimate-which we propose as a shadow banking index-compares well with estimates of the Financial Stability Board until 2009, but it shows a sharper rise in shadow banking activity after 2010. Finally, we propose a proportional random growth model that reproduces the observed distribution, thereby providing a quantitative estimate of the intensity of shadow banking activity.
Motivated by the method of interpolating inequalities that makes use of the improved Jensen-type inequalities, in this paper we integrate this approach with the well known Zipf-Mandelbrot law applied to various types off-divergences and distances, such are Kullback-Leibler divergence, Hellinger distance, Bhattacharyya distance (via coefficient), [Formula: see text]-divergence, total variation distance and triangular discrimination. Addressing these applications, we firstly deduce general results of the type for the Csiszár divergence functional from which the listed divergences originate. When presenting the analyzed inequalities for the Zipf-Mandelbrot law, we accentuate its special form, the Zipf law with its specific role in linguistics. We introduce this aspect through the Zipfian word distribution associated to the English and Russian languages, using the obtained bounds for the Kullback-Leibler divergence.