Rebuttal of Sproat, Farmer, et al.’s supposed “refutation”
[Updated: July, 2010]
This article is reproduced here, with due acknowledgements, as it has bearing on the Dravidian researches going on here in Tamilnadu.
Particularly, Asko Parpola had delivered his lecture at Coimbatore and Chennai, but full details are not provided to general readers, as these issues affect them socially and politically.
In 2004, Steve Farmer, Richard Sproat, and Michael Witzel published a paper in “Electronic Journal of Vedic Studies” (entitled “The Collapse of the Indus-Script Thesis: The Myth of a Literate Harappan Civilization”) claiming that the Indus valley civilization was illiterate and that Indus writing was a collection of political or religious symbols.
The publication of our paper in Science elicited hostile reactions from them, ranging from off-the-cuff dismissive remarks such as “garbage in, garbage out” (Witzel) to ad-hominem attacks (labeling us “Dravidian nationalists”) and a vicious campaign on internet discussion groups and blogs to discredit our work. Their first knee-jerk reaction was to call the two artificial control datasets in our study “invented data sets” (Farmer). This was followed by Sproat and others on a blog claiming to have constructed “counterexamples” to our result. Sproat has even attempted to publicize his claims using an article in Computational Linguistics and a web page entitled “Why Rao et al.’s work proves nothing”(!), despite the fact that our work has now been published in journals like Science, PNAS, PLOS One, and IEEE Computer.
Here, we respond to their arguments in a point-by-point fashion. First, their arguments:
(1) Two datasets, used as controls in our work, are artificial.
(2) Counterexamples can be given, of non-linguistic systems, which produce conditional entropy plots like those presented in our Science paper.
(3) Conditional entropy cannot even differentiate between language families.
(4) The absence of writing material and long texts is “proof” that the Indus people were illiterate.
We view arguments (1)-(3) as arising from a misunderstanding of our approach and an overinterpretation of the conditional entropy result. Some of these arguments are made with a narrow computational linguistics point of view without considering other properties of the Indus script and the Indus civilization (see below). The last argument has been controverted by several other researchers as discussed below.
Here is the point-by-point rebuttal:
(1) As stated in our Science paper, the two artificial data sets (which Farmer et al. call “invented data sets”) simply represent controls, necessary in any scientific investigation, to delineate the limits of what is possible. The two controls in our work represent sequences with maximum and minimum flexibility, for a given number of tokens. Though this can be computed analytically, the data sets were generated to subject them to the same parameter estimation process as the other data sets. Our conclusions do not depend on the controls, but are based on comparisons with real world data: DNA and protein sequences, various natural languages, and FORTRAN computer code. All our real world examples are bounded by the maximum and the minimum provided by the controls, which thus serve as a check on the computation.
(2) Counterexamples matter only if we claim that conditional entropy by itself is a sufficient criterion to distinguish between language and non-language. We do not make this claim in our Science paper. As clearly stated in the last sentence of the paper, our results provide evidence which, given the rich syntactic structure in the script (and other evidence as listed below), increases the probability that the script represents language.
The methodology, which is Bayesian in nature, can be summarized as follows. We begin with the fact that the Indus script exhibits the following properties:
- The Indus texts are linearly written, like the vast majority of linguistic scripts (and unlike nonlinguistic systems such as medieval heraldry or traffic signs);
- Indus symbols are often modified by the addition of specific sets of marks over, around, or inside a symbol. Multiple symbols are sometimes combined (“ligatured”) to form a single glyph. This is similar to later Indian scripts which use such ligatures and marks above, below, or around a symbol to modify the sound of a root consonant or vowel symbol;
- The script obeys the Zipf-Mandelbrot law, a power-law distribution on ranked data, which is often considered a necessary (though not sufficient) condition for language (see our PLOS One paper);
- The script exhibits rich syntactic structure such as the clear presence of beginners and enders, preferences of symbol clusters for particular positions within texts etc. (see References), not unlike linguistic sequences;
- Indus texts that have been discovered in Mesopotamia and the Persian Gulf use the same signs as texts found in the Indus region but alter their ordering. These “foreign” texts have low likelihood values compared to Indus region texts (see our PNAS paper), suggesting that the script was versatile enough to represent different subject matter or a different language in foreign regions.
Given that the Indus script shares the above properties with linguistic scripts, we claim that the similarity in conditional entropy of the Indus script to other natural languages provides additional evidence in favor of the linguistic hypothesis.
We have recently extended the result in our Science paper to block entropies for sequences of up to 6 symbols (see IEEE Computer paper for details):
The language-like scaling behavior of block entropies in the above figure, in combination with the other properties of language enumerated above, could be viewed in a Bayesian framework as further evidence for the linguistic nature of the Indus script.
The above figure also addresses objections raised by some (e.g., Fernando Pereira) who felt conditional entropy (which considers only pairwise dependencies) was not a sufficiently rich measure.
Let us now consider the nonlinguistic systems that have been suggested:
- Mark Liberman, Sproat, and Cosmo Shalizi in a blog constructed artificial examples of nonlinguistic systems whose conditional entropy was similar to the Indus script but their examples have no correlations between symbols – these examples do not exhibit the entropy scaling property exhibited by the Indus script and languages in the above figure, let alone other language-like properties like those exhibited by the Indus script.
- Two natural nonlinguistic systems that have been suggested, medieval heraldry and traffic signs, are not even linear, nor do they exhibit other script-like properties such as those listed above.
- The Vinca markings on pottery are linear but scholars have established that the symbols do not appear to follow any order – the system thus can be expected to fall in the maximum entropy range (MaxEnt) in the above figure.
- The carvings of deities on Mesopotamian boundary stones are also linear but the ordering of symbols appears to be more rigid than in natural languages, following for example the hierarchical ordering of the deities. This system can thus be expected to fall closer to the minimum entropy (MinEnt) range in the above entropy scaling figure than to natural languages.
We therefore believe that the new result above from our IEEE Computer paper, showing that the block entropies of the Indus script scale in a manner similar to natural languages, when viewed in conjunction with the other language-like properties of the script as described above, adds further support to the linguistic hypothesis.
(3) Sproat has endeavored to produce a plot where languages belonging to different language families have similar conditional entropies, thereby claiming that the conditional entropy result “proves nothing.” This claim is once again based on an overinterpretation of the result in our Science paper. We specifically note on page 10 in the supplementary information that “answering the question of linguistic affinity of the Indus texts requires a more sophisticated approach, such as statistically inferring an underlying grammar for the Indus texts from available data and comparing the inferred rules with those of various known language families.” In other words, conditional entropy provides a quantitative measure of the amount of flexibility allowed in choosing the next symbol given a previous symbol. It is useful for characterizing the average amount of flexibility in sequences of different kinds. We do not make the claim that it can be used to distinguish between language families – this requires a more sophisticated measure.
(4) With regard to the length of texts, several West Asian writing systems such as Proto-Cuneiform, Proto-Sumerian, and the Uruk script have statistical regularities in sign frequencies and text lengths which are remarkably similar to the Indus script (Details can be found in http://indusresearch.wikidot.com/script). These writing systems are by all accounts linguistic. Furthermore, the lack of archaeological evidence for long texts in the Indus civilization does not automatically imply that they did not exist (“absence of evidence is not evidence of absence”). There is a long history of writing on perishable materials like cotton, palm leaves, and bark in the Indian subcontinent using equally perishable writing implements (see Parpola’s paper below). Writing on such material is unlikely to have survived the hostile environment of the Indus valley. Thus, long texts may have been written, but no archaeological remains are to be found.
As regards the argument for literacy from the point of view of cultural sophistication of the Indus people, we believe Iravatham Mahadevan has addressed this adequately in his op-ed piece below (see also Massimo Vidale’s entertaining article).
- Final version of the Science paper (including Supplementary Information), 2009:
- IEEE Computer review article with new block entropy result:
Probabilistic analysis of an ancient undeciphered script, 2010:
- PLoS One paper: Statistical Analysis of the Indus script using n-grams, 2010:
- PNAS paper: A Markov model of the Indus script, 2009:
- Asko Parpola’s point-by-point rebuttal of Farmer, Sproat, and Witzel:
o Parpola A (2008) Is the Indus script indeed not a writing system? in Airavati: Felicitation volume in honor of Iravatham Mahadevan (Varalaaru.com publishers, Chennai, India) pp. 111-131.
- Massimo Vidale’s “The collapse melts down: a reply to Farmer, Sproat and Witzel”:
- Iravatham Mahadevan’s “The Indus non-script is a non-issue”:
- Syntactic structure in the Indus script:
o Koskenniemi K (1981) Syntactic methods in the study of the Indus script. Studia Orientalia 50:125-136.
o Parpola A (1994) Deciphering the Indus script. (Cambridge University Press), Chaps. 5 & 6.
o Yadav N, Vahia MN, Mahadevan I, Joglekar H (2008) A statistical approach for pattern search in Indus writing. International Journal of Dravidian Linguistics 37(1):39-52.
o Yadav N, Vahia MN, Mahadevan I, Joglekar H (2008) Segmentation of Indus texts. International Journal of Dravidian Linguistics 37(1):53-72.