Rebuttal of Sproat, Farmer, et al.'s supposed "refutation" by Rajesh Rao

July 13, 2010

Rebuttal of Sproat, Farmer, et al.’s supposed “refutation”

[Updated: July, 2010]

This article is reproduced here, with due acknowledgements, as it has bearing on the Dravidian researches going on here in Tamilnadu.

Particularly, Asko Parpola had delivered his lecture at Coimbatore and Chennai, but full details are not provided to general readers, as these issues affect them socially and politically.

In 2004, Steve Farmer, Richard Sproat, and Michael Witzel published a paper in “Electronic Journal of Vedic Studies” (entitled “The Collapse of the Indus-Script Thesis: The Myth of a Literate Harappan Civilization”) claiming that the Indus valley civilization was illiterate and that Indus writing was a collection of political or religious symbols.

The publication of our paper in Science elicited hostile reactions from them, ranging from off-the-cuff dismissive remarks such as “garbage in, garbage out” (Witzel) to ad-hominem attacks (labeling us “Dravidian nationalists”) and a vicious campaign on internet discussion groups and blogs to discredit our work. Their first knee-jerk reaction was to call the two artificial control datasets in our study “invented data sets” (Farmer). This was followed by Sproat and others on a blog claiming to have constructed “counterexamples” to our result. Sproat has even attempted to publicize his claims using an article in Computational Linguistics and a web page entitled “Why Rao et al.’s work proves nothing”(!), despite the fact that our work has now been published in journals like Science, PNAS, PLOS One, and IEEE Computer.

Here, we respond to their arguments in a point-by-point fashion. First, their arguments:

(1) Two datasets, used as controls in our work, are artificial.

(2) Counterexamples can be given, of non-linguistic systems, which produce conditional entropy plots like those presented in our Science paper.

(3) Conditional entropy cannot even differentiate between language families.

(4) The absence of writing material and long texts is “proof” that the Indus people were illiterate.

We view arguments (1)-(3) as arising from a misunderstanding of our approach and an overinterpretation of the conditional entropy result. Some of these arguments are made with a narrow computational linguistics point of view without considering other properties of the Indus script and the Indus civilization (see below). The last argument has been controverted by several other researchers as discussed below.

Here is the point-by-point rebuttal:

(1) As stated in our Science paper, the two artificial data sets (which Farmer et al. call “invented data sets”) simply represent controls, necessary in any scientific investigation, to delineate the limits of what is possible. The two controls in our work represent sequences with maximum and minimum flexibility, for a given number of tokens. Though this can be computed analytically, the data sets were generated to subject them to the same parameter estimation process as the other data sets. Our conclusions do not depend on the controls, but are based on comparisons with real world data: DNA and protein sequences, various natural languages, and FORTRAN computer code. All our real world examples are bounded by the maximum and the minimum provided by the controls, which thus serve as a check on the computation.

(2) Counterexamples matter only if we claim that conditional entropy by itself is a sufficient criterion to distinguish between language and non-language. We do not make this claim in our Science paper. As clearly stated in the last sentence of the paper, our results provide evidence which, given the rich syntactic structure in the script (and other evidence as listed below), increases the probability that the script represents language.

The methodology, which is Bayesian in nature, can be summarized as follows. We begin with the fact that the Indus script exhibits the following properties:

  • The Indus texts are linearly written, like the vast majority of linguistic scripts (and unlike nonlinguistic systems such as medieval heraldry or traffic signs);
  • Indus symbols are often modified by the addition of specific sets of marks over, around, or inside a symbol. Multiple symbols are sometimes combined (“ligatured”) to form a single glyph. This is similar to later Indian scripts which use such ligatures and marks above, below, or around a symbol to modify the sound of a root consonant or vowel symbol;
  • The script obeys the Zipf-Mandelbrot law, a power-law distribution on ranked data, which is often considered a nec­essary (though not sufficient) condition for language (see our PLOS One paper);
  • The script exhibits rich syntactic structure such as the clear presence of beginners and enders, preferences of symbol clusters for particular positions within texts etc. (see References), not unlike linguistic sequences;
  • Indus texts that have been discovered in Mesopotamia and the Persian Gulf use the same signs as texts found in the Indus region but alter their ordering. These “foreign” texts have low likelihood values compared to Indus region texts (see our PNAS paper), suggesting that the script was versatile enough to represent different subject matter or a dif­ferent language in foreign regions.

Given that the Indus script shares the above properties with linguistic scripts, we claim that the similarity in conditional entropy of the Indus script to other natural languages provides additional evidence in favor of the linguistic hypothesis.

We have recently extended the result in our Science paper to block entropies for sequences of up to 6 symbols (see IEEE Computer paper for details):



The language-like scaling behavior of block entropies in the above figure, in combination with the other properties of language enumerated above, could be viewed in a Bayesian framework as further evidence for the linguistic nature of the Indus script.

The above figure also addresses objections raised by some (e.g., Fernando Pereira) who felt conditional entropy (which considers only pairwise dependencies) was not a sufficiently rich measure.

Let us now consider the nonlinguistic systems that have been suggested:

  • Mark Liberman, Sproat, and Cosmo Shalizi in a blog constructed artificial examples of nonlinguistic systems whose conditional entropy was similar to the Indus script but their examples have no correlations between symbols – these examples do not exhibit the entropy scaling property exhibited by the Indus script and languages in the above figure, let alone other language-like properties like those exhibited by the Indus script.
  • Two natural nonlinguistic systems that have been suggested, medieval heraldry and traffic signs, are not even linear, nor do they exhibit other script-like properties such as those listed above.
  • The Vinca markings on pottery are linear but scholars have established that the symbols do not appear to follow any order – the system thus can be expected to fall in the maximum entropy range (MaxEnt) in the above figure.
  • The carvings of deities on Mesopotamian boundary stones are also linear but the ordering of symbols appears to be more rigid than in natural languages, following for example the hierarchical ordering of the deities. This system can thus be expected to fall closer to the minimum entropy (MinEnt) range in the above entropy scaling figure than to natural languages.

We therefore believe that the new result above from our IEEE Computer paper, showing that the block entropies of the Indus script scale in a manner similar to natural languages, when viewed in conjunction with the other language-like properties of the script as described above, adds further support to the linguistic hypothesis.

(3) Sproat has endeavored to produce a plot where languages belonging to different language families have similar conditional entropies, thereby claiming that the conditional entropy result “proves nothing.” This claim is once again based on an overinterpretation of the result in our Science paper. We specifically note on page 10 in the supplementary information that “answering the question of linguistic affinity of the Indus texts requires a more sophisticated approach, such as statistically inferring an underlying grammar for the Indus texts from available data and comparing the inferred rules with those of various known language families.” In other words, conditional entropy provides a quantitative measure of the amount of flexibility allowed in choosing the next symbol given a previous symbol. It is useful for characterizing the average amount of flexibility in sequences of different kinds. We do not make the claim that it can be used to distinguish between language families – this requires a more sophisticated measure.

(4) With regard to the length of texts, several West Asian writing systems such as Proto-Cuneiform, Proto-Sumerian, and the Uruk script have statistical regularities in sign frequencies and text lengths which are remarkably similar to the Indus script (Details can be found in These writing systems are by all accounts linguistic. Furthermore, the lack of archaeological evidence for long texts in the Indus civilization does not automatically imply that they did not exist (“absence of evidence is not evidence of absence”). There is a long history of writing on perishable materials like cotton, palm leaves, and bark in the Indian subcontinent using equally perishable writing implements (see Parpola’s paper below). Writing on such material is unlikely to have survived the hostile environment of the Indus valley. Thus, long texts may have been written, but no archaeological remains are to be found.

As regards the argument for literacy from the point of view of cultural sophistication of the Indus people, we believe Iravatham Mahadevan has addressed this adequately in his op-ed piece below (see also Massimo Vidale’s entertaining article).


War of words in the cradle of south Asian civilisation!

May 14, 2010

War of words in the cradle of south Asian civilisation

In the heart of Pakistan, the ruins of a 4,000-year-old city have spawned a cross-continental row about language, culture – and racism in academia, By Andrew Buncombe, Thursday, 25 March 2010

Curator Mohammed Hassan
ANDREW BUNCOMBECurator Mohammed Hassan

At the quiet ruins of Harappa, one of the two main centres of an ancient civilisation that once spread from the Himalayas to Mumbai, Naveed Ahmed took in the arid hills dotted with thorn-bush. “I think the people who lived here were very different from us,” said the part-time guide. “The stones and the beads [they made]; it was as if they were more sophisticated.”

However peaceful this ruined city of the Indus civilisation may appear, the former residents of Harappa and the remnants of their society are today at the centre of one of the most acrimonious disputes in academia, a controversy that has allegedly led to death threats and claims of racism and cultural chauvinism.

Many experts in south Asia and elsewhere believe that symbols and marks inscribed on seals and other artefacts found here represent an as yet undeciphered language. Arguing it may be the predecessor of one of several contemporary south Asian argots, these experts say it is proof of a literate Indian society that existed more than 4,000 years ago.

But other experts based in the West say although the symbols may contain information, they are not a true language. They claim the judgement of their counterparts in south Asia may be swayed by regional nationalism.

Mohammed Hassan is curator of the museum beside the dust-blown ruins. Before leading a tour, the government official served tea and biscuits in his office and insisted the people of Harappa must have possessed a written language to store information. “If they were not literate, then how could they do so many things?” he said. “They had well-made pottery, big cities that were well-planned. They had a lot of knowledge about these things. They grew cotton, wheat, rice and barley. They traded with other cities.”

The Indus civilisation covered more than 500,000 square miles and lasted, during what experts term its “mature phase”, from 2,600 till 1900 BCE. The ruins, 100 miles south-west of the Pakistani city Lahore, the ruins were rediscovered in the early part of the 19th century.

The skills of its residents – at least in terms of making bricks that could endure centuries – were revealed by two British engineers, John and William Brunton, who were building the East Indian Railway Company line to connect Lahore and Karachi and needed ballast for their track. The engineers later wrote that locals told them of well-made bricks from an ancient ruined city that the villagers had made use of. With little concern for preserving the ruins, huge numbers of the Indus-era bricks were reduced to rubble and used to support the tracks heading west.

In the early 20th century, excavation of Harappa proceeded along with that of the other Indus city at Mohenjo-daro, in the south of Pakistan, and it was at that time many of the seals now on display in Mr Hassan’s museum containing symbols and images of animals were discovered. And they have continued to beguile, fascinate and frustrate scientists, causing a running controversy that has played out on internet message boards, scientific papers and at academic conferences.

Like Mr Hassan, Iravatham Mahadevan, an expert in epigraphy from southern India who has been awarded the country’s highest civilian award for his work, has no doubts the symbols on the Indus seals represent a genuine language. “Archaeological evidence makes it inconceivable that such a large, well-administered, and sophisticated trading society could have functioned without effective long-distance communication, which could have been provided only by writing,” he wrote last year in a magazine.

“And there is absolutely no reason to presume otherwise,considering that thousands of objects, including seals, copper tablets, and pottery bear inscriptions in the same script throughout the Indus region. The script may not have been deciphered but that is no valid reason to deny its very existence.”

Mr Mahadevan believes the Indus script may have been a forerunner of so-called Dravidian languages, such as Tamil, spoken today in southern India and Sri Lanka. In addition to technical clues, he says the continued existence of a Dravidian language in modern Pakistan – Brahvi, which is spoken by people in parts of Balochistan – supports his idea.

Over the years, there have been plenty of other theories both from established experts and enthusiastic amateurs. Some, with the backing of Hindu nationalists, have claimed the script may be an early Indo-European language and that remnants of it may even exist in Sanskrit, an ancient language that is the root of many present languages in north India, including Hindi. It has even been claimed the Indus script belonged to metalsmiths, and others believe it died out with the city of Harappa itself and gave rise to no successor.

Part of the problem for the experts is that, unlike for those who cracked the hieroglyphics of Egypt, there is no equivalent of the Rosetta stone, the slab of granite-like rock discovered in 1799 that contained Egyptian and Greek text. In the 1950s, academic interest in Mayan hieroglyphics intensified when experts began to study modern spoken Mayan, but for the Indus scholars there is no agreement on which, if any, modern language is the successor to their script.

In 2004, the debate was jolted into a war of words after three American scholars claimed the Indus symbols were not a language at all. In a paper provocatively subtitled The Myth of a Literate Harappan Civilisation, they said there was insufficient evidence that the symbols constituted a proper language. They pointed to various factors: that there was no single long piece of text; that there was disagreement over the number of actual symbols and that other well-organised societies had been illiterate. The symbols, they argued, may well contain information in the same way that an image of a knife and fork together might represent a roadside eatery but they were not a language that could record speech.

The ensuing uproar came mainly from south Asians. One of the American scholars, Steve Farmer claimed people would approach him in tears after he gave talks and that he had even had death threats. Comments on internet discussion boards accuse him and his colleagues of trying to prove that “non-Western cultures were less advanced”. Mr Farmer, who lives in California, said he believed much of the anger was driven by those wishing to promote pet theories about Dravidians, indigenous Aryan Hindus or “the general man in the street who wants to think ancient India was of the same order as Egypt or Mesopotamia. It’s total rubbish”. He added: “I have never seen anything like the passion that there is in India. There is not that sort of passion in the Middle East about ancient things.”

More recently, the Indus controversy has been joined by a team of Indian scientists who ran computer programmes which led them to conclude the symbols almost certainly constitute a language. Central to their claims, published last year in Science, was the theory of “conditional entropy”, or the measure of randomness in any sequence. Because of linguistic rules – such as in English the letter Q is almost always followed by a U – in natural languages the degree of randomness is less than in artificial languages.

One of the authors, Rajesh Rao, who was born in Hyderabad but is now based at the University of Washington, became fascinated by the Indus culture after studying it at school. His team measured the randomness with which the individual Indus symbols appeared on seals and compared that to the randomness of several natural and artificial languages. Mr Rao said it was closest to a natural language. “The Indus civilisation was larger than the ancient Egyptian, Chinese, and Mesopotamian civilisations and the most advanced in terms of urban planning and trade,” he said. “Yet we know little about their leaders, their beliefs, their way of life, and the way their society was organised. Many of us hope that decoding the script will provide a new voice to the Indus people.”

Yet as soon as Mr Rao’s team published its findings, Mr Farmer and his colleagues hit back, denouncing their conclusions and methodology. Mr Rao, whose team has since issued a detailed defence of their theory, said he was surprised at the level of contention, within south Asia and beyond, but also at some of the comments he claims Mr Farmer’s group levelled at him.

So whether the Indus is a script with hidden meanings may never be deciphered. Naveed Ahmed, a 24-year-old part-time guide, whose family has lived “forever” in a village on the edge of the ruins speaks Punjabi, an Indo-Aryan language from the same family as Sanskrit. Was it possible a linguistic thread connected the language he used with what had been spoken – and possibly written – by the people who once occupied the ruined city? “I don’t know if it is the same,” he said. “But it’s a possibility that our language came from them. It is always a possibility.”