A few months ago my favorite newsweekly, The Economist, ran a cover story titled “The Data Deluge” that expressed misgivings about the quantity of data being generated. The Economist’s title went with the flow of water-based metaphors for data overload including: deluged, drinking from a fire hydrant, swamped, drowning, sinking under, treading, etc. As the authors note, torture data enough and it will confess to anything so it’s only a matter of time before someone coins the term data-boarding for the condition. This note begins to explain why the whole notion of systemic data overload strikes me as one of those bogus alarms such as culture shock that is predicated on a contrived imbalance.
“The Proliferation of Data is Making Them Increasingly Inaccessible”
The Economist notes that we have moved from an era of scarce data to one of superabundance and bemoans the shortage of storage capacity relative to data being generated, invokes security and privacy issues, and reports the lament, quoted above, of astrophysicist Alex Szalay, that the proliferation of data is making them less accessible.
Economists have long realized that relative volume is a lousy measure of relative value as revealed by the diamond-water paradox. Having more of something than we “need” or can currently use is not obviously a problem: we use only a portion of the water on Earth but don’t worry overmuch about the “surplus” that goes untapped. This is a very different picture than that of Szalay’s implication that a lot of very valuable data are being wasted for want of adequate bases to store and analyze it.
It is unclear to me how previously unknown or unsuspected data, however difficult or costly to access today, can possibly be less accessible than it was when its existence wasn’t even suspected yesterday. Maybe it’s the data on the dark matter that apparently makes up a great deal of the universe even though we can’t seem to find it. How are we worse off for not being able to access information we didn’t even know existed before now? If the data went past a black hole’s event horizon it’s lost forever but if it is just out there waiting for the right theory and tools to guide us to it, we can relax.
Perhaps Szalay believes that evidence of intelligent life, other habitable planets, or new insights into the cosmos’ origin and fate is contained in the gazillions of electromagnetic signals captured each day but we can’t see it because of the clutter. That may well be but without appropriate models to identify and make sense of it, the data is neither valuable nor useful. This point is illustrated by the accidental discovery in 1964 of the cosmic background radiation left over from the “big bang” by Arno Penzias and Robert Wilson. This event might look at first like finding an unanticipated needle in a very, very large haystack but in fact Penzias and Wilson, after removing the pigeon poop from their antenna and thus eliminating their initial hypothesis, had no idea what they had found. Fortunately Princeton astrophysicists Robert H. Dicke, Jim Peebles, and David Wilkinson, had already anticipated the leftover radiation phenomena and were preparing to search for it when word of Penzias and Wilson’s “discovery” reached them. The Nobel committee awarded the physics prize to the clueless and ignored the scientists who predicted, explained, and verified the finding. At best Penzias and Wilson were guilty of pattern recognition while Dicke, Peebles, and Wilkinson were committing science.
For expository convenience The Economist conflates data and information but in doing so it obscures a very important distinction: data has a high degree of entropy whereas information is data passed through a local entropy reducing mechanism such as a model. The event horizon of interest is the area illuminated by the model or hypothesis not by the collection of data. The location of that horizon is largely a function of search costs and expected returns. Models reduce search costs and often also increase returns to search.
Over 50 years ago, the great economist George Stigler set out the economics of information. His model applies still in today’s era of “big data” although The Economist’s article makes no use of it. Stigler explained searching for and advertising prices in a world where prices were dispersed and search (a form of data creation or collection) and information were not free. He showed that buyers would engage in search up to the point where, at the margin, the expected value in the form of a lower price was just equal to the incremental cost of search. On the other side of the market, sellers engage in advertising (data provision) in order to reduce buyers’ search costs and therefore increase the likelihood that buyers will find them and purchase from them. Whenever buyers (users) and sellers (providers) engage over time in trade based on their respective costs and returns an evolutionary process or game ensues in which changes in one or the other party’s’ costs or returns stimulates a reaction on the part of the others.
If we substitute the actions of the producers of data (or the means to create it) and those of the users of data for the buyers and sellers in Stigler’s example we can understand the dynamic interplay and evolutionary growth of the market for information. Today’s providers of data or the means of creating it have been increasing its availability and reducing the costs of acquiring it, much like the appearance of the telescope in Galileo’s day increased the available data about our solar system. But without a fairly good understanding of planetary movements, Galileo would have been hard pressed to locate Jupiter much less discover its four largest moons. Once the increased returns (information) to searching the night skies became evident, scientists spent much more on search and began working off the huge data “deluge” created by the introduction of the telescope.
“Understanding Turns Out to be Overrated, and Statistical Analysis Goes a Lot of the Way”
Princeton’s Edward Felton’s statement, quoted above, goes to the heart of my discomfort about the way we manage many large databases and squeeze information from them. When I was a young analyst, my mentors instructed me to focus my searches and avoid “boiling the ocean.” They instructed me to economize on time and data by establishing first a hypothesis using an appropriate economic model and then testing that model using the available data. The results of this exercise sometimes produced useful information – the pattern or relationship among the data. This was a practical application of the scientific method that comprised observation, hypothesis generation, prediction, and experimental validation.
But today supercrunching and data mining (aka pattern recognition) seem to reverse or depart from the scientific method. The ocean is first boiled, data is looked at from every which way and inferences are made from the patterns that emerge. Connections and correlations are identified rather than confirmed or tested against models. In some cases of course, new patterns are discovered that lead to new models. But in practice there seems typically to be no hypothesis generation stage, we move directly from observation to prediction. I can’t argue that the patterns that emerge from data mining statistical analyses aren’t valid but it is hard to grant them the status of explanatory models. At best, it seems to me, the crunchers have greatly increased the efficiency of pattern recognition in the observation stage. This is extraordinarily valuable in some cases and, I believe, dangerous in others.
Does finding all the statistical correlations in a pile of data really reverse the entropy of the data pile? Without determining causation as well as correlation – are we really increasing the degree of organization? I suppose that we do in the same way that removing the sedimentary rock around a fossil reveals more information but I am unclear whether this rises to reversing the local entropy and welcome comments and guidance from readers.
Evolutionary economists use the phrase “routines as genes” to express the tendency toward behavioral continuity of firms and it can be extended logically to individuals. It seems to me that without the modeling or hypothesis generation stage, our predictions are limited by an unspoken assumption of complete or nearly complete behavioral continuity. All evolutionary systems exhibit some form of behavioral continuity – selection wouldn’t make a lot of sense if the next generation operated substantially differently than its parents. But, that continuity also makes the species vulnerable to sudden environmental change – think very large meteor at the boundary of the Cretaceous and Tertiary periods.
This assumption of behavioral continuity or stable correlative relationships seems to work well in many cases but it may leave us open to disaster when those correlations are snapshots of the temporary outcomes of systems we don’t understand very well – such as the interrelationships of risks in a complex economy. How many derivatives were designed based on faulty understanding of correlations among underlying assets or subject to more multi-collinearity than suspected, or vulnerable to unsuspected network effects and acceleration? My guess is a lot.
Posted by Bob