The Data Delusion

A few months ago my favorite newsweekly, The Economist, ran a cover story titled “The Data Deluge” that expressed misgivings about the quantity of data being generated. The Economist’s title went with the flow of water-based metaphors for data overload including: deluged, drinking from a fire hydrant, swamped, drowning, sinking under, treading, etc. As the authors note, torture data enough and it will confess to anything so it’s only a matter of time before someone coins the term data-boarding for the condition. This note begins to explain why the whole notion of systemic data overload strikes me as one of those bogus alarms such as culture shock that is predicated on a contrived imbalance.

“The Proliferation of Data is Making Them Increasingly Inaccessible”

The Economist notes that we have moved from an era of scarce data to one of superabundance and bemoans the shortage of storage capacity relative to data being generated, invokes security and privacy issues, and reports the lament, quoted above, of astrophysicist Alex Szalay, that the proliferation of data is making them less accessible.

Economists have long realized that relative volume is a lousy measure of relative value as revealed by the diamond-water paradox. Having more of something than we “need” or can currently use is not obviously a problem: we use only a portion of the water on Earth but don’t worry overmuch about the “surplus” that goes untapped. This is a very different picture than that of Szalay’s implication that a lot of very valuable data are being wasted for want of adequate bases to store and analyze it.

It is unclear to me how previously unknown or unsuspected data, however difficult or costly to access today, can possibly be less accessible than it was when its existence wasn’t even suspected yesterday. Maybe it’s the data on the dark matter that apparently makes up a great deal of the universe even though we can’t seem to find it. How are we worse off for not being able to access information we didn’t even know existed before now? If the data went past a black hole’s event horizon it’s lost forever but if it is just out there waiting for the right theory and tools to guide us to it, we can relax.

Perhaps Szalay believes that evidence of intelligent life, other habitable planets, or new insights into the cosmos’ origin and fate is contained in the gazillions of electromagnetic signals captured each day but we can’t see it because of the clutter.  That may well be but without appropriate models to identify and make sense of it, the data is neither valuable nor useful. This point is illustrated by the accidental discovery in 1964 of the cosmic background radiation left over from the “big bang” by Arno Penzias and Robert Wilson. This event might look at first like finding an unanticipated needle in a very, very large haystack but in fact Penzias and Wilson, after removing the pigeon poop from their antenna and thus eliminating their initial hypothesis, had no idea what they had found. Fortunately Princeton astrophysicists Robert H. Dicke, Jim Peebles, and David Wilkinson, had already anticipated the leftover radiation phenomena and were preparing to search for it when word of Penzias and Wilson’s “discovery” reached them. The Nobel committee awarded the physics prize to the clueless and ignored the scientists who predicted, explained, and verified the finding. At best Penzias and Wilson were guilty of pattern recognition while Dicke, Peebles, and Wilkinson were committing science.

For expository convenience The Economist conflates data and information but in doing so it obscures a very important distinction: data has a high degree of entropy whereas information is data passed through a local entropy reducing mechanism such as a model. The event horizon of interest is the area illuminated by the model or hypothesis not by the collection of data. The location of that horizon is largely a function of search costs and expected returns. Models reduce search costs and often also increase returns to search.

Over 50 years ago, the great economist George Stigler set out the economics of information. His model applies still in today’s era of “big data” although The Economist’s article makes no use of it.  Stigler explained searching for and advertising prices in a world where prices were dispersed and search (a form of data creation or collection) and information were not free.  He showed that buyers would engage in search up to the point where, at the margin, the expected value in the form of a lower price was just equal to the incremental cost of search. On the other side of the market, sellers engage in advertising (data provision) in order to reduce buyers’ search costs and therefore increase the likelihood that buyers will find them and purchase from them. Whenever buyers (users) and sellers (providers) engage over time in trade based on their respective costs and returns an evolutionary process or game ensues in which changes in one or the other party’s’ costs or returns stimulates a reaction on the part of the others.

If we substitute the actions of the producers of data (or the means to create it) and those of the users of data for the buyers and sellers in Stigler’s example we can understand the dynamic interplay and evolutionary growth of the market for information. Today’s providers of data or the means of creating it have been increasing its availability and reducing the costs of acquiring it, much like the appearance of the telescope in Galileo’s day increased the available data about our solar system. But without a fairly good understanding of planetary movements, Galileo would have been hard pressed to locate Jupiter much less discover its four largest moons. Once the increased  returns (information) to searching the night skies became evident, scientists spent much more on search and began working off the huge data “deluge” created by the introduction of the telescope.

“Understanding Turns Out to be Overrated, and Statistical Analysis Goes a Lot of the Way”

Princeton’s Edward Felton’s statement, quoted above, goes to the heart of my discomfort about the way we manage many large databases and squeeze information from them.  When I was a young analyst, my mentors instructed me to focus my searches and avoid “boiling the ocean.” They instructed me to economize on time and data by establishing first a hypothesis using an appropriate economic model and then testing that model using the available data. The results of this exercise sometimes produced useful information – the pattern or relationship among the data. This was a practical application of the scientific method that comprised observation, hypothesis generation, prediction, and experimental validation.

But today supercrunching and data mining (aka pattern recognition) seem to reverse or depart from the scientific method. The ocean is first boiled, data is looked at from every which way and inferences are made from the patterns that emerge. Connections and correlations are identified rather than confirmed or tested against models. In some cases of course, new patterns are discovered that lead to new models. But in practice there seems typically to be no hypothesis generation stage, we move directly from observation to prediction.  I can’t argue that the patterns that emerge from data mining statistical analyses aren’t valid but it is hard to grant them the status of explanatory models. At best, it seems to me, the crunchers have greatly increased the efficiency of pattern recognition in the observation stage. This is extraordinarily valuable in some cases and, I believe, dangerous in others.

Does finding all the statistical correlations in a pile of data really reverse the entropy of the data pile? Without determining causation as well as correlation – are we really increasing the degree of organization? I suppose that we do in the same way that removing the sedimentary rock around a fossil reveals more information but I am unclear whether this rises to reversing the local entropy and welcome comments and guidance from readers.

Evolutionary economists use the phrase “routines as genes” to express the tendency toward behavioral continuity of firms  and it can be extended logically to individuals. It seems to me that without the modeling or hypothesis generation stage, our predictions are limited by an unspoken assumption of complete or nearly complete behavioral continuity. All evolutionary systems exhibit some form of behavioral continuity – selection wouldn’t make a lot of sense if the next generation operated substantially differently than its parents. But, that continuity also makes the species vulnerable to sudden environmental change – think very large meteor at the boundary of the Cretaceous and Tertiary periods.

This assumption of behavioral continuity or stable correlative relationships seems to work well in many cases but it may leave us open to disaster when those correlations are snapshots of the temporary outcomes of systems we don’t understand very well – such as the interrelationships of risks in a complex economy. How many derivatives were designed based on faulty understanding of correlations among underlying assets or subject to more multi-collinearity than suspected, or vulnerable to unsuspected network effects and acceleration?  My guess is a lot.

Posted by Bob

Advertisements

One comment on “The Data Delusion

  1. Brent Harbin says:

    I am a semi-fledgling student of Econonmics and currently studying under Bob. I have worked at quite a few occupations in my life (too many really) and during that time I have always taken note of how companies and their employees interact, dissiminate information, learn, and what/who they determine about leadership. My divergent background has allowed me to see many forms and approaches, some remarkably successful but most very ad hoc, which lead to their inability to respond to almost any type of change; ie something that did not fit perfectly into their very specialized system of “how we do things”.

    The following is a small example of information misuse, but it leads to the larger question of how to appropriately dissimenate information among workers and management as well as how to go about having the actual end user of that information apply it correctly so that the company realizes its benefit.

    I always wondered, and often inquired, as to why it was decided to perform a certain task in “this” fashion. Most often the answer really was “well, that’s how we’ve always done it”. What bothered me about such an answer was the results achieved by the means. The most remarkable example I can give comes from an extrusion plant. This was no small outfit and it had dozens of very large customers. A guy named Gary handled most of the maintenance around the plant. I watched Gary over the course of a couple of months and during that time he replaced, on several occassions, a particular gasket in the hydraulic system that was used to push preheated ingots through a die. In order to to make this replacement the line had to be stopped and he had to isolate the line, remove and replace the gasket, etc. all while enduring the heat from a poorly insulated oven that operated at several thousand degrees. Come to find out Gary had performed this operation 2-3 times a week for over 15 years. I was facinated by this and so I did a little investigation which led me to a very simple solution. The wrong gasket was being purchased, by the score, every month like clockwork. When I pointed this out, and gave the part number for the correct gasket, I was met with anger. Since this facinated me even more, at the time, than the 15 year habit of purchising the wrong gasket, I did not directly mention it again and moved my investigating to Gary himself. In a nutshell what I found out was that he was, for whatever reason, constantly afraid of losing his job so he made himself indispensable by performing such contrived and visible activities so he could be seen as “saving the day”. This man’s outlook on the permanance of his job was unfounded and based on his experiences when he was fairly new to the company but it seriously effected the company’s bottom line at the time.
    I have a large number of similar stories “banked” in my little processor and often use them in my “production function”, to quote Herbert Simon. They actually are supplemental inputs to the functions which I could be perceived as an expert in (I purposely leave out the world class that is prepended to expert in the article).

    Anyway, I must stop with this one example and leave by soliciting feedback on it. I recently read “Bounded Rationality and Organizational Learning” by the previously quoted Herbert Simon. In the article he talks about the location of an organization’s stored knowledge and how it may or may not be available at the decision points where it would be relevant. In my example I strongly believe that the correct information was there but just wasn’t used because the individual with the information had a distorted view of reality. This, I have seen more times than I can recall right now.

    Here is where the solicited feedback comes in: If you are operating something like an extrusion line, and this line is capable of, and has in the past, made a lot of money wouldn’t you want to pay a particular amount of attention to an individual who was nearly the sole source of keeping the thing operating? The whole operation had been set up so that, if Gary met with the perverbial beer truck, the plant would literally shut down. I looked at this as a glaring weakness for the company because within a short period I figured out that not only was the key maintenance guy inept, he was also a little crazy. So why is it that several managers had never drawn a similar conclusion? Did they not care? Was it a case of the information gap between Gary and management was so large that they were not able to draw conclusions?…

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s