Wednesday, December 19, 2007

The Nature of Data

The Nature of Data
Chapter 1 of Joe Celko's Data and Databases: Concepts in Practice

Where is the wisdom?
Lost in the knowledge.
Where is the knowledge?
Lost in the information. - T. S. Eliot

Where is the information?
Lost in the data.
Where is the data?
Lost in the #@%&! database! - Joe Celko

Overview

So I am not the poet that T. S. Eliot is, but he probably never wrote a computer program in his life. However, I agree with his point about wisdom and information. And if he knew the distinction between data and information, I like to think that he would have agreed with mine.

I would like to define data, without becoming too formal yet, as facts that can be represented with measurements using scales or with formal symbol systems within the context of a formal model. The model is supposed to represent something called the real world in such a way that changes in the facts of the real world are reflected by changes in the database. I will start referring to the real world as the reality for a model from now on.

The reason that you have a model is that you simply cannot put the real world into a computer or even into your own head. A model has to reflect the things that you think are important in the real world and the entities and properties that you wish to manipulate and predict.

I will argue that the first databases were the precursors to written language that were found in the Middle East (see Jean 1992). Shepherds keeping community flocks needed a way to manipulate ownership of the animals, so that everyone knew who owned how many rams, ewes, lambs, and whatever else. Rather than branding the individual animals, as Americans did in the West, each member of the tribe had a set of baked clay tokens that represented ownership of one animal, but not of any animal in particular.

When you see the tokens, your first thought is that they are a primitive internal currency system. This is true in part, because the tokens could be traded for other goods and services. But their real function was as a record keeping system, not as a way to measure and store economic value. That is, the trade happened first, then the tokens were changed, and not vice versa.

The tokens had all the basic operations you would expect in a database. The tokens were updated when a lamb grew to become a ram or ewe, deleted when an animal was eaten or died, and new tokens were inserted when the new lambs were born in the spring.

One nice feature of this system is that the mapping from the model to the real world is one to one and could be done by a man who cannot count or read. He had to pass the flock through a gate and match one token to one animal; we would call this a table scan in SQL. He would hand the tokens over to someone with more math ability- the CPU for the tribe- who would update everyone’s set of tokens. The rules for this sort of updating can be fairly elaborate, based on dowry payments, oral traditions, familial relations, shares owned last year, and so on.

The tokens were stored in soft clay bottles that were pinched shut to ensure that they were not tampered with once accounts were settled; we would call that record locking in database management systems.

Data versus Information

Information is what you get when you distill data. A collection of raw facts does not help anyone to make a decision until it is reduced to a higher-level abstraction. My sheepherders could count their tokens and get simple statistical summaries of their holdings (Abdul owns 15 ewes, 2 rams, and 13 lambs), which is immediately useful, but it is very low-level information.

If Abdul collected all his data and reduced it to information for several years, then he could move up one more conceptual level and make more abstract statements like, In the years when the locusts come, the number of lambs born is less than the following two years, which are of a different nature than a simple count. There is both a long time horizon into the past and an attempt to make predictions for the future. The information is qualitative and not just quantitative.

Please do not think that qualitative information is to be preferred over quantitative information. SQL and the relational database model are based on sets and logic. This makes SQL very good at finding set relations, but very weak at finding statistical and other relations. A set relation might be an answer to the query Do we have people who smoke, drink, and have high blood pressure? that gives an existence result. A similar statistical query would be How are smoking and drinking correlated to high blood pressure? that gives a numeric result that is more predictive of future events.

Information versus Wisdom

Wisdom does not come out of the database or out of the information in a mechanical fashion. It is the insight that a person has to make from information to handle totally new situations. I teach data and information processing; I don’t teach wisdom. However, I can say a few remarks about the improper use of data that comes from bad reasoning.

Innumeracy

Innumeracy is a term coined by John Allen Paulos in his 1990 best-seller of the same title. It refers to the inability to do simple mathematical reasoning to detect bad data, or bad reasoning. Having data in your database is not the same thing as knowing what to do with it. In an article in Computerworld, Roger L. Kay does a very nice job of giving examples of this problem in the computer field (Kay 1994).

Bad Math

Bruce Henstell (1994) stated in the Los Angeles Times: When running a mile, a 132 pound woman will burn between 90 to 95 calories but a 175 pound man will drop 125 calories. The reason seems to be evolution. In the dim pre-history, food was hard to come by and every calorie has to be conserved- particularly if a woman was to conceive and bear a child; a successful pregnancy requires about 80,000 calories. So women should keep exercising, but if they want to lose weight, calorie count is still the way to go.

Calories are a measure of the energy produced by oxidizing food. In the case of a person, calorie consumption depends on the amount of oxygen they breathe and the body material available to be oxidized.

Let’s figure out how many calories per pound of human flesh the men and women in this article were burning: (95 calories/132 pounds) = .71 calories per pound of woman and (125 calories/175 pounds) = .71 calories per pound of man. Gee, there is no difference at all! Based on these figures, human flesh consumes calories at a constant rate when it exercises regardless of gender. This does not support the hypothesis that women have a harder time losing fat through exercise than men, but just the opposite. If anything, this shows that reporters cannot do simple math.

Another example is the work of Professor James P. Allen of Northridge University and Professor David Heer of USC. In late 1991, they independently found out that the 1990 census for Los Angeles was wrong. The census showed a rise in Black Hispanics in South Central Los Angeles from 17,000 in 1980 to almost 60,000 in 1990. But the total number of Black citizens in Los Angeles has been dropping for years as they move out to the suburbs (Stewart 1994).

Furthermore, the overwhelming source of the Latino population is Mexico and then Central America, which have almost no Black population. In short, the apparent growth of Black Hispanics did not match the known facts.

Professor Allen attempted to confirm this growth with field interviews but could not find Black Hispanic children in the schools when he went to the bilingual coordinator for the district’s schools.

Professor Heer did it with just the data. The census questionnaire asked for race as White, Black, or Asian, but not Hispanic. Most Latinos would not answer the race question- Hispanic is the root word of spic, an ethnic slander word in Southern California. He found that the Census Bureau program would assign ethnic groups when it was faced with missing data. The algorithm was to look at the makeup of the neighbors and assume that missing data was the same ethnicity.

If only they had NULLs to handle the missing data, they might have been saved. Speaker’s Idea File (published by Ragan Publications, Chicago) lost my business when they sent me a sample issue of their newsletter that said, On an average day, approximately 140,000 people die in the United States. Let’s work that out using 365.2422 days per year times 140,000 deaths for a total of 51,133,908 deaths per year.

Since there are a little less than 300 million Americans as of the last census, we are looking at about 17% of the entire population dying every year- one person in every five or six. This seems a bit high. The actualfigure is about 250,000 deaths per year.

There have been a series of controversial reports and books using statistics as their basis. Tainted Truth: The Manipulation of Facts in America, by Cynthia Crossen, a reporter for the Wall Street Journal, is a study of how political pressure groups use false facts for their agenda (Crossen 1996). So there are reporters who care about mathematics, after all!

Who Stole Feminism?, by Christina Hoff Sommers, points out that feminist authors were quoting a figure of 150,000 deaths per year from anorexia when the actual figure was no higher than 53. Some of the more prominent feminist writers who used this figure were Gloria Steinem (In this country alone. . . about 150,000 females die of anorexia each year, in Revolution from Within) and Naomi Wolf (When confronted by such a vast number of emaciated bodies starved not by nature but by men, one must notice a certain resemblance [to the Nazi Holocaust], in The Beauty Myth). The same false statistic also appears in Fasting Girls: The Emergence of Anorexia Nervosa as a Modern Disease, by Joan Brumberg, former director of Women’s Studies at Cornell, and hundreds of newspapers that carried Ann Landers’s column. But the press never questioned this in spite of the figure being almost three times the number of dead in the entire 10 years of the Vietnam War (approximately 58,000) or in one year of auto accidents (approximately 48,000).

You might be tempted to compare this to the Super Bowl Sunday scare that went around in the early 1990s (the deliberate lie that more wives are beaten on Super Bowl Sunday than any other time). The original study only covered a very small portion of a select group- African Americans living in public housing in one particular part of one city. The author also later said that her report stated nothing of the kind, remarking that she had been trying to get the urban myth stopped for many months without success. She noted that the increase was considered statistically insignificant and could just as easily have been caused by bad weather that kept more people inside.

The broadcast and print media repeated it without even attempting to verify its accuracy, and even broadcasted public warning messages about it. But at least the Super Bowl scare was not obviously false on the face of it. And the press did do follow-up articles showing which groups created and knowingly spread a lie for political reasons.

Causation and Correlation

People forget that correlation is not cause and effect. A necessary cause is one that must be present for an effect to happen- a car has to have gas to run. A sufficient cause will bring about the effect by itself- dropping a hammer on your foot will make you scream in pain, but so will having your hard drive crash. A contributory cause is one that helps the effect along, but would not be necessary or sufficient by itself to create the effect. There are also coincidences, where one thing happens at the same time as another, but without a causal relationship.

A correlation between two measurements, say, X and Y, is basically a formula that allows you to predict one measurement given the other, plus or minus some error range. For example, if I shot a cannon locked at a certain angle, based on the amount of gunpowder I used, I could expect to place the cannonball within a 5-foot radius of the target most of the time. Once in awhile, the cannonball will be dead on target; other times it could be several yards away.

The formula I use to make my prediction could be a linear equation or some other function. The strength of the prediction is called the coefficient of correlation and is denoted by the variable r where -1 = r = 1, in statistics. A coefficient of correlation of -1 is absolute negative correlation- when X happens, then Y never happens. A coefficient of correlation of +1 is absolute positive correlation- when X happens, then Y also happens.

A zero coefficient of correlation means that X and Y happen independently of each other. The confidence level is related to the coefficient of correlation, but it is expressed as a percentage. It says that x % of the time, the relationship you have would not happen by chance.

The study of secondhand smoke (or environmental tobacco smoke, ETS) by the EPA, which was released jointly with the Department of Health and Human Services, is a great example of how not to do a correlation study. First they gathered 30 individual studies and found that 24 of them would not support the premise that secondhand smoke is linked to lung cancer. Next, they combined 11 handpicked studies that used completely different methods into one sample- a technique known as metanalysis, or more informally called the apples and oranges fallacy. Still no link. It is worth mentioning that one of the rejected studies was recently sponsored by the National Cancer Institute- hardly a friend of the tobacco lobby- and it also showed no statistical significance.

The EPA then lowered the confidence level from 98% to 95%, and finally to 90%, where they got a relationship. No responsible clinical study has ever used less than 95% for its confidence level. Remember that a confidence level of 95% says that 5% of the time, this could just be a coincidence. A 90% confidence level doubles the chances of an error.

Alfred P. Wehner, president of Biomedical and Environmental Consultants Inc. in Richland, Washington, said, Frankly, I was embarrassed as a scientist with what they came up with. The main problem was that statistical handling of the data. Likewise, Yale University epidemiologist Alvan Feinstein, who is known for his work in experimental design, said in the Journal of Toxicological Pathology that he heard a prominent leader in epidemiology admit, Yes, it’s [EPA’s ETS work] rotten science, but it’s in a worthy cause.

It will help us get rid of cigarettes and to become a smoke-free society. So much for scientific truth versus a political agenda.

Another way to test a correlation is to look at the real world. For example, if ETS causes lung cancer, then why do rats who are put into smoke-filled boxes for most of their lives not have a higher cancer rate? Why aren’t half the people in Europe and Japan dead from cancer?

There are five ways two variables can be related to each other. The truth could be that X causes Y. You can estimate the temperature in degrees Fahrenheit from the chirp rate of a cricket: degrees = (chirps + 137.22)/3.777, with r = 0.9919 accuracy. However, nobody believes that crickets cause temperature changes. The truth could be that Y causes X, case two.

The third case is that X and Y interact with each other. Supply and demand curves are an example, where as one goes up, the other goes down (negative feedback in computer terms). A more horrible example is drug addiction, where the user requires larger and larger doses to get the desired effect (positive feedback in computer terms), as opposed to habituation, where the usage hits an upper level and stays there.

The fourth case is that any relationship is pure chance. Any two trends in the same direction will have some correlation, so it should not surprise you that once in awhile, two will match very closely.

The final case is where the two variables are effects of another variable that is outside the study. The most common unseen variables are changes in a common environment.

For example, severe hay fever attacks go up when corn prices go down. They share a common element- good weather. Good weather means a bigger corn crop and hence lower prices, but it also means more ragweed and pollen and hence more hay fever attacks. Likewise, spouses who live pretty much the same lifestyle will tend to have the same medical problems from a common shared environment and set of habits.

Testing the Model against Reality

The March 1994 issue of Discovery magazine had a commentary column entitled Counting on Dyscalculia by John Allen Paulos. His particular topic was health statistics since those create a lot of pop dread when they get played in the media.

One of his examples in the article was a widely covered lawsuit in which a man alleged a causal connection between his wife’s frequent use of a cellular phone and her subsequent brain cancer. Brain cancer is a rare disease that strikes approximately 7 out of 100,000 people per year. Given the large population of the United States, this is still about 17,500 new cases per year- a number that has held pretty steady for years.

There are an estimated 10 million cellular phone users in the United States. If there were a causal relationship, then there would be an increase in cases as cellular phone usage increased. On the other hand, if we found that there were less than 70 cases among cellular phone users we could use the same argument to prove that cellular phones prevent brain cancer.

Perhaps the best example of testing a hypothesis against the real world was the bet between the late Julian Simon and Paul Ehrlich (author of The Population Bomb and a whole raft of other doomsday books) in 1980. They took an imaginary $1,000 and let Ehrlich pick commodities. The bet was whether the real price would go up or down, depending on the state of the world, in the next 10 years. If the real price (i.e., adjusted for inflation) went down, then Simon would collect the adjusted real difference in current dollars; if the real costs went up, then Ehrlich would collect the difference adjusted to current dollars.

Ehrlich picked metals- copper, chrome, nickel, tin, and tungsten- and invested $200 in each. In the fall of 1990, Ehrlich paid Simon $576.07 and did not call one of his press conferences about it. What was even funnier is that if Ehrlich had paid off in current dollars, not adjusted for inflation, he would still have lost!

Models versus Reality

A model is not reality, but a reduced and simplified version of it. A model that was more complex than the thing it attempts to model would be less than useless. The term the real world means something a bit different than what you would intuitively think. Yes, physical reality is one real world, but this term also includes a database of information about the fictional worlds in Star Trek, the what if scenarios in a spreadsheet or discrete simulation program, and other abstractions that have no physical forms. The main characteristic of the real world is to provide an authority against which to check the validity of the database model.

A good model reflects the important parts of its reality and has predictive value. A model without predictive value is a formal game and not of interest to us.

The predictive value does not have to be absolutely accurate. Realistically, Chaos Theory shows us that a model cannot ever be 100% predictive for any system with enough structure to be interesting and has a feedback loop.

Errors in Models

Statisticians classify experimental errors as Type I and Type II. A Type I error is accepting as false something that is true. A Type II error is accepting as true something that is false. These are very handy concepts for database people, too.

The classic Type I database error is the installation in concrete of bad data, accompanied by the inability or unwillingness of the system to correct the error in the face of the truth. My favorite example of this is a classic science fiction short story written as a series of letters between a book club member and the billing computer. The human has returned an unordered copy of Kidnapped by Robert Louis Stevenson and wants it credited to his account.

When he does not pay, the book club computer turns him over to the police computer, which promptly charges him with kidnapping Robert Louis Stevenson. When he objects, the police computer investigates, and the charge is amended to kidnapping and murder, since Robert Louis Stevenson is dead. At the end of the story, he gets his refund credit and letter of apology after his execution.

While exaggerated, the story hits all too close to home for anyone who has fought a false billing in a system that has no provision for clearing out false data.

The following example of a Type II error involves some speculation on my part. Several years ago a major credit card company began to offer cards in a new designer color with higher limits to their better customers. But if you wanted to keep your old card, you could have two accounts. Not such a bad option, since you could use one card for business and one for personal expenses.

They needed to create new account records in their database (file system?) for these new cards. The solution was obvious and simple: copy the existing data from the old account without the balances into the new account and add a field to flag the color of the card to get a unique identifier on the new accounts.

The first batch of new card orders came in. Some orders were for replacement cards, some were for the new card without any prior history, and some were for the new two accounts option.

One of the fields was the date of first membership. The company thinks that this date is very important since they use it in their advertising. They also think that if you do not use a card for a long period of time (one year), they should drop your membership. They have a program that looks at each account and mails out a form letter to these unused accounts as it removes them from the database.

The brand new accounts were fine. The replacement accounts were fine. But the members who picked the two card option were a bit distressed. The only date that the system had to use as date of last card usage was the date that the original account was opened. This was almost always more than one year, since you needed a good credit history with the company to get offered the new card.

Before the shiny new cards had been printed and mailed out, the customers were getting drop letters on their new accounts. The switchboard in customer service looked like a Christmas tree. This is a Type II error- accepting as true the falsehood that the last usage date was the same as the acquisition date of the credit card.

Assumptions about Reality

The purpose of separating the formal model and the reality it models is to first acknowledge that we cannot capture everything about reality, so we pick a subset of the reality and map it onto formal operations that we can handle.

This assumes that we can know our reality, fit it into a formal model, and appeal to it when the formal model fails or needs to be changed.

This is an article of faith. In the case of physical reality, you can be sure that there are no logical contradictions or the universe would not exist. However, that does not mean that you have full access to all the information in it. In a constructed reality, there might well be logical contradictions or vague information. Just look at any judicial system that has been subjected to careful analysis for examples of absurd, inconsistent behavior.

But as any mathematician knows, you have to start somewhere and with some set of primitive concepts to be able to build any model.

No comments: