# Using Statistics

#### Statistics matter

I recently chatted with my older granddaughter about statistics. She is planning to take a course in these and is a little concerned. Statistical math can be quite challenging, but for the rest of us, the math is not the problem. Everyone should understand statistics better because people use them to lie.

You should have an intuition about what people present to you.

“Facts are stubborn things, but statistics are more pliable.” – Mark Twain

#### Statistics are inferences, not facts

There is nothing wrong with inferences if you know how they came to be. Thus the course in statistics. To draw a reliable inference you need several things:

1. Competent data. You need a method of sampling. You cannot ask every human their opinion or preference on any subject. By the time you finished, many would have changed their mind.
2. Objective data. The form of the question matters. Do you prefer Coke or Pepsi, is pretty easy, although you might want to know the gradation of the preference. Some survey questions imply a “correct” answer. No reliable inference can follow from those.
3. Entire population. If you want to infer something about society as a whole, you cannot draw your sample from a tiny part of it. Drawing taxation preferences from poor people is quite different than from the rich. Surveying teachers would get different results about education than would surveying parents or students.
4. A method of analysis. There are many ways to analyze statistics. They include simple ones like average, regression, standard deviation, correlation, and significance. Average is most commonly seen and most commonly abused. The average human has slightly fewer than two arms. There are no humans who have exactly the average number of arms. When there is no specific observation that matches the average, beware.
5. A method of presentation. Some people like tables of numbers with columns headed by little Greek symbols lying on their side. Other like graphs. Some will want both. The idea of presentation is “As simple as possible but no simpler.”

#### How easy is it to lie with statistics?

Example #1 – Suppose I told you that 20% of American children lived in poverty while only 13% in the OECD did. Would you want government programs or an explanation of where the numbers came from. The key element is how you define poverty. The liars know people have simple ideas about poverty so if they can recast the meaning of the word, they can get a reaction. In this case, the term poverty means someone who has a family income less than 50% of the mean income in the population. In the United States mean household income is \$72,000. In Mexico a little under \$10,000. A household in the US just at the border of poverty has nearly four times the income of the average Mexican household and seven times the threshold of the poor Mexican. You will notice two things.

1. Under this definition poverty cannot be eradicated no matter what the government or anyone else does. There will always be people below the threshold.
2. The statistic is for polarizing not presenting useful information.

Example #2. The three richest Americans have more wealth than the bottom half. Technically true but a little meaningless. If you have \$100 in a bank account, you are wealthier than about 50,000,000 Americans. Why? Because many Americans have no wealth accumulated. Very young children for example. The bottom half is a net number. All of the below zero amounts are netted with the above zero amounts. The nature of the population of billionaires is not like the population of others and numbers don’t connect them.

Example #3. Ignoring the base rate. Suppose I may have some unusual disease. The disease occurs in one in ten thousand cases. There is a test and it has no false negatives and 5% false positives. I take the test and it shows positive. What should I do then? Step one. Establish the meaning. My odds of having the disease are now about 1 in 500. In 10,000 tests you should expect 500 positives and I am one of them. I suppose I could take the test again and if positive now about 1 in 25, so a third time and if still positive, pretty close to 50-50. Be very cautious relying on one outcome when the incidence is rare to begin with. The false positive rate matters far more.

Example #4. Notice the presentation. I once saw a graph where there were two lines, one going up and the other, unexpectedly going down. Upon examination I noticed the graph had two y-axes. The left side one went up as you would expect. The right side however went down. Higher numbers at the bottom. In fairness it did , in tiny print, note “inverted axis” The story that came with the graph was based on both axes being in traditional form. Misleading.

Example #5. Misused regression. Regression often fits a line to data. The trend line. People like trends, but regression analysis is not valid outside the range of the data you have. It does not predict the future. Suppose you had information about the number of people in cars on the highway. The data runs from 1920 to 1970. We observe that the number has fallen each decade. From the data, what year would the number of people in cars on the highway be less than 1? Clearly the data predicts driverless cars.

Example #6. Absence of evidence is not evidence of absence. In simple terms nothing proves nothing. You could argue that as there is no evidence of old telephone lines and switchboards in the Sahara, the people in the Sahara must have been the first to use wireless telephones. This is an example of misused correlation as proof of some fact. There are many spurious correlations available.

#### What to do

Assess the purpose of the presentation. If you see an argument based on statistics, assume the provider has a reason to present them as they are and in the way they have done so. Assume they want to convince you of something. Be skeptical and assess whether the methods to collect the data and create the inference make sense. If not, assume they intend to deceive.

Always know you can create statistics to prove anything if you are willing to tinker with their collection and analysis. Correlation is interesting but does not, by itself, imply cause. Trend lines have meaning only within the range of the data.

You can soon become intuitive about what is possible and what is not. Keep in mind that statisticians can prove that 42.3% of all statistics are made up on the spot.