5 Types of Bias in Data
While there has never been more data available for making qualified decisions, it is not guaranteed that these decisions are successful simply because they are based on data. In this blog post, I'd like to examine why wrong choices are made: bias. I have been talking about the issues and weaknesses of AI applications because of bias. Therefore, let's turn our focus to different types of bias.
In this context, bias means that the outcomes of research are altered by predetermined ideas, prejudice, or influence in a specific direction. The standard definition of data bias is that the available data is not representative of the population or study phenomenon. Or, as Andrew Gelman's dictum: "The most important aspect of a statistical analysis is not what you do with the data, it's what data you use."
The Five Most Common Types of Bias
There are different types of bias. However, there is only one source for bias in data: human misjudgment. This can range from personal (mostly even unnoticed) bias, to including outliners, from mal-selection of the data sample to a spurious association due to an “extra” variable that you didn’t account for. Let's take a closer look at the five main types of bias:
Confirmation Bias: Is the inclination to look for, decipher, favor, and review data that affirms or bolsters one's earlier individual convictions or values. Therefore, confirmation bias is a powerful type of cognitive bias with a critical impact on society's correct workings by misshaping evidence-based decision-making. An example for this is, when you remember information selectively, or make a biased interpretation of information given to you. Studies showed that we could even be manipulated to remember fake childhood memories. This comes to show that people sometimes don't even notice when they are analyzing data in a biased way (another psychological phenomenon which fits this category is wishful thinking).
Selection Bias: Selection bias is the bias introduced by selecting individuals, groups, or data for analysis in a way that does not achieve proper randomization, thereby ensuring that the sample obtained is not representative of the population to be analyzed. The term "selection bias" usually refers to the bias of a statistical analysis resulting from the sampling method. Therefore, it is essential to consider selection bias; some conclusions of the study may be wrong.
Outliers: An outlier is an extreme data value. For example, a customer with the age of 110 years. Or a consumer with 10 million dollars in his savings account. You can identify outliers by carefully inspecting the data, especially when distributing the values. Since outliners are extreme data values, it can be dangerous to decide based on the calculated "average." In other words: extreme behavior can have a significant impact on what is considered average. It is imperative to look for correct outliners, for example, by basing your conclusions on the median - the average value to have an accurate result.
Overfitting and Underfitting: Underfitting implies that a model gives an oversimplistic picture of reality. Overfitting is the inverse: i.e., when the demonstrate is overcomplicated. Overfitting risks causes a particular assumption to be treated as the truth, whereas it is not the case in practice. How can this bias be counteracted? The most straightforward approach is to ask how the model was validated. If you receive a somewhat glazed expression as a reaction, there is a good chance that the analysis outcomes are so-called unvalidated outcomes and, therefore, might not apply to the whole database. Always ask the data analyst whether they have done a training or test sample. If the answer is no, it is highly likely that the analysis outcomes will not be applicable for all customers.
Confounding Variables: Basically, this happens when additional factors influence variables you have not accounted for. In an experiment, the independent variable usually affects your dependent variable. For example, if you want to investigate whether the need to exercise leads to weight loss, the need to work out is your independent variable, and the weight loss is your dependent variable. Disturbing factors are all other factors that also influence your dependent variable. They are additional factors that have a hidden influence on your dependent variable. Aggravating factors can cause two main problems: increase variance and the introduction of bias.
It is essential to confirm that the conclusion drawn from the results of research and analysis is not affected by distortions. Uncovering biased results is not the sole responsibility of the analyst concerned. It is the joint responsibility of all those directly involved (including the market participant and the analyst) to reach a valid conclusion based on the correct data. In a world of marketing, where data and analysis play an increasingly important role, you must rely on the accurate facts. A fact is only a fact when it is sufficiently proven.
Before I end my post, keep the following in mind: "There are three types of lies: lies, downright lies, and statistics."
For this article, I used the following sources:
A picture is worth a thousand lies: Using false photographs to create false childhood memories by Kimberley A. Wade and Maryanne Garry
Confounding Variable: Simple Definition and Example, Bias in Statistics: Definition, Selection Bias & Survivorship Bias, and Variance: Simple Definition, Step by Step Examples by Stephanie Glen
5 Types of Bias in Data & Analytics by Sebastiaan de Vries
Understanding Data Bias by Prabhakar Krishnamurthy