Exposing the many biases in data
How machines can learn to make the wrong prediction
In 2022, I had an article published in the Business Information Review - Exposing the many biases of machine learning. Citation:
Richardson, S. (2022). Exposing the many biases in machine learning. Business Information Review, 39(3), 82-89. https://doi.org/10.1177/02663821221121024
I presented the many different ways bias can be present in data and machine learning, focusing on the use of real-world data. It was not intended to be an exhaustive list. Rather, it was to show that bias can occur throughout the data science process, from capture to curation to creating models and making predictions.
The image below summarises the biases in the paper, grouped at three stages - during data capture, data curation, and model predictions
I had the opportunity to present the paper to the Swiss Federal Statistics Office in 2022. It wasn’t recorded at the time but below is a rerecording, containing examples to help explain each bias.
To summarise,
Capturing data
Completeness bias occurs when you assume the data you have is complete when the reality is that there is missing data that, if known, would improve predictions. A version of this is ‘survivorship bias’ - where you only have the data that survives some selection or filtering process.
Representation bias occurs when you have all available data, but it does not fully represent the scenario or target you want to build a prediction model for. It could be that the dataset is imbalanced. For example, if predicting general human behaviour, does the dataset fairly represent all demographics?
Truthfulness bias occurs then the data contains inaccuracies because participants do not tell the truth. One example of this is ‘social desirability’ bias - we tend to say what we think the other person wants to hear. Another example is ‘adversarial bias’ - the deliberate poisoning of a training dataset or tricking an algorithm into making the wrong prediction.
Subjective bias can occur when the observers capturing data inadvertently create an imbalanced dataset through the methods they adopt to generate or acquire data. When capturing data about the real-world, it is positioned in space and time. And often reflects the culture of the era when the data was collected.
Curating data
Editing/cleaning bias occurs when errors are introduced into the dataset during processing. This can occur due to rekeying errors if data is being manually moved between systems, and it can occur during cleaning to fix incomplete or invalid records.
Transformation bias occurs when data is recoded to enable machine learning or scaled for comparison. For example, converting categorical data into numerical. The distances between numeric values will create weights that may or may not be valid.
Annotation bias occurs when data is annotated or labelled inappropriately. Much of labelling is still performed by humans, and perceptions and beliefs can affect decisions about what labels should be applied to different data. Automating with algorithms brings its own challenges - the algorithms still need training first.
Enrichment bias can occur when enriching datasets to create new features. This can often be beneficial when combining different datasets. But it can create spurious correlations or inappropriately amplify the weights of features.
Data predictions
Feature selection bias can occur when the selected features amplify bias within the dataset, or maintain a bias that was supposed to have been removed. Features are selected for their predictive power. But that power is based on past performance.
Sampling methods to split data into training and test sets can create bias. If the set has important outliers, you will want some to be present in training, and some to held back unseen for testing. Random sampling or arbitrary cut-offs risk all being in training or test. And failing to split at all means there is no testing of predictions…
Algorithm choice for building a model can have an impact on predictions. Back to our outliers - do we want to include or exclude them? For example, to perform clustering, you can choose a technique that includes outliers. You can choose a technique that excludes outliers. Each will produce different outcomes.
And finally, how the model is used in application matters. Models built based on real-world data are located in space and time. They have limited predictive power outside that environment. And their predictive performance will decay - the model will ‘drift’.
Closing thoughts
The potential for bias in machine learning is a concern because, at the time of writing the paper, and arguably it is still true today, machine learning models lacked the rigour of traditional statistical methods when it comes to evaluating performance. It is even more of a challenge when working with ‘black box’ algorithms, where it can be difficult to explain how the model learned from the data to make the prediction.
Since the paper was published, a new concern is the arrival of publicly available tools such as ChatGPT. Large language models have made it much easier for people to build powerful prediction models without needing to first study advanced analytical skills. It could lead to a rise in biased prediction models that could have serious real-world consequences. One example I’ve talked about before is the risk with using algorithms to detect emotion. For more on that topic - Can an AI detect emotion?


