Outliers



In data science, when analyzing a data set, you may notice random cells of data in an attribute that deviate greatly from the observed pattern; this is called an outlier. These phenomena can occur due to a few reasons:

  • measurement fluctuations
  • improper data entry
  • experimental errors
  • natural variance

It is important to recognize the importance of identifying outlier data because it ultimately can have an impact on the accuracy of your end result. This has a big impact when using machine learning especially. Thus finding outliers and identifying their cause is an important step in data analysis.

Outliers can be classified into three different types:

  • Global outliers - a data cell that deviates from the rest of the same attribute in the dataset
  • Collective outliers - when data points fluctuate according to particular conditions. These can be difficult to identify without any background knowledge or context behind the data.
  • Contextual outliers - when a data set in one context differs greatly from the rest of the dataset, but would be considered normal in another context. For example, hospitalizations have fluctuated greatly during the Covid-19 pandemic. Let's say Ontario's standard ICU count is 20 people on most given days throughout the province. This would be considered normal outside the pandemic, but an outlier in the context of the pandemic because it deviates so low from the standard 100 patients we generally have seen being treated on most given days.

The best visualizations to use to easily spot outliers are scatter plots, histograms, and box plots.

Source: Data Science Foundation: Knowing All About Outliers in Machine Learning