Missing Data

As a data analyst, it is a common phenomenon to encounter data sets that lack a value for an attribute sporadically. This concept is called missing data. This page will explain, the different types and how to deal with it from a data science perspective.

Types of Missing Data

There are three types of missing data:

  • Missing not at random (MNAR) - if the missing data follows a predictable pattern throughout either the entirety or a portion of the dataset, it is usually classified as MNAR; it is generally for related to data not being measured by the study. It follows a familiar pattern to a data analyst because the reason for the missing variable is tied directly to why the data is missing to begin with. ex: A survey about mental illness is administered. A person with depression is less likely to complete the survey about depression, than someone without.
  • Missing at random (MAR) - it is similar to MAR, except that the missing data is related to data being measured by the study. Usually when data is MAR, there is noticeable pattern within subgroups of the dataset. ex: a study tracks the race of each participant in a study about grocery stores prices, but a particular ethnic group is less likely to finish the survey. This example is MAR because there is no discernible correlation between race and the prices of groceries.
  • Missing completely at random (MCAR) - if the missing data is missing notwithstanding of the expected value or similar examples of complete data, the data is considered MCAR. This can occur for external reasons unrelated to the factors being measured or related.

How to handle missing data

There are several different techniques to handling the the dilemma of missing data:

  • Imputation - if the missing data cannot be removed without affecting the results' reliability, it's better to use statistical methods to try and estimate the missing values, such as time-series methods, interpolation, and calculating the mean, median, and mode.
  • Deletion - There are three types of deletion:
    • Listwise: if one attribute in a record is missing, the entire record is excluded; the drawback to this is that your sample population size for analysis shrink everytime a record is omitted.
    • Pairwise: the record is used with the non-missing values, but the attribute with the missing value is omitted. The disadvantage to this technique is that this produce a larger margin of error because of missing values in the set.
    • Drop the attribute: if a large enough portion (more than half) of an attribute has missing data and the attribute is trivial to the data analysis, it may be worth it to omit it completely.
Source: Wikipedia: Missing Data