Factors influencing the integrity of data

Currency

Currency refers to how current the data is. Is the data up to date? Old, outdated data can negatively impact analysis and decision-making.

Some examples of currency:

Weather data – Historical averages of temperature and rainfall are less useful if they don’t include recent years, which may differ due to climate change.

Population statistics – If using outdated census data from 5 years ago to analyse current populations, the analysis will be inaccurate as populations change over time. Current yearly population estimates should be used.

Authenticity

Authenticity refers to whether the data is from a reliable source – can it be trusted? Ensuring the source of the data is trusted, and the data itself is genuine. Data from unreliable sources or that has been tampered with lacks integrity.

Some examples of authenticity:

Survey responses – If some responses are fake or given by bots, it skews results. Screening for authentic human respondents improves integrity.

Web traffic – Invalid clicks or bot traffic should be filtered out before being analysed to ensure authentic human activity is measured.

Relevance

Relevance refers to how relevant the data is and does it reflect what is meant to be recorded.

An example might be if we are holding data about customers for selling furniture, but we are recording their eye colour, the eye colour has no relevance.

Accuracy

Accuracy refers to correctness.

Accuracy refers to how correct or free from error the data is. For example, if a database is recording how fast runners finish a race but is not recording the milliseconds, the data might not be accurate.

Data should accurately reflect the real-world state it represents. Inaccurate data will lead to incorrect conclusions. Accuracy can be improved through data validation and error checking.

Outliers (cleaning)

Data points outside the expected or extremes that are potentially erroneous or don’t fit the overall pattern in the data. Identifying and handling outliers, such as through cleaning or smoothing, helps improve integrity. Their removal should be carefully considered so as not to discard valid but unusual data.

Some examples of outliers:

In weather data, a temperature reading of 0 degrees on a day when temperatures were in the 30s is likely an error causing an outlier. This could possibly be due to a sensor error.

In exam scores with a typical range of 50-100, a score of 200 would be an outlier, possibly due to a data entry error.