Dark data

IONOS editorial team2021-08-31

In this age of information, organizations are constantly collecting massive amounts of data. But in most cases, collected data is stored without being analyzed. Data that exists but is not used, is referred to as dark data.

Compute Engine

The ideal IaaS for your workload

Cost-effective vCPUs and powerful dedicated cores
Flexibility with no minimum contract
24/7 expert support included

What is dark data?

Dark data is data that is not used by an organization to gain insight; in other words, it is hidden data. This may include data that is incomplete, has not been evaluated, exists in secret, or has not (yet) been recorded at all. Essential to our understanding of the term is that it is relative. Whether data is “dark” or not depends on the relationship of the data to a particular organization.

Dark data is particularly obvious in connection with the management of big data. Often, too much data is generated that it cannot be processed and analyzed in a timely manner. In the words of British statistician David Hand:

Quote

“In the era of big data, it is easy to imagine that we have all the information we need to make good decisions. But in fact the data we have are never complete, and may be only the tip of the iceberg.” - David Hand

The vast majority of data in an organization lies hidden beneath the surface as dark data.

To exemplify what dark data encompasses, let’s look at four scenarios:

Data of unknown existence
Data that is subject to uncertainties
Data that is stored unused
Data not yet recorded at all.

In all four scenarios, we further differentiate two distinct cases:

The organization is aware that data is missing, incomplete, or subject to uncertainty.

This case is less problematic. If there is an awareness that the available data may represent only the tip of an iceberg, the organization can take countermeasures. They may try to obtain more complete data or analyze available data regarding uncertainties.

The organization is unaware that data is missing or it is assumed that available data is complete.

This case is more dangerous. If one assumes that one has a complete picture of the situation based on the available data, the organization operates contrary to reality. Conclusions drawn from incomplete data lead to suboptimal decisions.

In times of big data and data mining, organizations strive to get everything they can out of data.

What is data?

Since the explosive spread of information technology, the term data has been widely used. Frequently mentioned by politicians, business representatives and scientists alike, the term remains nebulous. This is because data is non-physical in nature – it is an abstract concept.

Data is not synonymous with information

First, let us note that data is a manifestation of information. In fact, data is the smallest units of which information is composed. In the same way as atoms are the smallest building blocks of matter or photons are the smallest building blocks of energy.

Note

We use the term “information” as an abstract concept, like matter and energy.

Single data is often meaningless on its own. Only the interpretation of several data results becomes usable information. Think of data as individual letters. A single letter, for example the letter “P” has no meaning. Only when several letters are combined does a word result, e.g. “apple”. Here, moreover, the sequence is decisive.

Information is data that is summarized in structures and delimited from one another. The process of interpretation depends on the context. This means that a set of data can be interpreted differently, possibly resulting in several different meanings. Think again of the word “apple”. Instead of combining the individual letters into one word, we could count the letters. This would result in different information based on the same data.

Picture the totality of an organization's data as a mountain. The challenge is to extract useful information from this mountain of data. In contrast to a physical mountain, where valuable materials are lost during extraction, useful information can, in principle, be extracted from a mountain of data several times. It all depends on the context and the perspective.

The hierarchy of information

If information is composed of data, like matter is composed of atoms, it is fair to assume that higher structures exist. In fact, there is a hierarchy of information: at the bottom is data, followed by information, and then knowledge.

Knowledge is linked information. The individual pieces of information are weighted. Some are primary, others secondary. Crucial for knowledge is the concept of reference, which is known today as a (hyper) link: information that links to another knowledge unit. Examples of knowledge are Wikipedia entries, recipes, and documented processes.

Building on knowledge, intelligence follows. From learned knowledge and accumulated experience conclusions may be drawn and patterns can be recognized. New knowledge is synthesized by creating and testing hypotheses. Crucial for intelligence is executable information, or in other words: code, which can take on the form of algorithms or heuristics. Whereas data, information, and knowledge are inert, intelligence requires an environment in which it is executed. Cells, organisms, computers, and networks are all systems that exhibit intelligence.

The highest level in the information hierarchy is wisdom. Wisdom is the totality of knowledge and intelligence. Wisdom allows evaluating in different ways to find a balanced solution. The interesting questions are not so much “what” (data, information) or “how” (knowledge, intelligence), but “why” and “what for”. A good example of wisdom is a library. It contains not only knowledge in the form of books and other media, but also intelligence in the form of staff and index systems.

IONOS Object Storage

Secure, affordable storage

Cost-effective, scalable storage that integrates into your application scenarios. Protect your data with highly secure servers and individual access control.

How is dark data created?

Organizational processes, which are supported by modern methods of information processing, continuously produce data. Some proportion of the data will be dark data. Either the information that data exists is lost or missing from the outset. Or the knowledge of how data can be analyzed is not available.

Dark data comes in many forms. In the words of marketing expert Sky Cassidy:

Quote

“So as for dark data, it’s all the information companies collect in their regular business processes, don’t use, have no plans to use, but will never throw out. It’s web logs, visitor tracking data, surveillance footage, email correspondences from past employees, and so much more”. - Sky Cassidy

Dark data arises from forgotten or no longer accessible data

A majority of dark data consists of data that is no longer accessible. This can be forgotten data or data that can no longer be accessed.

Employees continuously store data on their private and company devices. It can happen quickly that such data is forgotten and becomes dark data. Data on USB sticks and portable hard drives, as well as internal data carriers of decommissioned desktop and mobile devices, are as much part of this as data in email attachments and unused databases.

Near endless scalability is one of the advantages of the cloud, but it’s also a curse. Because with the help of cloud storage, it is possible to keep accumulating data without hitting a fixed limit. This tempts employees to collect data without limitations. If the collection frenzy takes place outside of strictly regulated processes, the result is usually dark data.

Data security and protection must be warranted when storing data digitally. When data is encrypted, systems are protected. But what happens when the login password is forgotten, or the key can no longer be found? In both cases, access to data is hampered and information may be lost forever.

But there is another danger of losing access to available data: when it is no longer available in an accessible form. For example, if it is a proprietary file format, a special program may be necessary to read it. However, it could happen that the relevant software can no longer be operated or is no longer available in the required version. This means that the data remains trapped in a vendor lock-In.

Dark data arises due to incomplete or outdated data

Dark data is not just data that is no longer accessible. It also includes incomplete or outdated data. Let's let statistician David Hand have his say again:

Quote

“Dark data are data you don't have. This might be because you want today's data, but all you have is yesterday's. It might be because your sample is distorted, perhaps certain types of cases are missing. It might be because the recorded values are inaccurate – after all, no measurement instrument is perfect.”) - David Hand

Remember that data is the lowest level of the information hierarchy. Data inaccuracies and deviations manifest themselves in the higher information levels. This usually results in cascading effects: small deviations lead to large changes. Thus, incomplete data can have serious effects.

The situation is similar with obsolete data. Consider, for example, the geolocation of a user, which is stored as part of a data set. Since the geolocation changes as the user moves, the information it contains may only be useful if the data is analyzed in real time. For example, if you want to make a user a location-based offer, this must be done while the user is still on-site.

Dark data arises from unanalyzed data

A large class of dark data consists of data that has been collected and stored, but not analyzed. A particularly high volume of dark data comes from sources that generate data automatically. This includes sensors, log files, and statistics on page visits from websites. The data generated is often stored for long periods of time without being analyzed.

Some data is available in formats that require complex procedures for analysis. This includes texts contained in image files and spoken text in audio files. In general, digital images contain a wealth of information that can only be automated using modern artificial intelligence methods. Pattern recognition and classification are used to identify and assign objects depicted in image data. Since these are still relatively new approaches, the majority of image materials stored worldwide likely contain dark data.

In another case, dark data arises from existing but unanalyzed data. Namely, if the data is stored and kept only as part of audit security, without there being a need to evaluate the data. Statistician David Hand sums up the problem:

Quote

“It might even be that the data are available, but unexamined, gently decaying in a giant data warehouse, unlooked at because they were collected purely for compliance reasons.” - David Hand

Dark data arises from data not yet recorded

There is one more scenario from which dark data arises. This is of a more theoretical nature, because it involves data that has not yet been collected. Of course, this data (which does not yet exist) is outside the view of the organization. Therefore, it also counts as dark data.

Statistician David Hand draws an analogy to “dark matter”:

Quote

“Just as much of the universe is composed of dark matter, invisible to us but nonetheless present, the universe of information is full of dark data that we overlook at our peril.” -David Hand

Why dark data is a problem

There are various reasons why dark data is a problem for businesses and other organizations. Below we discuss cases where data actually exists. We exclude cases where data does not yet exist.

Storing dark data is inefficient

Storage of data requires resources. These include, in particular, storage space and energy on the part of the storage operator. This causes costs for the organization that claims the data as its own. Effort is expended in order to store the data.

Efficiency is defined as the quotient between benefit and effort. If a high benefit is achieved with little effort, this is referred to as high efficiency. On the other hand, a low benefit with a high effort means that efficiency is low.

Efficiency = benefit / effort

Data is supposed to be useful. With dark data, utility is limited. Nevertheless, a continuous effort must be expended to store the data. Consequently, the storage of dark data is inefficient.

Finding the information needle in the dark data haystack

Let's imagine the entirety of an organization's data as an iceberg. The majority of the data is dark data. Unfortunately, useful data is not collected on the surface. Rather, it is mixed in with dark data and cannot be easily separated. To find useful data, you have to search the entire mountain.

Because of the sheer mass of dark data, information that is useful remains hidden. Often, it is unclear whether data is of any value at all. Missing or incorrect data leads to incorrect information. Thus, dark data influences what conclusions are drawn from the information at hand. This limits how intelligently the organization can behave.

No one knows what dark data contains

Dark data is opaque by definition. You can never be sure whether it contains anything useful. It also cannot be ruled out that the data contains sensitive information that must not fall into the wrong hands.

Data is usually stored for long periods of time. At the same time, dark data has little benefit for the organization. There is often a lack of motivation to secure the data. Unused data is easily forgotten. This makes it more likely to be inadequately stored.

In principle, data can always include information that is subject to special protection. In most cases, individual data is harmless; on the other hand, sensitive information can be extracted from data volumes. For example, movement profiles can be created from location data collected over long periods of time. The loss of dark data, therefore, poses a high risk.

One other risk associated with dark data arises during disaster recovery because data may not be recovered after failure. Let's imagine a system that ran cleanly and in which seemingly all components were known and cloud backups were made. But what if one of the components consisted of dark data? When the system is restored, a critical part is missing. In the worst case, the failure of important systems is the consequence.

IONOS Cloud Managed Kubernetes

Container workloads in expert hands

The ideal platform for demanding, highly scalable container applications. Managed Kubernetes works with many cloud-native solutions and includes 24/7 expert support.

Dark data is hard to get rid of

A mountain of data is hard to keep track of. Dark data could contain useful or sensitive information. If applicable, certain storage periods are prescribed for the retention of the data. This means that it is not possible to dispose of the data without further ado.

This situation can be compared to hazardous waste, which is hard or impossible to separate. If a ton of waste contains one gram of highly toxic material, the entire ton is treated as hazardous waste. So data continues to be stored, and the mountain of data continues to grow. This also increases the costs incurred to store it.

Was this article helpful?

Dark data

What is dark data?

What is data?

Data is not synonymous with information

The hierarchy of information

How is dark data created?

Dark data arises from forgotten or no longer accessible data

Dark data arises due to incomplete or outdated data

Dark data arises from unanalyzed data

Dark data arises from data not yet recorded

Why dark data is a problem

Storing dark data is inefficient

Finding the information needle in the dark data haystack

No one knows what dark data contains

Dark data is hard to get rid of

Contents