What is metadata?
The term metadata has been on everyone’s lips for a few years now. Today, billions of people around the world use digital media. Large amounts of metadata are constantly being generated in the process. The term “transparent citizen” is sometimes used to describe the resulting data protection risk.
The evaluation of metadata by artificial intelligence provides predictions about people’s behavior. In perspective, this poses a serious threat to the privacy of citizens and to democracy in practice. Yet metadata is not a bad thing. In this article, we explain what metadata actually is.
What’s the difference between metadata and data?
Metadata: The term refers to information that supplements actual data. Often, metadata provides more details about the context of the content or gives instructions on how to handle data. In this way, metadata plays a major role in both computing and traditional data processing (including things like library catalogs or the postal system).
To become more familiar with the term metadata, imagine a simple example: You send a letter through the mail. Now the document contained in the envelope corresponds to the actual, primary data. This data is private and protected by law against access by third parties – the secrecy of correspondence applies.
The envelope contains the metadata of the letter. This is additional data that accompanies the primary data:
- Address and sender
- Stamp and post mark
- Where required, additional identifiers like bar codes
As you can see, all in all it is data that makes the sending of the letter possible in the first place. The metadata of the letter is visible to anyone. This means that it is not specially protected by the secrecy of correspondence, although postal secrecy does apply.
So, what is the danger posed by metadata? It’s not a problem if individual metadata can be read. If, for example, a third party gained knowledge of the existence of an individual envelope, it’s usually no cause for concern. However, this changes when more data is at stake, as is the case with massive data storage and its evaluation. On a larger scale, patterns emerge that reveal a lot about a person’s behavior: Who communicated with whom and when? Networks and chains of communication can be identified.
The distinction between data and metadata is fluid. The classification depends on the context and on perspective. Here’s another example. A book contains primary data, such as the title of the book and its content. Furthermore, a set of metadata is available for the publication of a book:
- Author
- Publisher
- Time and place the book was published
- Edition
- ISBN
Let’s imagine that the metadata of many publications is collected in a database. Regarding this kind of a database, the publication information would be primary data. In addition, there would be a new set of metadata for each publication. For example, for each publication, the database could store when an entry was added and by which user.
What types of metadata exist and how are these used?
Metadata is found in all areas of data storage and processing. The use of metadata cannot be described conclusively. Here are three major areas of use:
1. To provide context for information.
Metadata often describes the process that led to the creation of information. Think, for example, of the geographic coordinates with which digital photos are tagged. The context – once lost – may not be reconstructed and is therefore stored.
2. To provide information that would otherwise be difficult to find.
Here, consider the length of a video. This length is embedded as a timer in the video file. Without saving the duration of a video, it would have to be calculated. A possible approach would be to count the number of frames and divide this by the frame rate – a relatively high effort.
3. Linking information, making it easily retrievable and searchable.
The main goal here is to support human-readable information with machine-readable data. The aim is to use automated processes to establish relationships between pieces of information. In particular, structured data, which, when connected, creates a so-called “semantic web”.
Metadata that describes images
Images taken with digital cameras and smartphones contain a large amount of metadata. On the one hand, this is technical data, such as image dimensions, the camera used, focal length, etc. These factors are defined in the EXIF standard and are created automatically by the camera. Furthermore, the IPTC standard defines metadata that describes the content of the photo and is entered by the user.
Standard | Image metadata | Creation |
---|---|---|
EXIF | Image information like dimensions, color space, color channels, etc.; photographic information, such as exposure time, aperture, ISO, etc. | Automatic when recording |
IPTC | Keywords, copyrights, location and time information, content descriptions, etc. | Manually done by user |
When sharing digital images, you should be careful: the image metadata can contain private information on the author. Many apps and social networks automatically clear images when they are uploaded. But it’s best to not rely on this. In certain instances, it’s better to use a special tool to delete image information.
Metadata that is embedded in digital videos
A video file typically consists of a container that holds various data. The primary data of a video includes the encoded video and audio content. Additional metadata that is embedded includes:
- Length of the video
- Data rate and image dimensions
- Details of the audio and video codec used
- Subtitles, if applicable in different languages
Metadata that is assigned to files
A file in a digital system includes two primary pieces of data: the contents of the file and its name. In addition, each file has a set of metadata associated with it. The file metadata is managed by the operating system and is also known as “file attribute”. Here is an overview of common file metadata:
File metadata | Description |
---|---|
Time stamp | For the creation, modification, and last time the file was opened |
Saved location | File path in the data system |
Ownership | Owner and group |
File permissions | Read, right, execute: for users, groups, and other |
In addition to file attributes, some file types include specific metadata. These are managed by the respective application. Even with this metadata, there is a risk of disclosing confidential information when sharing it.
Metadata that is created when an email is sent
An email includes – analogous to the classic postal letter – two key parts:
- Email body
- Email header
The body contains the actual message, which corresponds to the document in the envelope. Like the envelope, the header contains the addresses of the sender and recipient. As with the envelope, some information in the header can be easily forged. For the recipient, it then appears as if an email came from a different sender. This is a trick that is often used in spoofing attacks.
The email header usually contains a lot of other metadata, such as:
- Various timestamps
- Information on the formatting and coding of the message
- Stages the email has passed through during transmission
- Evaluation of the email by spam filters
- Note on whether the email was checked by a virus scanner
The metadata of the email header is written and read by server software and application programs. The information generated in the process reveals a lot about an email and the path it has taken through the Internet. Among other things, statements can be made about the authenticity and confidentiality of an email. Furthermore, the header can contain the host name of the user’s own device and reveal the location from which an email was sent.
Metadata that is generated when you visit a website
From a technical point of view, visiting a website is retrieving an HTML document. The user’s browser retrieves the document from a server at the specified address. The HTTP or HTTPS protocol is used for this.
In addition to the actual HTML document that is displayed in the browser, metadata known as HTTP headers is transmitted. The HTTP headers are comparable to the fields of the email header. They contain information about the encoding, transmission, encryption, and compression of the HTTP connection.
Furthermore, metadata is generated during the transfer, which accumulates on the server. These include log files in which accesses to the server are logged, and which are needed for logfile analyses. For each access, another line is written to the log file. In addition, the browser usually sends further queries to the DNS server. Metadata is also generated and possibly stored and evaluated by the server operator.
Confusingly, in addition to the HTTP header already mentioned, there is also the HTML header. While the former refers to the connection, the latter contains metadata describing the contents of the document. Below is an overview of a HTTP server response. The introductory lines are the HTTP header. This is followed by the HTML source code with HTML head and HTML body elements:
HTTP/1.1 200 OK
Date: Mon, 01 Feb 2021 12:13:34 GMT
Content-Type: text/html; charset=UTF-8
Content-Length: 148
Last-Modified: Wed, 08 Jan 2003 23:11:55 GMT
Server: Apache/1.3.3.7 (Unix) (Red-Hat/Linux)
Accept-Ranges: bytes
Connection: close
<html>
<head>
<title>An Example Page</title>
</head>
<body>
<p> The human readable text is in the body of the document</p>
</body>
</html>
What metadata means for online marketing and search engine optimization
In this section, we focus on metadata that is embedded in a HTML document. We’ll leave out the HTTP metadata already mentioned, as well as server-side metadata such as log files. Usually, HTML metadata is embedded in the head of the HTML document.
Many of the elements used in the HTML header are directly used for search engine optimization. Search engine bots crawl the content of an HTML document. The human-readable part present in the HTML body is extracted and indexed. In addition, there is special metadata that is intended exclusively for bots. Here, we distinguish between “classic” and “modern” variants.
Website metadata illustrated with classic HTML head elements
The classic HTML head elements include the title and a handful of critical meta tags. The title is also visible to the user in various forms. For example, it is displayed in bookmarks or in the browser tab header. The other classic “<meta>” tags are used exclusively for search engine optimization. Here is an overview of the most important classic HTML head elements:
Tag | Description | Importance |
---|---|---|
<title> | Title of the document, displayed in results of a search | Critical |
<meta name="description"> | Description of the document, displayed in the search results | Critical |
<meta name="keywords"> | Keywords of the document, not displayed in search results | Minimal |
<meta name="robots"> | Directions for search engine bots for processing the document | Critical |
Website metadata displayed with modern HTML head elements
In addition to the classic HTML head elements, a variety of other elements are used today to include metadata on a website. Search engine operators and large technology groups are constantly defining new metadata. The elements “<meta>” and “<link>” are ideal for this, as they can be expanded. Here is an overview of frequently used modern website metadata:
Tag | Description | Importance |
<link rel="canonical"> | Canonical tag to avoid duplicate content | Critical, if duplicate content exists |
<link rel="alternate" hreflang="en"> | Provide alternative language versions of the same document per hreflang | Optional |
<meta property="og:…"> | Open Graph for publication on social media | Optional |
For the “<meta>” element, the “name” attribute is used to specify the specific type of metadata. For the “<link>” element, the “rel” attribute is used in a similar way. Depending on the metadata standard used, two alternative notations can be found for the “<meta>” element. We summarize them here:
How it’s written | Metadata standard |
---|---|
<meta name=""> | HTML5 |
<meta property=""> | RDFa |
<meta itemprop=""> | HTML Microdata |
Website metadata defined with the Open Graph
Open Graph is a protocol developed by Google to enrich a web document with metadata. The Open Graph data provides information that is displayed as an overview when the document is shared on social networks. In this way, optimized images, titles, and descriptive texts can be specified. This makes sense, since depending on the platform, specific restrictions apply in terms of length of texts, dimensions of images, and the like. The protocol is used extensively by Facebook and Twitter. Here is an overview of the essential Open Graph metadata:
Open Graph metadata | Explanation |
---|---|
<meta property="og:title"> | Title of the object |
<meta property="og:type"> | The type of objects e.g., image, web document, video, etc. |
<meta property="og:image"> | An image that represents an object |
<meta property="og:url"> | The canonical URL of the object |
If you find errors in your web content when sharing content on Facebook, the problem is often associated with faulty Open Graph entries. In this case, a simple trick can fix the error: log in to your Facebook account and try the Sharing Debugger. This tells Facebook to read the Open Graph information again.
Website metadata defined with Rich Cards
Besides Open Graph, Rich Cards is a further metadata standard developed by Google. Rich Cards enrich a web document with structured metadata. For example, the website of a restaurant can be supplemented with information on geographical location, prices, opening hours, etc. The Rich Card information can be placed in the HTML head or in the HTML body.
Technically, Rich Cards are derived from the metadata standard Schema.org. Various formats are used to mark up the metadata. Besides the older standards which include RDFa and microdata, JSON-LD is also available today. The use of JSON-LD even comes officially recommended by Google.