Sign Up

Fun with Container Formats – Part 1

Armin Trattnig

Written by:
June 18th, 2019

Welcome to the first of our three post series, Fun with Container Formats. Over the course of the next three weeks, we will be publishing a blog per week breaking out the four most common container formats and why it matters to you. This week, we’ll kick off with terminology and the handling of containers in players.

Before we get started, some notes about terminology …

A codec is used to store a media signal in binary form. Most codecs compress the raw media signal in a lossy way. The most common media signals are video, audio and subtitles. A movie consists of different media signals. Most movies have audio and subtitles in addition to moving pictures. Examples for video codecs are: H.264, HEVC, VP9 and AV1. For audio there are: AAC, MP3 or Opus. There are many different codecs for each media signal.

A single media signal is also often called Elementary Stream or just Stream. People usually mean the same thing when they refer to the Video Stream as Codec, Media or H.264 Stream.

What is a media container?

Container Format := meta file format specification describing how different multimedia data elements (streams) and metadata coexist in files.

A container format provides the following:

  • Stream Encapsulation
    One or more media streams can exist in one single file.
  • Timing/Synchronization
    The container adds data on how the different streams in the file can be used together. E.g. The correct timestamps for synchronizing lip-movement in a video stream with speech found in the audio stream.
  • Seeking
    The container provides information to which point in time a movie can be jumped to. E.g. The viewer wants to watch only a part of a whole movie.
  • Metadata
    There are many flavours of metadata. It is possible to add them to a movie using a container format. E.g. the language of an audio stream.
    Sometimes subtitles are also considered as metadata.

Common container formats are MP4, MPEG2-TS and Matroska, and can represent different video and audio codecs. Each container format  has its strengths and weaknesses. These properties can be in regards to compatibility, streaming and size overhead.

More Terminology…


Encoding: Converting a raw media signal to a binary file of a codec. For example encoding a series of raw images to the video codec H.264.
If someone wants to look at the encoded images they need to decode the H.264 codec stream to get actual viewable images.
Converting from one codec to another (or the same) codec is called transcoding. For transcoding you need to do both: decoding and encoding. Decoding the source codec stream and then encoding it again to the target codec stream.
Putting one or more codec streams into a container format is called muxing.
Extracting a codec stream from a container format is called demuxing.
Extracting streams from one container format and putting them in a different (or the same) container format is called transmuxing.

 

Containers in OTT Media Delivery

Containers are pretty much present anywhere where there’s digital media. For example, if you record a video using your smartphone the captured audio and video are both stored in one container file, e.g. an MP4 file. Another example of containers in the wild is media streaming over the internet. From end to end the main entity of media data that is handled are containers. At the end of content generation the packager multiplexes encoded media data into containers, which are then transported over the network as requested by the client device on the other end. There the containers are demuxed, the content is decoded and finally presented to the end user.

Handling Containers in the Player

Metadata Extraction

At the client-side the player needs to extract some basic info about the media from the container, for example, the segment’s playback time, duration and codecs.

Additionally, there is often metadata present in the container that most browsers would not extract or handle out-of-the box. This requires the player implementation to have desired handling in place. Some examples are CEA-608/708 captions, inband events (emsg boxes of fMP4), etc. where the player has to parse the relevant data from the media container format, keep track of a timeline and further process the data at the correct time (like displaying the right captions at the right time).

Client-side Transmuxing

Browsers often lack support for certain container formats. One prime example where this becomes a problem is Chrome, Firefox, Edge and IE not (properly) supporting the MPEG-TS container format. The MPEG (Motion Picture Experts Group, a working group formed out of ISO and IEC) Transport Stream format was specifically designed for Digital Video Broadcasting (DVB) applications. You can find more details on this format in one of the next parts of this blog series. With MPEG-TS still being a commonly used format the only solution is to convert the media from MPEG-TS to a container format that these browsers do support (i.e. fMP4). This conversion step can be done at the client right before forwarding the content to the browser’s media stack for demuxing and decoding. It basically includes demultiplexing the MPEG-TS and then re-multiplexing the elementary streams to fMP4. This process is often referred to as transmuxing.

Read Part 2 where we dive into MP4 and CMAF!