Welcome to the first of our three post series, Fun with Container Formats. This series will cover the four most common container formats and why they matter to you. This post kicks off with some basic terminology and how players handle containers.
Before we get started, some notes about terminology …
A codec is used to store a media signal in binary form. Most codecs compress the raw media signal in a lossy way. The most common media signals are video, audio and subtitles. A movie consists of different media signals. Most movies have audio and subtitles in addition to moving pictures. Examples for video codecs are: H.264, HEVC, VP9 and AV1. For audio there are: AAC, MP3 or Opus. There are many different codecs for each media signal.
A single media signal is also often called Elementary Stream or just Stream. People usually mean the same thing when they refer to the Video Stream as Codec, Media or H.264 Stream.
What is a media container?
Container Format = meta file format specification describing how different multimedia data elements (streams) and metadata coexist in files.
A container format provides the following:
- Stream Encapsulation
One or more media streams can exist in one single file.
The container adds data on how the different streams in the file can be used together. E.g. The correct timestamps for synchronizing lip-movement in a video stream with speech found in the audio stream.
The container provides information to which point in time a movie can be jumped to. E.g. The viewer wants to watch only a part of a whole movie.
There are many flavours of metadata. It is possible to add them to a movie using a container format. E.g. the language of an audio stream.
Sometimes subtitles are also considered as metadata.
Common container formats are MP4, MPEG2-TS and Matroska, and can represent different video and audio codecs. Each container format has its strengths and weaknesses. These properties can be in regards to compatibility, streaming and size overhead.
- Encoding: The process of converting a raw media signal to a binary file of a codec. For example encoding a series of raw images to the video codec H.264. Encoding can also refer to the process of converting a very high quality raw video file into a mezzanine format for simpler sharing & transmission – Ex: taking an uncompressed RGB 16-bit frame , with a size of 12.4MB, for 60 seconds (measured at 24 frames/sec) totalling 17.9GB – and compressing it into 8-bit frames with a size of 3.11MB per frame, which for the same video of 60 seconds at 24fps is 2.9GB in total. Effectively compressing the size of the video file down by 15GB!
- Decoding: The opposite of encoding; decoding is the process of converting binary files back into raw media signals. Ex: H.264 codec streams into viewable images.
- Transcoding: The process of converting one codec to another (or the same) codec. Both decoding & encoding are necessary steps to achieving a successful transcode. Best described as: decoding the source codec stream and then encoding it again to a new target codec stream. Although encoding is typically lossy, additional techniques like frame interpolation and upscaling increase the quality of the conversion of a compressed video format.
- Muxing: The process of adding one or more codec streams into a container format.
- Demuxing: Extracting a codec stream from a container format.
- Transmuxing: Extracting streams from one container format and putting them in a different (or the same) container format.
- Multiplexing: The process of interweaving audio and video into one data stream. Ex: An elementary stream (audio & video) from the encoder are turned into Packetized Elementary Streams (PES) and then converted into Transport Streams (TS).
- Demultiplexing: The reverse operation of multiplexing. This means extracting an elementary stream from a media container. E.g.: Extracting the mp3 audio data from an mp4 music video.
- In-Band Events: This refers to metadata events that are associated with a specific timestamp. This usually means that these events are synchronized with video and audio streams. E.g.: These events can be used to trigger dynamic content replacement (ad-insertion) or the presentation of supplemental content.
Container Formats in OTT Media Delivery
Containers are pretty much present anywhere where there’s digital media. For example, if you record a video using your smartphone the captured audio and video are both stored in one container file, e.g. an MP4 file. Another example of containers in the wild is media streaming over the internet. From end to end the main entity of media data that is handled are containers. At the end of content generation the packager multiplexes encoded media data into containers, which are then transported over the network as requested by the client device on the other end. There the containers are demuxed, the content is decoded and finally presented to the end user.
Handling Container Formats in the Player
At the client-side the player needs to extract some basic info about the media from the container, for example, the segment’s playback time, duration and codecs.
Additionally, there is often metadata present in the container that most browsers would not extract or handle out-of-the box. This requires the player implementation to have desired handling in place. Some examples are CEA-608/708 captions, inband events (emsg boxes of fMP4), etc. where the player has to parse the relevant data from the media container format, keep track of a timeline and further process the data at the correct time (like displaying the right captions at the right time).
Browsers often lack support for certain container formats. One prime example where this becomes a problem is Chrome, Firefox, Edge and IE not (properly) supporting the MPEG-TS container format. The MPEG (Motion Picture Experts Group, a working group formed out of ISO and IEC) Transport Stream format was specifically designed for Digital Video Broadcasting (DVB) applications. You can find more details on this format in one of the next parts of this blog series. With MPEG-TS still being a commonly used format the only solution is to convert the media from MPEG-TS to a container format that these browsers do support (i.e. fMP4). This conversion step can be done at the client right before forwarding the content to the browser’s media stack for demuxing and decoding. It basically includes demultiplexing the MPEG-TS and then re-multiplexing the elementary streams to fMP4. This process is often referred to as transmuxing.
Read Part 2 where we dive into MP4 and CMAF!
Want to skip ahead? Part 3 covers MPEG-Transport Streams & Matroska (WebM)