This article describes the changes to fMP4 outputs starting from version 2.153.0 of the Bitmovin Encoder. Starting with this version, fMP4 outputs with codecs H.264, AAC, HE-AAC and HE-AACv2 will use an overhauled implementation that aims to improve stability and correctness of ISO-BMFF files. AV1 has already been using the overhauled fMP4 implementation.
In this article, we're going to explicitly show the differences in MP4 outputs, by comparing excerpts from mp4dump outputs of fMP4 encodings up to version 2.152.0 versus encodings starting from 2.153.0 for the same configuration.
Finally, we'll list the devices/platforms used for testing playback.
Changes to ISO-BMFF boxes
Up to version 2.152.0, the ftyp box for an H.264 initialization segment looks like
1[ftyp] size=8+202 major_brand = mp423 minor_version = 14 compatible_brand = isom5 compatible_brand = mp426 compatible_brand = mp41
With version 2.153.0, the ftyp box of an H.264 initialization segment looks like
1[ftyp] size=8+322 major_brand = mp413 minor_version = 04 compatible_brand = iso85 compatible_brand = isom6 compatible_brand = mp417 compatible_brand = dash8 compatible_brand = avc19 compatible_brand = cmfc
Audio initialization segments should look the same, except for the
avc1 compatible brand, which is not present.
This should not have any practical side effect for demuxers, so we don't expect and have not found any issues with the change.
Timescale in mvhd and tkhd boxes
Up to version 2.152.0, the timescale in Movie Header (mvhd) box had value of 1000, and the timescale in Track Header (tkhd) box depended on:
- frame rate for video
- sampling rate for audio.
From version 2.153.0, the mvhd timescale is the same as the tkhd timescale.
The video timescale is the video framerate rounded to nearest integer multiplied by 1000. Some examples:
- 24 FPS: Timescale 24000
- 23.976 FPS: Timescale 24000
The audio timescale equals the sampling rate. This applies to all audio codecs, including HE-AAC and HE-AACv2 codecs, which up to version 2.152.0 had a timescale of half the sampling rate.
max_bitrate and avg_bitrate in DecoderConfig for audio
We noticed that our audio initialization segments, up to version 2.152.0, always reported 96 kbps as max_bitrate and avg_bitrate under the DecoderConfig box for audio, regardless of the configured bitrate. This was a problem only in muxing. The audio was encoded at the correct bitrate.
1[esds] size=12+272 [ESDescriptor] size=2+253 es_id = 04 stream_priority = 05 [DecoderConfig] size=2+176 stream_type = 57 object_type = 648 up_stream = 09 buffer_size = 614410 max_bitrate = 9600011 avg_bitrate = 9600012 DecoderSpecificInfo = 12 1013 [Descriptor:06] size=2+1
Starting from version 2.153.0, this issue is fixed and the values are correctly signaled depending on the specified bitrate from the codec configuration.
Sample flags in trun entries
In video segments up to version 2.152.0, the sample flags were always optimized using
default sample flags from the tfhd box and
first sample flags from the trun box, as shown below:
1[traf] size=8+8322 [tfhd] size=12+8, flags=200203 track ID = 14 default sample flags = 0x10100005 [tfdt] size=12+8, version=16 base media decode time = 07 [trun] size=12+780, flags=a058 sample count = 969 data offset = 87210 first sample flags = 0x200000011 entries:12 ( 0) sample_size = 429, sample_composition_time_offset = 200213 ( 1) sample_size = 72, sample_composition_time_offset = 500514 ( 2) sample_size = 70, sample_composition_time_offset = 2002
While this optimization is good for reducing muxing overhead, our muxer was optimizing the flags even in situations when it shouldn't, e.g. when there are more key frames than the first frame. This resulted in warnings on the Media inspector tab on Chrome:
1ISO-BMFF container metadata for video frame indicates that the frame is not a keyframe, but the video frame contents indicate the opposite.
Starting from version 2.153.0, sample flags in the trun boxes in a given segment are only optimized if it is possible. Otherwise, the flags are written per sample, like below:
1[traf] size=8+12122 [tfhd] size=12+12, flags=2000a3 track ID = 14 sample description index = 15 default sample duration = 10016 [tfdt] size=12+47 base media decode time = 08 [trun] size=12+1160, flags=e019 sample count = 9610 data offset = 125211 entries:12 ( 0) sample_size = 473, sample_flags = 0, sample_composition_time_offset = 200213 ( 1) sample_size = 78, sample_flags = 0x1000, sample_composition_time_offset = 500514 ( 2) sample_size = 76, sample_flags = 0, sample_composition_time_offset = 200215 ( 3) sample_size = 75, sample_flags = 0x1000, sample_composition_time_offset = 0
While this may increase the muxing overhead with 4 extra bytes for sample when the optimization can not be applied, it provides the most correct outputs, which may also fix playback in older players. Our muxer will still optimize the flags when it is possible.
Furthermore, the value set in sample_flags at version 2.153.0 (0 and 0x10000) differ slightly from the ones set in 2.152.0 (0x2000000 and 0x1010000) because we don't make use of the
sample_depends_on bits anymore (ISO/IEC 14496-12:2012 184.108.40.206).
Edit Lists with the edts box
We noticed that up to version 2.152.0, an edit list could be missing for H.264 streams that make use of B-frames. This edit list is required due to a delay between decoding and presenting frames (signaled in trun entries via
sample_composition_time_offset), and due to this, the first segment of some streams could have a non-zero reported start time, for example when checking it with ffprobe:
1Duration: 00:00:04.00, start: 0.083417, bitrate: 4396 kb/s
From version 2.153.0, the edit list is placed in the initialization segment when needed:
1[edts] size=8+282 [elst] size=12+163 entry_count = 14 entry/segment duration = 05 entry/media time = 20026 entry/media rate = 1
With this, the first segment correctly starts at 0 now:
1Duration: 00:00:04.00, start: 0.000000, bitrate: 4398 kb/s
For avoiding the use of edit lists, it is also possible to use of the ALIGN_ZERO_NEGATIVE_CTO in the "ptsAlignMode" configuration of fMP4 muxings, which makes use of trun v1 boxes that allow for negative
Before releasing this change, Bitmovin has conducted extensive device testing to make sure that the new outputs won't have playback regressions. The following devices/platforms were tested with non-DRM and DRM outputs, using the Bitmovin player:
- Chrome and Edge (stable, beta and dev) on MacOS, Linux and Windows
- Firefox on MacOS, Linux and Windows
- Safari on iPad Air 2 (iOS 13), iPad Mini 6 (iOS 15), iPhone 11 (iOS 14), iPhone 8+ (iOS 12)
- Samsung Tizen TVs, from 2016 to 2022 models
- LG WebOS TVs, from 2016 to 2022 models
- Panasonic TV 2018
- Xbox One and Xbox Series S
- Playstation 5
- Chromecast and Chromecast Ultra
- Android Pixel2 with browsers Chrome, Firefox and Samsung Internet
- Fire TV Stick 4K and Fire TV Stick 4K Max
- Roku Streaming Stick, Roku Streaming Stick 4K