Build an ESDS Box for an MP4 that Firefox Can Play

1.8k Views Asked by At

I am generating MP4 files (with h.264 video and AAC audio) by transmuxing from MPEG-TS in JavaScript to be played in the browser via blob URLs. Everything works fine in Chrome, and if I grab the blob URLs out of the developer console and download them, the generated files play fine on Windows Media Player as well. Firefox, however, claims that they are corrupted.

I've narrowed the issue down to a problem with the ESDS box in the audio metadata. If I repackage the source MPEG-TS files by some other means (like ffmpeg), and hand-edit my generated files in a hex editor to paste in the ESDS box from the equivalent file generated by other software, then Firefox is happy.

My code that builds the ESDS box. (And I'm tracking the issue)

I attempted to write it by a pretty straightforward transcribe-stuff-from-the-MPEG-specs process, but that is no guarantee that I did not screw it up. Since Chrome and Windows Media play my files just fine, I'm not sure if it's actually an error in my file that they are somehow capable of ignoring, or if it's a problem with Firefox. I suspect the former, but I'm just not sure.

Anyone got any insight, or perhaps a straightforward, easy-to-understand reference for how to build a proper ESDS box?

EDIT: Here are some different ESDS sections produced for the same input file (as hex bytes, copied out of my hex editor):

Mine:

00 00 00 27 65 73 64 73 00 00 00 00 03 22 00 00
02 04 14 40 15 00 00 00 00 00 3a f1 00 00 2d e6
05 02 12 10 06 01 02

mpegts:

00 00 00 33 65 73 64 73 00 00 00 00 03 80 80 80
22 00 02 00 04 80 80 80 14 40 15 00 00 00 00 00
00 00 00 00 00 00 05 80 80 80 02 12 10 06 80 80
80 01 02

ffmpeg:

00 00 00 2c 65 73 64 73 00 00 00 00 03 80 80 80
1b 00 02 00 04 80 80 80 0d 40 15 00 00 00 00 01
5f 42 00 00 00 00 06 80 80 80 01 02

Oddly, and I did not notice this before, Firefox will play the video with ffmpeg's output, but neither Firefox nor Windows Media will actually play the sound (Chrome does). Firefox and Windows Media are both happy to play the video with sound using the output from mpegts, though. With mine, Chrome and Windows Media will play with video with sound, but Firefox doesn't play at all, and claims the video is corrupted.

4

There are 4 best solutions below

3
On

You have now found your solution by adding three bytes of 0x80 each after the ES Descriptor Tag number. Glad that worked out for all browsers.

Let me share one insight that may help you or future users of your code:

"..I can find no mention in the MPEG specs for MP4 files or ISOBMFF of why those bytes should be there, but adding them in makes it work.."

Well looking at this link for mp4ESDSbox.java we see ESDS atom is broken into five sections and each section is padded by the bytes 80 80 80. These three bytes are decribed as "optional extended descriptor type tag string" with possible types values being.. 80 or 81 or FE

You're on the right path but you only have padded the first section.

MP4Muxer.js : (A) What you currently have...

00 00 00 27 65 73 64 73 00 00 00 00 03 80 80 80
22 00 00 02 04 14 40 15 00 00 00 00 00 3A F1 00
00 2D E6 05 02 12 10 06 01 02

MP4Muxer.js: (B) What it should be...

00 00 00 33 65 73 64 73 00 00 00 00 03 80 80 80
22 00 00 02 04 80 80 80 14 40 15 00 00 00 00 00
3A F1 00 00 2D E6 05 80 80 80 02 12 10 06 80 80
80
01 02

FFMpeg ESDS for random AAC track : Compare against new (B) version

00 00 00 33 65 73 64 73 00 00 00 00 03 80 80 80
22 00 01 00 04 80 80 80 14 40 15 00 00 00 00 01
F4 74 00 01 F4 74 05 80 80 80 02 12 10 06 80 80
80
01 02

Comparing the bytes structure of version B) against those made by FFMpeg we see now there is perfect alignment. Some values are slightly different cos they are not made from the same audio data.

Notice we have changed the first four bytes (size integer) to x33 (decimal == 51 bytes length) from the original x27 which was (decimal == 39 bytes length)

0
On

Well, I found an answer to my own question. Upon pondering the differences between my ESDS boxes and those produced by other software, it became apparent that the biggest difference was the presence of these 0x80 padding bytes- three of them after every ES Descriptor tag number. Add those in, and most everything else lines up and looks pretty much the same.

I can find no mention in the MPEG specs for MP4 files or ISOBMFF of why those bytes should be there, but adding them in makes it work- Firefox no longer thinks the files are corrupted.

1
On

The 0x80 bytes do not belong to the tag before it, but to the length value after it. Version 2 of the ISO spec changed the interpretation of the length value so it can wrap more than 255 bytes by making it a 'VarInt32' type. The high bit in each byte denotes there is another length byte following, the lower 7 bits encode the value.

You could use this to encode arbitrary large values, but the ISO spec limits this to 4 bytes at most, or 0...2^(4*7)-1.

I.e.:

0x80,0x80,0x80,0x0E = 0x80,0x0E = 0x0E => 14
0x80,0x80,0x84,0x7f = 0x84,0x7f => 0x4 << 7 + 0x7f = 0x27f = 639

The same encoding is e.g. used by Googles protobuf, named Base128 Varint.

0
On

For people looking for a reference to the relevant parts of the ISO 14496-1 specification:

  • Section 8.6.5 specifies the format of an ES_Descriptor class, which inherits from BaseDescriptor.
  • Section 8.2.2 defines BaseDescriptor to be an expandable class:
abstract aligned(8) expandable(2^28 - 1) class BaseDescriptor : bit(8) tag=0
{
    // empty. To be filled by classes extending this class.
}
  • Section 14.3.3 describes expandable classes to have the following length encoding:
int sizeOfInstance = 0;
bit(1) nextByte;
bit(7) sizeOfInstance;
while (nextByte)
{
    bit(1) nextByte;
    bit(7) sizeByte;
    sizeOfInstance = sizeOfInstance << 7 | sizeByte;
}

As can be seen from this definition, and the maximum expandable size of a BaseDescriptor, sizeOfInstance may be encoded with up to 4 bytes. When more bytes are used than strictly required, the size indeed seems to be prefixed with 0x80 bytes.