Adventures with Motion JPEG

I recently added support for Motion JPEG streams to my MP4 mux and demux. (Update: these changes are now available here). There were a few interesting little twists in getting this to work correctly, so I thought I would write these down in case they were of use to someone. I realise that the set of people interested in this who don’t already know it is quite small, but since my dog retired, I have to tell someone else.

I want the files that my MP4 mux creates to be compatible with QuickTime. It’s not enough to create files that can be played back by my demux (although that in itself is useful in some cases, such as CC 608 closed caption data). So I’ve been testing these files with QuickTime, VLC and a handful of different decoders on DirectShow, including the BlackMagic decoder.

Supporting progessive images was straightforward. I found I could use the FOURCC ‘jpeg’ and just put the elementary stream in the file without any changes, and it worked in all players. The issues arose with interlaced Motion JPEG, in which each frame contains two separate JPEG images and a header is needed to identify the boundary between them. There are two types of header structure used by different companies to mark this boundary, and I needed to include both for it to work correctly.

First, a quick refresher on JPEG. The data stream consists of a set of tags, each beginning with a 0xFF byte, with the compressed data in the middle. Any 0xFF in the compressed data is escaped, so you can parse it easily, but unfortunately the length of the compressed data is not marked in a header. To parse it, you follow these rules:

0xFF followed by another 0xFF (or at the end) is a padding byte.
0xFF followed by a 0 is an escaped 0xFF that is part of the compressed data
0xFF followed by a byte in the range 0xD0..0xD9 is a two byte code
0xFF followed by any other code will have a two byte length field followed by header data. The length field does not include the FF xx, but it does include the length field itself.
0xFF DA (Start Of Scan) marks the start of the compressed data. There is a length field, which only describes the length of the SOS record itself. The compressed data which follows is of unknown length.
The frame starts with an SOI (FF D8) and ends with a EOI (FF D9).

In the MP4 index, each frame is indexed as a single chunk of data. For interlaced frames, the decoder needs to separate the two images, but as you can see from that brief description, that means parsing the whole scan data to find the EOI code. To avoid this, decoders require an additional header at the start of each image, giving the length of the whole image.

Motion JPEG AVI files typically have an APP0 header (FF E0), with the tag ‘AVI1’. The structure of this data can be seen as the APP0 struct in typehandler.cpp, but essentially it just contains the fieldsize for the whole JPEG image. Quicktime requires an APP1 header (FF E1), with a ‘mjpg’ tag. This structure (struct APP1 in typehandler.cpp) is more complicated, as it contains offsets to key fields within the image header as well as the overall length.

My mux will create or fix up both APP0 and APP1 headers, creating files that are playable by all the players I tried. If one of them is present, I use it to determine the length of the image. If neither are present, the mux will scan the entire image to find the EOI. This works correctly, but slows down the multiplexing process considerably.

There’s another catch. Motion JPEG files normally omit the Huffman table if the default table is used. However, Quicktime will not decompress images that omit the Huffman table, so the multiplexor will insert the default table if it is not present. This, unfortunately, adds 420 bytes to every image (840 bytes for interlaced frames).

The final point is that any MP4 containing Motion JPEG will be rejected by QuickTime. It needs to be labelled as a MOV file. I’ve already met this when adding support for uncompressed PCM audio, and fortunately the fix is very simple. The first four bytes of the ‘ftyp’ atom at the start of the file contain the file type. I’ve been using ‘mp42’ for MP4 files. Changing this to ‘qt ‘ means that QuickTime will accept it as a valid MOV file. My mux will do this if any of the contained streams are in a non-MP4 format (that is, Motion JPEG or uncompressed PCM audio).

2nd May 2013