Hardware Video Encoding on iPhone — RTSP Server example

On iOS, the only way to use hardware acceleration when encoding video is to use AVAssetWriter, and that means writing the compressed video to file. If you want to stream that video over the network, for example, it needs to be read back out of the file.

I’ve written an example application that demonstrates how to do this, as part of an RTSP server that streams H264 video from the iPhone or iPad camera to remote clients. The end-to-end latency, measured using a low-latency DirectShow client, is under a second. Latency with VLC and QuickTime playback is a few seconds, since these clients buffer somewhat more data at the client side.

The whole example app is available in source form here under an attribution license. It’s a very basic app, but is fully functional. Build and run the app on an iPhone or iPad, then use Quicktime Player or VLC to play back the URL that is displayed in the app.

Details, Details

When the compressed video data is written to a MOV or MP4 file, it is written to an mdat atom and indexed in the moov atom. However, the moov atom is not written out until the file is closed, and without that index, the data in mdat is not easily accessible. There are no boundary markers or sub-atoms, just raw elementary stream. Moreover, the data in the mdat cannot be extracted or used without the data from the moov atom (specifically the lengthSize and SPS and PPS param sets).

My example code takes the following approach to this problem:

Only video is written using the AVAssetWriter instance, or it would be impossible to distinguish video from audio in the mdat atom.
Initially, I create two AVAssetWriter instances. The first frame is written to both, and then one instance is closed. Once the moov atom has been written to that file, I parse the file and assume that the parameters apply to both instances, since the initial conditions were the same.
Once I have the parameters, I use a dispatch_source object to trigger reads from the file whenever new data is written. The body of the mdat chunk consists of H264 NALUs, each preceded by a length field. Although the length of the mdat chunk is not known, we can safely assume that it will continue to the end of the file (until we finish the output file and the moov is added).
For RTP delivery of the data, we group the NALUs into frames by parsing the NALU headers. Since there are no AUDs marking the frame boundaries, this requires looking at several different elements of the NALU header.
Timestamps arrive with the uncompressed frames from the camera and are stored in a FIFO. These timestamps are applied to the compressed frames in the same order. Fortunately, the AVAssetWriter live encoder does not require re-ordering of frames. Update this is no longer true, and I now have a version that supports re-ordered frames.
When the file gets too large, a new instance of AVAssetWriter is used, so that the old temporary file can be deleted. Transition code must then wait for the old instance to be closed so that the remaining NALUs can be read from the mdat atom without reading past the end of that atom into the subsequent metadata. Finally, the new file is opened and timestamps are adjusted. The resulting compressed output is seamless.

A little experimentation suggests that we are able to read compressed frames from file about 500ms or so after they are captured, and these frames then arrive around 200ms after that at the client app.

Rotation

For modern graphics hardware, it is very straightforward to rotate an image when displaying it, and this is the method used by AVFoundation to handle rotation of the camera. The buffers are captured, encoded and written to file in landscape orientation. If the device is rotated to portrait mode, a transform matrix is written out to the file to indicate that the video should be rotated for playback. At the same time, the preview layer is also rotated to match the device orientation.

This is efficient and works in most cases. However, there isn’t a way to pass this transform matrix to an RTP client, so the view on a remote player will not match the preview on the device if it is rotated away from the base camera orientation.

The solution is to rotate the pixel buffers after receiving them from the capture output and before delivering them to the encoder. There is a cost to this processing, and this example code does not include this extra step.

Update (April 2014)

On iOS 7, and particularly with an iPhone 5s, the output written to the file has changed slightly and I needed to update the code to fix this. The updated code is available here, and there is a complete explanation here.