Hardware Video Encoding on iPhone — RTSP Server example

On iOS, the only way to use hardware acceleration when encoding video is to use AVAssetWriter, and that means writing the compressed video to file. If you want to stream that video over the network, for example, it needs to be read back out of the file.

I’ve written an example application that demonstrates how to do this, as part of an RTSP server that streams H264 video from the iPhone or iPad camera to remote clients. The end-to-end latency, measured using a low-latency DirectShow client, is under a second. Latency with VLC and QuickTime playback is a few seconds, since these clients buffer somewhat more data at the client side.

The whole example app is available in source form here under an attribution license. It’s a very basic app, but is fully functional. Build and run the app on an iPhone or iPad, then use Quicktime Player or VLC to play back the URL that is displayed in the app.

Details, Details

When the compressed video data is written to a MOV or MP4 file, it is written to an mdat atom and indexed in the moov atom. However, the moov atom is not written out until the file is closed, and without that index, the data in mdat is not easily accessible. There are no boundary markers or sub-atoms, just raw elementary stream. Moreover, the data in the mdat cannot be extracted or used without the data from the moov atom (specifically the lengthSize and SPS and PPS param sets).

My example code takes the following approach to this problem:

A little experimentation suggests that we are able to read compressed frames from file about 500ms or so after they are captured, and these frames then arrive around 200ms after that at the client app.

Rotation

For modern graphics hardware, it is very straightforward to rotate an image when displaying it, and this is the method used by AVFoundation to handle rotation of the camera. The buffers are captured, encoded and written to file in landscape orientation. If the device is rotated to portrait mode, a transform matrix is written out to the file to indicate that the video should be rotated for playback. At the same time, the preview layer is also rotated to match the device orientation.

This is efficient and works in most cases. However, there isn’t a way to pass this transform matrix to an RTP client, so the view on a remote player will not match the preview on the device if it is rotated away from the base camera orientation.

The solution is to rotate the pixel buffers after receiving them from the capture output and before delivering them to the encoder. There is a cost to this processing, and this example code does not include this extra step.

Update (April 2014)

On iOS 7, and particularly with an iPhone 5s, the output written to the file has changed slightly and I needed to update the code to fix this. The updated code is available here, and there is a complete explanation here.