Video codecs past, present, and future

Yesterday Fippo talked about applying the principles of the open web to realtime communication with WebRTC. Consider this post the first installment of providing in-depth descriptions of what that means in practice.

Where codecs fit in

A key component of realtime audio and video systems (like Talky) consists of methods for encoding the light waves and sound waves that our eyes and ears can understand into digital information that computers can process. Such a method is called a codec (shorthand for code/decode).

Because there are many different audio codecs and video codecs, if we want interoperability, then it's important for different software implementations to "speak the same language" by having at least one codec in common. In the parlance of industry standards, such a codec is specified as "mandatory to implement" or MTI (often an MTI codec is a lowest common denominator, but software can use better codecs if available - a model that has also worked well for cipher suites in SSL/TLS and other technologies.)

Unfortunately, because a lot of big companies have been granted a huge number of patents over codec technologies (and often jealously guard those patents under restrictive licenses), it can be difficult to settle on one MTI audio codec or video codec for communication over the Internet.

In the case of audio, way back in 2009 I helped spearhead an effort at the Internet Engineering Task Force (IETF) to create a patent-clear and technically superior audio codec optimized for the Internet.

In a spirit of true collaboration, core developers from Skype, Xiph.org, and other teams came together and combined the best of the Silk and CELT codecs to produce something that was better than the sum of its parts.

The result was Opus, which has become the go-to audio codec for Internet voice apps and the primary MTI audio codec for WebRTC (yes, the old G.711 codec is also mandatory to implement in WebRTC, but primarily for interoperability with legacy systems - Opus is the codec that matters here).

Video killed the audio star

In the case of video, the story is not so happy.

For almost three years now, the RTCWEB Working Group at the IETF has been fighting on and off about which video codec to specify as mandatory to implement. There have been two main contenders in this battle:

  • VP8, proposed by Google and currently implemented by Google Chrome, Mozilla Firefox, Opera and the open source webrtc.org stack provided by Google. Google claims it is not subject to patent licenses or royalties.
  • H.264, currently supported by Firefox through a plugin provided by Cisco. H.264 is subject to royalties under certain conditions (which are avoided by Firefox in a clever way). It is a widely established codec supported in hardware such as phones, switches, PBXs, and the like.

The VP8 and H.264 camps have been at odds for years, and it appeared that a solution would never be found. Some folks even proposed the use of H.261, an ancient video codec that appears to be patent-clear, because at least everyone could deploy it without fear of licenses and royalties.

A new hope?

Yet at the IETF meeting a few weeks ago in Honolulu, a novel proposal emerged and quickly received broad support among the browser vendors that currently implement WebRTC.

The compromise is that WebRTC "user agents" (browsers like Chrome and Firefox) and "devices" (apps like Talky iOS) are supposed to implement both VP8 and H.264. Thus there will be at least one codec which all participants in a video chat can use.

Building something better

That's all fine, and I'm not about to rain on this particular parade.

However, I have to ask:

Is this really the best we can do for the open web?

Far from it.

What we need is a truly open and technically superior codec, like Opus but for video.

In 2012 I once again worked with proponents of open codecs at the IETF (including the folks working on Daala, this time in an attempt to form a working group devoted to creation of a standard video codec. Sadly, that effort did not bear fruit.

Yet.

But I take the long view, perhaps because I've been working on open communication technologies for 15 years, ever since I joined the Jabber instant messaging project in late 1999.

The way I see it, 15 years from now we'll probably still be using WebRTC, but we won't be using creaky old codecs like H.264 and VP8. By then, I'm confident that we'll have a technically superior and truly open video codec.

Paradoxically, the Great Video Compromise worked out in Honolulu might even move us a bit closer in that direction, despite the fact that it seems to entrench H.264 and VP8 forever.

As my former Cisco colleague Mo Zanaty pointed out, supporting two video codecs also entails support for codec negotiation, capabilities discovery, and other technical features that will make it easier for software to handle emerging codecs such as VP9 (recently deployed in Chrome) and H.265, as well as future codecs we can only dream about today.

So am I enthusiastic about supporting both VP8 and H.264 in Talky? Not really. But it's the best we can do right now, until those of us who care about the open web build something better.

Maybe ask me again in 15 years. :-)

You might also enjoy reading: