Talky gets Real-time Text

Today I'm pleased to announce the release of a new feature of our video conferencing service, Talky.io. It's called Real-time Text (or RTT for short), and it lets other people in your Talky room see what you are typing into the room chat, as you type. Here's a screen capture of RTT in action:

RTT is not enabled by default as most people are inclined to hide typos or half-finished thoughts, but you can opt-in by ticking the "Send as I type" box beneath the chat input.

So why add RTT to Talky? Because at &yet we believe that software is about people, and there are people who use Talky that are not able to use the voice half of our voice and video service.

One of &yetConf's opening night speakers, Aimee Chou, talked about her experiences as a deaf person, both in terms of conferences and software. (In fact, she gave her presentation in sign language, and it was the audience who had an interpreter.) As she put it, deaf users are treated as "edge-cases".

Many software developers think it's OK to not bother supporting people who are hearing impaired, or simply aren't aware of how the user experience of their app changes when a person can't hear.

Given that audio and voice is a large part of what I work on, I was especially impacted by her plea to be considered when building apps. So I ended up skipping most of the conference talks to focus on bringing RTT to Talky.

Real-time Text is an important feature for deaf and hard-of-hearing folks, because it lets them have a close approximation to "voice". Without RTT, the options are to use some form of sign language or rely on normal text chat. But text chat is slow for live conversations. It usually doesn't feel that way, but it becomes readily apparent when you have video. There are lots of starts and stops, pausing the video conversation experience while you type or read a message. And a lot of the time you aren't even able to maintain eye contact, because you're both always looking off to the side at the chat to read the latest message.

The purpose of RTT is to make the text chat mimic the speed and bandwidth of voice. You don't have to finish a paragraph for the other person to respond, they're already answering before you finished! You can see the emotions of the person both from their face on video and from the cadence and intensity of their typing -- the fast keypresses from heightened emotions, the slow adding and changing of words when the speaker is unsure. All of these little cues that we are accustomed (even if unknowingly) to interpreting and applying to our understanding of the other person in the conversation are important, and with RTT those cues can be recreated, at least in part. Just as importantly, we can place the live text on top of the person's video so that you are both looking at each other so you can see the emotions on each other's faces as you type.

Real-time text is not a new thing. You might remember similar things from the heydays of AIM and ICQ. For Talky, we use XMPP for our signaling, and so we now use the In-Band Real Time Text standard written by Mark Rejhon and Gunnar Hellstrom. As a member of the XMPP Standards Foundation, I was involved in the standardization process of that document back in 2011 and 2012, providing feedback on various portions. So when we converted Talky to use XMPP in 2014, I actually added support for RTT to our stanza.io XMPP library. Sadly, I then never got around to actually integrating RTT into Talky, which made Aimee's words strike me all the harder.

The naive approach to implementing some form of RTT would be to send keystrokes as they happen. That's what it sounds and looks like you're doing, after all. But there are more ways that text can change than just keystrokes (for example, your OS or browser might perform spell correction automatically). So what we do is whenever the chat input text is modified in any way we perform a simple diff calculation to see what text was added or removed. I say "simple" here, because we can generally assume that there is only one cursor and therefore only one spot is realistically able to be changed between diff checks. That diff is then saved in a queue, along with the time since the last diff event. Every 700 milliseconds or so, that queue is drained and all of the events sent to the other conversation participants, who are then able to "replay" those events, including the delays, to simulate the process of how the text was originally typed.

There are of course more little implementation details and potential Unicode issues that the XEP-0301 spec covers, but the fundamental model is remarkably simple and fun to implement.

Since Talky's earliest days, we've received many kind and encouraging notes from members of the deaf community. We're thrilled to be able to add this feature for those users and more.

Give real-time text a try the next time you use Talky (it also pairs very well with our new small screen support.

We're looking forward to your feedback!

You might also enjoy reading:

Blog Archives: