Skip to main content Go to the homepage
State of the Browser

Exploring the Potential of the Web Speech API in Karaoke

Isn't it frustrating when the song you want isn't available at karaoke? Let's see if we can solve this. We will look at the current state of the Web Speech API, see what's possible, what isn't and what we have to look forward to.


[MUSIC PLAYING] Hello, everyone.

I'm really excited to be here at the state of the browser.

So a while back, the following quote came up in conversation.

"We are at the mercy of the browser.

" And I just chuckled, and I thought, that's an incredible quote.

I want to use it somewhere.

And here I am.

What a better place to be at the mercy of the browser.

So my talk today is about exploring the potential of the Web Speech API in karaoke.

So first things first, hello.

I'm Anna and I do not represent any browser vendor.

I work as a front-end developer at Hacktar.

And fun fact, I haven't been to karaoke since before the pandemic.

And the truth is, this talk, The only reason this talk even popped into my head is because I'm a big fan of the Rasmus.

Do you know them?


I'm not a stan.

I need to under-- I need to make sure people understand that, but I'm a big fan of them.

And because of that, I have a problem that I needed to fix.

So whenever I go karaoke, the only song from Rasmus they have available is In the Shadows.

It's always In the Shadows, and it isn't my favorite.

It's a solid bop, but I need more.

And while we can look up the instrumentals of the songs you like on places like YouTube, I wonder if I could build something.

And then my brain immediately started to escalate the idea.

And I found myself asking, what if we had more than just lyrics on the screen?

What if we could gamify the experience?

What if we could match what we're saying to the lyrics?

Because I know I would win.

And I truly know all the lyrics from my heart.

So I search speech to text.

And I realize that the results in the first page were all from private companies.

And we know that there is a browser native speech recognition.

And we are at the state of the browser and not at the state of private companies.

So we're going to focus on what the browser is currently giving us.

So let's talk about the Web Speech API.

The speech recognition-- I mean, Web Speech API split into two, speech recognition and speech synthesis.

So while it's starting to be obvious where I'm going to be using the speech recognition, actually one of the core ideas of why it was built was not for karaoke, but it was to enable developers to use speech recognition to use an input for forms, continuous dictation, and control.

In fact, there's quite an interesting oldie draft of how this API should work on input fields.

It's very, very old, bear in mind, but it is quite interesting.

So how is the browser support for it?

The screenshot was taken this week.

It currently isn't supported in Firefox.

And in the other browsers, it still requires a vendor prefix or a different name.

It's not ideal, but we'll work with what we have.

After all, we are at the mercy of the browser.

And I'll circle back to this topic soon.

So let's give this a go.

For the state of the browser, I decided to use what is available in a browser for free.

I decided to just use HTML, CSS, and vanilla JavaScript.

No libraries.

[APPLAUSE] On a note, the speech recognition does not work offline.

and it requires HTTPS to send the data.

So let's build our base HTML.

Obviously, this is a bit short because of the amount of code.

For the sake of readability, I won't post it all.

But let's add two buttons, a place for us to see where the transcription is coming up, load our audio, load our lyrics and our JS.

Ideally, I would use a song by the Rasmus, obviously.

But copyright doesn't let me.

So I played safe, and I picked a song that doesn't have any royalties attached to it.

So I had to look it up.

And turns out nursery rhymes are one of them.

And Happy Birthday Song only recently came out available, because you weren't allowed to do that, apparently.

We'll add some base styling with gradients, And it should look something like this.

OK, so the good thing here is that I don't need design skills because I'm sure that whatever I'm coming up will be a bit better than what your local bar has to offer.

[LAUGHTER] Right, so apologies for the block of text.

But we will be initiating our speech recognition.

We'll ask to show the interim results.

So once we initiate it, we'll see recognition.


And we'll say it's true just for the sake of demo.

We'll set English written as a language and append to our doc once we create the-- sorry, apologies.

We'll append to our doc where we are putting the transcription.

And there are some quirks here.

For example, the continuous doesn't actually work.

I said it false because I gave up making it work.

Because initially, I put it to true.

But it doesn't work.

And I looked it up, and apparently the theory-- the speech recognition is a bit of a black box.

We don't really know how it goes.

But the theory is that you are spending quite a bit of data on their servers as well, constantly sending it.

So they're just like, oh, we'll just shut it off after a while.

So it's not really working.

But there is a trick.

At the end, you can see that we are forcing it to start once it ends.

So it's like, oh, you ended?

No, no, no, not on my watch.

You'll start again.

And there's-- unfortunately, on mobile devices, if you do this, you keep hearing the-- like, constantly on and off.

And that's-- so let's just stick to the browser for now on the desktop version.


Let's add our lyrics.

I chose "Twinkle, Twinkle, Little Star," because it's short and slow.

And it will be-- we'll use the same logic as subtitles.

will have which seconds they start and what seconds they end.

So in this case, we'll start at around five seconds and end around 11 past two seconds.

I was like, OK, right.

What is the logic behind this?

So when the speech recognition begins, I want the song to also start.

I was like, OK, that makes sense.

Once I hit Start, both things will begin.

We'll check if the current time falls between the start and end of that song line.

And if yes, we'll just add a class that makes it the current line.

It looks roughly like this.

So once we click in our button, we start our recognition.

We play our song.

And actually, the audio has an event listener that is the time of day, which tells you exactly which songs of the audio you are in right now.

So we save our transcript temporarily.

And we'll check, OK, if the current second right now is-- for each line is five seconds, let's add the-- this is the line being selected right now.

This is a bit too much, but we'll see it playing in a second.

And if the current time is bigger than the end, time of a song line, we'll add a class that marks it as stays in the past.

And once that line is in the past, we'll do some checks.

All right.

So search of shame.

Here we go.

So we'll have something like this.

Please don't fail me.

So in this example here, I'm only showing you what I'm saying right now.

It's not very stopped because it's not my end goal.

I just want you to see that whatever I'm saying is rocking up on the screen right now continuously, which shows that, OK, well, I'm working.

The funny thing about web speech recognition that it actually returns-- I'm only printing the result, but the actual API returns the confidence level of what we think the words are.

And because I'm showing the interim bit, it is thinking what could I be saying and then adjust them at the same time.

If I turned off the interim, it will only give me what they're confident the words are.

But here it is.

OK, I got this bit working.

I got it printing.

Now I just like thinking, OK, I need to match things, right?

Here we go.


I decide to spice things up a little bit, and I decide to do a switch thing button as well.

I think it looks pretty good as it is, but I was like, all right.

We have a bar.

We have a flamingo and this is because one of my favorite bands actually did an Instagram like random like they posted coordinates on their Instagram stories and it was a karaoke bar.

This was last week and they said like fans come meet us and a bunch of people just went to this karaoke bar and they had a party doing karaoke of their own songs with fans and I've never had so much FOMO in my entire life.

And I noticed on their videos that their karaoke had a flamingo in the background.

I was like, okay, I'll do a little tribute over here.

So we have a flamingo and you have, and now we have our classic C.

So this is like the one you'll probably see in like really old bars.

Alright, so first of all I'm not gonna sing in this bit.

Unfortunately I will sing in the next one.

So let's just see this pitch recognition bit just by saying it versus singing because it does make a difference.

So Twinkle, twinkle, little star, How I wonder what you are, Up above the world so high, Like a diamond in the sky.

Twinkle, twinkle, little star, how I wonder what you are.

Let's stop this because I'm so sick and tired of this song.


So I've done some matching validation.

I was like, OK, if you're matching what I'm saying, let's do a little green class.

And I did add some glow, but I don't think it came through very well through the screen.

And then, in theory, I was supposed to say it so well that it would match.

The second sentence should have been all green as well.

But as you can see, it has a couple of flaws.

I know I did say it, but it didn't fully capture it.

But at the time when I was testing it, I thought, well, it's a bit unfair if you don't get the whole sentence, but it did say a few words.

So I did a partial match, which is shown as orange.

So my point with the orange bit was, for example, to be-- [MUSIC PLAYING] I'm going to say words incorrectly.

And actually, it's going to capture what it just says.

So it's going to unmatch it completely, which is unfortunate.

How I wonder, cat, you are.

Okay, how, sorry, I messed up.

But it's fun, it's fun because like it's gonna say, yeah it's wrong, it didn't say it.

So, whatever.

Twinkle, twinkle, little star.

How I wonder, dog, you are.




It works.

And guess what?

It's all in the browser.

There is no libraries.

Even the yellow bits showing which line you're supposed to-- it's just a CSS animation.

Everything is CSS, and I literally only had to do a few for loops to match what you're saying.

So I think it's pretty neat.

If I was using probably a lot more things, I could have glitter, I could have, oh, add your name, save it, add points.

I couldn't decide on points, so I decided to park that idea, because I was like, how many points should you get if you get strikes and things?

And I was escalating it so much that I thought, no, I need to rein myself with a proper goal, Otherwise, I won't stop.


So back to where we were.

There are really a lot of fun demos and projects built on top of this API.

Tony Edwards did a fantastic talk called Beats, Rhymes, and Unit tests.

In his talk, he wanted to see if the Web Speech API could help him jot down the rhymes it comes up with.

Similar to me, Tony noted that it wasn't perfect, but it was a great experiment.

I really recommend watching his talk.

And with that said, I, oop, there we go.

I, you notice that if I say those words properly, they will come out correctly.

Unfortunately, if I sing it, sing them, the results vary.

I'm so sorry I have-- (audience laughing) I am so sorry, but here we go.

Twinkle, twinkle, little star, how I wonder what you are.

Up above the world so high Like a diamond in the sky Twinkle, twinkle little star How I wonder what you are [APPLAUSE] If you want to learn more, Stephanie Hackles built a really fun center chat for the 12 Days of Web Dev Challenge.

I think it was a couple of years ago, though.

And Wes Bose also has a chapter on the Web Speech API for free on his JavaScript course.

And there's plenty of other projects using this technology.

During the pandemic, the web captioner became a tool for many people.

It's going to sunset soon, and it's actually going to be made open source.

And it was because of this tool, actually, that I found out that there's also microphone information.

And that's when I realized that my microphone actually doesn't function very well.

It gives me a really low volume.

And it could also be the reason why Some words may be missing.

I won't find out any time, so I'm not going to fix it.

And there's also quite a few polyfills available.

Implementations with WebRTC and, as mentioned before, lots of private companies that will offer you speech recognition as a service.

So the following pop-up appeared on Safari when I was trying things out.

It reads, Safari would like to access speech recognition.

Speech data from this app will be sent to Apple to process your requests.

This will also help Apple improve its speech recognition technology.

And I said, OK, and I've never saw this ever again.

So this pop up sort of answers what some of you may be asking right now.

So what is the current status of the Web Speech API recognition in Firefox?

Unfortunately, I wasn't able to be in touch with someone who is currently working in this project in Firefox.


But this is my assumption made from things I've read online.

BLEX-PCH API needs access to a lot of data to train on.

And it is my understanding that, unlike the other browser vendors that do belong to massive corporations, it doesn't sound like Firefox has unlimited access to data that they can train on.

But I'm optimistic.

It's just-- I don't have any information on this, but I do think it's just a matter of time.

There are projects like the Common Voice from Mozilla, where you can donate your voice, and most importantly, at least to me, your accent, and also do some quality control.

So what's next?

Right, after all this, you might be wondering what's next?

And I don't have any answers.

But I would like to see these types of APIs grow because they open a world of free creative opportunities, just in a browser for you for free to play on.

It is also conferences like the State of the Browser that help highlight some wishes and needs.

For example, that's how I learned about the Common Voice project.

And I have the feeling the web speech API will grow.

Meanwhile, I will share and support the call to action to the speech recognition sibling.

Look up Leonie Watson's amazing talk, Designing Voice Interfaces, where a call to action is presented to support CSS speech.

I want to pull this talk to you.

You should go and watch it.

This is a bit of an improved bit, but you can learn by building useless things.

And what you might be wondering-- we learned so much today.

There was amazing talks on accessibility, web, all the web stuff.

And what do I do with this?

I only recently gave myself permission to build unproductive things.

They are not for my job.

They don't benefit anyone except my own joy.

Side projects don't need to be monetized in order to be valid.

They don't need to become NPM packages or open source projects.

That's it.

If you learn anything from this, it's that I probably did more JavaScript than I usually did on my day job.

And I have to go back to it to do all the improvisations based on all the talks I've seen today.

So OK, good job you're not looking at my JavaScript.

Anyways, thank you.

[ Applause ] >> Wow.

I am a fan, not a fan.

You look like a band member.

I know.

I don't know why they didn't invite me to go on stage and sing.

Oh yeah, I know why.

Never mind.


About Ana Rodrigues

Ana Rodrigues

Ana works as a front-end developer for the agency Hactar. She started coding as a teenager building fan sites, and has been working as a front-end developer for the last 11 years. Nowadays, Ana spends most of her free time experimenting on her personal blog and is particularly interested in ethics, IndieWeb, sustainability, plants, cooking, privacy and all things CSS.