Building Advanced AI Voice Assistants Using Google Gemini 2.0 and Angular
Explore the Gemini Live API (Experimental) — a real-time, multimodal API for creating advanced, voice-first user experiences.
This article presents a quick-start project demonstrating Gemini 2.0’s real-time, multimodal AI capabilities within an Angular web application.
This project originated as an Angular adaptation of the Multimodal Live API — Web console, which is presently only available in React. The Gemini Live API is in active development and may change; there is currently no official JavaScript client or SDK.
What’s Gemini Live?
Gemini Live API is part of Google Gemini 2.0, enabling a new generation of dynamic, multimodal, real-time AI experiences, as showcased in Pixel 9 phones and the Gemini app for Mobile (iOS and Android).
Gemini Live powers innovative applications across devices and platforms:
- Hands-free AI Assistance: Users interact naturally through voice while cooking, driving, or multitasking.
- Real-time Visual Understanding: Get instant AI responses as you show objects, documents, or scenes through your camera.
- Smart Home Automation: Control your environment with natural voice commands — from adjusting lights to managing thermostats.
- Seamless Shopping: Browse products, compare options, and complete purchases through conversation.
- Live Problem Solving: Share your screen to get real-time guidance, troubleshooting, or explanations.
- Integration with Google services: Leverage existing Google services like Search or Maps to enhance its capabilities.
Try Gemini Live in the Gemini app (for iOS and Android)
Gemini is now available on iPhone and Android, so you can enjoy free, natural conversations with Gemini Live and get the most out of Google’s personal AI assistant.
Try Gemini Live in Google AI Studio
Before diving into the code, explore Gemini Live’s capabilities — voice interactions, webcam, and screen sharing — by using Google AI Studio Live. This interactive playground will help you understand the available features and integration options before adding them to your projects.
First step: Obtaining your API key from Google AI Studio
Navigate to aistudio.google.com and generate an API key. You can check the global availability of the API here.
Cloning the Angular Application
Run the following command in a folder of your choice to create a local copy of the project:
$ git clone https://github.com/gsans/gemini-2-live-angular.git
> Cloning into `Spoon-Knife`...
> remote: Counting objects: 10, done.
> remote: Compressing objects: 100% (8/8), done.
> remove: Total 10 (delta 1), reused 10 (delta 1)
> Unpacking objects: 100% (10/10), done.
$ cd gemini-2-live-angular
$ npm install
These commands will download the complete repository and install its dependencies.
System Requirements
- Node.js and npm (latest stable version)
- Angular CLI (globally installed via
npm install -g @angular/cli
)
Setting up the project
Add a new environment by running the following command:
ng g environments
This creates the necessary files for development
and production
environments:
src/environments/environment.development.ts
src/environments/environment.ts
Modify the development
file, replacing <YOUR-API-KEY>
with your actual API Key:
// src/environments/environment.development.ts
export const environment = {
API_KEY: "<YOUR-API-KEY>",
WS_URL: "wss://generativelanguage.googleapis.com/ws/google.ai.generativelanguage.v1alpha.GenerativeService.BidiGenerateContent",
};
To execute the code, run the following command in the terminal and navigate to localhost:4200
in your browser:
ng serve
Usage Guide: Getting Started
- Launch the application and click the
Connect
button underConnection Status
. - This project requires a WebSocket connection, which will be used to send and receive real-time voice, images, and video from the Gemini Live API.
- Monitor the browser’s Developer Tools Console for connection issues.
- You can also use the Control Tray, located in the top right corner, to: toggle the microphone, share your screen or webcam, and use the play/pause button to start a conversation.
A few examples using Gemini’s Tools
You can try any of the prompts you already use with Gemini. The following examples explore Gemini’s Tools that access real-time data via Google Search, execute Python code, or execute actions through external APIs.
Gemini real-time access via Google Search
Gemini can use Google Search to provide real-time information beyond the training data cut-off date and grounding to minimize hallucinations.
Tell me the scores for the last 3 games of FC Barcelona.
Gemini using a Python sandbox via Code Execution
Gemini can access a Python sandbox to generate and execute code as part of its response. This sandbox has access to libraries such as altair
, chess
, cv2
, matplotlib
, mpmath
, numpy
, pandas
, pdfminer
, reportlab
, seaborn
, sklearn
, statsmodels
, striprtf
, sympy
, and tabulate
. matplotlib
can generate 2D/3D diagrams as images.
You can try this option in Google AI Studio. Enable
Code execution
in theTools
section of theRun settings
panel.
What’s the 50th prime number?
What’s the square root of 342.12?
Gemini running an external weather API (mock) via Function Calling
The function calling feature in Gemini allows you to provide users with access to tools: external APIs that provide real-time data or perform actions (e.g., making restaurant reservations or ordering food). These tools remain available throughout the conversation.
This example demonstrates this capability using a mock weather API. The model doesn’t directly call the function; instead, it generates structured output specifying the function name and suggested arguments. You’ll need to add code to handle the user request (as identified by Gemini) by calling the external API. Finally, provide the API’s output to Gemini, which formats the final answer for the user.
What’s the weather in London?
For a more complex example of Function Calling implementing CRUD, see GenList.
Configuration Options
The main configuration is handled in src/app.component
. You can toggle between audio and text modalities:
let config: LiveConfig = {
model: "models/gemini-2.0-flash-exp",
generationConfig: {
// For text responses in chat window
responseModalities: "text",
// For audio responses (uncomment to enable)
// responseModalities: "audio",
// speechConfig: {
// voiceConfig: { prebuiltVoiceConfig: { voiceName: "Aoede" } },
// },
},
}
Usage Limits. Daily and session-based limits apply. Token count restrictions are in place to prevent abuse. If limits are exceeded, wait until the next day to resume.
At the moment, the audio option is not sending over the transcript or corresponding text to your voice commands or Gemini’s voice responses. You can use an external speech-to-text API to achieve this, like Deepgram. This exercise is left to the reader to avoid adding further complexity to the current project. Let me know if you don’t know how to approach this task.
Technical Overview: Data Flow, Events and UI
Compared to a standard Gemini’s client, this project introduces significant complexity due to its real-time streaming capabilities for audio (microphone and Gemini’s voice) and video (screen and webcam). This requires two clients: one for Gemini and another for the WebSocket. The following diagram illustrates the methods, observables, and events involved in the data flow.
Managing the real-time connection (over WebSocket)
Contrary to a REST API that follows a request-response protocol, WebSockets follow a bidirectional flow, meaning that both sides of the WebSocket channel can initiate or interrupt the communication. This creates three stages: setup handshake (open connection), bidirectional message exchange (connection remains open), and termination (one side closes the connection).
Processing Audio (Web Audio API)
This section of the project manages both user microphone input (outgoing audio) and Gemini’s voice response (incoming audio) using the AudioRecorder
and AudioStreamer
classes, respectively.
Audio processing within the Web Audio API occurs in a separate thread from the main UI thread, ensuring smooth performance, even under heavy load.
- The
AudioRecorder
captures audio from the user’s microphone and sends it to the Gemini’s client via thesendRealtimeInput()
method. This is accomplished using two audio worklets:worklet.audio-processing.ts
andworklet.audio-meter.ts
. The first worklet,worklet.audio-processing.ts
, handles the core audio capture fromnavigator.mediaDevices.getUserMedia({ audio: true })
, performing buffering and converting the audio data to a suitable format for transmission. The second worklet,worklet.audio-meter.ts
, provides volume level data, used by theaudio-pulse
control to visually represent the audio levels from both theAudioRecorder
(microphone) and theAudioStreamer
(Gemini’s voice). - The
AudioStreamer
receives Gemini’s voice data through thews.audio
event handler (indicating audio data received from the WebSocket). It manages buffering and scheduling of audio chunks to ensure smooth, gap-free playback, even with network fluctuations. The Gemini’s client calls theAudioStreamer
'saddPCM16()
method to pass the received audio data for playback.
Processing Video (WebRTC)
This part of the project uses WebRTC to stream the screen or webcam to a video element. Using a separate process, we render a frame via a canvas element and send it over to the Gemini Live API using sendRealtimeInput()
.
Current FPS is set to 0.5, or a frame every 2 seconds.
User Interface: connection, chat window, and control tray
The user interface is split into three main blocks: the connection, the chat window, and the control tray. The chat window is followed by the video and canvas elements used by the control tray component. Below is a diagram showing the main methods behind each button, component inputs, and events.
Gemini Live API Setup and Tools
The Gemini Live API configuration can be found in the connect
method in src/app/app.component.ts
and, if left empty, in the config
property in the MultimodalLiveService
. I recommend starting with a simpler project where you can learn each Gemini Tool separately. You can get started with this article below.
Conclusion
Congratulations! You have successfully accessed the Gemini Live API. Use the completed GitHub project as a reference to create your own features.
This tutorial covered the following:
- Introduction to the Gemini Live API.
- Project setup and getting started.
- A few examples to showcase Gemini’s Tools (Code Execution, Google Search, and Function Calling).
This project shows how to create a Gemini Live Client using WebSockets to stream images, documents, audio (microphone and Gemini Live voices), and video (screen sharing and webcam) from an Angular project.
Building your own Voice-First Assistant Applications
Use this project to start familiarizing yourself with real-time voice user interactions and build your own. These are some ideas to get you started:
- An AI booking assistant with access to your schedule, capable of handling dentist bookings and cancellations (via voice or chatbot interface).
- An automated robot cafe that takes orders and relays them to a robot barista (using voice, text, or visual interactive controls).
There are more examples available created by the Gemini team using React.
Thanks for reading!
Do you have any questions? Feel free to leave your comments below or contact me on Twitter at @gerardsans.
Resources
Final Warning
The official version of the Gemini Live API in the JavaScript SDK, once released, will likely have a different implementation and API design, with improvements in security and performance (possibly using WebAssembly). This project is at the forefront of innovation and may already be outdated by the time you read this. However, it is still a great way to get ahead and learn about the future of voice AI assistants today.