Cloud Speech-to-Text Basic

This conceptual guide covers the

1 types of available requests to Speech-to-Text ,

2 how to construct those requests, and

3 how to handle their responses.

It is recommended that all users of Speech-to-Text read this guide and one of the associated tutorials before diving into the API itself.

Speech Request

Speech-to-Text has three main methods to perform speech recognition.

Synchronous Recognition(REST and gRPC) sends audio data to the Speech-to-Text API, performs recognition on that data, and returns results after all audio has been processed. Synchronous recognition requests are limited to audio data of 1 minute or less in duration.
Asynchronous Recognition(REST and gRPC) sends audio data to the Speech-to-Text API and initiates aLong Running Operation. Using this operation, you can periodically poll for recognition results. Use asynchronous requests for audio data of any duration up to 180 minutes.
Streaming Recognition(gRPC only) performs recognition on audio data provided within agRPC bi-directional stream. Streaming requests are designed for real-time recognition purposes, such as capturing live audio from a microphone. Streaming recognition provides interim results while audio is being captured, allowing result to appear, for example, while a user is still speaking.

Requests contain configuration parameters as well as audio data. The following sections describe these type of recognition requests, the responses they generate, and how to handle those responses in more detail.

Speech-to-Text API recognition

Speech-to-Text can process up to 1 minute of speech audio data sent in a synchronous request.

Speech-to-Text typically processes audio faster than realtime, processing 30 seconds of audio in 15 seconds on average.

Speech-to-Text has both REST and gRPC methods for calling Speech-to-Text API synchronous and asynchronous requests.

This article demonstrates the REST API because it is simpler to show and explain basic use of the API.

The basic makeup of a REST or gRPC request is similar.

Streaming Recognition Requests are only supported by gRPC.

Synchronous Speech Recognition Requests

A synchronous Speech-to-Text API request consists of a speech recognition configuration, and audio data.

A sample request is shown below:

{
    "config": {
        "encoding": "LINEAR16",
        "sampleRateHertz": 16000,
        "languageCode": "en-US",
    },
    "audio": {
        "uri": "gs://bucket-name/path_to_audio_file"
    }
}

All Speech-to-Text API synchronous recognition requests must include a speech recognition configfield (of type RecognitionConfig [1]). A RecognitionConfig

contains the following sub-field

encoding- (required) specifies the encoding scheme of the supplied audio (of type

AudioEncoding). If you have a choice in codec, prefer a lossless encoding such as

FLACorLINEAR16for best performance [2]. The encoding

field is optional forFLACandWAVfiles where the encoding is included in the file header.

sampleRateHertz- (required) specifies the sample rate (in Hertz) of the supplied audio [3].

ThesampleRateHertzfield is optional forFLACandWAVfiles where the sample rate is included in the file header.

languageCode- (required) contains the language + region/locale to use for speech recognition of the supplied audio. The language code must be a BCP-47identifier [4]. Note that language codes typically consist of primary language tags and secondary region subtags to indicate dialects (for example, 'en' for English and 'US' for the United States in the above example.) (For a list of supported languages [5].)

maxAlternatives- (optional, defaults to1) indicates the number of alternative transcriptions to provide in the response. By default, the Speech-to-Text API provides one primary transcription. If you wish to evaluate different alternatives, set

maxAlternativesto a higher value. Note that Speech-to-Text will only return alternatives if the recognizer determines alternatives to be of sufficient quality; in general, alternatives are more appropriate for real-time requests requiring user feedback (for example, voice commands) and therefore are more suited for streaming recognition requests.

profanityFilter- (optional) indicates whether to filter out profane words or phrases. Words filtered out will contain their first letter and asterisks for the remaining characters (e.g. f***). The profanity filter operates on single words, it does not detect abusive or offensive speech that is a phrase or a combination of words.

speechContext- (optional) contains additional contextual information for processing this audio. A context contains the following sub-field:

[0]

https://cloud.google.com/speech-to-text/docs/basics

[1]

https://cloud.google.com/speech-to-text/docs/reference/rest/v1/RecognitionConfig

[2]

https://cloud.google.com/speech-to-text/docs/encoding\#audio-encodings

[3]

https://cloud.google.com/speech-to-text/docs/basics\#sample-rates

[4]

https://tools.ietf.org/html/bcp47

[5]