Transcribing Long Audio File

https://cloud.google.com/speech-to-text/docs/async-recognize

Long audio files refers to audio signal longer than 1 minute.

Asynchronous speech recognition starts a long running audio processing operation.

Asynchronous speech recognition can be used to recognize audio that is longer than a minute.

The audio file can be stored in Google Cloud Storage.

For shorter audio, including audio stored locally (inline), Synchronous speech recognition is faster and simpler.

The results of operation can be retrieved via the google.longrunning.Operations interfact.

Audio content can be sent directly to Cloud Speech-to-Text or it can process audio content that already resides in Google Cloud Storage.

The following python code requires that you have created and activated a service account. ("Quickstart" provide information about setting up gcloud, and create, activate a service account.)

def transcribe_gcs(gcs_uri):
    """Asynchronously transcribes the audio file specified by the gcs_uri."""
    from google.cloud import speech
    from google.cloud.speech import enums
    from google.cloud.speech import types
    client = speech.SpeechClient()

    audio = types.RecognitionAudio(uri=gcs_uri)
    config = types.RecognitionConfig(
        encoding=enums.RecognitionConfig.AudioEncoding.FLAC,
        sample_rate_hertz=16000,
        language_code='en-US')

    operation = client.long_running_recognize(config, audio)

    print('Waiting for operation to complete...')
    response = operation.result(timeout=90)

    # Each result is for a consecutive portion of the audio. Iterate through
    # them to get the transcripts for the entire audio file.
    for result in response.results:
        # The first alternative is the most likely one for this portion.
        print(u'Transcript: {}'.format(result.alternatives[0].transcript))
        print('Confidence: {}'.format(result.alternatives[0].confidence))

The audio limits for asynchronous speech recognition request can be referred to [1].

The maximal quotas can be edited by using Google Cloud Platform Dashboard.

Content Limits

Audio data can be provided either directly within the content field of the request or referenced within a Google Cloud Storage URl in the uri field of the request.

The API contains the following limits on the size of this content

Content Limit Audio length
synchronous request 1 minute
asynchronous request 180 minutes
streaming request 1 minute

Audio longer than 1 minute must use the uri field to reference an audio file in Google Cloud Storage.

The current API usage limits for Cloud Speech-to-Text are as follows

type of limit usage limit
request per 100 seconds 500
processing per 100 seconds 10,800 seconds of audio
processing per day 480 of audio

Each StreamingRecognize session is considered a single request even though it includes multiple frame of StreamingREcognizerReqeust audio within the stream.

Requests and/or attempts at audio processing in excess of these limits will produce an error.

These limits apply to each Cloud Speech-to-Text developer project, and are shared across all applications and IP addresses using a given a developer project.

[0]

https://cloud.google.com/speech-to-text/docs/async-recognize

[1]

https://cloud.google.com/speech-to-text/quotas

results matching ""

    No results matching ""