Google Speech-to-Text API Experiment

Propose:\colon transcribe audio wave to text

Figure 1waveform 上面註記文字跟segment time。

choice 檔案結構 choice 1, choice 2, choice 3, choice 4

Sentence Segmentation [字跟字之間暫停] vs [段落間隔]

Selecting models

Type Enum constant Description Supported languages
Video video transcribing audio in video clips. For best results, audio is recorded at 16,000Hz or greater sampling rate. en-US only
Phone call phone_call transcribing audio from phone call. Typically, phone audio is recorded at 8,000Hz sampling rate. en-US only
Command and search command_and_search transcribing shorter audio clips. All available languages
Default default Use this model if your audio does not fit one of the previously described models. Ideally, audio is high-fidelity, recorded at 16,000Hz or greater sampling rate. All available languages

Phrase hints

speechContext can be pass by RecognitionConfig to provides information to aid in processing the given audio. a speechContext can hold a list of phrases to act as "hints" to the recognizer; these phrases can boost the probability that such words or phrases will be recognized.

  • Improve the accuracy for specific words and phrases that may tend to overrepresented in your audio data. For example, if specific "commands" are typically spoken by the user, you can provide these as phrase hints. Such additional phrases may be particularly useful if the supplied audio contains noise or speech is not very clear.
  • Add additional words to the vocabulary of the recognition task. Speech-to-Text includes a very large vocabulary. However, if proper names or domain-specific words are out-of-vocabulary, you can add them to the phrases provided to your request's speechContext.

Realization

    "speechContexts": {
      "phrases":["四","三","二","一"]
     }

original [output]

[0] https://cloud.google.com/speech-to-text/docs/basics?hl=zh-tw

results matching ""

    No results matching ""