Google Speech-to-Text API Experiment
Propose transcribe audio wave to text
Figure 1waveform 上面註記文字跟segment time。
choice 檔案結構 choice 1, choice 2, choice 3, choice 4
Sentence Segmentation [字跟字之間暫停] vs [段落間隔]
Selecting models
Type | Enum constant | Description | Supported languages |
---|---|---|---|
Video | video | transcribing audio in video clips. For best results, audio is recorded at 16,000Hz or greater sampling rate. | en-US only |
Phone call | phone_call | transcribing audio from phone call. Typically, phone audio is recorded at 8,000Hz sampling rate. | en-US only |
Command and search | command_and_search | transcribing shorter audio clips. | All available languages |
Default | default | Use this model if your audio does not fit one of the previously described models. Ideally, audio is high-fidelity, recorded at 16,000Hz or greater sampling rate. | All available languages |
Phrase hints
speechContext
can be pass by RecognitionConfig
to provides information to aid in processing the given audio. a speechContext
can hold a list of phrases
to act as "hints" to the recognizer; these phrases can boost the probability that such words or phrases will be recognized.
- Improve the accuracy for specific words and phrases that may tend to overrepresented in your audio data. For example, if specific "commands" are typically spoken by the user, you can provide these as phrase hints. Such additional phrases may be particularly useful if the supplied audio contains noise or speech is not very clear.
- Add additional words to the vocabulary of the recognition task. Speech-to-Text includes a very large vocabulary. However, if proper names or domain-specific words are out-of-vocabulary, you can add them to the phrases provided to your request's
speechContext
.
Realization
"speechContexts": {
"phrases":["四","三","二","一"]
}
original [output]
[0] https://cloud.google.com/speech-to-text/docs/basics?hl=zh-tw