Speech-to-Textの音声適応について

Speech-to-Text の音声適応

この記事では Speech-to-Text API で提供されている音声適応機能を紹介します。

この機能を利用する際、Speech-to-Text API に「ヒント」としてフレーズの一覧を与えます。これによって、音声データからフレーズを抽出する際に「ヒント」に含まれるフレーズを優先させることができます。

たとえば、音声データに「衛生」という単語が含まれているとします。Speech-to-Text が「衛生」という単語を検出した場合、「衛星」よりも多く「衛生」と文字変換されることが理想的です。この場合は、モデル適応を使用して「衛生」と認識するように Speech-to-Text にバイアスをかけることができます。

モデルの適応は、特に次のようなユースケースに対して有効です。

音声データでよく使用される単語やフレーズの精度を改善する。
Speech-to-Text によって認識される語彙を増加させる。
音声にノイズが含まれている場合や、あまり鮮明でない場合に、音声文字変換の精度を改善する。

音声適応の詳細と利用法

適応機能には、主に、単語やフレーズのリストを指定して単語とフレーズの認識精度を向上させる機能とクラスを使用して認識精度を向上させる機能の二つの機能からなります。この節では、これらの二つの機能に加え、ブースト機能による認識モデルのバイアスを微調整する方法も説明します。

単語とフレーズの認識精度を向上させる

Speech-to-Text が音声データを文字変換するときに特定の単語・フレーズを認識する確率を高めるには、SpeechAdaptation リソースの PhraseSet オブジェクトにその単語・フレーズを渡します。

複数単語のフレーズを指定すると、Speech-to-Text はその単語の並びを認識しやすくなります。また、個々の単語を含むフレーズの一部を認識する確率も高まります。

こうしたフレーズの数とサイズの上限は以下の通りです。

音声適応の上限	値
1 リクエストあたりのフレーズ数	5,000
文字数	100,000
1 フレーズあたりの文字数	100

より詳細な内容はコンテンツの上限ページからご確認ください。。

単語とフレーズの認識精度を向上させる機能の使用例

SpeechAdaptation リソースを使用して音声文字変換の精度を改善する python コードの例を以下に示します（モデル適応による認識リクエストの送信より一部改変して抜粋）。


from google.cloud import speech_v1p1beta1 as speech

def transcribe_with_model_adaptation(
    project_id, location, storage_uri, custom_class_id, phrase_set_id
):

    """
    Create`PhraseSet` to create custom lists of similar
    items that are likely to occur in your input data.
    """

    # Create the adaptation client
    adaptation_client = speech.AdaptationClient()

    # The parent resource where the custom class and phrase set will be created.
    parent = f"projects/{project_id}/locations/{location}"

    # Create the phrase set resource
    phrase_set_response = adaptation_client.create_phrase_set(
        {
            "parent": parent,
            "phrase_set_id": phrase_set_id,
            "phrase_set": {
                "boost": 10,
                "phrases": [
                    {"value": "fare"}
                ],
            },
        }
    )
    phrase_set_name = phrase_set_response.name

    # The next section shows how to use
    # phrase set to send a transcription request with speech adaptation

    # Speech adaptation configuration
    speech_adaptation = speech.SpeechAdaptation(phrase_set_references=[phrase_set_name])

    # speech configuration object
    config = speech.RecognitionConfig(
        encoding=speech.RecognitionConfig.AudioEncoding.LINEAR16,
        sample_rate_hertz=24000,
        language_code="en-US",
        adaptation=speech_adaptation,
    )

    # The name of the audio file to transcribe
    # storage_uri URI for audio file in Cloud Storage, e.g. gs://[BUCKET]/[FILE]

    audio = speech.RecognitionAudio(uri=storage_uri)

    # Create the speech client
    speech_client = speech.SpeechClient()

    response = speech_client.recognize(config=config, audio=audio)

    for result in response.results:
        print("Transcript: {}".format(result.alternatives[0].transcript))

Speech-to-Text V2 を使用したコードを書きたい場合は、モデル適応により音声文字変換の結果を改善するをご参照ください。

クラスを使用して認識精度を向上させる

クラスとは、通貨単位やカレンダー日付など、自然言語で発生する一般的なコンセプトを表します。クラスにより、共通のコンセプトに対応する大規模な単語グループの音声文字変換の精度を向上できますが、必ずしも同一の単語やフレーズを含むとは限りません。

例えば、住所を表すクラストークンである $ADDRESSNUM を指定することで、ソース音声からの番地の音声文字変換を改善することができます。そうすれば、「123 Main Street」や「987 Grand Boulevard」などの語句が番地として認識されやすくなるでしょう。

また、自作のクラスを作成することもできます。この方法についてはカスタムクラスにて解説します。

クラストークンの使用法

実際に、クラスを使用するには、PhraseSet リソースの phrases フィールドにクラストークンを格納します。

クラスは、phrases 配列のスタンドアロンアイテムとして使用することも、1 つ以上のクラストークンを複数の単語からなるフレーズに埋め込むこともできます。すなわち、"$ADDRESSNUM", "my address is $ADDRESSNUM" のいずれも PhraseSet リソースの phrases に含めることができます。

利用できるクラスの一覧はサポートされるクラストークンから確認してください。

以下に、クラスを使用した音声適応の使用例を示します（モデル適応による認識リクエストの送信より一部改変して抜粋）。


from google.cloud import speech_v1p1beta1 as speech

def transcribe_with_model_adaptation(
    project_id, location, storage_uri, custom_class_id, phrase_set_id
):

    """
    Create`CustomClasses` to create custom lists of similar
    items that are likely to occur in your input data.
    """

    # Create the adaptation client
    adaptation_client = speech.AdaptationClient()

    # The parent resource where the custom class and phrase set will be created.
    parent = f"projects/{project_id}/locations/{location}"

    # Create the phrase set resource
    phrase_set_response = adaptation_client.create_phrase_set(
        {
            "parent": parent,
            "phrase_set_id": phrase_set_id,
            "phrase_set": {
                "boost": 10,
                "phrases": [
                    {"value": "my address is $ADDRESSNUM"},
                    {"value": "$ADDRESSNUM"}
                ],
            },
        }
    )
    phrase_set_name = phrase_set_response.name
    # The next section shows how to use
    # phrase set to send a transcription request with speech adaptation

    # Speech adaptation configuration
    speech_adaptation = speech.SpeechAdaptation(phrase_set_references=[phrase_set_name])

    # speech configuration object
    config = speech.RecognitionConfig(
        encoding=speech.RecognitionConfig.AudioEncoding.LINEAR16,
        sample_rate_hertz=24000,
        language_code="en-US",
        adaptation=speech_adaptation,
    )

    # The name of the audio file to transcribe
    # storage_uri URI for audio file in Cloud Storage, e.g. gs://[BUCKET]/[FILE]

    audio = speech.RecognitionAudio(uri=storage_uri)

    # Create the speech client
    speech_client = speech.SpeechClient()

    response = speech_client.recognize(config=config, audio=audio)

    for result in response.results:
        print("Transcript: {}".format(result.alternatives[0].transcript))

Speech-to-Text V2 を使用したコードを書きたい場合は、モデル適応により音声文字変換の結果を改善するをご参照ください。

カスタムクラス

関連する項目または値のカスタムリストで構成された独自の CustomClass クラスを作成することもできます。

カスタムクラスを使用するには、ClassItem として各レストラン名を含む CustomClass リソースを作成します。カスタムクラスは、事前ビルドされたクラストークンと同じ方法で機能します。``phrase` には、事前ビルドされたクラストークンとカスタムクラスの両方を含めることができます。

以下に、カスタムクラスを使用した音声適応の使用例を示します（モデル適応による認識リクエストの送信より抜粋）。


from google.cloud import speech_v1p1beta1 as speech

def transcribe_with_model_adaptation(
    project_id, location, storage_uri, custom_class_id, phrase_set_id
):

    """
    Create`PhraseSet` and `CustomClasses` to create custom lists of similar
    items that are likely to occur in your input data.
    """

    # Create the adaptation client
    adaptation_client = speech.AdaptationClient()

    # The parent resource where the custom class and phrase set will be created.
    parent = f"projects/{project_id}/locations/{location}"

    # Create the custom class resource
    adaptation_client.create_custom_class(
        {
            "parent": parent,
            "custom_class_id": custom_class_id,
            "custom_class": {
                "items": [
                    {"value": "sushido"},
                    {"value": "altura"},
                    {"value": "taneda"},
                ]
            },
        }
    )
    custom_class_name = (
        f"projects/{project_id}/locations/{location}/customClasses/{custom_class_id}"
    )
    # Create the phrase set resource
    phrase_set_response = adaptation_client.create_phrase_set(
        {
            "parent": parent,
            "phrase_set_id": phrase_set_id,
            "phrase_set": {
                "boost": 10,
                "phrases": [
                    {"value": f"Visit restaurants like ${{{custom_class_name}}}"}
                ],
            },
        }
    )
    phrase_set_name = phrase_set_response.name
    # The next section shows how to use the newly created custom
    # class and phrase set to send a transcription request with speech adaptation

    # Speech adaptation configuration
    speech_adaptation = speech.SpeechAdaptation(phrase_set_references=[phrase_set_name])

    # speech configuration object
    config = speech.RecognitionConfig(
        encoding=speech.RecognitionConfig.AudioEncoding.LINEAR16,
        sample_rate_hertz=24000,
        language_code="en-US",
        adaptation=speech_adaptation,
    )

    # The name of the audio file to transcribe
    # storage_uri URI for audio file in Cloud Storage, e.g. gs://[BUCKET]/[FILE]

    audio = speech.RecognitionAudio(uri=storage_uri)

    # Create the speech client
    speech_client = speech.SpeechClient()

    response = speech_client.recognize(config=config, audio=audio)

    for result in response.results:
        print("Transcript: {}".format(result.alternatives[0].transcript))

Speech-to-Text V2 を使用したコードを書きたい場合は、CustomClass を使用して音声文字変換の結果を改善するをご参照ください。

ブースト機能

モデル適応のブースト機能を使用すると、一部のフレーズに対する重み付けの割り当てを他のフレーズよりも引き上げることで、認識モデルのバイアスを引き上げられます。

この機能は以下のようなケースで有用です（ブーストを使用して音声文字変換の結果を微調整するより引用)）。

たとえば、「fare」よりも「fair」という単語が頻繁に出現する状況で、「fare to get into the county fair」と尋ねる録音が多くあるとします。この場合は、モデル適応を使用して、PhraseSet リソースに phrases として追加することにより、モデルが「fair」と「fare」の両方を認識する確率を向上できます。これは、「hare」や「lair」などよりも高い頻度で「fair」と「fare」と認識するように Speech-to-Text へ指示します。

ただし、「fair」は音声においてより頻繁に出現するため、「fare」よりも高い頻度で認識されなければなりません。すでに、Speech-to-Text API を使用して音声を文字変換し、正しい単語（「fair」）を認識し損ねるエラーを多数見つけているかもしれません。この場合、boost とともにフレーズをさらに使用して、「fare」よりも「fair」により高いブースト値を割り当てることをおすすめします。「fair」に割り当てる重み付け値を高くすると、Speech-to-Text API が「fare」よりも「fair」を頻繁に選択するようにバイアスが適用されます。ブースト値がなければ、認識モデルは「fair」と「fare」を同じ確率で認識します。

強調数値には 0 より大きい浮動小数点数値を指定してください。ブースト値の実用的な上限値は 20 です。ブースト値を大きくすると、認識漏れが少なくなります。ただし、ブーストを使用すると過剰検出の可能性は高くなります。

以下に、ブースト機能を使用した音声適応の使用例を示します（音声適応ブーストによる音声文字変換の精度の改善より抜粋）。

from google.cloud import speech_v1p1beta1 as speech

def sample_recognize(storage_uri, phrase):
    """
    Transcribe a short audio file with speech adaptation.

    Args:
      storage_uri URI for audio file in Cloud Storage, e.g. gs://[BUCKET]/[FILE]
      phrase Phrase "hints" help recognize the specified phrases from your audio.
    """

    client = speech.SpeechClient()

    # storage_uri = 'gs://cloud-samples-data/speech/brooklyn_bridge.mp3'
    # phrase = 'Brooklyn Bridge'
    phrases = [phrase]

    # Hint Boost. This value increases the probability that a specific
    # phrase will be recognized over other similar sounding phrases.
    # The higher the boost, the higher the chance of false positive
    # recognition as well. Can accept wide range of positive values.
    # Most use cases are best served with values between 0 and 20.
    # Using a binary search approach may help you find the optimal value.
    boost = 20.0
    speech_contexts_element = {"phrases": phrases, "boost": boost}
    speech_contexts = [speech_contexts_element]

    # Sample rate in Hertz of the audio data sent
    sample_rate_hertz = 44100

    # The language of the supplied audio
    language_code = "en-US"

    # Encoding of audio data sent. This sample sets this explicitly.
    # This field is optional for FLAC and WAV audio formats
    encoding = speech.RecognitionConfig.AudioEncoding.MP3

    config = {
        "speech_contexts": speech_contexts,
        "sample_rate_hertz": sample_rate_hertz,
        "language_code": language_code,
        "encoding": encoding,
    }
    audio = {"uri": storage_uri}

    response = client.recognize(config=config, audio=audio)

    for result in response.results:
        # First alternative is the most probable result
        alternative = result.alternatives[0]
        print(u"Transcript: {}".format(alternative.transcript))

モデル適用が利用可能な言語

ご使用の言語でモデル適応機能が利用できるかどうかについては、言語サポートページをご覧ください。

2023/1/27 時点での日本語のサポート状況は以下の通りです。

モデル名	モデル適応への対応
command_and_search	✓
default	✓
phone_call
latest_long
latest_short	✓

なお、音声認識のモデルを指定するには、リクエストの RecognitionConfig オブジェクトに model フィールドを追加し、使用するモデルを値（例えば latest_long ）として指定します。各モデルの詳細などについては、Speech-to-Text リクエストの構成＞ Speech-to-Text API 認識＞モデルの選択をご参照ください。

Speech-to-Textの音声適応について