🎙️Spirit LM: 풍부한 감정을 표현하는 Meta의 최신 음성 모델

안녕하세요! 오늘은 Meta에서 공개한 새로운 음성 모델, Spirit LM에 대해 알아보겠습니다. Meta Spirit LM은 음성과 텍스트를 결합하여 새로운 방식으로 언어를 처리하는 모델로, 기존 텍스트 언어 모델을 기반으로 음성 모달리티로 확장되었으며, 단순한 의미 이해부터 감정 표현이 풍부한 음성 생성까지 다양한 작업을 처리할 수 있습니다. 이를 통해 ASR(Automatic Speech Recognition), TTS(Text-to-Speech), 음성 분류 등의 분야에서 활용가능한 잠재력을 보여줍니다. 이 블로그에서는 Spirit LM의 개요, 특징 및 주요 기능, 벤치마크 결과에 대해 알아보고, 실제 성능을 테스트해 보겠습니다.

"이 포스팅은 쿠팡 파트너스 활동의 일환으로, 이에 따른 일정액의 수수료를 제공받습니다."

Spirit LM 개요

Spirit LM은 텍스트와 음성을 자유롭게 혼합하는 멀티모달 언어 모델입니다. 이 모델은 사전 훈련된 텍스트 언어 모델을 기반으로 하며, 텍스트 및 음성 단위에 대한 지속적인 훈련을 통해 음성 모달리티로 확장함으로써, 텍스트 모델의 의미적 능력과 음성 모델의 표현력을 모두 보여줍니다.

Spirit LM은 음성 의미 단위를 사용하는 BASE 버전과 의미 단위 외에도 피치 및 스타일 단위를 사용하여 표현력을 모델링하는 EXPRESSIVE 버전의 두 가지 버전으로 제공됩니다.

Spirit LM Base: 음성 음성 단위를 사용합니다.
Spirit LM Expressive: 음성 단위 외에도 피치 및 스타일 단위를 사용하여 표현력을 모델링합니다.

Spirit LM은 텍스트와 음성을 모두 생성할 수 있는 단일 언어 모델이며, 텍스트 LLM과 마찬가지로 텍스트, 음성, 그리고 교차 모달 설정(예: 음성-텍스트, 텍스트-음성)에서 적은 양의 샘플로 새로운 작업을 학습할 수 있습니다.

주요 기능:

텍스트 및 음성 생성: 텍스트 또는 음성 프롬프트를 사용하여 텍스트와 음성을 모두 생성할 수 있습니다.
퓨샷 학습 (few-shot learning): 아주 적은 양의 훈련 데이터를 가지고 모델이 새로운 작업을 학습할 수 있는 기술로서, 텍스트, 음성 및 교차 모달 설정에서 퓨샷 학습을 지원합니다.
교차 모달 추론: 텍스트-음성 및 음성-텍스트 변환이 가능합니다.
표현력 있는 생성: 감정을 담은 표현력 있는 음성과 텍스트를 생성할 수 있습니다.

Spirit LM의 표현력 있는 버전인 SPIRIT-LM-EXPRESSIVE는 감정이나 뉘앙스를 더 잘 전달할 수 있도록 설계된 모델로, 텍스트와 음성에서 감정을 유지할 수 있으며, 음성이나 텍스트의 감정을 같은 모달리티(예: 음성 -> 음성)뿐만 아니라, 다른 모달리티(예: 텍스트 -> 음성)로 변환할 때도 그 감정을 보존할 수 있습니다.

Spirit LM 동작원리

다음은 Spirit LM의 동작원리에 대해 알아보겠습니다. Spirit LM 모델은 텍스트 전용 시퀀스, 음성 전용 시퀀스, 인터리빙된 음성-텍스트 시퀀스를 혼합하여 훈련됩니다.

SPIRIT-LM는 텍스트와 음성을 처리하는 새로운 방식의 언어 모델로, 다양한 모달리티(텍스트와 음성)를 함께 학습하여 활용하는 구조를 가지고 있습니다. 위 개념도에서 SPIRIT-LM의 동작은 다음과 같습니다:

a. SPIRIT-LM 기본 아키텍처: SPIRIT-LM은 텍스트와 음성을 인코딩하고, 이를 다음 토큰 예측을 통해 훈련합니다. 인코더(Enc)는 텍스트와 음성 데이터를 받아들여 이를 토큰 시퀀스로 변환합니다. 그런 다음, 디코더(Dec)는 그 시퀀스를 다시 텍스트나 음성으로 변환하는 과정을 거칩니다. 텍스트 데이터는 텍스트 전용 시퀀스로 처리되고, 음성 데이터는 음성 전용 시퀀스로 처리됩니다. 또, 음성과 텍스트가 혼합된 데이터는 인터리빙 된 시퀀스(교차 배치된 형태)로 처리됩니다. 모델은 LLaMA 2와 같은 언어 모델에서 확장되었으며, 여기서는 음성 모달리티가 추가되었습니다.
b. 음성-텍스트 인터리빙 처리 방식: 이 모델은 음성을 Hubert(음성의 표현)나 Pitch(음의 높낮이), Style(발화 스타일) 토큰으로 인코딩합니다. 텍스트는 긴 단어나 문장을 짧은 단위의 토큰으로 분할하는 BPE(Byte Pair Encoding) 방식을 사용하여 토큰화되며, 여기서 음성과 텍스트가 함께 처리되는데, [TEXT]와 [SPEECH]라는 특별한 토큰을 사용하여 텍스트와 음성을 구분합니다. 학습 과정에서 음성과 텍스트가 일정한 경계를 넘어가며 교차(interleave)되어 처리됩니다. 예를 들어, "Eat your raisins outdoors"라는 텍스트에 음성 토큰이 교차 배치되어 있으며, 음성은 [SPEECH]로, 텍스트는 [TEXT]로 구분되어 섞입니다.
c. 감정 표현이 포함된 음성 토큰 처리 (Expressive Speech tokens): SPIRIT-LM 모델의 감정 표현이 강화된 버전인 SPIRIT-LM-EXPRESSIVE에서는, Hubert, Pitch, 그리고 Style 토큰이 함께 사용되어 감정이 풍부한 음성을 생성할 수 있습니다. 이 토큰들은 시간적 간격(40ms, 80ms, 1000ms)과 함께 인터리빙 된 형태로 처리된 후, 중복된 정보를 제거한 후 최종적으로 음성과 텍스트 데이터를 병합합니다. 이 과정을 통해 SPIRIT-LM은 텍스트와 음성을 통합적으로 처리하면서 두 가지 모달리티 간의 정보를 이해하고 생성할 수 있는 능력을 갖추게 됩니다.

Spirit LM 테스트

다음은 Spirit LM을 직접 테스트 해보겠습니다. Spirit LM의 매개변수는 7B(70억개)이며, PC의 VRAM이 16GB 이상은 되어야 모델의 동작을 테스트할 수 있어서, 저는 무료 GPU 클라우드 Lightning AI를 이용해서 테스트를 진행하였습니다. Lightning AI에 대한 이전 포스팅은 아래 링크를 참고하시면 됩니다.

2024.06.02 - [AI 도구] - ⚡️🆓Lightning AI: 무료 GPU 클라우드 기반 AI 개발 플랫폼 Ollama 가이드

⚡️🆓Lightning AI: 무료 GPU 클라우드 기반 AI 개발 플랫폼 Ollama 가이드

안녕하세요! 오늘은 Lightning AI라는 클라우드 컴퓨팅 기반 AI 개발 플랫폼을 소개해 드리겠습니다. Lightning AI는 머신러닝(ML)과 인공지능(AI) 프로젝트를 빠르고 효율적으로 개발, 프로토타입, 훈련,

fornewchallenge.tistory.com

1. 다운로드 요청: 모델을 다운로드하기 위해서는 아래 링크에서 간단한 정보를 입력하고 승인을 요청합니다.

https://ai.meta.com/resources/models-and-libraries/spirit-lm-downloads/

Spirit LM access request form - Meta AI

ai.meta.com

2. 다운로드 승인: 모델 다운로드 요청 후, 몇 시간 후면 아래와 같이 승인 메일이 옵니다.

3. 모델 다운로드: 메일의 링크를 클릭하여 Spirit LM을 다운로드합니다. 파일은 토크나이저와 모델 2종류 등으로 구성되며, 용량은 약 20GB 정도입니다.

https://github.com/facebookresearch/spiritlm

GitHub - facebookresearch/spiritlm: Inference code for the paper "Spirit-LM Interleaved Spoken and Written Language Model".

Inference code for the paper "Spirit-LM Interleaved Spoken and Written Language Model". - facebookresearch/spiritlm

github.com

다운로드한 모델 파일의 디렉토리 구조는 다음과 같으며, 메인 디렉토리 밑에 checkpoints 디렉토리가 있고, 그 밑으로 speech_tokenizer와 spiritlm_model 디렉토리가 있습니다.

4. 패키지 설치: Lightning AI에 접속해서 GPU Machine을 선택한 후, Spirit LM에 필요한 패키지를 설치합니다.

pip install -e '.[dev]'
pip install streamlit torchaudio transformers torch numpy

5. Streamlit 파이썬 코드 작성: 새로운 파이썬 파일을 만들고 아래 코드를 붙여넣습니다. 이 코드는 Streamlit 웹 애플리케이션을 사용하여 Spirit LM 모델을 기반으로 텍스트 또는 음성을 생성하는 데 사용됩니다.

import os
import streamlit as st
import torchaudio
import torch
import tempfile
import numpy as np

from spiritlm.model.spiritlm_model import Spiritlm, OutputModality, GenerationInput, ContentType
from transformers import GenerationConfig

# Set the environment variable for PyTorch memory management
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"

# Initialize the Spirit LM model with the modified class
spirit_lm = Spiritlm("spirit-lm-expressive-7b")

def generate_output(input_type, input_content_text, input_content_audio, output_modality, temperature, top_p, max_new_tokens, do_sample, speaker_id):
    generation_config = GenerationConfig(
        temperature=temperature,
        top_p=top_p,
        max_new_tokens=max_new_tokens,
        do_sample=do_sample,
    )

    if input_type == "text":
        interleaved_inputs = [GenerationInput(content=input_content_text, content_type=ContentType.TEXT)]
    elif input_type == "audio":
        # Load audio file
        waveform, sample_rate = torchaudio.load(input_content_audio)
        interleaved_inputs = [GenerationInput(content=waveform.squeeze(0), content_type=ContentType.SPEECH)]
    else:
        raise ValueError("Invalid input type")

    outputs = spirit_lm.generate(
        interleaved_inputs=interleaved_inputs,
        output_modality=OutputModality[output_modality.upper()],
        generation_config=generation_config,
        speaker_id=speaker_id,
    )

    text_output = ""
    audio_output = None

    for output in outputs:
        if output.content_type == ContentType.TEXT:
            text_output = output.content
        elif output.content_type == ContentType.SPEECH:
            if isinstance(output.content, np.ndarray):
                if len(output.content.shape) == 1:
                    audio_data = torch.from_numpy(output.content).unsqueeze(0)
                else:
                    audio_data = torch.from_numpy(output.content)

                with tempfile.NamedTemporaryFile(delete=False, suffix=".wav") as temp_audio_file:
                    torchaudio.save(temp_audio_file.name, audio_data, 16000)
                    audio_output = temp_audio_file.name
            else:
                raise TypeError("Expected output.content to be a NumPy array, but got {}".format(type(output.content)))

    return text_output, audio_output

# Streamlit UI 설정
st.title("Spirit LM WebUI Demo")
st.write("Demo for generating text or audio using the Spirit LM model.")

# 사용자 입력 받기
input_type = st.radio("Input Type", ["text", "audio"], index=0)

if input_type == "text":
    input_content_text = st.text_area("Input Content (Text)")
    input_content_audio = None
else:
    input_content_audio = st.file_uploader("Input Content (Audio)", type=["wav", "mp3"])
    input_content_text = None

output_modality = st.radio("Output Modality", ["TEXT", "SPEECH", "ARBITRARY"], index=1)
temperature = st.slider("Temperature", 0.0, 1.0, 0.9)
top_p = st.slider("Top P", 0.0, 1.0, 0.95)
max_new_tokens = st.slider("Max New Tokens", 1, 800, 500)
do_sample = st.checkbox("Do Sample", value=True)
speaker_id = st.selectbox("Speaker ID", [0, 1, 2, 3], index=0)

if st.button("Generate"):
    if input_type == "audio" and input_content_audio is not None:
        # 스트림릿 file_uploader로 받은 파일을 임시 파일로 저장
        with tempfile.NamedTemporaryFile(delete=False, suffix=".wav") as temp_audio_file:
            temp_audio_file.write(input_content_audio.read())  # 업로드된 파일을 임시 파일로 저장
            temp_audio_file_path = temp_audio_file.name
        
        output_text, output_audio = generate_output(input_type, input_content_text, temp_audio_file_path, output_modality, temperature, top_p, max_new_tokens, do_sample, speaker_id)
    else:
        output_text, output_audio = generate_output(input_type, input_content_text, None, output_modality, temperature, top_p, max_new_tokens, do_sample, speaker_id)

    # 출력 결과 표시
    st.text_area("Generated Text", value=output_text, height=200)
    if output_audio: 
        st.audio(output_audio, format="audio/wav")

아래는 코드의 주요 동작 순서입니다:

1. 환경 변수 설정: `os.environ`을 사용하여 PyTorch의 CUDA 메모리 관리를 위한 환경 변수를 설정합니다.
2. 모델 초기화: `Spiritlm` 클래스를 사용하여 Spirit LM 모델을 초기화합니다. (`spirit-lm-expressive-7b` 모델을 사용)
3. `generate_output` 함수 정의: 텍스트 또는 오디오 입력에 따라 적절한 콘텐츠를 준비합니다. 모델에 입력을 전달해 텍스트 또는 음성 출력을 생성합니다. 음성 출력이 생성되면, `torchaudio`를 사용하여 임시 파일에 저장합니다.
4. Streamlit 인터페이스 설정: 사용자가 입력 타입(텍스트 또는 오디오)을 선택할 수 있는 라디오 버튼을 제공합니다. 텍스트 입력의 경우 텍스트 영역에서 입력을 받고, 오디오 입력의 경우 파일 업로더를 사용하여 오디오 파일을 업로드할 수 있습니다. 출력 방식(텍스트, 음성, 또는 혼합)을 선택하고, 모델의 다양한 생성 설정을 조절할 수 있는 슬라이더 및 체크박스를 제공합니다.
5. 출력 생성 및 표시: 'Generate' 버튼을 누르면, 입력된 내용과 설정값을 기반으로 `generate_output` 함수를 호출해 출력을 생성합니다. 생성된 텍스트를 텍스트 영역에 표시하고, 음성이 생성된 경우 플레이어를 통해 재생할 수 있도록 합니다. 이 과정은 사용자가 Spirit LM 모델을 통해 텍스트 또는 음성을 입력하고, 모델이 이를 기반으로 생성한 결과를 웹 UI로 확인할 수 있게 합니다.

6. Lightning AI Streamllit 웹 애플리케이션 실행: Lightning AI 화면 우측의 streamlit 아이콘을 클릭한 후, 아래 화면과 같이 위에서 작성한 코드를 선택하고 우측 하단 Run 아이콘을 클릭합니다.

7. 기능 테스트: 실행된 웹 페이지에서 텍스트나 오디오 입력을 선택한 후, 텍스트 프롬프트나 입력하거나 오디오 파일을 입력하고 페이지 하단 Generate 버튼을 클릭합니다.

Text 입력, Text 출력: 세계에서 가장 큰 나라를 물어보는 "The largest country in the world is"라는 입력 프롬프트에 Spirit LM 모델은 Russia, Canada 등 질문할때 마다 다른 답변을 하였습니다.

Spirit LM Text 입력 프롬프트에 대한 Text 출력 테스트 화면

입력	The largest country in the world is
출력	Canada. It is a country that is also the second largest country in the western hemisphere and is located in North America. The population in Canada is 33,903,2677,2053. Canada is not a member of the European Union and is also the world’s second largest country in size. Canada has a large amount of natural resources such as oil, minerals, metals and forest. The capital of Canada is Ottawa and the largest city is Toronto. Canada is situated in North America and the country is a parliamentary democracy, with a constitutional monarchy. The country is divided into ten provinces, with each province being ruled by the queen. Canada is a member of NATO and has a free trade agreement with the United States. The official languages in Canada are both French and English, although the majority of the population speaks English. Canada was once a French colony and was only recently independent and it became independent in 18671. (생략)
출력	번역) 캐나다. 서반구에서 두 번째로 큰 국가이기도 하며 북아메리카에 위치한 국가이다. 캐나다의 인구는 33,903,2677,2053명이다. 캐나다는 유럽 연합의 회원국이 아니며, 세계에서 두 번째로 큰 국가이기도 하다. 캐나다는 석유, 광물, 금속 및 산림과 같은 많은 양의 천연 자원을 보유하고 있다. 캐나다의 수도는 오타와이며, 가장 큰 도시는 토론토이다. 캐나다는 북아메리카에 위치하고 있으며, 이 나라는 의회 민주주의 국가이며, 입헌 군주제 국가이다. 이 나라는 10개의 주로 나뉘어져 있으며, 각 주는 여왕에 의해 통치된다. 캐나다는 NATO의 회원국이며, 미국과 자유 무역 협정을 맺고 있다. 캐나다의 공식 언어는 프랑스어와 영어이지만, 대부분의 인구는 영어를 사용한다. 캐나다는 한때 프랑스의 식민지였으며, 최근에야 독립을 선언했고 18671년에 독립을 선언했다. (생략)

Text 입력, Audio 출력: 새로운 여정에 대한 기대감을 표현한 "I am absolutely thrilled to be embarking on this new journey, it's going to be an incredible adventure and " 텍스트 프롬프트에 대해 Spirit LM의 오디오 답변은 억양이 자연스럽고, 감정이 잘 표현되어 실제 사람의 답변과 비슷한 음성을 출력하였습니다.

기능구분	Text 입력, Audio 출력
입력	I am absolutely thrilled to be embarking on this new journey, it's going to be an incredible adventure and
출력
출력	I can not wait to get onboard......to the adventures. So, thanks for having me here, sorry for being a little late. But I hope this still got a good time together. Thank you again also for listening...

Audio 입력, Text 출력: "Ha, Ha, No..I think you're thinking of Days of Thunder", which is a Tom Cruise racing movie."(하, 하, 아니..톰 크루즈의 레이싱 영화인 "Days of Thunder"를 생각하시는 것 같아요.) 라는 오디오 입력 프롬프트에 Spirit LM은 아래와 같이 영화 관련 팟 캐스트 방송으로 상황을 설정하여 대답하였습니다.

기능구분	Audio 입력, Text 출력
입력
입력	Ha, Ha, No..I think you're thinking of "Days of Thunder", which is a Tom Cruise racing movie.
출력	(생략) This is a brand new podcast hosted by me. Michael bozell. This is a podcast devoted to all things film related. If you're here for the first time welcome. I'm really happy to have you if you're coming back. I'm really glad you're back. If this is your first time. I hope you enjoy the rest of your time here. I'll be your host today. I'll be discussing the second part of the series. I'm also joined by Ben. (생략)

Audio 입력, Audio출력: 3번 테스트와 같은 오디오 프롬프트에 대해 Spirit LM은 관련있는 대답을 하였지만 발음이 정확하지 않고 매우 빠르게 발화하여 알아들을수 없는 부분이 있습니다.

기능	Audio 입력, Audio출력
입력
입력	Ha, Ha, No..I think you're thinking of "Days of Thunder", which is a Tom Cruise racing movie.
출력
출력	But I'm talking about Days of Thunder, the picture that was in actual horrific. I heard the instrumentalist of the company that play ....... That makes no sense. That's how they explain what happens in the movie. I was sleeping by myself in ..... I get the nightmare for starting in first...

"이 포스팅은 쿠팡 파트너스 활동의 일환으로, 이에 따른 일정액의 수수료를 제공받습니다."

맺음말

Spirit LM은 텍스트와 음성을 모두 생성할 수 있는 멀티모달 언어 모델로 퓨샷 학습, 교차 모달 추론, 표현력 있는 생성과 같은 다양한 기능을 통해 음성 및 언어 처리 분야에 큰 발전을 가져올 잠재력을 가지고 있으며, 앞으로 다양한 분야에서 활용될 수 있을 것으로 기대됩니다.

Spirit LM을 테스트해 본 후기는 다음과 같습니다.

한국어 입력 프롬프트에도 영어로만 대답합니다.
음성출력은 나름 자연스럽고 감정을 잘 표현한다.
발음이 부정확하거나 매우 빠르게 지나가는 부분이 있다.

오늘은 Meta의 Spirit LM에 대해서 알아보았습니다. 여러분도 한번 텍스트와 음성을 생성해 보시면 좋은 경험이 될 것 같습니다. 그럼 저는 다음 시간에 더 유익한 정보를 가지고 다시 찾아뵙겠습니다. 감사합니다.

2024.01.13 - [AI 언어 모델] - Ollama와 대규모 언어 모델 Llama2-uncensored를 활용한 PDF 요약과 음성변환

Ollama와 대규모 언어 모델 Llama2-uncensored를 활용한 PDF 요약과 음성변환

안녕하세요! 오늘은 대규모 언어 모델 활용도구인 Ollama를 활용해서 PDF의 내용을 요약하고 읽어주는 Llama2-unsensored(무검열) 모델을 만나보겠습니다. unsensored는 부적절한 내용이나 표현의 제한이

fornewchallenge.tistory.com

저작자표시 (새창열림)

'AI 언어 모델' 카테고리의 다른 글

✨Claude 3.5 Haiku: Anthropic 최고 가성비 AI 모델 분석 및 테스트 (19)	2024.11.07
🎨Stable Diffusion 3.5: 향상된 이미지 품질과 프롬프트 정확도로 업그레이드된 최신 텍스트-이미지 AI (14)	2024.10.27
🏆NVIDIA Llama-3.1-nemotron-70B: GPT-4o를 뛰어넘은 오픈소스 모델 (36)	2024.10.20
🌟Aria: 최신 오픈소스 멀티모달 네이티브 MoE 모델 (26)	2024.10.14
🚀Flux 1.1 Pro: 6배 더 빨라진 고품질 이미지 생성 모델🎨 (22)	2024.10.07