AI奇想空间 | Aimazing Site and Tools

DALL·E 3创建的图像

在本教程中，我们将创建一个个人本地LLM助手，您可以与之对话。您将能够使用麦克风录制您的声音并发送给LLM。LLM将以文本和语音形式返回答案。

这是应用程序的工作原理：

您可以在此GitHub存储库中找到代码： https://github.com/amirarsalan90/personal_llm_assistant

应用程序的主要组件包括：

本地LLM（由llama-cpp-python托管）
语音转文本（Whisper）
文本转语音（Bark）

llama-cpp-python

Llama-cpp-python是一个用于伟大的llama.cpp的Python绑定，它在C/C++中实现了许多大型语言模型。由于它在开源社区中被广泛采用，我决定在本教程中使用它。

注意：我在一台配有Nvidia RTX4090 GPU的系统上测试了此应用程序。

首先，让我们创建一个新的conda环境：

conda create --name assistant python=3.10
conda activate assistant

接下来，我们需要安装llama-cpp-python。如llama-cpp-python的描述所述，llama.cpp支持多种硬件加速后端以加速推理。为了利用GPU并在GPU上运行LLM，我们将使用CUBLAS构建程序。我在获取GPU上的模型时遇到了一些问题，最后我在这篇文章中找到了正确安装的方法：

export CMAKE_ARGS="-DLLAMA_CUBLAS=on"
export FORCE_CMAKE=1
pip install --upgrade --force-reinstall llama-cpp-python --no-cache-dir
pip install llama-cpp-python[server]

我们还需要为此应用程序安装其他一些软件包：

pip install gradio
pip install openai
pip install huggingface-cli
pip install huggingface_hub
pip install torch
pip install transformers
pip install nltk
pip install optimum

下一步是下载要提供的模型权重。对于本教程，我决定使用Mistral-7b-Instruct-v0.2（实际上，Mistral 7b是我在所有其他7b模型中的首选模型）。llama.cpp的模型格式为GGUF格式。如果您不知道，TheBloke是获取量化模型和GGUF转换模型的首选存储库。

我们可以使用以下命令下载模型权重：

mkdir models
download TheBloke/Mistral-7B-Instruct-v0.2-GGUF mistral-7b-instruct-v0.2.Q4_K_M.gguf --local-dir ./models/ --local-dir-use-symlinks False

下载模型权重后，我们准备测试我们的LLM。为此目的，我们可以运行一个LLM服务器（我们之前安装了llama-cpp-python[server]）。打开一个终端，激活您的_assistant_ conda环境，并运行服务器：

conda activate assistant
python3 -m llama_cpp.server --model ./models/mistral-7b-instruct-v0.2.Q4_K_M.gguf --n_gpu_layers -1 --chat_format chatml

在此命令中，n_gpu_layers显示要卸载到GPU的模型的层数。因为我有一块具有24 GB VRAM的4090 GPU，足以加载此量化模型，因此我使用了-1，表示将所有层卸载到GPU。第二个参数chat_format显示应使用哪个聊天模板来使用我们的模型，因为我们使用的是mistral模型，所以我们选择_chatml_模板。您可以在此处阅读有关聊天模板的更多信息

现在是时候使用Python并向您的模型发送请求了。还有一件事要提到的是，llama-cpp-python实现了类似于OpenAI的API。这意味着您可以向本地LLM发送请求，就像您向OpenAI API发送请求以使用GPT模型（如GPT-3.5或GPT-4）一样。

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="sk-xxx")
response = client.chat.completions.create(
    model="mistral",
    messages=[
        {"role": "system", "content": "You are a helpful AI."},
        {"role": "user", "content": "In what city were the 2000 olympics taken place?"}
    ],
)

print(response)

语音转文本（Whisper）

对于语音转文本组件，我们将使用著名的Whisper模型，这是一个开源的基于Transformer的语音转文本模型。Whisper将音频文件作为输入，并返回所说话的单词的转录。我使用了huggingface transformers的实现，因此非常简单。使用huggingface pipeline，使用Whisper进行推理将非常简单：

pipe = pipeline(
    "automatic-speech-recognition",
    model="openai/whisper-large-v2",
    torch_dtype=torch.float16,
    device="cuda:0"
)

transcription = pipe(audio_file_path)['text']

其中transcription包含输入音频文件的转录。

文本转语音（Bark）

为了将文本转换为语音，我使用了Bark模型，这是一个基于Transformer的文本转语音模型。它可以生成逼真的多语言语音以及其他音频-包括音乐、背景噪音和简单的音效。该模型还可以生成像笑、叹气和哭泣等非语言交流。

同样，对于Bark模型，我们将使用huggingface的实现。使用huggingface transformers很简单：

from transformers import AutoProcessor, AutoModel

processor = AutoProcessor.from_pretrained("suno/bark")
model = AutoModel.from_pretrained("suno/bark")

inputs = processor(
    text=["Hello, my name is Suno. And, uh - and I like pizza. [laughs] But I also have other interests such as playing tic tac toe."],
    return_tensors="pt",
)

speech_values = model.generate(**inputs, do_sample=True)
from IPython.display import Audio

sampling_rate = model.generation_config.sample_rate
Audio(speech_values.cpu().numpy().squeeze(), rate=sampling_rate)

Web UI（Gradio）

为了创建简单的用户界面，我们使用Gradio。Gradio是一个很棒的库，使用它可以轻松地在几分钟内为您的数据科学项目构建用户界面。

这是Gradio应用程序的完整代码：

import gradio as gr
from transformers import pipeline
from transformers import AutoProcessor, BarkModel
import torch
from openai import OpenAI
import numpy as np
from IPython.display import Audio, display
import numpy as np
import re
from nltk.tokenize import sent_tokenize


WORDS_PER_CHUNK = 25

def split_sentence_into_chunks(sentence, n):
    words = sentence.split()
    if len(words) <= n:
        return [sentence]
    else:
        chunks = [' '.join(words[i:i+n]) for i in range(0, len(words), n)]
        return chunks


# Setup Whisper client
pipe = pipeline(
    "automatic-speech-recognition",
    model="openai/whisper-large-v2",
    torch_dtype=torch.float16,
    device="cuda:0"
)

voice_processor = AutoProcessor.from_pretrained("suno/bark")
voice_model = BarkModel.from_pretrained("suno/bark", torch_dtype=torch.float16).to("cuda:0")

voice_model =  voice_model.to_bettertransformer()
voice_preset = "v2/en_speaker_9"


system_prompt = "You are a helpful AI"


client = OpenAI(base_url="http://localhost:8000/v1", api_key="sk-xxx")  # Placeholder, replace 
sample_rate = 48000

def transcribe_and_query_llm_voice(audio_file_path):

    transcription = pipe(audio_file_path)['text']
    
    response = client.chat.completions.create(
        model="mistral",
        messages=[
            {"role": "system", "content": system_prompt},  # Update this as per your needs
            {"role": "user", "content": transcription + "\n Answer briefly."}
        ],
    )
    llm_response = response.choices[0].message.content

    sampling_rate = voice_model.generation_config.sample_rate
    silence = np.zeros(int(0.25 * sampling_rate))

    BATCH_SIZE = 12
    model_input = sent_tokenize(llm_response)

    pieces = []
    for i in range(0, len(model_input), BATCH_SIZE):
        inputs = model_input[BATCH_SIZE*i:min(BATCH_SIZE*(i+1), len(model_input))]
        
        if len(inputs) != 0:
            inputs = voice_processor(inputs, voice_preset=voice_preset)
            
            speech_output, output_lengths = voice_model.generate(**inputs.to("cuda:0"), return_output_lengths=True, min_eos_p=0.2)
            
            speech_output = [output[:length].cpu().numpy() for (output,length) in zip(speech_output, output_lengths)]
            
            pieces += [*speech_output, silence.copy()]
        
        
    whole_ouput = np.concatenate(pieces)

    audio_output = (sampling_rate, whole_ouput) 

    return llm_response, audio_output


def transcribe_and_query_llm_text(text_input):

    transcription = text_input
    
    response = client.chat.completions.create(
        model="mistral",
        messages=[
            {"role": "system", "content": system_prompt},  # Update this as per your needs
            {"role": "user", "content": transcription + "\n Answer briefly."}
        ],
    )

    llm_response = response.choices[0].message.content

    sampling_rate = voice_model.generation_config.sample_rate
    silence = np.zeros(int(0.25 * sampling_rate))```python
BATCH_SIZE = 12
model_input = sent_tokenize(llm_response)

pieces = []
for i in range(0, len(model_input), BATCH_SIZE):
    inputs = model_input[BATCH_SIZE*i:min(BATCH_SIZE*(i+1), len(model_input))]
    
    if len(inputs) != 0:
        inputs = voice_processor(inputs, voice_preset=voice_preset)
        
        speech_output, output_lengths = voice_model.generate(**inputs.to("cuda:0"), return_output_lengths=True, min_eos_p=0.2)
        
        speech_output = [output[:length].cpu().numpy() for (output,length) in zip(speech_output, output_lengths)]
        
        
        pieces += [*speech_output, silence.copy()]
    
    
whole_ouput = np.concatenate(pieces)

audio_output = (sampling_rate, whole_ouput)  

return llm_response, audio_output



with gr.Blocks() as demo:
    with gr.Row():
        with gr.Column():
            text_input = gr.Textbox(label="Type your request", placeholder="Type here or use the microphone...")
            audio_input = gr.Audio(sources=["microphone"], type="filepath", label="Or record your speech")
        with gr.Column():
            output_text = gr.Textbox(label="LLM Response")
            output_audio = gr.Audio(label="LLM Response as Speech", type="numpy")
    
    submit_btn_text = gr.Button("Submit Text")
    submit_btn_voice = gr.Button("Submit Voice")

    submit_btn_voice.click(fn=transcribe_and_query_llm_voice, inputs=[audio_input], outputs=[output_text, output_audio])
    submit_btn_text.click(fn=transcribe_and_query_llm_text, inputs=[text_input], outputs=[output_text, output_audio])

demo.launch(ssl_verify=False,
            share=False,
            debug=False)

上述Python代码的一些说明：

Bark有几种可供选择的声音。我们使用的是"v2/en_speaker_9"。您可以在此处找到完整的选项列表：https://huggingface.co/suno/bark/tree/main/speaker_embeddings/v2
我们将transcribe_and_query_llm_voice函数分配给submit_btn_voice，以便在用户的语音输入上运行模型
我们将transcribe_and_query_llm_text函数分配给submit_btn_text，以便在用户的文本输入上运行模型。
在第57行，我们基本上是在创建块并在每个块上运行Bark模型，然后将其聚合为单个语音，因为它在非常长的文本输入上效果不好。

您必须在PC上安装麦克风才能将语音提交给模型。否则，您仍然可以在文本框中输入文本，并从模型获取文本和音频输出。

最后，要运行UI，在启动llm引擎之后，在新的终端中运行：

python gradio_tts.py

您将在终端中收到一个链接，用于gradio应用程序。它很可能是http://127.0.0.1:7860。您的应用程序将准备好使用！

如何通过家庭WiFi从手机访问应用程序

如果您想通过手机访问此应用程序，还需要进行额外的步骤。假设您正在本地PC上运行应用程序，并且该PC连接到家庭WiFi，您希望在PC托管的同时从手机访问此应用程序，请按照以下说明操作：

要在Ubuntu上获取PC的本地IP地址，可以在终端中使用ip addr命令：

ip addr | grep 'inet ' | grep -v ' lo' | awk '{print $2}' | cut -d'/' -f1

例如，在我的情况下，本地IP地址是192.168.0.231。使用Gradio应用程序当前启动时，我可以通过在我的iPhone的Chrome浏览器中导航到以下URL来访问我的应用程序：

http://192.168.0.231:7860

在这里，192.168.0.231是我的PC的本地IP地址，7860是端口。但是，有一个注意事项：通过HTTP访问应用程序意味着由于安全限制，Chrome不允许使用手机的麦克风。要使用麦克风，必须通过HTTPS访问应用程序。

要为Gradio应用程序设置HTTPS并通过家庭WiFi安全访问它，特别是从iPhone访问它，您需要按照一系列步骤进行操作。这涉及生成自签名SSL证书，配置Gradio应用程序以使用此证书，然后从iPhone访问应用程序。

首先，请确保在PC上安装了OpenSSL，因为它需要生成SSL证书。OpenSSL通常预装在Linux和macOS上。打开终端并运行此命令。此命令生成新的自签名证书（cert.pem）和私钥（key.pem）：

openssl req -x509 -newkey rsa:4096 -keyout key.pem -out cert.pem -sha256 -days 365 -nodes

运行此命令时，OpenSSL会询问您几个问题（例如您的国家、州和组织）。您可以填写这些信息，但由于这是在家庭网络中供个人使用，因此具体细节并不重要。

创建cert.pem和key.pem之后，您可以在gradio_app.py文件的demo.launch函数中引用它们。因此，您应该用以下命令替换demo.launch：

demo.launch(ssl_verify=False,
            share=False,
            debug=False,
            server_name="0.0.0.0",
            ssl_certfile="cert.pem",
            ssl_keyfile="key.pem")

应用程序启动后，在您的手机上，您可以打开Chrome并打开此地址：

https://192.168.0.231:7860

然后，您将能够在应用程序中使用手机麦克风。

不要忘记关注更多的AI内容。更多的实践文章即将推出！