This article was initially published on my Medium Page.
Those last weeks, the whole Internet has been upside down. A new actor arrived and shook the AI game: ChatGPT. If you have tried it, you have probably figured out that ChatGPT is incredible: it can give you a detailed answer to almost every question, create poems or jokes, and help developers to program, among others.
After playing with the platform for a while, I remembered that years ago, I built for a client a chatbot that could talk. What about making ChatGPT talk? Is it even possible? Let’s see.
Spoil alert: I could make it! Here is a video showing the final result:
ChatGPT
The ChatGPT interface is fantastic, but even better, OpenAI has an API and an official library for Python—a gold mine for the developers. Once logged, we can figure out the following: OpenAI gives us two months of free trial usage with an $18 credit. Thanks, dudes, that’s cool.
The OpenAI console is straightforward; the only thing you have to do is generate a new API key to allow any external program to connect:
General Architecture
Here is the general architecture of the project:
Notes:
- Do you remember my article about the hotel platform? We will reuse the main components for authentication: we will connect a Unity app to AWS thanks to a login system with Cognito and API Gateway.
- We will create a user pool and a new user with a name and a password in Cognito.
- We will create an endpoint in API Gateway and an associated Authorizer so that only the users of the user pool can consume the API.
- The Lambda function will receive the text from Unity and call the OpenAI API.
- Once the OpenAI API has answered, we call Polly, the text-to-voice converter of AWS, which will convert the answer into a voice stream.
- We keep the audio file in an S3 bucket and generate a pre-signed URL to restrict access to the file.
AWS implementation
S3
In the same way that my previous article, we create a private repository:
Lambda
First, we create a Lambda layer with the OpenAI library. Do you remember my previous article about making a homemade CCTV? I explain there in detail how to create a Lambda layer from a local environment, so we will follow the same method with the OpenAI library:
Now, we create our Lambda function:
Don’t forget to add the openai
Layer to the function:
Inside the Lambda function, we define a new environment variable called openai_api_key
with the OpenAI API key value.
Inside the function’s role, we create 2 inline policies, one for Polly, and the other for S3.
And here is the function:
Notes:
- We store the OpenAI API key in a Lambda environment variable called
openai_api_key
, and we call it in Lambda thanks to the os.getenv function. - We parse the Lambda function’s entry parameters and retrieve the message sent from Unity.
- We call the Create completion function of the OpenAI API with the message sent from Unity. We extract the answer, as specified in the OpenAI documentation, and trim it with the strip function to avoid spaces or line breaks.
- When we call the Create completion function, we concatenate the message with the sentence “Please give me a short answer.” to be sure that the response given by ChatGPT will not be too elaborate.
- We call the synthesize_speech function of Polly with boto3, passing the answer as a parameter. We chose the ogg format, the best choice to work in Unity with, as I demonstrated in this previous article.
- I chose Aria, a friendly New Zealand vocal option of Polly, but it’s up to you to choose your favorite one!
- We keep the audio stream locally as a file thanks to the open and write functions, and we upload it to an S3 bucket thanks to the upload_file function of the boto3 library. After finishing, we remove the local file thanks to the os.remove function.
- We generate a pre-signed URL with 1 minute of time life thanks to the generate_presigned_url function of the boto3 library, so only the user using the Unity app will be able to access the audio file.
Cognito
In the same way that in the hotel platform article, we create a new user pool in Cognito:
In the Pool, we create a new user with a name and a password:
API Gateway
In API Gateway, we create a new REST API with a POST method until our Lambda function, and we deploy it:
And we create an authorizer to allow access to the endpoint only for the Cognito users from the Pool we have created:
Unity3D Implementation
The Audio Component
Our app will be able to talk, so we need an AudioSource
component in our project!
Note: We let the AudioClip
parameter empty; we will fill it with the audio file from S3.
The UI components
I usually detail little about the UI building of my Unity apps because I’m not a designer nor a front-end developer. Still, in this case, I found it interesting to explain how I built the client app mainly because of the complex layout of the chat.
That’s how I built the client app:
Notes:
- We use a
Canvas
with aScrollView
(without scrollbars) to show the messages. - We use a vertical
Content Size Fitter
to resize the content of theScrollView
automatically, and aVertical Layout Group
to place the messages vertically. - We use a combination of horizontal and vertical
Content Size Fitter
andLayout Groups
to resize the box containing the message. - We use a sliced image for the box, so all the messages will always have the same rounder corners, no matter the text size.
The code
Okay, so we have a functional chat in Unity. Let’s connect it with the backend!
First of all, we login to Cognito when the application starts, and we store the token id returned by Cognito in a PlayPrefs parameter:
Please refer to my previous article for an extensive explanation of the above code.
Then, we write the functions to show and hide the user’s device keyboard:
Notes:
- Unity work with the native keyboard of the device where the application is running. That means the keyboard will look different if you run it on iOS or Android.
- We use the class
TouchScreenKeyboard
as specified in the Unity documentation and the related function Open.
Then, here is the most exciting part: we call our endpoint, and we pass the message written as a parameter:
Well, our endpoint returned a URL of the audio file, so we use it now to retrieve the file and play it:
Notes:
- We use the function
GetAudioClip
ofUnityWebRequestMultimedia
to retrieve the audio stream inogg
format. - We assign the audio stream to the clip parameter of our
AudioSource
object.
And now, we can add the message to the chat:
Notes:
- We instantiate the user message object or the friend message object according to the needs.
- We use the function
ForceRebuildLayoutImmediate
to refresh theScrollView
content and avoid graphical bugs. - We set the
verticalNormalizedPosition
parameter of theScrollView
to 0, so the scroll position is at the bottom, and we can see the last messages.
Costs
Let’s check with the AWS Calculator what our system cost would be for a very pessimistic scenario: you love the app, and you perform 100 daily, 3,000 requests a month.
- Cognito: We only have one MAU (monthly active user) for this project. Cost: 0.00 USD
- API Gateway: With 3,000 requests to our REST API, the monthly cost is 0,01 USD
- S3: Suppose that 50 KB could be the average size of an audio file; we would have 150 MB stored each month. Additionally, we would have 3,000 put requests and 3,000 get requests, leading to a monthly cost of 0.02 USD.
- Lambda: With 3,000 requests with an average time of 3 seconds and a 1,024 MB of memory allocation, the monthly cost is 0.00 USD; excellent!
- Polly: Polly is undoubtedly the more expensive AWS service here. Let’s suppose chatGPT answers have an average of 100 characters; the monthly bill will be 1.20 USD.
- OpenAI: Based on the OpenAI tokenizer tool, suppose that every question we ask ChatGPT represents 15 tokens, so we use 45,000 tokens monthly. According to the OpenAI pricing, this gives us a total of 0.9 USD monthly.
Total: The total bill for our system would be 2.13 USD monthly. It’s totally affordable, taking into account that this is a very pessimistic scenario.
Closing Thoughts
In this article, we could figure out how to build an entire cloud architecture on AWS and how easy it is to integrate the OpenAI API with a Lambda function. We also had the opportunity to discover Polly, the text-to-voice service of AWS. Furthermore, we could evaluate the cost of the entire system thanks to the AWS Calculator.
Every code of this article has been tested using Unity 2021.3.3 and Visual Studio Community 2022 for Mac. The mobile device I used to run the Unity app is a Galaxy Tab A7 Lite with Android 11.
All ids and tokens shown in this article are fake or expired; if you try to use them, you will not be able to establish any connections.
You can download the Unity package of the client app specially designed for this article.
A special thanks to Gianca Chavest for designing the amazing illustration.