{"id":2814183,"date":"2023-08-07T12:19:32","date_gmt":"2023-08-07T16:19:32","guid":{"rendered":"https:\/\/wordpress-1016567-4521551.cloudwaysapps.com\/plato-data\/aws-performs-fine-tuning-on-a-large-language-model-llm-to-classify-toxic-speech-for-a-large-gaming-company-amazon-web-services\/"},"modified":"2023-08-07T12:19:32","modified_gmt":"2023-08-07T16:19:32","slug":"aws-performs-fine-tuning-on-a-large-language-model-llm-to-classify-toxic-speech-for-a-large-gaming-company-amazon-web-services","status":"publish","type":"station","link":"https:\/\/platodata.io\/plato-data\/aws-performs-fine-tuning-on-a-large-language-model-llm-to-classify-toxic-speech-for-a-large-gaming-company-amazon-web-services\/","title":{"rendered":"AWS performs fine-tuning on a Large Language Model (LLM) to classify toxic speech for a large gaming company | Amazon Web Services"},"content":{"rendered":"
The video gaming industry has an estimated user base of over 3 billion worldwide1<\/sup>. It consists of massive amounts of players virtually interacting with each other every single day. Unfortunately, as in the real world, not all players communicate appropriately and respectfully. In an effort to create and maintain a socially responsible gaming environment, AWS Professional Services was asked to build a mechanism that detects inappropriate language (toxic speech) within online gaming player interactions. The overall business outcome was to improve the organization\u2019s operations by automating an existing manual process and to improve user experience by increasing speed and quality in detecting inappropriate interactions between players, ultimately promoting a cleaner and healthier gaming environment.<\/p>\n The customer ask was to create an English language detector that classifies voice and text excerpts into their own custom defined toxic language categories. They wanted to first determine if the given language excerpt is toxic, and then classify the excerpt in a specific customer-defined category of toxicity such as profanity or abusive language.<\/p>\n AWS ProServe solved this use case through a joint effort between the Generative AI Innovation Center (GAIIC) and the ProServe ML Delivery Team (MLDT). The AWS GAIIC is a group within AWS ProServe that pairs customers with experts to develop generative AI solutions for a wide range of business use cases using proof of concept (PoC) builds. AWS ProServe MLDT then takes the PoC through production by scaling, hardening, and integrating the solution for the customer.<\/p>\n This customer use case will be showcased in two separate posts. This post (Part 1) serves as a deep dive into the scientific methodology. It will explain the thought process and experimentation behind the solution, including the model training and development process. Part 2 will delve into the productionized solution, explaining the design decisions, data flow, and illustration of the model training and deployment architecture.<\/p>\n This post covers the following topics:<\/p>\n The main challenge AWS ProServe faced with training a toxic language classifier was obtaining enough labeled data from the customer to train an accurate model from scratch. AWS received about 100 samples of labeled data from the customer, which is a lot less than the 1,000 samples recommended for fine-tuning an LLM in the data science community.<\/p>\n As an added inherent challenge, natural language processing (NLP) classifiers are historically known to be very costly to train and require a large set of vocabulary, known as a corpus<\/em>, to produce accurate predictions. A rigorous and effective NLP solution, if provided sufficient amounts of labeled data, would be to train a custom language model using the customer\u2019s labeled data. The model would be trained solely with the players\u2019 game vocabulary, making it tailored to the language observed in the games. The customer had both cost and time constraints that made this solution unviable. AWS ProServe was forced to find a solution to train an accurate language toxicity classifier with a relatively small labeled dataset. The solution lay in what\u2019s known as transfer learning<\/em>.<\/p>\n The idea behind transfer learning is to use the knowledge of a pre-trained model and apply it to a different but relatively similar problem. For example, if an image classifier was trained to predict if an image contains a cat, you could use the knowledge that the model gained during its training to recognize other animals like tigers. For this language use case, AWS ProServe needed to find a previously trained language classifier that was trained to detect toxic language and fine-tune it using the customer\u2019s labeled data.<\/p>\n The solution was to find and fine-tune an LLM to classify toxic language. LLMs are neural networks that have been trained using a massive number of parameters, typically in the order of billions, using unlabeled data. Before going into the AWS solution, the following section provides an overview into the history of LLMs and their historical use cases.<\/p>\n LLMs have recently become the focal point for businesses looking for new applications of ML, ever since ChatGPT captured the public mindshare by being the fastest growing consumer application in history2<\/sup>, reaching 100 million active users by January 2023, just 2 months after its release. However, LLMs are not a new technology in the ML space. They have been used extensively to perform NLP tasks such as analyzing sentiment, summarizing corpuses, extracting keywords, translating speech, and classifying text.<\/p>\n Due to the sequential nature of text, recurrent neural networks (RNNs) had been the state of the art for NLP modeling. Specifically, the encoder-decoder<\/em> network architecture was formulated because it created an RNN structure capable of taking an input of arbitrary length and generating an output of arbitrary length. This was ideal for NLP tasks like translation where an output phrase of one language could be predicted from an input phrase of another language, typically with differing numbers of words between the input and output. The Transformer architecture3<\/sup> (Vaswani, 2017) was a breakthrough improvement on the encoder-decoder; it introduced the concept of self-attention<\/em>, which allowed the model to focus its attention on different words on the input and output phrases. In a typical encoder-decoder, each word is interpreted by the model in an identical fashion. As the model sequentially processes each word in an input phrase, the semantic information at the beginning may be lost by the end of the phrase. The self-attention mechanism changed this by adding an attention layer to both the encoder and decoder block, so that the model could put different weightings on certain words from the input phrase when generating a certain word in the output phrase. Thus the basis of the transformer model was born.<\/p>\n The transformer architecture was the foundation for two of the most well-known and popular LLMs in use today, the Bidirectional Encoder Representations from Transformers (BERT)4<\/sup> (Radford, 2018) and the Generative Pretrained Transformer (GPT)5 <\/sup>(Devlin 2018). Later versions of the GPT model, namely GPT3 and GPT4, are the engine that powers the ChatGPT application. The final piece of the recipe that makes LLMs so powerful is the ability to distill information from vast text corpuses without extensive labeling or preprocessing via a process called ULMFiT. This method has a pre-training phase where general text can be gathered and the model is trained on the task of predicting the next word based on previous words; the benefit here is that any input text used for training comes inherently prelabeled based on the order of the text. LLMs are truly capable of learning from internet-scale data. For example, the original BERT model was pre-trained on the BookCorpus and entire English Wikipedia text datasets.<\/p>\n This new modeling paradigm has given rise to two new concepts: foundation models (FMs) and Generative AI. As opposed to training a model from scratch with task-specific data, which is the usual case for classical supervised learning, LLMs are pre-trained to extract general knowledge from a broad text dataset before being adapted to specific tasks or domains with a much smaller dataset (typically on the order of hundreds of samples). The new ML workflow now starts with a pre-trained model dubbed a foundation model. It\u2019s important to build on the right foundation, and there are an increasing number of options, such as the new Amazon Titan FMs<\/a>, to be released by AWS as part of Amazon Bedrock<\/a>. These new models are also considered generative because their outputs are human interpretable and in the same data type as the input data. While past ML models were descriptive, such as classifying images of cats vs. dogs, LLMs are generative because their output is the next set of words based on input words. That allows them to power interactive applications such as ChatGPT that can be expressive in the content they generate.<\/p>\n Hugging Face has partnered with AWS<\/a> to democratize FMs and make them easy to access and build with. Hugging Face has created a Transformers API<\/a> that unifies more than 50 different transformer architectures on different ML frameworks, including access to pre-trained model weights in their Model Hub<\/a>, which has grown to over 200,000 models as of writing this post. In the next sections, we explore the proof of concept, the solution, and the FMs that were tested and chosen as the basis for solving this toxic speech classification use case for the customer.<\/p>\n AWS GAIIC chose to experiment with LLM foundation models with the BERT architecture to fine-tune a toxic language classifier. A total of three models from Hugging Face\u2019s model hub were tested:<\/p>\n All three model architectures are based on the BERTweet<\/a> architecture. BERTweet is trained based on the RoBERTa<\/a> pre-training procedure. The RoBERTa pre-training procedure is an outcome of a replication study of BERT pre-training that evaluated the effects of hyperparameter tuning and training set size to improve the recipe for training BERT models6 <\/sup>(Liu 2019). The experiment sought to find a pre-training method that improved the performance results of BERT without changing the underlying architecture. The conclusion of the study found that the following pre-training modifications substantially improved the performance of BERT:<\/p>\n The bertweet-base model uses the preceding pre-training procedure from the RoBERTa study to pre-train the original BERT architecture using 850 million English tweets. It is the first public large-scale language model pre-trained for English tweets.<\/p>\n Pre-trained FMs using tweets were thought to fit the use case for two main theoretical reasons:<\/p>\n AWS decided to first fine-tune BERTweet with the customer\u2019s labeled data to get a baseline. Then chose to fine-tune two other FMs in bertweet-base-offensive and bertweet-base-hate that were further pre-trained specifically on more relevant toxic tweets to achieve potentially higher accuracy. The bertweet-base-offensive model uses the base BertTweet FM and is further pre-trained on 14,100 annotated tweets that were deemed as offensive7<\/sup> (Zampieri 2019). The bertweet-base-hate model also uses the base BertTweet FM but is further pre-trained on 19,600 tweets that were deemed as hate speech8<\/sup> (Basile 2019).<\/p>\n To further enhance the performance of the PoC model, AWS GAIIC made two design decisions:<\/p>\n <\/p>\n\n
Data challenge<\/h2>\n
Tapping into the power of LLMs<\/h2>\n
AWS GAIIC proof of concept<\/h2>\n
\n
\n
\n