{"id":1013329,"date":"2021-08-13T12:00:32","date_gmt":"2021-08-13T16:00:32","guid":{"rendered":"https:\/\/www.kdnuggets.com\/?p=131194"},"modified":"2021-08-13T12:00:32","modified_gmt":"2021-08-13T16:00:32","slug":"how-to-train-a-bert-model-from-scratch","status":"publish","type":"station","link":"https:\/\/platodata.io\/plato-data\/how-to-train-a-bert-model-from-scratch\/","title":{"rendered":"How to Train a BERT Model From Scratch"},"content":{"rendered":"\n
\n

How to Train a BERT Model From Scratch<\/h1>\n
\n
\n= Previous post<\/strong><\/a><\/div>\n
\n<\/div>\n

 
  <\/p>\n

<\/div><\/div>\n
Tags: BERT<\/a>, Hugging Face<\/a>, NLP<\/a>, Python<\/a>, Training<\/a><\/div>\n

<\/p>\n

Meet BERT\u2019s Italian cousin, FiliBERTo. <\/p>\n<\/div>\n

<\/div>\n
<\/p>\n
<\/p>\n
comments<\/a><\/div>\n

By James Briggs<\/a>, Data Scientist<\/b><\/p>\n


\n
\nBERT, but in Italy \u2014 image by author<\/span>
\n<\/center>
\n <\/p>\n

Many of my articles have been focused on BERT \u2014 the model that came and dominated the world of natural language processing (NLP) and marked a new age for language models.<\/p>\n

For those of you that may not have used transformers models (eg what BERT is) before, the process looks a little like this:<\/p>\n

    \n
  • pip install transformers<\/code>\n<\/li>\n
  • Initialize a pre-trained transformers model \u2014 from_pretrained<\/code>.\n<\/li>\n
  • Test it on some data.\n<\/li>\n
  • Maybe<\/em> fine-tune the model (train it some more).\n<\/li>\n<\/ul>\n

    Now, this is a great approach, but if we only ever do this, we lack the understanding behind creating our own transformers models.<\/p>\n

    And, if we cannot create our own transformer models \u2014 we must rely on there being a pre-trained model that fits our problem, this is not always the case:<\/p>\n


    \n
    \nA few comments asking about non-English BERT models<\/span>
    \n<\/center>
    \n <\/p>\n

    So in this article, we will explore the steps we must take to build our own transformer model \u2014 specifically a further developed version of BERT, called RoBERTa.<\/p>\n

    An Overview<\/h2>\n

     
    \n 
    \nThere are a few steps to the process, so before we dive in let\u2019s first summarize what we need to do. In total, there are four key parts:<\/p>\n

      \n
    • Getting the data\n<\/li>\n
    • Building a tokenizer\n<\/li>\n
    • Creating an input pipeline\n<\/li>\n
    • Training the model\n<\/li>\n<\/ul>\n

      Once we have worked through each of these sections, we will take the tokenizer and model we have built \u2014 and save them both so that we can then use them in the same way we usually would with from_pretrained<\/code>.<\/p>\n

      Getting The Data<\/h2>\n

       
      \n 
      \nAs with any machine learning project, we need data. In terms of data for training a transformer model, we really are spoilt for choice \u2014 we can use almost any text data.<\/p>\n


      \n