{"id":2578818,"date":"2023-04-12T13:00:13","date_gmt":"2023-04-12T17:00:13","guid":{"rendered":"https:\/\/wordpress-1016567-4521551.cloudwaysapps.com\/plato-data\/beyond-accuracy-evaluating-improving-a-model-with-the-nlp-test-library\/"},"modified":"2023-04-12T13:00:13","modified_gmt":"2023-04-12T17:00:13","slug":"beyond-accuracy-evaluating-improving-a-model-with-the-nlp-test-library","status":"publish","type":"station","link":"https:\/\/platodata.io\/plato-data\/beyond-accuracy-evaluating-improving-a-model-with-the-nlp-test-library\/","title":{"rendered":"Beyond Accuracy: Evaluating & Improving a Model with the NLP Test Library"},"content":{"rendered":"
Sponsored Post<\/p>\n
<\/p>\n
Most importantly, it got us thinking:<\/p>\n Shortly after, the answer to this last question became a resounding Yes<\/strong>. The aptly named Beyond Accuracy<\/a> paper by Ribeiro et al. won Best Overall Paper at the ACL 2020 conference by showing major robustness issues with the public text analysis APIs of Amazon Web Services, Microsoft Azure, and Google Cloud, as well as with the popular BERT and RoBERTa open-source language models. For example, sentiment analysis models of all three cloud providers failed over 90% of the time on certain types of negation (\u201cI thought the plane would be awful, but it wasn\u2019t\u201d should have neutral or positive sentiment), and over 36% of the time on certain temporality tests (\u201cI used to hate this airline, but now I like it\u201d should have neutral or positive sentiment).<\/p>\n This was followed by a flurry of corporate messaging on Responsible AI<\/a> that created committees, policies, templates, and frameworks \u2014 but few tools to actually help data scientists build better models. This was instead taken on by a handful of startups and many academic researchers. The most comprehensive publication to date is Holistic Evaluation of Language Models<\/a> by the Center of Research on Foundation Models at Stanford. Most of the work so far has focused on identifying the many types of issues that different natural language processing (NLP) models can have and measuring how pervasive they are.<\/p>\n If you have any experience with software engineering, you\u2019d consider the fact that software performs poorly on features it was never tested on to be the least surprising news of the decade. And you would be correct.<\/p>\n The nlptest library<\/a> aims to share these tools with the open-source community. We believe that such a library should be:<\/p>\n The goal of this article is to show you what\u2019s available now and how you can put it to good use. We\u2019ll run tests on one of the world\u2019s most popular Named Entity Recognition (NER) models to showcase the tool\u2019s capabilities.<\/p>\n <\/p>\n <\/p>\n Let\u2019s say you\u2019ve just trained a model on the CoNLL 2003 dataset. You can check out this notebook<\/a> for details on how we did that. The next step would be to create a test Harness as such:<\/p>\n This will create a test Harness with default test configurations and the Next, generate your test cases and take a look at them:<\/p>\n <\/p>\n <\/p>\n At this point, you can easily export these test cases to re-use them later on:<\/p>\n <\/p>\n <\/p>\n It looks like on this short series of tests, our model is severely lacking in robustness<\/strong>. Bias<\/strong> is looking shaky\u2014we should investigate the failing case further. Other than that, accuracy, representation <\/strong>and fairness<\/strong> seem to be doing good. Let\u2019s take a look at the failing test cases for robustness since they seem quite bad:<\/p>\n <\/p>\n <\/p>\n Let\u2019s also take a look at failing cases for bias:<\/p>\n <\/p>\n <\/p>\n Even the simplest tests for robustness, which involve uppercasing or lowercasing the input text, have been able to impair the model\u2019s ability to make consistent predictions. We also notice cases where replacing random country names to low income country names or replacing random names to asian names (based on US census data) manage to bring the model to its knees.<\/p>\n This means that if your company had deployed this model for business-critical applications at this point, you may have encountered an unpleasant surprise. The NLP Test library attempts to bring awareness and minimize such surprises.<\/p>\n The NLP Test library provides an augmentation method which can be called on the original training set:<\/p>\n This provides a starting point for any user to then fine-tune their model on an augmented version of their training dataset and make sure their model is ready to perform when deployed into the real world. It uses automated augmentations based on the pass rate of each test.<\/p>\n A couple minutes later, after a quick training process, let\u2019s check what the report looks like once we re-run our tests.<\/p>\n <\/p>\n <\/p>\n We notice massive increases in the previously failing robustness pass rates (+47% and +23%) and moderate increases in the previously failing bias pass rates (+5%). Other tests stay exactly the same \u2014 which is expected since augmentation will not address fairness, representation and accuracy test categories. Here\u2019s a visualization of the post-augmentation improvement in pass rates for the relevant test types:<\/p>\n <\/p>\n <\/p>\n And just like that, the model has now been made more resilient. This process is meant to be iterative and provides users with confidence that each subsequent model is safer to deploy than its previous version.<\/p>\n NLP Test is also an early stage open-source community project which you are welcome to join. John Snow Labs<\/a> has a full development team allocated to the project and is committed to improving the library for years, as we do with other open-source libraries. Expect frequent releases with new test types, tasks, languages, and platforms to be added regularly. However, you\u2019ll get what you need faster if you contribute, share examples & documentation, or give us feedback on what you need most. Visit nlptest on GitHub<\/a> to join the conversation.<\/p>\n We look forward to working together to make safe, reliable, and responsible NLP an everyday reality.
NLP Test: Deliver Safe & Effective Models<\/span>
<\/p>\nThe need to test Natural Language Processing models<\/h2>\n
A few short years ago, one of our customers notified us about a bug. Our medical data de-identification model had near-perfect accuracy in identifying most patient names \u2014 as in \u201cMike Jones is diabetic\u201d \u2014 but was only around 90% accurate when encountering Asian names \u2014 as in \u201cWei Wu is diabetic\u201d. This was a big deal, since it meant that the model made 4 to 5 times<\/em> more mistakes for one ethnic group. It was also easy to fix, by augmenting the training dataset with more examples of this (and other) groups.<\/p>\n\n
Introducing the open-source nlptest library<\/h2>\n
John Snow Labs<\/a> primarily serves the healthcare and life science industries \u2014 where AI safety, equity and reliability are not nice to haves. In some cases it\u2019s illegal to go to market and \u201cfix it later\u201d. This means that we\u2019ve learned a lot about testing and delivering Responsible NLP models: not only in terms of policies and goals, but by building day-to-day tools for data scientists.<\/p>\n\n
The various tests available in the NLP Test library<\/span><\/p>\nEvaluating a spaCy NER model with NLP Test<\/h2>\n
Let\u2019s shine the light on the NLP Test library\u2019s core features. We\u2019ll start by training a spaCy NER model on the CoNLL 2003 dataset. We\u2019ll then run tests on 5 different fronts: robustness, bias, fairness, representation and accuracy. We can then run the automated augmentation process and retrain a model on the augmented data and hopefully see increases in performance. All code and results displayed in this blogpost is available to reproduce right here<\/a>.<\/p>\nGenerating test cases<\/h3>\n
To start off, install the nlptest<\/code> library by simply calling:<\/p>\n
pip install nlptest<\/code><\/pre>\n<\/div>\n
from nlptest import Harness h = Harness(model=spacy_model, data=\"sample.conll\")<\/code><\/pre>\n<\/div>\n
sample.conll<\/code> dataset which represents a trimmed version of the CoNLL 2003 test set. The configuration can be customized by creating a
config.yml<\/code> file and passing it to the Harness
config<\/code>parameter, or simply by using the
.config()<\/code> method. More details on that right here<\/a>.<\/p>\n
# Generating test cases\nh.generate() # View test cases\nh.testcases()<\/code><\/pre>\n<\/div>\n
<\/p>\n
h.save(\"saved_testsuite\")<\/code><\/pre>\n<\/div>\n
Running test cases<\/h3>\n
Let\u2019s now run the test cases and print a report:<\/p>\n# Run and get report on test cases\nh.run().report()<\/code><\/pre>\n<\/div>\n
<\/p>\n
# Get detailed generated results\ngenerated_df = h.generated_results() # Get subset of robustness tests\ngenerated_df[(generated_df['category']=='robustness') & (generated_df['pass'] == False)].sample(5)<\/code><\/pre>\n<\/div>\n
<\/p>\n
# Get subset of asian lastnames tests\ngenerated_df[(generated_df['category'] == 'bias') & (generated_df['pass'] == False)].sample(5)<\/code><\/pre>\n<\/div>\n
<\/p>\n
Fixing your model automatically<\/h3>\n
The immediate reaction we receive at this point is usually: \u201cOkay, so now what?\u201d. Despite the absence of automated fixing features in conventional software test suites, we made the decision to implement such capabilities in an attempt to answer that question.<\/p>\nh.augment(input=\"conll03.conll\", output=\"augmented_conll03.conll\")<\/code><\/pre>\n<\/div>\n
# Create a new Harness and load the previous test cases\nnew_h = Harness.load(\"saved_testsuite\", model=augmented_spacy_model) # Running and getting a report\nnew_h.run().report()<\/code><\/pre>\n<\/div>\n
<\/p>\n
<\/p>\n
Get Started Now<\/h2>\n
The nlptest library is live and freely available to you right now. Start with pip install nlptest or visit nlptest.org<\/a> to read the docs and tutorials.<\/p>\n
<\/p>\n