{"id":2578818,"date":"2023-04-12T13:00:13","date_gmt":"2023-04-12T17:00:13","guid":{"rendered":"https:\/\/wordpress-1016567-4521551.cloudwaysapps.com\/plato-data\/beyond-accuracy-evaluating-improving-a-model-with-the-nlp-test-library\/"},"modified":"2023-04-12T13:00:13","modified_gmt":"2023-04-12T17:00:13","slug":"beyond-accuracy-evaluating-improving-a-model-with-the-nlp-test-library","status":"publish","type":"station","link":"https:\/\/platodata.io\/plato-data\/beyond-accuracy-evaluating-improving-a-model-with-the-nlp-test-library\/","title":{"rendered":"Beyond Accuracy: Evaluating &amp; Improving a Model with the NLP Test Library"},"content":{"rendered":"<p>Sponsored Post<\/p>\n<p>&nbsp;<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/wordpress-1016567-4521551.cloudwaysapps.com\/wp-content\/uploads\/2023\/04\/beyond-accuracy-evaluating-improving-a-model-with-the-nlp-test-library.webp\" alt=\"Beyond Accuracy: Evaluating &amp; Improving a Model with the NLP Test Library\" width=\"100%\"><br \/><span>NLP Test: Deliver Safe &amp; Effective Models<\/span><br \/>&nbsp;<\/p>\n<h2>The need to test Natural Language Processing models<\/h2>\n<p>&nbsp;<br \/>A few short years ago, one of our customers notified us about a bug. Our medical data de-identification model had near-perfect accuracy in identifying most patient names \u2014 as in \u201cMike Jones is diabetic\u201d \u2014 but was only around 90% accurate when encountering Asian names \u2014 as in \u201cWei Wu is diabetic\u201d. This was a big deal, since it meant that the model made 4 to 5&nbsp;<em>times<\/em>&nbsp;more mistakes for one ethnic group. It was also easy to fix, by augmenting the training dataset with more examples of this (and other) groups.<\/p>\n<p>Most importantly, it got us thinking:<\/p>\n<ul>\n<li>We shouldn\u2019t just fix this bug once. Shouldn\u2019t there be an automated regression test that checks this issue whenever we release a new model version?\n<\/li>\n<li>What other robustness, fairness, bias, or other issues should we be testing for? We\u2019ve always been focused on delivering state-of-the-art accuracy, but this seemed to obviously be a minimum requirement.\n<\/li>\n<li>We should test&nbsp;<em>all<\/em>&nbsp;our models for the same issues. Were we not finding such issues everywhere just because we weren\u2019t looking?\n<\/li>\n<li>Is it just us, or is everyone else also encountering this same problem?\n<\/li>\n<\/ul>\n<p>Shortly after, the answer to this last question became a resounding&nbsp;<strong>Yes<\/strong>. The aptly named&nbsp;<a href=\"https:\/\/aclanthology.org\/2020.acl-main.442\/\" rel=\"noopener\" target=\"_blank\">Beyond Accuracy<\/a>&nbsp;paper by Ribeiro et al. won Best Overall Paper at the ACL 2020 conference by showing major robustness issues with the public text analysis APIs of Amazon Web Services, Microsoft Azure, and Google Cloud, as well as with the popular BERT and RoBERTa open-source language models. For example, sentiment analysis models of all three cloud providers failed over 90% of the time on certain types of negation (\u201cI thought the plane would be awful, but it wasn\u2019t\u201d should have neutral or positive sentiment), and over 36% of the time on certain temporality tests (\u201cI used to hate this airline, but now I like it\u201d should have neutral or positive sentiment).<\/p>\n<p>This was followed by a flurry of corporate messaging on&nbsp;<a href=\"https:\/\/www.forbes.com\/sites\/forbestechcouncil\/2021\/09\/01\/six-essential-elements-of-a-responsible-ai-model\/?sh=4d75abde56cf\" rel=\"noopener\" target=\"_blank\">Responsible AI<\/a>&nbsp;that created committees, policies, templates, and frameworks \u2014 but few tools to actually help data scientists build better models. This was instead taken on by a handful of startups and many academic researchers. The most comprehensive publication to date is&nbsp;<a href=\"https:\/\/crfm.stanford.edu\/helm\/latest\/\" rel=\"noopener\" target=\"_blank\">Holistic Evaluation of Language Models<\/a>&nbsp;by the Center of Research on Foundation Models at Stanford. Most of the work so far has focused on identifying the many types of issues that different natural language processing (NLP) models can have and measuring how pervasive they are.<\/p>\n<p>If you have any experience with software engineering, you\u2019d consider the fact that software performs poorly on features it was never tested on to be the least surprising news of the decade. And you would be correct.<\/p>\n<h2>Introducing the open-source nlptest library<\/h2>\n<p>&nbsp;<br \/><a href=\"http:\/\/www.johnsnowlabs.com\/\" rel=\"noopener\" target=\"_blank\">John Snow Labs<\/a>&nbsp;primarily serves the healthcare and life science industries \u2014 where AI safety, equity and reliability are not nice to haves. In some cases it\u2019s illegal to go to market and \u201cfix it later\u201d. This means that we\u2019ve learned a lot about testing and delivering Responsible NLP models: not only in terms of policies and goals, but by building day-to-day tools for data scientists.<\/p>\n<p><a href=\"https:\/\/www.nlptest.org\/\" rel=\"noopener\" target=\"_blank\">The nlptest library<\/a>&nbsp;aims to share these tools with the open-source community. We believe that such a library should be:<\/p>\n<ol>\n<li>100% open-source under a commercially permissive license (Apache 2.0)\n<\/li>\n<li>Backed by a team that\u2019s committed to support the effort for years to come, without depending on outside investment or academic grants\n<\/li>\n<li>Built by software engineers for software engineers, providing a production-grade codebase\n<\/li>\n<li>Easy to use \u2014 making it easy to apply the best practices it enables\n<\/li>\n<li>Easy to extend \u2014 specifically designed to make it easy to add test types, tasks, and integrations\n<\/li>\n<li>Easy to integrate with a variety of NLP libraries and models, not restricted to any single company\u2019s ecosystem.\n<\/li>\n<li>Integrate easily with a variety of continuous integration, version control, and MLOps tools\n<\/li>\n<li>Support the full spectrum of tests that different NLP models &amp; task require before deployment\n<\/li>\n<li>Enable non-technical experts to read, write, and understand tests\n<\/li>\n<li>Apply generative AI techniques to automatically generate tests cases where possible\n<\/li>\n<\/ol>\n<p>The goal of this article is to show you what\u2019s available now and how you can put it to good use. We\u2019ll run tests on one of the world\u2019s most popular Named Entity Recognition (NER) models to showcase the tool\u2019s capabilities.<\/p>\n<p>&nbsp;<\/p>\n<p><img decoding=\"async\" alt src=\"https:\/\/wordpress-1016567-4521551.cloudwaysapps.com\/wp-content\/uploads\/2023\/04\/beyond-accuracy-evaluating-improving-a-model-with-the-nlp-test-library.png\" width=\"700\"><br \/><span>The various tests available in the NLP Test library<\/span><\/p>\n<p>&nbsp; <\/p>\n<h2>Evaluating a spaCy NER model with NLP Test<\/h2>\n<p>&nbsp;<br \/>Let\u2019s shine the light on the NLP Test library\u2019s core features. We\u2019ll start by training a spaCy NER model on the CoNLL 2003 dataset. We\u2019ll then run tests on 5 different fronts: robustness, bias, fairness, representation and accuracy. We can then run the automated augmentation process and retrain a model on the augmented data and hopefully see increases in performance. All code and results displayed in this blogpost is available to reproduce&nbsp;<a href=\"https:\/\/github.com\/JohnSnowLabs\/nlptest\/blob\/main\/demo\/blogposts\/KDnuggets_spacy_workflow.ipynb\" rel=\"noopener\" target=\"_blank\">right here<\/a>.<\/p>\n<h3>Generating test cases<\/h3>\n<p>&nbsp;<br \/>To start off, install the&nbsp;<code>nlptest<\/code>&nbsp;library by simply calling:<\/p>\n<div>\n<pre><code>pip install nlptest<\/code><\/pre>\n<\/div>\n<p>Let\u2019s say you\u2019ve just trained a model on the CoNLL 2003 dataset. You can check out&nbsp;<a href=\"https:\/\/github.com\/JohnSnowLabs\/nlptest\/blob\/main\/demo\/blogposts\/KDnuggets_spacy_workflow.ipynb\" rel=\"noopener\" target=\"_blank\">this notebook<\/a>&nbsp;for details on how we did that. The next step would be to create a test Harness as such:<\/p>\n<div>\n<pre><code>from nlptest import Harness h = Harness(model=spacy_model, data=\"sample.conll\")<\/code><\/pre>\n<\/div>\n<p>This will create a test Harness with default test configurations and the&nbsp;<code>sample.conll<\/code>&nbsp;dataset which represents a trimmed version of the CoNLL 2003 test set. The configuration can be customized by creating a&nbsp;<code>config.yml<\/code>&nbsp;file and passing it to the Harness&nbsp;<code>config<\/code>parameter, or simply by using the&nbsp;<code>.config()<\/code>&nbsp;method. More details on that&nbsp;<a href=\"https:\/\/nlptest.org\/docs\/pages\/tests\/test\" rel=\"noopener\" target=\"_blank\">right here<\/a>.<\/p>\n<p>Next, generate your test cases and take a look at them:<\/p>\n<div>\n<pre><code># Generating test cases\nh.generate() # View test cases\nh.testcases()<\/code><\/pre>\n<\/div>\n<p>&nbsp;<\/p>\n<p><img decoding=\"async\" alt src=\"https:\/\/wordpress-1016567-4521551.cloudwaysapps.com\/wp-content\/uploads\/2023\/04\/beyond-accuracy-evaluating-improving-a-model-with-the-nlp-test-library-1.png\" width=\"700\"><\/p>\n<p>&nbsp;<\/p>\n<p>At this point, you can easily export these test cases to re-use them later on:<\/p>\n<div>\n<pre><code>h.save(\"saved_testsuite\")<\/code><\/pre>\n<\/div>\n<h3>Running test cases<\/h3>\n<p>&nbsp;<br \/>Let\u2019s now run the test cases and print a report:<\/p>\n<div>\n<pre><code># Run and get report on test cases\nh.run().report()<\/code><\/pre>\n<\/div>\n<p>&nbsp;<\/p>\n<p><img decoding=\"async\" alt src=\"https:\/\/wordpress-1016567-4521551.cloudwaysapps.com\/wp-content\/uploads\/2023\/04\/beyond-accuracy-evaluating-improving-a-model-with-the-nlp-test-library-2.png\" width=\"700\"><\/p>\n<p>&nbsp;<\/p>\n<p>It looks like on this short series of tests, our model is severely lacking in&nbsp;<strong>robustness<\/strong>.&nbsp;<strong>Bias<\/strong>&nbsp;is looking shaky\u2014we should investigate the failing case further. Other than that,&nbsp;<strong>accuracy, representation&nbsp;<\/strong>and&nbsp;<strong>fairness<\/strong>&nbsp;seem to be doing good. Let\u2019s take a look at the failing test cases for robustness since they seem quite bad:<\/p>\n<div>\n<pre><code># Get detailed generated results\ngenerated_df = h.generated_results() # Get subset of robustness tests\ngenerated_df[(generated_df['category']=='robustness') &amp; (generated_df['pass'] == False)].sample(5)<\/code><\/pre>\n<\/div>\n<p>&nbsp;<\/p>\n<p><img decoding=\"async\" alt src=\"https:\/\/wordpress-1016567-4521551.cloudwaysapps.com\/wp-content\/uploads\/2023\/04\/beyond-accuracy-evaluating-improving-a-model-with-the-nlp-test-library-3.png\" width=\"700\"><\/p>\n<p>&nbsp;<\/p>\n<p>Let\u2019s also take a look at failing cases for bias:<\/p>\n<div>\n<pre><code># Get subset of asian lastnames tests\ngenerated_df[(generated_df['category'] == 'bias') &amp; (generated_df['pass'] == False)].sample(5)<\/code><\/pre>\n<\/div>\n<p>&nbsp;<\/p>\n<p><img decoding=\"async\" alt src=\"https:\/\/wordpress-1016567-4521551.cloudwaysapps.com\/wp-content\/uploads\/2023\/04\/beyond-accuracy-evaluating-improving-a-model-with-the-nlp-test-library-4.png\" width=\"700\"><\/p>\n<p>&nbsp;<\/p>\n<p>Even the simplest tests for robustness, which involve uppercasing or lowercasing the input text, have been able to impair the model\u2019s ability to make consistent predictions. We also notice cases where replacing random country names to low income country names or replacing random names to asian names (based on US census data) manage to bring the model to its knees.<\/p>\n<p>This means that if your company had deployed this model for business-critical applications at this point, you may have encountered an unpleasant surprise. The NLP Test library attempts to bring awareness and minimize such surprises.<\/p>\n<h3>Fixing your model automatically<\/h3>\n<p>&nbsp;<br \/>The immediate reaction we receive at this point is usually: \u201cOkay, so now what?\u201d. Despite the absence of automated fixing features in conventional software test suites, we made the decision to implement such capabilities in an attempt to answer that question.<\/p>\n<p>The NLP Test library provides an augmentation method which can be called on the original training set:<\/p>\n<div>\n<pre><code>h.augment(input=\"conll03.conll\", output=\"augmented_conll03.conll\")<\/code><\/pre>\n<\/div>\n<p>This provides a starting point for any user to then fine-tune their model on an augmented version of their training dataset and make sure their model is ready to perform when deployed into the real world. It uses automated augmentations based on the pass rate of each test.<\/p>\n<p>A couple minutes later, after a quick training process, let\u2019s check what the report looks like once we re-run our tests.<\/p>\n<div>\n<pre><code># Create a new Harness and load the previous test cases\nnew_h = Harness.load(\"saved_testsuite\", model=augmented_spacy_model) # Running and getting a report\nnew_h.run().report()<\/code><\/pre>\n<\/div>\n<p>&nbsp;<\/p>\n<p><img decoding=\"async\" alt src=\"https:\/\/wordpress-1016567-4521551.cloudwaysapps.com\/wp-content\/uploads\/2023\/04\/beyond-accuracy-evaluating-improving-a-model-with-the-nlp-test-library-5.png\" width=\"700\"><\/p>\n<p>&nbsp;<\/p>\n<p>We notice massive increases in the previously failing robustness pass rates (+47% and +23%) and moderate increases in the previously failing bias pass rates (+5%). Other tests stay exactly the same \u2014 which is expected since augmentation will not address fairness, representation and accuracy test categories. Here\u2019s a visualization of the post-augmentation improvement in pass rates for the relevant test types:<\/p>\n<p>&nbsp;<\/p>\n<p><img decoding=\"async\" alt src=\"https:\/\/wordpress-1016567-4521551.cloudwaysapps.com\/wp-content\/uploads\/2023\/04\/beyond-accuracy-evaluating-improving-a-model-with-the-nlp-test-library-6.png\" width=\"700\"><\/p>\n<p>&nbsp;<\/p>\n<p>And just like that, the model has now been made more resilient. This process is meant to be iterative and provides users with confidence that each subsequent model is safer to deploy than its previous version.<\/p>\n<h2>Get Started Now<\/h2>\n<p>&nbsp;<br \/>The nlptest library is live and freely available to you right now. Start with pip install nlptest or visit&nbsp;<a href=\"http:\/\/www.nlptest.org\/\" rel=\"noopener\" target=\"_blank\">nlptest.org<\/a>&nbsp;to read the docs and tutorials.<\/p>\n<p>NLP Test is also an early stage open-source community project which you are welcome to join.&nbsp;<a href=\"https:\/\/www.johnsnowlabs.com\/\" rel=\"noopener\" target=\"_blank\">John Snow Labs<\/a>&nbsp;has a full development team allocated to the project and is committed to improving the library for years, as we do with other open-source libraries. Expect frequent releases with new test types, tasks, languages, and platforms to be added regularly. However, you\u2019ll get what you need faster if you contribute, share examples &amp; documentation, or give us feedback on what you need most. Visit&nbsp;<a href=\"https:\/\/github.com\/johnSnowLabs\/nlptest\" rel=\"noopener\" target=\"_blank\">nlptest on GitHub<\/a>&nbsp;to join the conversation.<\/p>\n<p>We look forward to working together to make safe, reliable, and responsible NLP an everyday reality.<br \/>&nbsp;<\/p>\n<div class=\"crp_related crp-text-only\"><\/p>\n<h3>More On This Topic<\/h3>\n<\/div>\n<ul class=\"plato-post-bottom-links\">\n<li class=\"plato-post-bottom-link-amplifi\">SEO Powered Content &amp; PR Distribution. <a href=\"https:\/\/www.amplifipr.com\" target=\"_blank\" rel=\"noopener\">Get Amplified Today.<\/a><\/li>\n<li class=\"plato-post-bottom-link-platoblockchain\">Platoblockchain. Web3 Metaverse Intelligence. Knowledge Amplified. <a href=\"https:\/\/platoblockchain.com\" target=\"_blank\" rel=\"noopener\">Access Here.<\/a><\/li>\n<li class=\"plato-post-bottom-link-source\"><span>Source:<\/span> <a href=\"https:\/\/www.kdnuggets.com\/2023\/04\/john-snow-beyond-accuracy-nlp-test-library.html?utm_source=rss&amp;utm_medium=rss&amp;utm_campaign=beyond-accuracy-evaluating-improving-a-model-with-the-nlp-test-library\" target=\"_blank\" rel=\"noopener\">https:\/\/www.kdnuggets.com\/2023\/04\/john-snow-beyond-accuracy-nlp-test-library.html?utm_source=rss&amp;utm_medium=rss&amp;utm_campaign=beyond-accuracy-evaluating-improving-a-model-with-the-nlp-test-library<\/a><\/li>\n<\/ul>\n","protected":false},"author":1,"featured_media":2578819,"template":"Default","meta":{"_eb_attr":"","type":"","auto_type":false,"post":"","stream":"","stream_url":"","waveform_data":[],"duration":0,"start":0,"end":0,"bpm":0,"downloadable":false,"download_url":"","purchase_title":"","purchase_url":"","post-count-all":0,"like_count":0,"download_count":0,"editor_note":"","copyright":"","captions":[],"sources":[]},"genre":[42022],"station_tag":[38012,4397,3629,48576,4262,13775,13275,31352,9160,40669,40465,9950,40086,13570,9085,39830,4043,15932,7363,48591,3761,14478,12467,4735,7230,4045,18340,12454,34086,4896,3681,12974,48551,3718,4135,48552,33852,48553,30511,9222,7043,13342,9837,10032,5604,23439,30993,48554,13212,12619,14449,4139,4140,14383,11991,40971,16785,13083,5458,4526,4244,48555,24510,39889,48563,9772,14043,4145,40213,10957,7543,45654,20308,12200,39832,4737,4297,3720,22460,11364,11515,11797,3690,3830,10528,11560,4528,3946,8637,13213,12952,40937,10042,9883,40570,4863,10315,11621,10269,4559,11928,11290,4152,3833,47137,3642,36184,47819,4155,10383,14052,36453,39953,4159,11480,10420,11462,9627,11604,9833,12106,12134,3938,3834,9163,40585,39854,14017,3692,4913,9227,6988,4167,12152,13796,41051,4445,7061,11957,4402,452,47171,12135,11677,9706,13837,9166,12390,10521,9712,8066,4532,32236,9611,20920,40666,10369,4382,10318,40057,30957,11675,4899,3730,9420,39959,48556,3844,4614,8214,13103,40929,4178,6575,11103,9007,14115,11294,11204,37537,40035,9229,34012,19651,9607,8080,4491,3647,18780,4274,3847,12023,14317,48557,4184,9238,40861,3694,12342,48568,3650,48567,11041,4570,10792,9267,22734,12392,12758,19265,4855,31543,4187,11381,11192,40062,10037,4594,7019,4254,3698,40117,9980,3701,3653,48564,4866,43209,4572,5487,27164,4573,40235,5388,5230,15462,19970,4659,48580,13273,4965,13214,9169,20810,29070,39916,11365,3959,13777,3737,11796,39875,3658,10720,26118,12349,13530,4878,41086,4477,4256,40145,5328,17376,11386,39968,13691,37578,4094,9413,10997,9483,4673,37288,4884,10517,43581,16265,39917,24368,20482,459,14094,16762,41836,48559,48560,17454,4207,14287,3663,40066,12145,4321,68458,12074,14299,40330,4102,40802,3862,7457,8453,9642,11181,4210,4017,10871,9876,40574,5125,12190,39866,40224,18347,9231,9221,11810,4020,13783,9057,10949,4215,18348,37029,9248,12192,42598,19521,4217,4859,15551,16559,15116,8300,4546,10274,16209,43520,14297,10515,6259,10950,40627,12350,3744,11994,9612,10097,33053,3710,10966,13779,9769,40463,4433,46826,10224,3921,4674,4503,13076,4828,13653,4118,4580,40502,3812,3778,40195,3975,3976,4026,5364,4483,39869,14209,14105,40082,12444,13654,7437,3779,39842,4282,4462,11458,9255,29557,3713,43519,3714,3668,24661,3926,10719,3780,21389,40659,39843,3669,4505,12991,40728,39844,40350,10316,10092,9889,12071,13219,4128,4361,4970,48561,13215,39845,48569,48566,4549,14803,13569,3671,19267,9084,8813,40985,7291,6313,12348,13074,9872,3984,45655,10367,30318,3755,9177,11348,11001,12835,44444,4484,6280,4590,4809,31564,10685,48571,9178,36520,27168,48562,9089,33854,3935,9608,3878,8996,11461,40104,5932,3937,13217,14555],"artist":[42028],"mood":[],"activity":[],"_links":{"self":[{"href":"https:\/\/platodata.io\/wp-json\/wp\/v2\/station\/2578818"}],"collection":[{"href":"https:\/\/platodata.io\/wp-json\/wp\/v2\/station"}],"about":[{"href":"https:\/\/platodata.io\/wp-json\/wp\/v2\/types\/station"}],"author":[{"embeddable":true,"href":"https:\/\/platodata.io\/wp-json\/wp\/v2\/users\/1"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/platodata.io\/wp-json\/wp\/v2\/media\/2578819"}],"wp:attachment":[{"href":"https:\/\/platodata.io\/wp-json\/wp\/v2\/media?parent=2578818"}],"wp:term":[{"taxonomy":"genre","embeddable":true,"href":"https:\/\/platodata.io\/wp-json\/wp\/v2\/genre?post=2578818"},{"taxonomy":"station_tag","embeddable":true,"href":"https:\/\/platodata.io\/wp-json\/wp\/v2\/station_tag?post=2578818"},{"taxonomy":"artist","embeddable":true,"href":"https:\/\/platodata.io\/wp-json\/wp\/v2\/artist?post=2578818"},{"taxonomy":"mood","embeddable":true,"href":"https:\/\/platodata.io\/wp-json\/wp\/v2\/mood?post=2578818"},{"taxonomy":"activity","embeddable":true,"href":"https:\/\/platodata.io\/wp-json\/wp\/v2\/activity?post=2578818"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}