Skip to main content

Should You Rely on That AI?

A workshop to explore the multi-domain challenge of AI operational testing

An online event, January 28, 2021
9:45am-5:30pm (all times EST)


Note: Recordings of this event are posted here; slides are here and linked below. A summary report is being generated and will be available upon request by Summer 2021


National security organizations have long recognized the need to test and evaluate the people, teams, and systems they rely upon to accomplish their missions. From evaluating knowledge, skills, abilities, and other traits (KSAOs) and personality, to mission practice and operational testing, organizations have spent enormous resources and time on developing a wide range of methods, capabilities, and testbeds to better understand whether someone, or some team, is mission ready. Likewise, the systems engineering community has a long history of developing a plethora of standardized techniques to test engineered systems: consequently, verification and validation is straightforward where we have full access to the design, and when the design’s performance is continuous against clearly defined metrics in a well-understood operational context. However, with AI's rapid rise and emergence - and imminent operational use - we have neither the benefit of a long research history nor the luxury of a lot of time to develop methods, techniques and testbeds to evaluate whether an AI is “mission ready.” Today’s approaches are applicable to component-level AI testing (that is, to address the question of whether an algorithm meets certain performance metrics as a function of the inputs), but component-level performance is necessary but not sufficient – akin to having to predict a human’s performance in complex environments using only a single score from a classroom exam.

Conventional engineering techniques are insufficient for several reasons. AIs are increasingly black boxes: either due to proprietary code or the sheer volume of source lines of code. With rapid releases of software versions, we cannot hope to get to experiential-based confidence. Compounding this challenge, AIs are increasingly being applied in tasks beyond their initially-conceived operational concept, due to missions changing, data streams changing, and/or users changing. Finally, AIs are being developed to make higher-level (more human-like) decisions: instead of just needing to determine the steps of how to accomplish a defined task, AIs are being asked what task to do, and to explain the causality of their rationale. Going broader than the mission context, AIs must recognize and operate within ethical and legal bounds. And further beyond the conventional engineering system, AIs are biased by both their training data, as well as by the viewpoints held by their (now distant) development teams. Acknowledging the increasingly influential roles that AIs will play in our sociotechnical systems, finding a solution to AI operational testing will require leveraging the best from engineering, design, and social sciences.

In this workshop, we will bring together leading thinkers spanning this space, to identify the most critical challenges and potential solutions.

"For example, if U.S. Special Operations Command uses a deep learning algorithm to translate documents from a raid on a terrorist compound and finds time-sensitive information, how do you measure operational impact? Determining impact isn’t just about statistical analysis on the level of precision-recall, but the impact compared to a human being’s ability and the efficiency created for the operator."

-- Flournoy 2020



Introduction of the day

Dr. William Regli, Executive Director, ARLIS [video]


Panel 1: Role of simulation, test, training, qualifications, assurance cases in operational testing

[video; slides]

What is the best use of the array of assessment tools we have to choose from, to determine that an AI-enabled team is ready to perform a mission? Rather than thinking of conventional technology unit testing, we anticipate
that AI test will be an element of bigger-picture evaluation of the readiness of a human-machine team. Such demonstration will require best use of diverse methods, including simulation, component testing, individual and small unit training, personnel qualifications, wargaming, mission rehearsals, and assurance cases. Demonstration will need to be accomplished efficiently, in support of mission cadence.

This session will explore the integrated technology solutions that will be required to demonstrate that a team is mission-ready.

Invited Speakers:

  • Prof. John Dickerson
    Assistant Professor, Computer Science and University of Maryland Institute for Advanced Computer Studies (UMIACS), University of Maryland;
    Chief Scientist, ArthurAI
  • Dr. Sandeep Neema
    Program Manager, Information Innovation Office, Defense Advanced Research Projects Agency (DARPA/I2O)
  • Lieutenant General (ret) Darsie Rogers
    Former Deputy Director, Defense Threat Reduction Agency and Commander, Special Operations Command for CENTCOM;
    Professor of the Practice, Applied Research Laboratory for Intelligence & Security, University of Maryland
  • Prof. Hava Siegelmann,
    Professor, Computer Science, Neuroscience and Behavior Program, University of Massachusetts;
    Former Program Manager, Information Innovation Office, Defense Advanced Research Projects Agency (DARPA/I2O)
  • Dr. Greg Zacharias
    Chief Scientist, Operational Test and Evaluation, Office of the Secretary of Defense
  • Moderated by: Dr. Brian Pierce
    Former Director (and Deputy Director), Information Innovation Office, former Deputy Director, Strategic Technology Office, Defense Advanced Research Projects Agency (DARPA/I2O, DARPA/STO);
    Visiting Research Scientist, Applied Research Laboratory for Intelligence & Security, University of Maryland


Panel 2: Moving to a full-lifetime testing approach

[video; slides]

Rather than a milestone, we anticipate that operational testing will be a continuous process, from development through deployment through sustainment. How can we leverage each use of an AI as an opportunity to improve? Continuous training will enable “online learning” of AIs as they are deployed into new missions, and as part of new systems-of-systems. Shifting to full-lifecycle testing will require capturing all of the relevant contextual data as part of an “after action report,” including the role of other systems operating in theater, user actions and metadata, operational data, outputs, decisions, impacts, and reactions.

How do we collect, store, and apply this data, and protect the privacy of both the personnel involved and other persons in-theater, in such a way that the operators are not negatively impacted and the data is used to constructively improve during maintenance and future requirements cycles?

The invited speakers for this session are:

  • Prof. Jeffrey Herrmann
    Professor, Mechanical Engineering, Institute for Systems Research, Center for Risk and Reliability, and Maryland Robotics Center, University of Maryland
  • Dr. Mikael Lindvall
    Technology Director, Fraunhofer USA Center MidAtlantic
  • Prof. Douglas Schmidt
    Cornelius Vanderbilt Professor of Engineering (Computer Science), Associate Provost of Research, and Data Science Institute Co-Director, Vanderbilt University
  • Moderated by: Prof. Adam Porter
    Scientific Director and Executive Director Fraunhofer USA Center MidAtlantic;
    Professor, Computer Science, UM Institute for Advanced Computing Studies and the Institute for Systems Research, University of Maryland
    Affiliate, Applied Research Laboratory for Intelligence & Security, University of Maryland


Panel 3: A new look at policy, standards, and requirements specification

[video; slides]

How do DOD policy, standards, and metrics need to evolve to enable assessment of AI? There is an increasing realization that requirements for AI-based systems must be considered in the context of acceptable risk (where risk must be considered along many dimensions, e.g., performance, safety, ethics, etc.). It is impossible to test the technology against all cases that it may encounter in the field, so there is a tension between the depth of testing and the level of risk that is acceptable. Further, the level of acceptable risk is dependent upon the mission and environment in which the system is being applied. We anticipate that gaining warfighter trust and adoption of AI will require assessment of performance against realistic missions in the setting of realistic mission workflows. This demonstration will need to assess both performance and risk metrics, which may be framed in probabilistic (or otherwise fuzzy) terms. Using test outcomes to assess mission readiness implies that metrics must be framed in user-centered terminology, and should provide guidance on acceptable performance, risk, and limits to maintain those bounds.

This session will explore how the requirements process needs to evolve – in concert with policy and standards – in support of fielding AI.

Invited speakers:

  • Dr. Chad Bieber 
    Director, Test and Evaluation. Project Maven, Johns Hopkins University Applied Physics Laboratory
  • Lieutenant General (ret) Edward “Ed” Cardon 
    Former Commander of the Second United States Army/United States Army Cyber Command;
    Former Director of the United States Army Office of Business Transformation
    Professor of the Practice, Applied Research Laboratory for Intelligence & Security, University of Maryland
  • Prof. Michael Horowitz
    Richard Perry Professor of Political Science, Director, Perry World House, University of Pennsylvania
  • Dr. Jane Pinelis
    Chief, Test and Evaluation of AI/ML, Joint Artificial Intelligence Center
  • Prof. Ben Shneiderman
    Professor Emeritus, Computer Science and University of Maryland Institute for Advanced Computer Studies (UMIACS);
    Founding Director, Human-Computer Interaction Lab; Affiliate, Institute for Systems Research and College of Information Studies, University of Maryland
  • Moderated by: Dr. Craig Lawrence
    Director of Systems Research, Applied Research Laboratory for Intelligence and Security and Visiting Research Scientist, Institute for Systems Research, Clark School of Engineering, University of Maryland


17:00-17:30: Conclusions of the Day



Flournoy, Michèle A., Avril Haines, and Gabrielle Chefitz. “Building Trust through Testing.” (2020).