Test automation is probably the single most important factor within an agile software project. Providing a customer with small, working increments of a product, delivered frequently, requires the pipeline from feature design to product delivery to operate very fast.
The biggest hurdle in delivering a small change quickly is regression testing. The trouble stems from the fact that regression testing scales with the scope of all existing features, not merely the scope of newly added features. Test suites will therefore inevitably run slower as an app grows larger. But tests need to run fast, or they slow developers down.
Automation is the key to achieving speed. But this insight says nothing about how to automate testing, or even what exactly should be automated.
A mistake that teams tend to make is to approach automation like this:
- Keep the exact same tests you have now, that are run manually by human testers, and turn them into automated tests.
This naive testing strategy reliably fails, and can cause a team to revolt against automation altogether. Why does it fail, and what is a proper automation strategy instead? This article will explore the answer.
The primary problem with the naive testing strategy is summed up by Moravec’s Paradox:
- What is easy for humans, is hard for computers, and vice versa.
One of the first “intelligent” tasks we were able to do with a computer is to play chess. Currently, AI chess programs are world class chess players, and can reliably beat the best human chess players. On the other hand, we still do not have robots with AI capable of the basic problem of perceiving and moving around a 3D environment.
Perception and motion in 3D space involves capturing the surroundings with cameras, then intelligently processing the raw image into 3D objects laid out in 3D space, and then selecting an open path toward a destination that avoids collision with the surrounding objects. Once a path is selected, the AI has to generate a sequence of joint movements, using continual feedback from the pressure on its joints, to actually move itself along the chosen path.
This problem is orders of magnitude more complex than playing chess — and yet every human child masters it at an early age. But only a handful of humans will ever become master chess players.
Let’s think now about how human testers test a GUI app. They do so in the same way that human users use a GUI app: by interacting with the UI. There is a continuous feedback loop from the computer to a sequence of images into the mind of the tester/user and then back out through the tester/user’s hands or fingers back to input devices and back into a computer.
Obviously this feedback loop is primarily one of perception and mobility. These are the very things Moravec told us are very difficult to make an AI do well. In fact, they are so difficult that no test automation strategy really automates the things human testers do directly.
Consider mobile apps that provide a multi-touch screen. If a robot were interacting with such an app the way a human does, it would need a camera and a movable hand. The robot would “see” the mobile device’s screen through its camera, using computer vision algorithms to process the captured image into data structures (using, for example, edge detection, pattern recognition, etc.). Then, with a model of the on-screen objects in its AI program, the robot would decide how to interact with that model, resulting in robotic hand gestures to trigger UI behavior. The robot would continue capturing the device’s screen and analyzing it, building a model of the on-screen objects, and deciding further how to interact with those objects.
Doing this kind of testing would turn your company from an app developer into a cutting edge robot and AI design company. If you even marginally succeeded in automating your tests in this way, the robots you built to do it would be enormously more valuable than whatever software you are using those robots to test. Correspondingly, all your money and resources would be poured into researching, developing and maintaining the robots. If your goal is simply to test your app, it would make no sense from a financial perspective to do this. There are already machines available that outperform these robots for a fraction of the cost. They’re called humans.
Why Test Robots Don’t Need Eyes and Hands
Obviously, in real life it’s completely unnecessary to test in this way if the tests are being run by machines. So why is that?
We posit a machine, the test robot, testing another machine, the mobile device (let’s say an iPhone). This requires the video signal to get from the iPhone to the robot, and touch gestures to get from the robot to the iPhone. The iPhone generates an electronic signal that represents the on-screen image, and uses a display device to turn that into visible light (electromagnetic radiation). The robot, in our scenario, has a digital camera, which does the reverse: takes visible light and turns it back into an electronic signal.
It is obvious why we convert the video signal to light when the user is a human: it’s because that’s the sensory input available for humans. If we could beam the signal directly into a human user’s brain, converting the signal to light wouldn’t be necessary. We just don’t know how to do that.
But we obviously do know how to beam the signal directly into the robot’s brain! That’s exactly what we’re doing with the camera signal. So let’s just skip the camera, and the display device on the iPhone, and hook the iPhone’s video signal directly to the robot instead.
The same consideration applies to the robot touching the touchscreen. The reason the iPhone needs a touchscreen, with the necessary electronics to detect touches and determine where physically on the screen the touches occurred, is because this is the only way humans can interact with an iPhone. (A keyboard or some other input device ultimately works the same way, namely by providing something for a human to physically touch.) We don’t know how to beam a human’s thoughts directly into an iPhone, and have it detect that the human wants to interact with some on-screen object.
But again, we do know how to beam the AI’s thoughts directly into the iPhone. We can skip the robotic hand and fingers, and the iPhone’s touchscreen electronics, and just have the robot send touch events directly to the iPhone.
The Vision Problem
We’ve now removed the need to actually build a robot. We don’t need any camera, or moving parts. Everything is just an electronic signal, going directly between the iPhone and the tester, and the tester is now merely another computer with input and output ports.
That certainly makes things easier. But the biggest challenge remaining is that if the testing machine is “experiencing” the app through a video signal, it still has to intelligently process the incoming images. It has to solve the computer vision problem of analyzing a grid of pixels and finding the “objects” in that image.
A major reason why vision is so challenging for a computer, at least with the way we program computers, is that it is not an exact science. The human mind is very good at detecting “near” matches for certain edges, shapes, patterns, and so on. It would be easy enough to find a perfect square or rectangle in an image. But what if it’s off by a few pixels? Or even many pixels, but it still matches a square or rectangle better than it matches any other shape? What does “better” mean? How does a computer measure “how much” of a grid of pixels matches a certain shape? These are the very complex problems that computer vision involves. The best we’ve been able to do is write trainable computer programs, and have humans, already equipped with this pattern matching capability, train them to properly detect shapes.
The simple task of finding out what basic objects, like buttons, images or text, are on the screen, becomes probably the hardest part of programming a testing machine. And yet it is so trivial for human testers that no mention of it is made in manual test scripts. The scripts assume, quite reasonably, that the tester already knows how to tell what s/he’s looking at on the screen; so the script just tells the tester to look for something in particular, like a button with a certain text.
But even this is overly simplified. A test script might instruct a tester to look for “the cancel button”. A human has the kind of intelligence to make an educated guess as to whether some visible button is a “cancel” button. Maybe the button has the text “Cancel” written on it. Maybe it is a left-pointing chevron, or an X button. We would have to train an AI tester to know what the English term “cancel button” might mean. Again, what’s trivial for humans becomes the primary challenge for AI testers.
Why We Don’t Need To Solve the Vision Problem
Okay, so let’s apply the same reasoning we did before. Why is it necessary for a tester, human or AI, to have to process a 2D image and look for patterns? The answer, as before, is that images are how we get information into the mind of a human user. Vision is the available sensory input, and vision works by processing images.
Previously, however, we realized that converting the app experience to visible light isn’t necessary with a robot, because we can just beam the electronic video signal directly in. Well, the same is true with converting the app experience to a video signal.
The iPhone comes with an operating system that takes a list of visual “elements” or “widgets” and converts them into pixels in an image by means of the compositing and graphics libraries in iOS. This means that in the code running on the iPhone, there already exists a higher abstraction of “visual elements”. These are, in fact, what iOS developers primarily deal with. They are writing code that is concerned with the buttons, and text, and so on, and their placement on the screen. They are not writing code that is concerned with pixels in an image.
So, just as before, with our testing machine, we can skip the process of breaking these elements down into pixels in order to allow them to be transmitted visually to a human user, and instead we can expose the testing machine directly to the abstract model of widgets in the iPhone. Now we no longer have to solve the computer vision problem. The testing machine has direct access to a list of visual elements, instead of having to reconstruct them from the iPhone’s video signal.
We’ve Invented UI Tests
At last we have arrived at the kind of testing that teams do run when they simply take their manual tests and try to automate them. The various platforms like iOS and Android expose ways to access the model of on-screen visual elements and to trigger interactions with them. That is how so-called UI test frameworks work!
On iOS, XCUITest piggybacks on the accessibility framework for iOS, which was initially designed to provide additional ways for the visually impaired to use iOS devices. Using this, another device can communicate with the iPhone over a network, to download the tree of visual elements, and can then issue touch commands to one of those elements.
The same can be done on Android with frameworks like Espresso. And tools like Appium provide a cross-platform web interface for interacting with mobile apps in this way.
The Trouble With UI Tests
We have removed the complicated, and really unnecessary, problems of moving robotic joints, cameras, and intelligent image processing from our automated testers. We have arrived at the notion of an automated UI test.
But, as I said at the beginning, the resulting strategy of running all your manual tests as this kind of UI test, by searching for on-screen elements and issuing interaction commands, still tends to go very poorly. The trouble is that we really haven’t removed enough of the “human” elements of testing that are hard to automate. Any tester who has gone down this path can tell you about the typical problems.
For one thing, navigating around an app is an inexact process. Small variations in timing and in the position of elements on the screen can break these kinds of tests. And for mobile touch devices, any gestures beyond simple tapping, particularly scrolling collections, become a constant pain with automated scripts.
As a result, the verification that can be done in this way is brittle, and is often of questionable value. If a test needs to confirm that a certain button appears on screen, what exactly should it look for? A button with the right text? What if the text isn’t the essential definition of the button, and can change?
More often than not, I’ve seen testers ask developers to assign a certain “element ID” to the button, and then have the test just confirm that a visual element with that ID is on screen at the right time. But what is this testing? The “element ID” is there only to facilitate testing. It serves no purpose except to let the test scripts find elements. Having a test’s pass/fail criteria be based on “test” information like this is arguably a tautological, and therefore useless, test!
After all, no one, certainly not the product designers, really cares what this “element ID” is. They care about what the button looks like, or even more likely what it does. How could you test a simple case of a back button? How can the test be sure that the new state of the screen is really the “previous” screen? It’s possible, but can be very complicated in terms of analyzing the visual element tree. Such a check is trivial, on the other hand, for a human tester.
Another major issue is the dynamic nature of data. Let’s say there’s a test script that confirms that, when a detail page for an item in a list is opened, the right details for that item are displayed. A human tester knows how to look at the item in the list and intelligently determine what the details for that item should be. A test script cannot do that. If there is no way for the test script to predictably control what items appear in the list before selecting one to tap on, then it will not reliably be able to check that the details displayed on the detail page are correct for that item.
A typical workaround, which is very brittle, is to simply assume and rely on one particular item that has been there for a while being there forever. Once something in the data changes, though, such tests break for no apparent reason, and worse yet, they don’t indicate anything is really broken.
When Good Testers Go Bad
The consequences of having these kinds of tests are predictable:
They take a long time to write and maintain.
They take a long time to execute. Searching for elements in a large tree is inefficient, and it is not uncommon for these types of automated tests to take several times longer to execute than what a human would take.
They break frequently, meaning they fail even when there is no actual bug being uncovered.
They miss things or improperly test things, meaning bugs frequently occur that don’t cause the tests to fail.
The result is that the tests are untrustworthy. No one would dare hit the “Release to App Store” button simply because these tests all passed, and we have many times hit that button despite some of the tests failing. The tests serve no purpose and really just waste everyone’s time.
Realizing this, the testers tasked with writing and maintaining these scripts, knowing they’ll have to work overtime to run manual tests anyway, will just slap together something quickly to check off the “80% automation” requirement (or whatever) imposed on them by their leadership. The tests will become mostly meaningless distractions and hurdles, and, if anything, they cause the testers to sour on the idea of automation. They’ll argue, quite convincingly from their experiences, that testing a GUI app simply can’t be automated — not effectively or reliably, at least.
Is Automated Testing Even Possible?
In our thought experiment about test robots earlier, we eliminated the biggest challenges, identifying them as really unnecessary, through bypassing the normal app experience route that human users go through. The UI test scripts have privileged access to more abstract concepts that end users don’t have access to. Now, however, we might be tempted to conclude that the resulting UI tests aren’t really testing the app at all.
After all, one might argue, who cares what the Appium element tree is? Users don’t see that or interact with that. What if that tree isn’t being broken down to pixels in the correct way? The only way to truly test what users experience is to test the very user experience humans will receive. That means images, video signals, and if we’re really being pedantic, displays and cameras.
This line of reasoning always comes up when discussing tests. It needs a dedicated discussion of its own, but let’s just stipulate here that it’s fundamentally confused. All testing occurs in some kind of controlled, artificial environment, a laboratory, that is not exactly the same as how and where paying customers will be using the app. The basic design concept of tests is that the controlled test environment correctly mimics the real environment. The “correctly” part of that concept is crucially important, and makes a significant impact on how we design our tests.
A flawed test, to be sure, may pass in the laboratory but fail to catch an error that occurs “in the wild” (or vice versa). But this does not invalidate the concept of artificially controlling the environment in order to run tests! Without such artifical control, you aren’t testing at all; you’re just monitoring.
So the argument that states, “This isn’t a real test because it isn’t truly end-to-end,” is not a valid argument. Even “end-to-end” tests aren’t really end-to-end, because the “end” is actually the paying customer, not testers working for the organization and logging into test accounts.
Understanding UI Tests
Based on this discussion, we can give a specific characterization of the so-called “UI tests” that use Appium/XCUITest/Espresso. They are black box tests. They attempt, as closely as reasonably possible, to mimic the actions and checks of an actual end user.
Black box tests only penetrate into the app just enough to avoid the unnecessary problems of image processing. But they don’t go further than that. Whatever actions they take, even if they do them more “directly” than a human, should still be the same actions available to a user. In principle, these tests should only see (even if in a more direct, abstract manner) what a human user sees, and can only touch or manipulate what a human user can touch or manipulate. This manifests as the tests testing the exact app binary that will be shipped to customers.
So UI tests are valid within the realm that they are adequate to test. The way to respond to their inadequacies is not to reject automated testing as invalid but to devise some tests that test more of what really needs testing.
Bypassing the UI
The answer to the problem of effective automation is not to back away just because UI tests are inadequate, but rather to continue forward with the process of penetrating deeper into the “guts” of a GUI app and dealing with even higher levels of abstraction. The key here is to give our tests direct access to the same level of abstraction as the business requirements that we are trying to test.
Our business requirements, after all, aren’t specified in terms of pixels on a screen, or the state of the electromagnetic field (imagine that)! That’s way too low-level. In fact, for much of the business requirements, even the UI and UX (what interactions, like touch gestures, the user can make) are too low level. There’s a higher level of abstraction “behind” all of this that is usually of more interest to us. We want to break open the black box to a degree. We want, in some way, white box tests.
The UI, after all, is merely a layer we provide to users to give them access to a more abstract “virtual” world of various information concepts in our app. For example, consider the app GrubHub. The UI, even when abstracted from the level of pixels on the screen to visual elements, still involves various buttons, scrolling lists, text, selection tables, images and so on. But these are themselves just low-level abstractions used to convey higher-level abstractions to the user. Those higher abstractions are: restaurants, menus (not a UI menu, a restaurant menu), food items, pending orders, order status, delivery times, and so on.
To see that this is true, consider that an app with the very same functionality and purpose, meaning a food delivery app, could be written without a GUI. It could be a 1960’s style terminal app with green text on a black background, and users interact with it by typing a list of known commands to request information or start building an order. There is no GUI anymore. No visual elements, no buttons, no images, nothing. There are no touch gestures either. Instead, there’s text output and text input. Yet, behind this completely different user interface, the same app exists. There is still the same world of restaurants, menus, orders, delivery times, and so on that was there before. We’re just using a different method of conveying this virtual world to the user.
To be sure, there are certainly business requirements for the UI of an app. The product designers design the UI. They decide what the screens will look like, the colors and fonts, the animations, how a screen is laid out, what information is displayed to a user and where, and so on. The UI is a major part of building a GUI app. And the UI needs to be tested.
However, it may make sense to leave the pure UI requirements to manual, human testers. Strictly speaking, testing the GUI of an app with a machine is a computer vision problem no matter how you approach it. The Appium black box tests bypass the actual visual appearance of the app (though they still have access to some concepts like the position of elements on the screen). If you want to directly test the look and feel of an app, that’s best left to humans.
Beyond this, though, there is a virtual world of entities, actions and storage that an app represents. It is this world of business logic that is ripe for automation. There are tons of business requirements for an app that don’t directly involve the UI. If there weren’t, the UI would exist for its own sake, which is to say it would be pointless. The business logic is what the user interface allows the user to interface with. Behavior-driven development (BDD) can help disentangle these purely informational requirements from visual/UI requirements.
The way to automate the business logic requirements is, analogously to what we did with the cameras and robot joints earlier, to bypass the process of conveying these rules to the user through a UI, and instead to give the test direct access to the “models” being built and maintained behind the scenes by the app. The business concepts are already expressed in code. By writing tests in the same code in which the app is written, the tests immediately get full comprehension of these models, and can interact with and verify them in the same way the app does.
Unit Tests vs. Acceptance Tests
It may sound like I’m talking here about unit tests. Not quite. Unit tests continue this penetration down into single atoms of functionality, and are therefore primarily concerned with the technical architecture of an application. Unit tests are there to maintain quality and proper technical practices. Instead, we want tests that punch just below the UI of an application, but no further. They deal directly with the abstract concepts that are created by product designers, not architects and programmers. But they deal with them directly as abstract entities in the code, not through their UI representation.
The same is true of actions available to users. Users may invoke actions on the app by making touch gestures, or (on a slightly higher abstraction) by activating certain visual elements. But this is just a gateway to accessing some set of abstract “actions” that may be performed on the virtual world the app manages.
In the example of GrubHub, the user might issue an order to a restaurant by pressing a “Complete Order” button. But the button, or the fact that a user is pressing it, isn’t the essential detail. What’s essential is that there is some abstract action called “complete an order”, no matter how the user performs that action — even if the user does it by clapping three times. We can design tests that are directly about this action, and what exactly is supposed to occur in the abstract world of restaurants, menus, orders, delivery times, etc., when such an action is invoked.
This, then, is where automation is most effective, and most valuable. The name for these kinds of tests is functional tests. In the world of BDD, they become acceptance tests, and are written in the very Gherkin that expresses the business requirement.
Here’s an easy way to remind yourself of the distinction between the two types of test:
Functional/acceptance tests ensure we build the right thing.
Unit tests ensure we build the thing right.
Functional/acceptance tests and unit tests complement each other, and neither one can act as a substitute for the other.
Take Control of the Setup
Earlier, I mentioned some of the problems that black box tests present, a major one being that data can change, and test scripts have trouble intelligently adapting to dynamic data. By making tests more “white box”, what we mean is that part of the app that is “real” when it is out in the wild, or in human testers’ hands, is replaced with an extension of our laboratory. The test environment starts to take over some of the outer layers of the app and control them directly, in ways that a human or even black box test script wouldn’t be able to.
In the code for the test scripts, what this looks like is mocking. An automated test can mock parts of the app it is not trying to directly test, but instead that it wants to be in a certain state before the test even starts.
Every test involves three phases:
(These are reflected in the Gherkin key words “given”, “when” and “then”.) A manual or black-box test must perform the “setup” and “trigger” in the same way that a human user would use. If a test requires the setup of being on a certain screen in the app, then a black box test must execute this setup by “manually” navigating to the screen in the same way that a human user would, which is a sequence of interactions with the visual elements, typically involving feedback from the visual elements themselves.
This, of itself, is a major source of instability for black box tests. Simply getting to the right part of an app to start the test is ripe for failure, and such tests often fail before they even properly begin. This failure may indicate a bug that was encountered “on the way” to the screen of interest; but that’s not what this test is testing.
If we want to test screen Z, ideally we want a test that passes when screen Z works, and fails when screen Z does not work. If a user has to go through screens X and Y just to get to Z, we would still want a test for screen Z to pass even when screens X and Y are broken. That way, we can reliably know if screen Z needs attention, independently of other screens. (Of course, we would also have tests for screens X and Y that should catch whatever bug would make it impossible to navigate to screen Z.)
This is a problem even with human testing. Testers know that a bug near the “beginning” of the app can actually block testing of large areas (the login server going down, or test accounts getting corrupted, can render testers completely idle). This is an example of a more general problem, which includes the dynamic nature of data that isn’t controlled by the app.
In an ideal test, the “setup” phase should be infallible. It should not be possible for the “setup” procedure to “fail”. The test doesn’t really begin until after setup is complete. The telltale sign of this anti-pattern is if assertions are being made in a test script on the way toward preparing the trigger. Those assertions are really just protecting the test in case it fails to be set up properly and cannot run. They aren’t asserting what the test is actually testing.
The solution to the problem of failable setup is to have the test environment (the laboratory) take over the setup process. If something needs to happen in setup, instead of “attempting” it through black box use of the app, we make it so by mocking whatever components are involved.
Note that mocking does not, in and of itself, make a test into a unit test. (This is a common misunderstanding whereby functional automated tests and unit tests can be confused.) There may be still be an entire subsystem of “real” production code running in the test. We are just mocking around what the test is concerned with. If we have broken our business requirements into pieces, this lets us test each business requirement individually, and get reliable answers not only to whether any requirement is broken, but exactly which requirements are broken.
It is no surprise that teams who tend to have the most success with automation are server-side teams. By the very nature of their work, they have already separated themselves completely from UI concerns, and there would never be a question of testing their services “through” a UI. It is more obvious to them that their “acceptance tests” directly test functionality. If they need to reconstruct any abstractions that were broken down, this just involves parsing JSON from an HTTP response, which is trivial for machines to do. The troubles with automation tend to show up more on the client side, where it is more difficult for developers, designers, and especially testers to start separating UI from business logic in their minds.
Testing, like software developing, is a craft, especially when it comes to automation. Becoming skilled in effective automation takes time, practice, and patience. As always, the most instructive experiences are to fail by trying, and to learn first-hand what tends to not work and why. Hopefully the concepts we’ve discussed here provide a helpful guide for understanding your experiences with automation and getting the most out of them. Brittle, slow and untrustworthy “UI tests” (black box tests, as I call them) are a common experience, but that doesn’t invalidate the goal of automation.
Look for that sweet spot where you’ve penetrated past enough of the outer layers that your tests have direct access to what matters to the business concerns, but hasn’t penetrated so deeply that it’s getting into the technical architecture. And don’t be afraid to take control and mock components, especially for the purpose of working with consistent, predictable stubbed data.
With those basic principles in mind, you should at least be able to see an effective direction in which to take your automation, and avoid the pain of automation that just gets in your way.