Dynamically expecting failure in unit tests.

Posted by Sjoerd Job on June 15, 2016, 7:37 p.m.

Yesterday I composed my first pull request for the VOC project. The project is aimed at writing a Python bytecode to JVM bytecode transpiler.

While adding some extra test cases, I noticed that the system was not abiding by me telling it to expect failure using the @expectedFailure decorator. I decided to investigate.

The test case I was working on

After having implemented builtins.sum in Java, I wanted to add some extra test cases. One of the test cases was summing up a list of integers and floating point numbers. It all seemed so simple:

    def test_sum_mix_floats_and_ints(self):
        self.assertCodeExecution("""
            print(sum([1, 1.414, 2, 3.14159]))
        """)

Little did I know:

FAIL: test_sum_mix_floats_and_ints (tests.builtins.test_sum.BuiltinSumFunctionTests)
...
TypeError: unsupported operand type(s) for +: 'float' and 'float'

It turned out, adding floats together was not yet implemented. Fair enough. I don't want to make the test go to waste, but I do know that for the time being, it is normal for it to fail. Let us mark it as such.

from unittest import expectedFailure

...
    @expectedFailure
    def test_sum_mix_floats_and_ints(self):
        self.assertCodeExecution("""
            print(sum([1, 1.414, 2, 3.14159]))
        """)

And that should be it, right. We have told the system that it should expect a failure, and more importantly, we now expect it to shut up about it.

Re-running the test suite now gives us

FAIL: test_sum_mix_floats_and_ints (tests.builtins.test_sum.BuiltinSumFunctionTests)
...
TypeError: unsupported operand type(s) for +: 'float' and 'float'

Uhm. What? That was totally unexpected. Uncalled for, even. Why is the computer not obeying me? Is the world ending? Has the singularity finally occurred? (Probably yes, but the AI is smart enough to not let us know).

A quick solution

Putting all superstition aside, I went to look for a fix (besides removing the test method). In this same test case, there already was another system for marking methods as expected failure (or not).

class BuiltinSumFunctionTests(BuiltinFunctionTestCase, TranspileTestCase):
    functions = ["sum"]

    not_implemented = [
        'test_bytearray',
        'test_bytes',
        'test_class',
        'test_complex',
        'test_dict',
        'test_frozenset',
        'test_set',
        'test_str',
    ]

...

    @expectedFailure  # + not defined on float/float yet.
    def test_sum_mix_floats_and_ints(self):
        self.assertCodeExecution("""
            print(sum([1, 1.414, 2, 3.14159]))
        """)

Adding the test_sum_mix_floats_and_ints to the not_implemented list did the trick. It was successfully marked as 'expected failure' in the output. The world was right again.

A deeper investigation

But some nagging feeling remained in my head: Why did "the only obvious way to do it" fail on me? It did not sit right with me. So I decided to investigate further. The not_implemented list gave me a good clue as to what I was looking for. greping around in the code base gave me a good pointer as to where I wanted to look: tests/utils.py contained the only non-listing reference to not_implemented. It was in a method called run which was supposed to override unittest.TestCase.run.

class BuiltinFunctionTestCase:
    format = ''

    def run(self, result=None):
        # Override the run method to inject the "expectingFailure" marker
        # when the test case runs.
        for test_name in dir(self):
            if test_name.startswith('test_'):
                getattr(self, test_name).__dict__['__unittest_expecting_failure__'] = test_name in self.not_implemented
        return super().run(result=result)

    def assertBuiltinFunction(self, **kwargs):
        substitutions = kwargs.pop('substitutions')
        self.assertCodeExecution(
            """
            f = %(f)s
            x = %(x)s
            print(%(format)s%(operation)s)
            """ % kwargs, "Error running %(operation)s with f=%(f)s, x=%(x)s" % kwargs,
            substitutions=substitutions
        )

    for datatype, examples in SAMPLE_DATA.items():
        vars()['test_%s' % datatype] = _builtin_test('test_%s' % datatype, 'f(x)', examples)

Here we see the following: It dynamically creates a set of tests based on SAMPLE_DATA. This is a really nice way to generate a lot of test cases dynamically. Due to how it was used, one immediately had test cases for sum(...) on a variety of builtin data types. Most of course non-sensical (what is the sum of None?), but still.

Now, what was the not_implemented about? It was a way of marking to the test runner that a certain case was not fully handled yet. For instance, 'test_list' was marked as not implemented before I provided a basic implementation of sum.

Where the magic happens

So let us take a closer look at where the magic happens, to take away the illusion of magic. The interesting lines are as follows:

        for test_name in dir(self):
            if test_name.startswith('test_'):
                getattr(self, test_name).__dict__['__unittest_expecting_failure__'] = test_name in self.not_implemented

Let's look at it closer. First dir(self) asks for the names of 'all interesting things' on self, and this is then iterated over using the name test_name.

Second, it is checked that the test_name starts with the string 'test_'. If not, we won't touch it.

The last line does so many things, it is mind boggling. Again, dissecting helps here. First, it gets the attribute from self which is described as in test_name, and then gets the __dict__ item belonging to that. Finally, it writes the item '__unittest_expecting_failure__' to that, with the value test_name in self.not_implemented.

Now, I'm sorry to say this, but I'm not quite sure that the previous paragraph helped anyone in understanding what was going on.

Anyhow, let's rewrite it a bit more sequential (less golfed).

            expected_failure = test_name in self.not_implemented
            test_method = getattr(self, test_name)
            test_method.__dict__['__unittest_expecting_failure__'] = expected_failure

The first two lines look quite innocent. The last line explained to me why my original method was being marked as not expecting failure: I needed to register it explicitly in not_implemented.

Iterating over possible solutions

Then, I started experimenting with different ways of "solving" this "problem" (the cognitive dissonance that happened to me).

First, I tried using

            test_method.__dict__.setdefault('__unittest_expecting_failure__', expected_failure)

hoping that would work. Well, it partially did. There was another problem though, the value seemed to stay stuck for other unit tests, causing really weird results.

Then I tried setting the value using setdefault only when the failure was to be expected. That helped a bit, with the downside of a lot of "unexpected success" messages.

Turns out that the __dict__ of a bound method, is the same as the __dict__ of the unbound method.

>>> class K(object):
...     def foo(self):
...         pass
>>> K.foo
<function K.foo at 0x1059e8400>
>>> k = K()
>>> k.foo
<bound method K.foo of <__main__.K object at 0x1059eb160>>
>>> k.foo.__dict__ is K.foo.__dict__
True

Since the generated code is all on the same class, every run in a subclass caused more and more methods to be marked as expecting failure, so the tests became more and more lenient. Definitely not what I wanted.

Reading the source

Having had no luck so far, I decided to look into the __unittest_expecting__failure__ marker. Specifically, where it was used instead of where it was to be set. In Python 3.4.4, the interesting region was on lines 566 to 570 of unittest/case.py

        expecting_failure_method = getattr(testMethod,
                                           "__unittest_expecting_failure__", False)
        expecting_failure_class = getattr(self,
                                          "__unittest_expecting_failure__", False)
        expecting_failure = expecting_failure_class or expecting_failure_method

So expecting failure was calculated based on the method and the class. Reading a bit further gave me another piece of information needed. The name of the test method was stored in an attribute _testMethodName.

Using this new piece of information, I could rewrite the run method as follows:

    def run(self, result=None):
        # Override the run method to inject the "expectingFailure" marker
        # when the test case runs.
        self.__unittest_expecting_failure__ = self._testMethodName in self.not_implemented
        return super().run(result=result)

And all was right in the world. So I did a git push, and awaited the results from Travis-CI.

Oh, while we are waiting: did you know the nice thing about the attribute named _testMethodName. It's guaranteed 100% not to clash with the name of an attribute in a subclass. Why? Because _testMethodName is not PEP8, and everybody writes PEP8 compliant code nowadays.

But Travis-CI did not like it

Back on topic. Travis-CI got back with the results. Failure! To be honest, Travis-CI has been complaining about my work on this aspect for so long, I myself was starting to feel like a failure.

However, I did make progress. The test case was now successful on Python 3.4.4, but not on Python 3.4.2. Looking at unittest/case.py from 3.4.2, it was obvious the check on the class attribute was missing. However, seeing the progress I made, I refused to give up. I know the solution was close.

I had one last trick up my sleeve, a trick I was really hoping to not use: replacing the test method with a wrapper.

    def run(self, result=None):
        # Override the run method to inject the "expectingFailure" marker
        # when the test case runs.
        if self._testMethodName in self.not_implemented:
            method = getattr(self, self._testMethodName)
            wrapper = lambda *args, **kwargs: method(*args, **kwargs)
            wrapper.__unittest_expecting_failure__ = True
            setattr(self, self._testMethodName, wrapper)
        return super().run(result=result)

I did another quick run, pushed the code back to GitHub, and awaited the results from Travis-CI. It worked! Success! Party-time!

What I left out to mention

There were a couple more complications I did not go in to: I tried to reduce duplicate code by creating a mixin class. However, that did initially go so well, due to self.not_implemented not always being defined. In the end I got it working, though.

A short remark

One thing I would like to note in particular. Even though I found something in the test suite of VOC that annoyed me, I am not annoyed by (the test suite of) VOC itself. I think it is a really interesting project, and it could have potential for bringing Python to a Java world (Android, in particular).

The parts of the codebase I have seen so far also make real sense to me. From what I understand, they welcome any contributors. So if you are interested in getting a Python-to-Java transpiler working, why not help out the VOC project!

Also, the "trick" that was used to auto-generate a lot of test cases is really nice. I've seen it used before (albeit in a different way) in an internal project I was once working on. It has the advantage of generating a lot of test cases with very little effort. The drawback is that it generates a lot of test cases.

Conclusion

In the end, by understanding the problem space, the wiggle room for solutions, and sometimes even digging into the internals of a tool you are using, you can find an elegant solution to a problem.