Posts Tagged “testing”
Feb 15, 2012
This is the second part of my unit testing advice. See the first part on this blog.
If you need any introduction you should really read the first part. I’ll just present the other three ideas I wanted to cover.
Focusing on common cases
This consists of testing only/mostly common cases. These tests rarely fail and give a false sense of security. Thus, tests are better when they also include less common cases, as they’re much more likely to break inadvertently. Common cases not only break far less often, but will probably be caught reasonably fast once someone tries to use the buggy code, so testing them has comparatively less value than testing less common cases.
The best example I found was in the
wrap_stringtests. The relevant example was adding the string “A test of string wrapping…”, which wraps not to two lines, but three (the wrapping is done only on spaces, so “wrapping…” is taken as a single unit; in this sense, my test case could have been clearer and use a very long word, instead of a word followed by ellipsis). Most of the cases we’ll deal with will simply wrap a given word in two lines, but wrapping in three must work, too, and it’s much more likely to break if we decide to refactor or rewrite the code in that function, with the intention to keep the functionality intact.
See other examples of this in aa20bce (no tests with more than one consecutive newline, no tests with lines of only non-printable characters), b248b3f (no tests with just dots, no valid cases with more than one consecutive slash, no invalid cases with content other than slashes), 5e771ab (no directories or hidden files), f8ecac5 (invalid hex characters don’t fail, but produce strange behaviour instead; this test actually discovered a bug), 7856643 (broken escaped content) and 87e9f89 (trailing garbage).
Not trying to make the tests fail
This is related to the previous one, but the emphasis is on trying to choose tests that we think will fail (either now or in the future). My impression is that people often fail to do this because they are trying to prove that the code works, which misses the point of testing. The point is trying to prove the code doesn’t work. And hope that you fail at it, if you will.
The only example I could find was in the
strcasecmpendtests. Note how there’s a test that checks that the last three characters of string “abcDEf” (ie. “DEf”) is less than “deg” when compared case-insensitively. That’s almost pointless, because if we made that same comparison case-sensitively (in other words, if the “case” part of the function breaks) the test still passes! Thus it’s much better to compare the strings ”abcdef” and “Deg”.
Addendum: trying to cover all cases in the tests
There’s another problem I wanted to mention. I have seen several times before, although not in the Tor tests. The problem is making complicated tests that try to cover many/all cases. This seems to stem from the idea that having more test cases is good by itself, when actually more tests are only useful when they increase the chances to catch bugs. For example, if you write tests for a “sum” function and you’re already testing
[5, 6, 3, 7], it’s probably pointless to add a test for
[1, 4, 6, 5]. A test that would increase the chances of catching bugs would probably look more like
[-4, 0, 4, 5.6]or
So what’s wrong with having more tests than necessary? The problem is they make the test suite slower, harder to understand at a glance and harder to review. If they don’t contribute anything to the chance of catching bugs anyway, why pay that price? But the biggest problem is when we try to cover so many test cases than the code produces the test data. In this cases, we have all the above problems, plus that the test suite becomes almost as complex as production code. Such tests become much easier to introduce bugs in, harder to follow the flow of, etc. The tests are our safety net, so we should be fairly sure that they work as expected.
And that’s the end of the tips. I hope they were useful :-)
Feb 14, 2012
When reviewing tests written by other people I see patterns in the improvements I would make. As I realise that these “mistakes” are also made by experienced hackers, I thought it would be useful to write about them. The extra push to write about this now was having concrete examples from my recent involvement in Tor, that will hopefully illustrate these ideas.
These ideas are presented in no particular order. Each of them has a brief explanation, a concrete example from the Tor tests, and, if applicable, pointers to other commits that illustrate the same idea. Before you read on, let me explicitly acknowledge that (1) I know that many people know these principles, but writing about them is a nice reminder; and (2) I’m fully aware that sometimes I need that reminder, too.
Edit: see the second part of this blog.
Tests as spec
Tests are more useful if they can show how the code is supposed to behave, including safeguarding against future misunderstandings. Thus, it doesn’t matter if you know the current implementation will pass those tests or that those test cases won’t add more or different “edge” cases. If those test cases show better how the code behaves (and/or could catch errors if you rewrite the code from scratch with a different design), they’re good to have around.
I think the clearest example were the tests for the
eat_whitespace*functions. Two of those functions end in
_no_nl, and they only eat initial whitespace (except newlines). The other two functions eat initial whitespace, including newlines… but also eat comments. The tests from line 2280 on are clearly targeted at the second group, as they don’t really represent an interesting use case for the first. However, without those tests, a future maintainer could have thought that the
_no_nlfunctions were supposed to eat whitespace too, and break the code. That produces confusing errors and bugs, which in turn make people fear touching the code.
See other examples in commits b7b3b99 (escaped ‘%’, negative numbers, %i format string), 618836b (should an empty string be found at the beginning, or not found at all? does “\n” count as beginning of a line? can “\n” be found by itself? what about a string that expands more than one line? what about a line including the “\n”, with and without the haystack having the “\n” at the end?), 63b018ee (how are errors handled? what happens when a %s gets part of a number?), 2210f18 (is a newline only \r\n or \n, or any combination or \r and \n?) and 46bbf6c (check that all non-printable characters are escaped in octal, even if they were originally in hex; check that characters in octal/hex, when they’re printable, appear directly and not in octal).
Boundaries of different kinds are a typical source of bugs, and thus are among the best points of testing we have. It’s also good to test both sides of the boundaries, both as an example and because bugs can appear on both sides (and not necessarily at once!).
The best example are the tor_strtok_r_impl tests (a function that is supposed to be compatible with
strtok_r, that is, it chops a given string into “tokens”, separated by one of the given separator characters). In fact, these extra tests discovered an actual bug in the implementation (ie. an incompatibility with
strtok_r). Those extra tests asked a couple of interesting questions, including “when a string ends in the token separator, is there an empty token in the end?” in the “howdy!” example. This test can also be considered valuable as in “tests as spec”, if you consider that the answer to be above question is not obvious and both answers could be considered correct.
See other examples in commits 5740e0f (checking if
tor_snprintfcorrectly counts the number of bytes, as opposed the characters, when calculating if something can fit in a string; also note my embarrassing mistake of testing
snprintf, and not
tor_snprintf, later in the same commit), 46bbf6c (check that character 21 doesn’t make a difference, but 20 does) and 725d6ef (testing 129 is very good, but even better with 128—or, in this case, 7 and 8).
Testing implementation details
Testing implementation details tends to be a bad idea. You can usually argue you’re testing implementation details if you’re not getting the test information from the APIs provided by whatever you’re testing. For example, if you test some API that inserts data in a database by checking the database directly, or if you test the result of a method call was correct by checking the object’s internals or calling protected/private methods. There are two reasons why this is a bad idea: first, the more implementation details you tests depend on, the less implementation details you can change without breaking your tests; second, your tests are typically less readable because they’re cluttered with details, instead of meaningful code.
The only example I encountered of this in Tor were the compression tests. In this case it wasn’t a big deal, really, but I have seen this before in much worse situations and I feel this illustrates the point well enough. The problem with that deleted line is that it’s not clear what’s it’s purpose (it needs a comment), plus it uses a magic number, meaning if someone ever changes that number by mistake, it’s not obvious if the problem is the code or the test. Besides, we are already checking that the magic number is correct, by calling the
detect_compression_method. Thus, the deleted
memcmpdoesn’t add any value, and makes our tests harder to read. Verdict: delete!
I hope you liked the examples so far. My next post will contain the second half of the tips.
Sep 5, 2010
Summary: there’s a simple tool that will tell you which Facebook sharing options are “too open” in your account. I’d like you to help me by trying it out and telling me what you think (if you had problems using it, if you would like extra/other information to be shown, if you found any bugs, etc.). Skip to “how to use it” below if you’re not interested in the details for developers. Thanks!
Weeks passed and the tool didn’t get any update, so I decided to step in and try to help the original programmer adapt the tool so it worked again. The ReclaimPrivacy code is in GitHub so it was pretty easy to make my own fork and start hacking away. It didn’t take me long to adapt the first things to the new privacy settings layout, and after some more time I was much more comfortable with the code, had made more things work, added tests and even added new features. Now that it’s starting to get close to something we could release as the new official ReclaimPrivacy version, I’d like your feedback.
The getInformationDropdownSettings method, renamed to getSettingInformation, is now shorter, more readable, more testable and has more features. The changes are: (1) making it receive an object with the relevant part of the DOM, instead of a window object; (2) supporting, in principle, any kind of setting, not only dropdowns; (3) allowing each setting to have its own idea of what “too open” means (see the settings array); (4) allowing the caller of the method to specify its own list of recognised settings and acceptable privacy levels; (5) passing the number of open and total sections to the handler, instead of just a boolean stating whether or not there’s any “too open” setting.
I made the old getUrlForV2Section more testable by extracting the most interesting (read: likely to break or need maintenance) code to its own method, _extractUrlsFromPrivacySettingsPage, and making the new getUrlForV2Section work with both real URLs (checking Facebook with an Ajax call) and fake HTML dumps representing what those URLs would return.
I made the old withFramedPageOnFacebook, a very important method used in several places, more flexible by accepting not just URLs, but also functions or data structures (new withFramedPageOnFacebook).
Aug 10, 2010
This post is probably not about what you’re thinking. It’s actually about automated testing.
Different stuff I’ve been reading or otherwise been exposed to in the last weeks has made me reach a sort of funny comparison: code is (or can be) like science. You come up with some “theory” (your code) that explains something (solves a problem)… and you make sure you can measure it and test it for people to believe your theory and build on top of it.
I mean, something claiming to be science that can’t be easily measured, compared or peer-reviewed would be ridiculous. Scientists wouldn’t believe in it and would certainly not build anything on top of it because the foundation is not reliable.
I claim that software should be the same way, and thus it’s ridiculous to trust software that doesn’t have a good test suite, or even worse, that may not even be particularly testable. Trusting software without a test suite is not that different from taking the word of the developer that it “works on my machine”. Scientists would call untested science pseudo-science, so I am tempted to call code without tests pseudo-code.
Don’t get me wrong: sure you can test by hand, and hand-made tests are useful and necessary, but that only proves that the exact code you tested, without any changes, works as expected. But you know what? Software changes all the time, so that’s not a great help. If you don’t have a way to quickly and reliably measure how your code behaves, every time you make a change you are taking a leap of faith. And the more leaps of faith you take, the less credible your code is.
Dec 6, 2009
The other day I was in a conversation with some developer that was complaining about some feature. He claimed that it was too complex and that it had led to tons of bugs. In the middle of the conversation, the developer said that the feature had been so buggy that he ended up writing a lot of unit tests for it. To be honest I don’t think there were a lot of bugs after those tests were written, so that made me think:
Maybe the testers in his team are doing too good of a job?
Because, you know, if testers are finding enough of “those bugs” (the ones that should be caught and controlled by unit tests and not by testers weeks after the code was originally written), maybe some developers are just not “feeling the pressure” and can’t really get that they should be writing tests for their code. If testers are very good, things just work out fine in the end… sort of. And of course, the problem here is the trailing “sort of”.
I know I’m biased, but in my view there is a ton of bugs that should never be seen by someone that is not the developer itself. Testers should deal with more complex, interesting, user-centred bugs. Non-trivial cases. Suboptimal UIs. Implementation disagreements between developers and stakeholders. That kind of thing. It’s simply a waste of time and resources that testers have to deal with silly, easy-to-avoid bugs on a daily basis. Not to mention that teams shouldn’t have to wait for days or weeks until they find basic bugs via exploratory testing. Or that a lot of those are much, much quicker to test with unit tests than having to create the whole fixture/environment for them to be found with exploratory testing.
My current conclusion is that pushing on the UI/usability side is not only good for the user, but it’s likely to produce better code as it will be, on average, more complex and will have to be better controlled by QA (code review, less ad-hoc design, …) and automated tests. Maybe developers will start hating me for that, but hopefully users will thank me.
Sep 20, 2009
I had said that I was going to publish the slides for a couple of talks I had given over the last couple of months, and I just got around to actually do it, so here they are:
Software automated testing 123, an entry-level talk about software automated testing. Why you should be doing it (if you’re not already), some advice for test writing, some basic concepts and some basic examples (in Perl, but I trust it shouldn’t be too hard to follow even if you don’t know the language).
Taming the Snake: Python unit tests, another entry-level talk, but this time about Python unit testing specifically. How to write xUnit style tests with
unittest, some advice and conventions and some notes on how to use the excellent
Just a quick note about them: the slides shouldn’t be too hard to understand without me talking, but of course you’ll lose some stuff that is not written down, some twists, clarifications of what I mean exactly by different things and whatnot. In particular, the “They. don’t. make. sense. Don’t. write. them” stuff refers to tests that don’t have a reliable/controlled environment to run into. I feel really strong about them, so I wanted to dedicate a few more seconds to smashing the idea that they’re ok, hence the extra slides :-)
Enjoy them, and please send me any comments you have about them!
Sep 13, 2009
I spent the whole last week (or this week; after all it’s Sunday… and Sunday is obviously the last day of the week, not the first, right?) in Linköping, Sweden. The idea was repeating some Debian course I gave here in Oslo, giving two more talks about automated testing since I was there anyway, and attend two more talks. It was lots of fun, partly thanks to my “host” (thanks Gerald!), and surprisingly I found a bunch of things that seemed plain weird to me… or at least quite different from Oslo.
The talks themselves went pretty good I think, although I’d have preferred more people attending. I guess it was normal that there were less people than I’m used to, since the Linköping office is much smaller. But anyway. The Debian course went quite well and some people got started packaging stuff almost right away. The other talks were an introduction to automated testing (advocacy and arguments for it, advice, basic examples and small rant about a different kind of QA), which went ok, and an entry-level talk about unit testing in Python (thanks Ask and Batiste for the information and reviewing the slides!), which went very well. I’ll try to get the slides for all the talks available somewhere.
About the city itself, it’s a charming little part of Sweden where:
Restaurants have insanely different prices for food whether it’s for lunch or dinner. Typical prices for lunch are 80 SEK (around 8 EUR) and typical prices for dinner are around 250 SEK just the main course!
Restaurants usually serve some Swedish dish for lunch… and I mean every restaurant, meaning all the Greek, Vietnamese, etc. Considering “real” Swedish restaurants are very expensive, you usually go to those foreign cuisine ones when you actually want to eat Swedish food.
Restaurants typically have some salad (that you have to take yourself) while you wait for the food… and some coffee, tea and cookies (that obviously you have to take yourself) for the end.
Related to this, restaurants are usually very self-service. I thought service in Norway sucked, but boy was I wrong, at least there is some service. And: there were typically long but pretty-fast-moving queues, and there was this one place where you didn’t even get the food on the table after ordering at the bar; instead, you were given some gadget with some wireless receiver, and when your food was ready it’d beep so you knew you had to go to some special place and fetch your food. Is it really cheaper maintaining some gadgets than hiring a waiter? I guess so.
The restrictions on the amount of alcohol that can be bought outside the special Government booze stores are even harder than in Norway. You can only buy booze with up to 3.5% alcohol outside “Systembolaget”. Now that is sad. And I was complaining about Norway’s 5%.
Partly because of that (I assume/hope) the Swedish “cider” you get in Sweden is even sweeter and worse and the Swedish cider you get in Norway.
We went to this nice student pub… which was literally for students. They actually checked your student id, but each student could bring one non-student along. Once you were “identified” as a non-student-coming-with-a-student, you’d get a stamp on your hand so you wouldn’t have to bring along the student when you ordered again. Also, the place was so very slow it was almost funny. One of the good sides was that they had what I thought it was the only decent Swedish cider… but after checking just now, it seems it’s actually American. Bummer. And the name of it was funny too: “Hardcore Cider”.
Right before leaving the office on Friday there was a small gathering in the canteen (the “Friday Beer”), where they had a Dreamcast with one of the most awesome games I’ve seen in a long while: The Typing of the Dead, a version of The House of the Dead 2 in which you kill the zombies by typing words that appear on the screen, instead of aiming and shooting with a gun:
Oct 21, 2008
I admit it. I’m a terrible developer. I write code, sometimes even write tests.
But. I. don’t. test. my. programs.
By hand, that is. And sometimes (usually) the coverage is not enough, and I end up making embarrassing mistakes. It usually happens outside of work, although at work I also have my share. The last one was with the Debian package
dhelp, where trying to fix an issue before Lenny is released, I ended up making it even worse. The story goes like this:
There was some problem with the indexing of documents on installation/upgrade (namely, it would take ages for most people upgrading to Lenny, and they would think the upgrade process had hung). So, I go and change the indexing code so it ignores documents on installation/upgrade. Also, as suggested by someone, I created some small example utility to reindex documentation for certain packages. I test installation, upgrades, upgrade of the
dhelppackage itself, the utility, searching for keywords before and after all that… and everything worked.
Only that I made a typo. A typo that would make all indexing to be ignored (except for the example utility, because it was a bit lower level). And I didn’t realise, because it “only” broke some cronjob, a completely different part of the package. And it happens that the cronjob reindexed everything weekly, to make sure that you had reasonably up-to-date search indices. And it also happens that, given that the documentation reindexing was being ignored on package installation/upgrade, the weekly total reindex process was the only thing that could provide the user with indexed documentation. But I screwed up. Oh well.
Someone filed a bug yesterday, and I fixed more or less right away. But this time I spent a couple of hours thinking of test paths and ways to make it fail, and actually doing all that testing. Thanks to that, I found some potential bug in the example utility, that I fixed just in case. So hopefully everything is fine now, if I can convince the Release Masters to allow the new, less broken update to
dhelpto be accepted for Lenny.
I think I need personal QA. Anyone up to the task?