Yes, this kind of thing is essential if you want to design a system that stays up.
I have another fun thing along these lines. When you start a request that comes into the server, you assign it an integer. As the handling of the request reaches transaction boundaries (transactions on an underlying database management system), you decrease the counter by one and keep passing it along the chain of handling. (The chain might include sending messages from one component to another, and getting replies, and so on. For now I'm assuming that request handling only has one thread.)
Whenever the count reaches zero, the component that is currently handling the request is killed. The idea is to make sure that you test "all the places" where a crash might happen; that is, "all" with respect to the database state.
This tool is testing the system under a certain set of assumptions. It's assuming that the DBMS does what it's told to do. It's testing "stop" failures. It works best when the only side-effects are to the database system. If there are other side effects, e.g. the components get things into their cache that stay there and are used in subsequent requests, then you are less sure that you are testing out all possible paths.
I like to call these things "failure injection" (I didn't invent that term). We have other failure injection stuff all over our system. The "transaction ticking time bomb" one is just my favorite.
Using a random tool might not find all of these states. Random tools are great, but there are other useful test tools, too. In general, trying to "test all possible circumstances" is very, very hard; you never know what particular input values might be the ones that cause a problem, and then you have to worry about variable A having certain values while variable B has certain other values, leading to a combinatorial explosion of circumstances to test.
In my opinion, one of the best ways to deal with this problem is by having experienced Q/A people who have a knack for guessing what cases ought to be tested.
There's a lot I can say about this, but the main thing I'll say is that this is one of the reasons I have doubts about the wisdom of the methodology used at Facebook, in which there is no Q/A department, and programmers are expected to do their own Q/A. That's just one of the reasons.
It helps that at Facebook, you can roll out a new feature to a very small subset of users, and un-install it quickly if it's causing a problem, and usually if it doesn't work, that's not very important when there's only a very small number of early adopters. Things generally don't work that way with an airline reservation system. The methodologies that are suited for one situation are not necessarily those suitable for another.
-- Dan
Peter Seibel wrote:
Presumably this kind of thing is the reason for the Chaos Monkey:
http://www.readwriteweb.com/cloud/2010/12/chaos-monkey-how-netflix-uses.php
-Peter
On Mon, Dec 20, 2010 at 6:37 PM, Scott L. Burson Scott@sympoiesis.com wrote:
On Fri, Dec 17, 2010 at 11:16 AM, Ryan Davis ryan@acceleration.net wrote:
We do something like this. For lisp websites my company makes, we have a password-protected admin section with some light UI to help us manage the site (turn logging levels up/down, clear caches, etc), and one of those tools is a "evaluate this code in the running lisp" textarea, with a dropdown to select what package it runs in. This is very rarely used to patch the site in emergency situations or for trivial changes where we don't want to bring down the site. This has bit us a few times, where we fixed a small bug directly in the running lisp and then forgot to publish the new code and had mystery regressions when the lisp process was restarted.
A really funny cautionary tale about this sort of thing:
http://thedailywtf.com/Articles/Designed-For-Reliability.aspx
-- Scott
pro mailing list pro@common-lisp.net http://common-lisp.net/cgi-bin/mailman/listinfo/pro