It was published, in the Elsevier journal 'Information Sciences' in 1999.
It appears to be the first, and the first large, collection of statistical hypothesis tests that are both multi-dimensional and distribution-free.
I try to be anonymous here at HN, but I'm willing enough to send a PDF of the paper to anyone who wants a copy. E.g., ask for a copy and leave your e-mail address on your HN profile, at least temporarily.
The main point of the paper is that we do get an hypothesis test. In particular, we get to select false alarm rate and then get that rate essentially exactly in practice.
It's behavioral monitoring -- it assumes that the past and future of healthy performance are, to be simple, statistically the same. So, right, its for a server farm or network that is statistically relatively stable, that is statistically unchanging, in what it is doing. The site can be wild and crazy, but it has to continue to be wild and crazy in statistically the same way.
In particular, the work is for detecting zero day problems, that is, problems never seen before. Maybe the philosophy here is that when we get a new problem and detect it, then we fix the cause of the problem and never see it again and, then, again are left looking for zero day problems.
Then the work uses past data -- hypothesis tests have done that since Karl Pearson 100+ years ago, and now parts of computer science do something similar and call it training data in unsupervised learning or some such. The approaches of just statistical hypothesis testing make more sense to me.
The key, core mathematical argument is a finite algebraic group of measure preserving transformations on the data. I believe that there are connections with U-statistics, e.g., as in an advanced statistics book by Serflng.
This stuff with groups and measure preserving is a little like some classic arguments in ergodic theory. On the page, the math looks awful, but actually it is conceptually quite simple.
But, you don't need to dig into the math too much.
For the actual calculations, those are based on nearest neighbors (although other options also work with the basic math). At least since the paper, others have thought of using nearest neighbors, but they didn't have an hypothesis test because they didn't know how to calculate and adjust false alarm rate. So, they have an heuristic instead of an hypothesis test. So, again, the main contribution of the paper is that it really is an hypothesis test, that is, know and get to adjust false alarm rate (conditioned on the old data and also true in long run expectation over the conditioned work -- standard result in conditional expectation from the Radon-Nikodym theorem in measure theory).
For detection rate, there is some good news, not as good as from the classic Neyman-Pearson result (in practice in the context we don't have enough data to do much with Neyman-Person), but nice: In a useful sense, for the selected false alarm rate, the work gives the highest possible detection rate. Really the mathematical key here is just Fubini's theorem (the measure theory version of interchange of order of integration). Intuitively, the technique has the largest area where alarms are raised consistent with the selected false alarm rate.
For the practical application, do need some help with some computational geometry. For that, I dreamed up some work. Soon I found that part of what I dreamed up was k-D trees, e.g., as in Sedgewick's book on algorithms. But there is more -- need some cutting planes. I programmed most of it 20+ years ago in PL/I but finally dropped it due to lack of interest.
I have some ideas for more results of interest and more papers, but after 20+ years of no interest I just gave up.
More can be said, but I stopped the research when I discovered, about 20 years ago, that no one was interested. The paper was published in 1999, and since then interest has been quieter than the tombs of ancient Egypt. So, I'm doing a startup that is quite different.
I dreamed up the work when I saw the need, or at least as I regarded the need, way back in about 1990 when I was in an AI group at the IBM Watson lab doing work on monitoring and management of large server farms and networks. The AI work was trying to build on essentially just threshold detectors. There was no attention to false alarm rate or a best detector -- highest detection rate for given false alarm rate. The classic Neyman-Pearson result was ignored. I was our guy with GM Research, and we gave a paper at the Stanford AAAI IAAI conference. But I was outraged by the lack of concern for false alarm rate, ignoring hypothesis tests and distribution free hypothesis tests (long common in the social sciences), and with no attention at all to multi-dimensional data.
The real world context is just awash in multi-dimensional data. Treating the data components separately in effect says that the geometrical region of healthy behavior is just a box. Bummer. Box too small -- get false alarm rate too large. Box too big, get too many missed detections. Problem: A box is a poor fit to reality. Simple stuff.
How to see this? Monitor CPU busy and page faults per second and look for anomalies, e.g., thrashing, a program allocating infinite memory, etc. Then the normal behavior is just a 2-D box? I don't think so! But, sure, need to automate picking the shape of the region of healthy behavior.
For the distribution-free stuff, that is where we make no assumptions about probability distributions. I got a kick in the back side on that sitting one day in the office of Ulf Grenander, one of the world's best ever statisticians, at Brown (I got accepted to grad school there; was considering going; went elsewhere instead). Grenander had been looking at computer performance data and was shocked at how different it was from the data, e.g., biomedical, he had been used to. So, right, Gaussian assumptions and more go out the window!
So, really, just want to make no assumptions about distributions, want to be distribution-free (a.k.a., non-parametric although I believe distribution-free is more appropriate terminology).
For multi-dimensional, at IBM I got a slap in the face: There was a cluster of computers doing transaction processing. There was some front end load leveling that sent the next transaction to the least busy computer in the cluster. Okay. But one day one of the computers got sick, just a little sick in the head, and was doing a very silly thing -- it was throwing all its incoming transaction work into the bit bucket! Thus, this computer looked to the load leveling as not very busy and, thus, was getting nearly all the transactions. Thus, nearly all the transactions for the whole cluster were going into the bit bucket. Bummer.
So, I thought, to detect this anomaly, want somehow to look at all of the computers in the cluster at the same time and compare them with each other, that is, have all the data in some appropriate region in some space of several dimensions, a region that works whether the cluster is busy or not. So, want to be multi-dimensional, that is, don't want just threshold detectors on variables one at a time.
There are more war stories where the importance of being multi-dimensional is crucial. Really, commonly separating multi-dimensional data into its components and treating the components separately can be throwing away a lot of crucial information which stands to give a poor combination of false alarm rate and detection rate.
Heck, in principle the region of healthy performance can be a fractal, say, like the Mandelbrot set, and, so, somehow we need to approximate that. Can we do that? Basically with nearest neighbor, or k-nearest neighbors (which also works), yes.
There is now a good opportunity for my work: My work can use a LOT of training data, and in the near real-time detection work can want to do a lot with that data. So, could use fast access to a lot of data which doesn't change very fast. So, sure, use some big solid state disks (SSDs)! A few of the Samsung 14 TB drives should do wonders for my paper!
My view is, anyone doing monitoring of a large server farm or network and not using what is in my paper is not being fully serious. And, since get to adjust false alarm rate, say, to one a month, can't say that can't afford the extra false alarms.
Uh, I left out: For each alarm, get told the lowest false alarm rate at which the real-time input data is still an alarm -- so get an indication of alarm seriousness.
More is possible, but at least have to be using what I cooked up in 1990, wrote prototype software for in the early 1990s, and published in 1999.
I did the work a long time ago, guys! And since then, there have been various serious consequences from anomalies, intrusions, etc. Maybe in some of those cases, my work would have done good, early detection. My work looks a heck of a lot better than anything else!
So I think the mathematics to make this work is not the problem. How do you engineer this though?
If I were to try and build a platform that could do this in real-time for, lets say, a million metrics per minute, can you engineer something that would scale horizontally to do this? Can it be done by cobbling together various open-source tools/libraries currently out there? Then how would you present the results in a way that someone that's not necessarily "mathematically inclined", say for example, your typical operational support person, that they could meaningfully interpret whatever your system is spitting out?
That's for me the hard part, is to get those two components working well. Make it scale, make it idiot friendly. If you can't get those parts right, it doesn't matter what you're trying to do.
I say this because I've spent the last 6 years in the application performance management space and "the best" way to handle alarms at the moment is to put down a team, literally a team, of people and have them hand-tune thresholds by looking at a combination of history, incidents/outages and root cause outcomes, domain specialist inputs (like DBAs or application server specialists). You send out a false or noisy alarm to an ops guy too many times and they become desensitized. You don't put enough context in your alarm messages, they won't use it (logging into a tool is asking too much, the email must contain everything they need or they complain).
Any form of dynamic baselining is just too noisy. The simplest example is trying to "baseline" CPU usage. CPU usage without something trivial like comparing to run-queue is stupid. It's actually even more stupid because you should be looking at things top-down, i.e. so what if the CPU is 100% and the run queue is 100, are any user facing transactions slowing down? i.e. is there customer impact. It could be some batch job kicking off. So in short, anything that looks at a metric in isolation is stupid, dynamic baselines with time of day, day of month, etc. it's all rubbish shit, you're wasting your time with this approach. This is the sad state that current "cutting edge" third generation APM tools offer though.
The same as for several others here at HN, leave your e-mail address in
your HN profile, and I will try to remember to return, use your e-mail, and send you a PDF of the paper, 1999 in Information Sciences.
Scaling? Can scale about as much as you want. Get some racks of high end servers and a fast server farm LAN, collect the data with whatever instrumentation, and let it run.
For false alarm rate, just shovel in the training data, pick a false alarm rate, say, one a week, one a month, and let it run. Don't hand tune anything. In effect, the hand tuning is replaced by the adjustable false alarm rate and where you know the rate in advance and, with the statistical assumptions, get that rate exactly in practice. Statistical hypothesis testing has had adjustable false alarm rates for 100+ years. Sorry that computer monitoring has been struggling with that.
Yes, you will need to make a judgment call about the assumption that the system now is statistically the same as during the past, say, three months of training or historical data of apparently healthy behavior.
For rates of false alarms, rates of missed detections of real problems -- the server farm bridge staff and the network operations center (NOC) can understand those. I was invited for a free lunch and gave a presentation to the operations staff at the main NASDAQ facility in Trumbull, CT, and the operations staff, maybe 30 people, understood the basics right away.
For the math, the only tricky part, and the core of my paper, is how the heck to know and adjust the false alarm rate, but operationally for the staff that is just trivially easy.
Once I was at Morgan Stanley and talking to their main Unix system administrator. He'd just come from a meeting on how the heck to monitor his Unix systems, and I explained my work. He nearly jumped up and exclaimed "We can use this right away!". But they didn't give me an offer, ask me to consult (my paper was not published yet), etc. So, really, they didn't much care.
For how to report the alarms, I'd guess feed into some standard system management infrastructure, consoles, whatever from HP, CA, EMC, Microsoft, etc. There's a way to get a real time, running strip chart that says what the false alarm rate would be for the data just observed to be an alarm. If people want to watch 500 of those, okay by me. But mostly would want a way to display the strip chart for detectors that just gave alarms or, given an alarm for, say, a Cisco switch, display the strip charts for all the monitors of that switch -- for insight and an aid to diagnosis.
> The simplest example is trying to "baseline" CPU usage. CPU usage without something trivial like comparing to run-queue is stupid.
Of course it is. My work would still give the selected rate of false alarms, but the detector would likely also have poor detection rate. I.e., with just CPU usage, the poor detector just doesn't have enough information to do anything very good.
Now THAT'S in part why want to be multi-dimensional. So, say, feed in PAIRS of CPU busy and run-queue length. Maybe include some more variables, e.g., time of day.
So, here's your first judgment call: What to monitor and what variables to combine to several variables to feed to some one detector. It's clear you have some insight. Good. In time there will be some good ideas for what variables to use to monitor a Cisco switch, an Oracle database, a Windows Server, etc.
I have in mind some more research to help make such selections, but, again, for 20+ years no one was interested. You described the problems well, and I made good progress on solving them, but, still, no one was interested. No one. Did I mention, no one? The paper was right there in a peer-reviewed journal, and it was treated like a source of leprosy. Not my fault. And this is not nearly the first time I've publicized this on HN.
Really, there's hardly a well known VC firm that hasn't heard from me. And there's hardly a one I ever heard back from. What is it, you can lead a horse to water, but you can't make him drink?
Next issue, you already know about: Given a detection, the staff will want a diagnosis of the cause, then the root cause, and then the fix. Well, right, given some topology or some such of the detectors, could do some root cause analysis. But, in practice, diagnosis can be difficult. To ease the work of diagnosis, in each detector try to use some variables that, given an alarm, do give a hint about the cause and diagnosis. Or, have several detectors monitoring one server and, considering them jointly, that is, which ones just gave an alarm and which ones didn't, get some hints on cause -- right, could do more useful research here (I mentioned that).
The only VC that called me back wanted not just my all nicely automated detection but also nicely clear diagnosis and, no doubt, correction. Maybe he also wanted me to give Godzilla a bath, manicure, and rub down, too -- no problem, guys! Godzilla bath coming right up with release 2.0 and the Gold Enterprise Edition! I told the VC that anyone who promised to do a good job automating diagnosis in a real and large server farm or network was, uh, exaggerating what they could do and to stay far away.
Typically diagnosis takes a lot of information about the system being monitored. To do really well at diagnosis, likely need already to have seen all the causes and their symptoms. While we should collect data and make progress where we can, that is, for the more common problems, in general we can't do diagnosis well easily.
It appears to be the first, and the first large, collection of statistical hypothesis tests that are both multi-dimensional and distribution-free.
I try to be anonymous here at HN, but I'm willing enough to send a PDF of the paper to anyone who wants a copy. E.g., ask for a copy and leave your e-mail address on your HN profile, at least temporarily.
The main point of the paper is that we do get an hypothesis test. In particular, we get to select false alarm rate and then get that rate essentially exactly in practice.
It's behavioral monitoring -- it assumes that the past and future of healthy performance are, to be simple, statistically the same. So, right, its for a server farm or network that is statistically relatively stable, that is statistically unchanging, in what it is doing. The site can be wild and crazy, but it has to continue to be wild and crazy in statistically the same way.
In particular, the work is for detecting zero day problems, that is, problems never seen before. Maybe the philosophy here is that when we get a new problem and detect it, then we fix the cause of the problem and never see it again and, then, again are left looking for zero day problems.
Then the work uses past data -- hypothesis tests have done that since Karl Pearson 100+ years ago, and now parts of computer science do something similar and call it training data in unsupervised learning or some such. The approaches of just statistical hypothesis testing make more sense to me.
The key, core mathematical argument is a finite algebraic group of measure preserving transformations on the data. I believe that there are connections with U-statistics, e.g., as in an advanced statistics book by Serflng.
This stuff with groups and measure preserving is a little like some classic arguments in ergodic theory. On the page, the math looks awful, but actually it is conceptually quite simple.
But, you don't need to dig into the math too much.
For the actual calculations, those are based on nearest neighbors (although other options also work with the basic math). At least since the paper, others have thought of using nearest neighbors, but they didn't have an hypothesis test because they didn't know how to calculate and adjust false alarm rate. So, they have an heuristic instead of an hypothesis test. So, again, the main contribution of the paper is that it really is an hypothesis test, that is, know and get to adjust false alarm rate (conditioned on the old data and also true in long run expectation over the conditioned work -- standard result in conditional expectation from the Radon-Nikodym theorem in measure theory).
For detection rate, there is some good news, not as good as from the classic Neyman-Pearson result (in practice in the context we don't have enough data to do much with Neyman-Person), but nice: In a useful sense, for the selected false alarm rate, the work gives the highest possible detection rate. Really the mathematical key here is just Fubini's theorem (the measure theory version of interchange of order of integration). Intuitively, the technique has the largest area where alarms are raised consistent with the selected false alarm rate.
For the practical application, do need some help with some computational geometry. For that, I dreamed up some work. Soon I found that part of what I dreamed up was k-D trees, e.g., as in Sedgewick's book on algorithms. But there is more -- need some cutting planes. I programmed most of it 20+ years ago in PL/I but finally dropped it due to lack of interest.
I have some ideas for more results of interest and more papers, but after 20+ years of no interest I just gave up.
More can be said, but I stopped the research when I discovered, about 20 years ago, that no one was interested. The paper was published in 1999, and since then interest has been quieter than the tombs of ancient Egypt. So, I'm doing a startup that is quite different.
I dreamed up the work when I saw the need, or at least as I regarded the need, way back in about 1990 when I was in an AI group at the IBM Watson lab doing work on monitoring and management of large server farms and networks. The AI work was trying to build on essentially just threshold detectors. There was no attention to false alarm rate or a best detector -- highest detection rate for given false alarm rate. The classic Neyman-Pearson result was ignored. I was our guy with GM Research, and we gave a paper at the Stanford AAAI IAAI conference. But I was outraged by the lack of concern for false alarm rate, ignoring hypothesis tests and distribution free hypothesis tests (long common in the social sciences), and with no attention at all to multi-dimensional data.
The real world context is just awash in multi-dimensional data. Treating the data components separately in effect says that the geometrical region of healthy behavior is just a box. Bummer. Box too small -- get false alarm rate too large. Box too big, get too many missed detections. Problem: A box is a poor fit to reality. Simple stuff.
How to see this? Monitor CPU busy and page faults per second and look for anomalies, e.g., thrashing, a program allocating infinite memory, etc. Then the normal behavior is just a 2-D box? I don't think so! But, sure, need to automate picking the shape of the region of healthy behavior.
For the distribution-free stuff, that is where we make no assumptions about probability distributions. I got a kick in the back side on that sitting one day in the office of Ulf Grenander, one of the world's best ever statisticians, at Brown (I got accepted to grad school there; was considering going; went elsewhere instead). Grenander had been looking at computer performance data and was shocked at how different it was from the data, e.g., biomedical, he had been used to. So, right, Gaussian assumptions and more go out the window!
So, really, just want to make no assumptions about distributions, want to be distribution-free (a.k.a., non-parametric although I believe distribution-free is more appropriate terminology).
For multi-dimensional, at IBM I got a slap in the face: There was a cluster of computers doing transaction processing. There was some front end load leveling that sent the next transaction to the least busy computer in the cluster. Okay. But one day one of the computers got sick, just a little sick in the head, and was doing a very silly thing -- it was throwing all its incoming transaction work into the bit bucket! Thus, this computer looked to the load leveling as not very busy and, thus, was getting nearly all the transactions. Thus, nearly all the transactions for the whole cluster were going into the bit bucket. Bummer.
So, I thought, to detect this anomaly, want somehow to look at all of the computers in the cluster at the same time and compare them with each other, that is, have all the data in some appropriate region in some space of several dimensions, a region that works whether the cluster is busy or not. So, want to be multi-dimensional, that is, don't want just threshold detectors on variables one at a time.
There are more war stories where the importance of being multi-dimensional is crucial. Really, commonly separating multi-dimensional data into its components and treating the components separately can be throwing away a lot of crucial information which stands to give a poor combination of false alarm rate and detection rate.
Heck, in principle the region of healthy performance can be a fractal, say, like the Mandelbrot set, and, so, somehow we need to approximate that. Can we do that? Basically with nearest neighbor, or k-nearest neighbors (which also works), yes.
There is now a good opportunity for my work: My work can use a LOT of training data, and in the near real-time detection work can want to do a lot with that data. So, could use fast access to a lot of data which doesn't change very fast. So, sure, use some big solid state disks (SSDs)! A few of the Samsung 14 TB drives should do wonders for my paper!
My view is, anyone doing monitoring of a large server farm or network and not using what is in my paper is not being fully serious. And, since get to adjust false alarm rate, say, to one a month, can't say that can't afford the extra false alarms.
Uh, I left out: For each alarm, get told the lowest false alarm rate at which the real-time input data is still an alarm -- so get an indication of alarm seriousness.
More is possible, but at least have to be using what I cooked up in 1990, wrote prototype software for in the early 1990s, and published in 1999.
I did the work a long time ago, guys! And since then, there have been various serious consequences from anomalies, intrusions, etc. Maybe in some of those cases, my work would have done good, early detection. My work looks a heck of a lot better than anything else!