Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

So I think the mathematics to make this work is not the problem. How do you engineer this though?

If I were to try and build a platform that could do this in real-time for, lets say, a million metrics per minute, can you engineer something that would scale horizontally to do this? Can it be done by cobbling together various open-source tools/libraries currently out there? Then how would you present the results in a way that someone that's not necessarily "mathematically inclined", say for example, your typical operational support person, that they could meaningfully interpret whatever your system is spitting out?

That's for me the hard part, is to get those two components working well. Make it scale, make it idiot friendly. If you can't get those parts right, it doesn't matter what you're trying to do.

I say this because I've spent the last 6 years in the application performance management space and "the best" way to handle alarms at the moment is to put down a team, literally a team, of people and have them hand-tune thresholds by looking at a combination of history, incidents/outages and root cause outcomes, domain specialist inputs (like DBAs or application server specialists). You send out a false or noisy alarm to an ops guy too many times and they become desensitized. You don't put enough context in your alarm messages, they won't use it (logging into a tool is asking too much, the email must contain everything they need or they complain).

Any form of dynamic baselining is just too noisy. The simplest example is trying to "baseline" CPU usage. CPU usage without something trivial like comparing to run-queue is stupid. It's actually even more stupid because you should be looking at things top-down, i.e. so what if the CPU is 100% and the run queue is 100, are any user facing transactions slowing down? i.e. is there customer impact. It could be some batch job kicking off. So in short, anything that looks at a metric in isolation is stupid, dynamic baselines with time of day, day of month, etc. it's all rubbish shit, you're wasting your time with this approach. This is the sad state that current "cutting edge" third generation APM tools offer though.



Hello graycat - I would be interested in a copy of your paper and/or the article name/publication date/etc. Regards


The same as for several others here at HN, leave your e-mail address in your HN profile, and I will try to remember to return, use your e-mail, and send you a PDF of the paper, 1999 in Information Sciences.


Yup.

Scaling? Can scale about as much as you want. Get some racks of high end servers and a fast server farm LAN, collect the data with whatever instrumentation, and let it run.

For false alarm rate, just shovel in the training data, pick a false alarm rate, say, one a week, one a month, and let it run. Don't hand tune anything. In effect, the hand tuning is replaced by the adjustable false alarm rate and where you know the rate in advance and, with the statistical assumptions, get that rate exactly in practice. Statistical hypothesis testing has had adjustable false alarm rates for 100+ years. Sorry that computer monitoring has been struggling with that.

Yes, you will need to make a judgment call about the assumption that the system now is statistically the same as during the past, say, three months of training or historical data of apparently healthy behavior.

For rates of false alarms, rates of missed detections of real problems -- the server farm bridge staff and the network operations center (NOC) can understand those. I was invited for a free lunch and gave a presentation to the operations staff at the main NASDAQ facility in Trumbull, CT, and the operations staff, maybe 30 people, understood the basics right away.

For the math, the only tricky part, and the core of my paper, is how the heck to know and adjust the false alarm rate, but operationally for the staff that is just trivially easy.

Once I was at Morgan Stanley and talking to their main Unix system administrator. He'd just come from a meeting on how the heck to monitor his Unix systems, and I explained my work. He nearly jumped up and exclaimed "We can use this right away!". But they didn't give me an offer, ask me to consult (my paper was not published yet), etc. So, really, they didn't much care.

For how to report the alarms, I'd guess feed into some standard system management infrastructure, consoles, whatever from HP, CA, EMC, Microsoft, etc. There's a way to get a real time, running strip chart that says what the false alarm rate would be for the data just observed to be an alarm. If people want to watch 500 of those, okay by me. But mostly would want a way to display the strip chart for detectors that just gave alarms or, given an alarm for, say, a Cisco switch, display the strip charts for all the monitors of that switch -- for insight and an aid to diagnosis.

> The simplest example is trying to "baseline" CPU usage. CPU usage without something trivial like comparing to run-queue is stupid.

Of course it is. My work would still give the selected rate of false alarms, but the detector would likely also have poor detection rate. I.e., with just CPU usage, the poor detector just doesn't have enough information to do anything very good.

Now THAT'S in part why want to be multi-dimensional. So, say, feed in PAIRS of CPU busy and run-queue length. Maybe include some more variables, e.g., time of day.

So, here's your first judgment call: What to monitor and what variables to combine to several variables to feed to some one detector. It's clear you have some insight. Good. In time there will be some good ideas for what variables to use to monitor a Cisco switch, an Oracle database, a Windows Server, etc.

I have in mind some more research to help make such selections, but, again, for 20+ years no one was interested. You described the problems well, and I made good progress on solving them, but, still, no one was interested. No one. Did I mention, no one? The paper was right there in a peer-reviewed journal, and it was treated like a source of leprosy. Not my fault. And this is not nearly the first time I've publicized this on HN.

Really, there's hardly a well known VC firm that hasn't heard from me. And there's hardly a one I ever heard back from. What is it, you can lead a horse to water, but you can't make him drink?

Next issue, you already know about: Given a detection, the staff will want a diagnosis of the cause, then the root cause, and then the fix. Well, right, given some topology or some such of the detectors, could do some root cause analysis. But, in practice, diagnosis can be difficult. To ease the work of diagnosis, in each detector try to use some variables that, given an alarm, do give a hint about the cause and diagnosis. Or, have several detectors monitoring one server and, considering them jointly, that is, which ones just gave an alarm and which ones didn't, get some hints on cause -- right, could do more useful research here (I mentioned that).

The only VC that called me back wanted not just my all nicely automated detection but also nicely clear diagnosis and, no doubt, correction. Maybe he also wanted me to give Godzilla a bath, manicure, and rub down, too -- no problem, guys! Godzilla bath coming right up with release 2.0 and the Gold Enterprise Edition! I told the VC that anyone who promised to do a good job automating diagnosis in a real and large server farm or network was, uh, exaggerating what they could do and to stay far away.

Typically diagnosis takes a lot of information about the system being monitored. To do really well at diagnosis, likely need already to have seen all the causes and their symptoms. While we should collect data and make progress where we can, that is, for the more common problems, in general we can't do diagnosis well easily.

Any questions?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: