First off, I'd like to say that I think this is great and we need more books in this space.
I hope this can be seen as constructive criticism but I do have a few comments on the layout and content.
1. Free chapter - perhaps chapter one or two may have been a better choice. I'd like to know the philosophy behind monitoring expressed in the book before diving into the details.
2. capacity planning - A chapter on this would have been great. Most teams I've worked with have struggled with sizing, planning resources required and archiving strategies with their monitoring solutions.
3. Monitoring strategies for different levels in the stack - where do I start, what should my short term goals be and so on.
4. naming of some of the chapters are too focused on the technology- for example the chapter on logstash could have been renamed to something to do with application logging or log scrapping.
5. visualisation and communication of results - there could have been a chapter on dashboard and reporting. This is a common issue with teams trying to understand how to do this.
This was written in a bit of a hurry but I hope my points came through.
1. I decided not to do this because my experience is that people like to do something practical first. I've had a huge response to that chapter - lots of folks have gotten into Riemann that had previously been stuck. That alone is a solid +++ for me.
2. Each chapter contains some discussion of capacity planning for specific tools, where relevant.
3. The capstone chapters (11-13) discusses this, as do the chapters covering logging and application instrumentation.
4. Thanks - I'll consider that.
5. I discuss in various chapters visualization but I've found that most folks have very different needs and desires. So I focussed on discussing what to show in small segments as well as some visual design discussion rather than a specific chapter on dashboarding/reporting. Hard choice but a 750 page book needs to stop somewhere. :)
Thanks for taking the time to comment - it's awesome when folks share their thoughts!
As a side recommendation, I'm most of the way through The Practice of Cloud Administration, and it hits on most of the things you've mentioned. It's on SBO, so you likely have a copy through work.
To the "free chapter" idea - I think the website has a pretty good prospectus, and compared to other tech books / publishers, it's certainly well worth taking a flyer on it for a maximum of $20.
The monitoring considered in the OP is for server farms and networks where the main challenges are rates of false alarms and rates of missed detections.
The challenge is to find means of monitoring that will permit selecting the rate of false alarms are willing to tolerate and, then, for that rate, get the lowest rate possible for missed detections.
Thus, would like to use the Neyman-Pearson result. Usually, however, for this context, do not have enough data for that. E.g., typically are quite short on data on the anomalies are trying to detect, and more short as the systems become more reliable.
From the above, we see that necessarily and inescapably such monitoring is some continually applied statistical hypothesis tests.
Apparently in practice, false alarm rate is not known and not reasonable to select or even to adjust.
Then we see that we need tests are both multi-dimensional and distribution-free.
A special case of high interest is zero-day problems, that is, detecting problems never seen before. So, this is behavioral monitoring -- any behavior sufficiently unusual is regarded as an anomaly, that is, evidence of something wrong.
From all I can see, so far the monitoring community has yet to take these points to heart.
The OP's remarks on thresholds are on target: Thresholds have been the old, lame, weak workhorse of monitoring far too long.
If anyone is actually seriously interested in this subject, let me know. Some years ago I concluded that no one was interested!
It's a spectrum to me. We're way behind the curve on monitoring and the "state of the art" in, anywhere but cutting edge shops, is woeful. I'd love folks to be able to anomaly detection easily and simply but the technology and tools aren't quite there yet. I am just hoping to get folks to advance their environments a little way forward.
Yes, the situation is horrible. I'm reluctant to believe that the "cutting edge shops" are doing very well.
For good "tools", I have a good paper on the subject, but from all I can see there is essentially no interest. People would prefer not to be bothered. The attitude seems to be, if there is a problem, then we will detect it, eventually if not soon, and then we will fix it.
It was published, in the Elsevier journal 'Information Sciences' in 1999.
It appears to be the first, and the first large, collection of statistical hypothesis tests that are both multi-dimensional and distribution-free.
I try to be anonymous here at HN, but I'm willing enough to send a PDF of the paper to anyone who wants a copy. E.g., ask for a copy and leave your e-mail address on your HN profile, at least temporarily.
The main point of the paper is that we do get an hypothesis test. In particular, we get to select false alarm rate and then get that rate essentially exactly in practice.
It's behavioral monitoring -- it assumes that the past and future of healthy performance are, to be simple, statistically the same. So, right, its for a server farm or network that is statistically relatively stable, that is statistically unchanging, in what it is doing. The site can be wild and crazy, but it has to continue to be wild and crazy in statistically the same way.
In particular, the work is for detecting zero day problems, that is, problems never seen before. Maybe the philosophy here is that when we get a new problem and detect it, then we fix the cause of the problem and never see it again and, then, again are left looking for zero day problems.
Then the work uses past data -- hypothesis tests have done that since Karl Pearson 100+ years ago, and now parts of computer science do something similar and call it training data in unsupervised learning or some such. The approaches of just statistical hypothesis testing make more sense to me.
The key, core mathematical argument is a finite algebraic group of measure preserving transformations on the data. I believe that there are connections with U-statistics, e.g., as in an advanced statistics book by Serflng.
This stuff with groups and measure preserving is a little like some classic arguments in ergodic theory. On the page, the math looks awful, but actually it is conceptually quite simple.
But, you don't need to dig into the math too much.
For the actual calculations, those are based on nearest neighbors (although other options also work with the basic math). At least since the paper, others have thought of using nearest neighbors, but they didn't have an hypothesis test because they didn't know how to calculate and adjust false alarm rate. So, they have an heuristic instead of an hypothesis test. So, again, the main contribution of the paper is that it really is an hypothesis test, that is, know and get to adjust false alarm rate (conditioned on the old data and also true in long run expectation over the conditioned work -- standard result in conditional expectation from the Radon-Nikodym theorem in measure theory).
For detection rate, there is some good news, not as good as from the classic Neyman-Pearson result (in practice in the context we don't have enough data to do much with Neyman-Person), but nice: In a useful sense, for the selected false alarm rate, the work gives the highest possible detection rate. Really the mathematical key here is just Fubini's theorem (the measure theory version of interchange of order of integration). Intuitively, the technique has the largest area where alarms are raised consistent with the selected false alarm rate.
For the practical application, do need some help with some computational geometry. For that, I dreamed up some work. Soon I found that part of what I dreamed up was k-D trees, e.g., as in Sedgewick's book on algorithms. But there is more -- need some cutting planes. I programmed most of it 20+ years ago in PL/I but finally dropped it due to lack of interest.
I have some ideas for more results of interest and more papers, but after 20+ years of no interest I just gave up.
More can be said, but I stopped the research when I discovered, about 20 years ago, that no one was interested. The paper was published in 1999, and since then interest has been quieter than the tombs of ancient Egypt. So, I'm doing a startup that is quite different.
I dreamed up the work when I saw the need, or at least as I regarded the need, way back in about 1990 when I was in an AI group at the IBM Watson lab doing work on monitoring and management of large server farms and networks. The AI work was trying to build on essentially just threshold detectors. There was no attention to false alarm rate or a best detector -- highest detection rate for given false alarm rate. The classic Neyman-Pearson result was ignored. I was our guy with GM Research, and we gave a paper at the Stanford AAAI IAAI conference. But I was outraged by the lack of concern for false alarm rate, ignoring hypothesis tests and distribution free hypothesis tests (long common in the social sciences), and with no attention at all to multi-dimensional data.
The real world context is just awash in multi-dimensional data. Treating the data components separately in effect says that the geometrical region of healthy behavior is just a box. Bummer. Box too small -- get false alarm rate too large. Box too big, get too many missed detections. Problem: A box is a poor fit to reality. Simple stuff.
How to see this? Monitor CPU busy and page faults per second and look for anomalies, e.g., thrashing, a program allocating infinite memory, etc. Then the normal behavior is just a 2-D box? I don't think so! But, sure, need to automate picking the shape of the region of healthy behavior.
For the distribution-free stuff, that is where we make no assumptions about probability distributions. I got a kick in the back side on that sitting one day in the office of Ulf Grenander, one of the world's best ever statisticians, at Brown (I got accepted to grad school there; was considering going; went elsewhere instead). Grenander had been looking at computer performance data and was shocked at how different it was from the data, e.g., biomedical, he had been used to. So, right, Gaussian assumptions and more go out the window!
So, really, just want to make no assumptions about distributions, want to be distribution-free (a.k.a., non-parametric although I believe distribution-free is more appropriate terminology).
For multi-dimensional, at IBM I got a slap in the face: There was a cluster of computers doing transaction processing. There was some front end load leveling that sent the next transaction to the least busy computer in the cluster. Okay. But one day one of the computers got sick, just a little sick in the head, and was doing a very silly thing -- it was throwing all its incoming transaction work into the bit bucket! Thus, this computer looked to the load leveling as not very busy and, thus, was getting nearly all the transactions. Thus, nearly all the transactions for the whole cluster were going into the bit bucket. Bummer.
So, I thought, to detect this anomaly, want somehow to look at all of the computers in the cluster at the same time and compare them with each other, that is, have all the data in some appropriate region in some space of several dimensions, a region that works whether the cluster is busy or not. So, want to be multi-dimensional, that is, don't want just threshold detectors on variables one at a time.
There are more war stories where the importance of being multi-dimensional is crucial. Really, commonly separating multi-dimensional data into its components and treating the components separately can be throwing away a lot of crucial information which stands to give a poor combination of false alarm rate and detection rate.
Heck, in principle the region of healthy performance can be a fractal, say, like the Mandelbrot set, and, so, somehow we need to approximate that. Can we do that? Basically with nearest neighbor, or k-nearest neighbors (which also works), yes.
There is now a good opportunity for my work: My work can use a LOT of training data, and in the near real-time detection work can want to do a lot with that data. So, could use fast access to a lot of data which doesn't change very fast. So, sure, use some big solid state disks (SSDs)! A few of the Samsung 14 TB drives should do wonders for my paper!
My view is, anyone doing monitoring of a large server farm or network and not using what is in my paper is not being fully serious. And, since get to adjust false alarm rate, say, to one a month, can't say that can't afford the extra false alarms.
Uh, I left out: For each alarm, get told the lowest false alarm rate at which the real-time input data is still an alarm -- so get an indication of alarm seriousness.
More is possible, but at least have to be using what I cooked up in 1990, wrote prototype software for in the early 1990s, and published in 1999.
I did the work a long time ago, guys! And since then, there have been various serious consequences from anomalies, intrusions, etc. Maybe in some of those cases, my work would have done good, early detection. My work looks a heck of a lot better than anything else!
So I think the mathematics to make this work is not the problem. How do you engineer this though?
If I were to try and build a platform that could do this in real-time for, lets say, a million metrics per minute, can you engineer something that would scale horizontally to do this? Can it be done by cobbling together various open-source tools/libraries currently out there? Then how would you present the results in a way that someone that's not necessarily "mathematically inclined", say for example, your typical operational support person, that they could meaningfully interpret whatever your system is spitting out?
That's for me the hard part, is to get those two components working well. Make it scale, make it idiot friendly. If you can't get those parts right, it doesn't matter what you're trying to do.
I say this because I've spent the last 6 years in the application performance management space and "the best" way to handle alarms at the moment is to put down a team, literally a team, of people and have them hand-tune thresholds by looking at a combination of history, incidents/outages and root cause outcomes, domain specialist inputs (like DBAs or application server specialists). You send out a false or noisy alarm to an ops guy too many times and they become desensitized. You don't put enough context in your alarm messages, they won't use it (logging into a tool is asking too much, the email must contain everything they need or they complain).
Any form of dynamic baselining is just too noisy. The simplest example is trying to "baseline" CPU usage. CPU usage without something trivial like comparing to run-queue is stupid. It's actually even more stupid because you should be looking at things top-down, i.e. so what if the CPU is 100% and the run queue is 100, are any user facing transactions slowing down? i.e. is there customer impact. It could be some batch job kicking off. So in short, anything that looks at a metric in isolation is stupid, dynamic baselines with time of day, day of month, etc. it's all rubbish shit, you're wasting your time with this approach. This is the sad state that current "cutting edge" third generation APM tools offer though.
The same as for several others here at HN, leave your e-mail address in
your HN profile, and I will try to remember to return, use your e-mail, and send you a PDF of the paper, 1999 in Information Sciences.
Scaling? Can scale about as much as you want. Get some racks of high end servers and a fast server farm LAN, collect the data with whatever instrumentation, and let it run.
For false alarm rate, just shovel in the training data, pick a false alarm rate, say, one a week, one a month, and let it run. Don't hand tune anything. In effect, the hand tuning is replaced by the adjustable false alarm rate and where you know the rate in advance and, with the statistical assumptions, get that rate exactly in practice. Statistical hypothesis testing has had adjustable false alarm rates for 100+ years. Sorry that computer monitoring has been struggling with that.
Yes, you will need to make a judgment call about the assumption that the system now is statistically the same as during the past, say, three months of training or historical data of apparently healthy behavior.
For rates of false alarms, rates of missed detections of real problems -- the server farm bridge staff and the network operations center (NOC) can understand those. I was invited for a free lunch and gave a presentation to the operations staff at the main NASDAQ facility in Trumbull, CT, and the operations staff, maybe 30 people, understood the basics right away.
For the math, the only tricky part, and the core of my paper, is how the heck to know and adjust the false alarm rate, but operationally for the staff that is just trivially easy.
Once I was at Morgan Stanley and talking to their main Unix system administrator. He'd just come from a meeting on how the heck to monitor his Unix systems, and I explained my work. He nearly jumped up and exclaimed "We can use this right away!". But they didn't give me an offer, ask me to consult (my paper was not published yet), etc. So, really, they didn't much care.
For how to report the alarms, I'd guess feed into some standard system management infrastructure, consoles, whatever from HP, CA, EMC, Microsoft, etc. There's a way to get a real time, running strip chart that says what the false alarm rate would be for the data just observed to be an alarm. If people want to watch 500 of those, okay by me. But mostly would want a way to display the strip chart for detectors that just gave alarms or, given an alarm for, say, a Cisco switch, display the strip charts for all the monitors of that switch -- for insight and an aid to diagnosis.
> The simplest example is trying to "baseline" CPU usage. CPU usage without something trivial like comparing to run-queue is stupid.
Of course it is. My work would still give the selected rate of false alarms, but the detector would likely also have poor detection rate. I.e., with just CPU usage, the poor detector just doesn't have enough information to do anything very good.
Now THAT'S in part why want to be multi-dimensional. So, say, feed in PAIRS of CPU busy and run-queue length. Maybe include some more variables, e.g., time of day.
So, here's your first judgment call: What to monitor and what variables to combine to several variables to feed to some one detector. It's clear you have some insight. Good. In time there will be some good ideas for what variables to use to monitor a Cisco switch, an Oracle database, a Windows Server, etc.
I have in mind some more research to help make such selections, but, again, for 20+ years no one was interested. You described the problems well, and I made good progress on solving them, but, still, no one was interested. No one. Did I mention, no one? The paper was right there in a peer-reviewed journal, and it was treated like a source of leprosy. Not my fault. And this is not nearly the first time I've publicized this on HN.
Really, there's hardly a well known VC firm that hasn't heard from me. And there's hardly a one I ever heard back from. What is it, you can lead a horse to water, but you can't make him drink?
Next issue, you already know about: Given a detection, the staff will want a diagnosis of the cause, then the root cause, and then the fix. Well, right, given some topology or some such of the detectors, could do some root cause analysis. But, in practice, diagnosis can be difficult. To ease the work of diagnosis, in each detector try to use some variables that, given an alarm, do give a hint about the cause and diagnosis. Or, have several detectors monitoring one server and, considering them jointly, that is, which ones just gave an alarm and which ones didn't, get some hints on cause -- right, could do more useful research here (I mentioned that).
The only VC that called me back wanted not just my all nicely automated detection but also nicely clear diagnosis and, no doubt, correction. Maybe he also wanted me to give Godzilla a bath, manicure, and rub down, too -- no problem, guys! Godzilla bath coming right up with release 2.0 and the Gold Enterprise Edition! I told the VC that anyone who promised to do a good job automating diagnosis in a real and large server farm or network was, uh, exaggerating what they could do and to stay far away.
Typically diagnosis takes a lot of information about the system being monitored. To do really well at diagnosis, likely need already to have seen all the causes and their symptoms. While we should collect data and make progress where we can, that is, for the more common problems, in general we can't do diagnosis well easily.
Temporarily stick your e-mail address in your HN profile and I will send you the PDF. Also stick in a random string, and I will add it to my profile thus confirming my identity.
Another thought. Two really:
- Lead us into more about the why before showing the price. Really tell the story of the pain that so many of us can identify with.
- Use a standard three teir pricing box style, with the values of each above the price. Something like: https://planscope.io/pricing/
If you haven't read one of the authors book's before (he's released titles on both Logstash and Docker.), he puts out really quality material and he seems to update them when new releases of the subject come out. This looks like another great release. Kudos James.
Just bought the book, and I didn't even realize it was James Turnbull, that guy is extremely advanced in monitoring. I've seen many of his talks online, and no one speaks about monitoring in a more rational way that I've seen.
That's mostly correct. I have a chapter on adding instrumentation to applications with examples in Ruby/Rails and Clojure that can easily be adapted to other languages and frameworks. I also cover adding structured logging to your applications.
Love your site design. it's clean and easy to read, great presentation. Do you just whip up a new design from scratch each time you create a book or did you hire someone or purchase off the shelf? Curious as these things take me forever..
I would argue that its not and that it's a mess, starting with that URL. This is yet another fork of an outdated monitoring system(how many Nagios forks are there?) The architecture diagram made my head hurt, how many moving parts is that? I counted 8 separate components. There are much more modern monitoring systems these days such as Prometheus, Bosun and Riemann.
Check_mk is not a nagios/icinga fork, it's an addon that works with them. It's a host agent (like NRPE) that bundles all the local checks in a single run and submits them on Nagios as a passive check.
There is also icinga2. So how does one decide which one to use? For a mixed environment (windows, linux, etc) which monitoring tool is better when both server and client is considered?
Thanks for the information. Currently, we are using icinga + check_mk combination and mostly happy with that. However since both icinga and check_mk are going their ways, it is no longer possible to use them together. Also I want to check out some modern alternatives like the ones you suggested.
Thanks - I considered that but I don't like to put barriers in people's way - especially for free content. I don't like to be too marketing-esque. Just not in my nature. :) A couple of thousand folks signed up to the mailing list using the current approach, which I'm pretty happy with.
I hope this can be seen as constructive criticism but I do have a few comments on the layout and content.
1. Free chapter - perhaps chapter one or two may have been a better choice. I'd like to know the philosophy behind monitoring expressed in the book before diving into the details.
2. capacity planning - A chapter on this would have been great. Most teams I've worked with have struggled with sizing, planning resources required and archiving strategies with their monitoring solutions.
3. Monitoring strategies for different levels in the stack - where do I start, what should my short term goals be and so on.
4. naming of some of the chapters are too focused on the technology- for example the chapter on logstash could have been renamed to something to do with application logging or log scrapping.
5. visualisation and communication of results - there could have been a chapter on dashboard and reporting. This is a common issue with teams trying to understand how to do this.
This was written in a bit of a hurry but I hope my points came through.