How PostCSS became 1.5x faster by changing 2 lines of code

struppi · on Nov 25, 2015

This is so important:

  Do not even try to write what you think is effective code before 
  benchmarking. The VM has many clever optimizations. And even if 
  you will somehow learn of all of them, in the next release they 
  still can be changed. Instead, write a simple and clean piece of 
  code, make a benchmark, find the real bottleneck and rewrite code 
  in small parts.

Every now and then I see some strange code at a client I work with, and the justification is "Performance". Because somebody thought it might be faster - But they didn't check it back then. And they don't have an automated performance test that continuosly validates if their "dirty but fast" code is still faster than the clean version...

Don't even ask how many hours of developer time I saw wasted because of "dirty but fast" code. Where the code often was not much faster then the clean version. Or it was faster, but speed was not crucial in the area where the code operated.

Also, I guess a lot of the optimizations from old blog posts or the first "Effective Java" will not yield amazing results anymore, as compilers and runtimes are getting better.

I also like this one:

  Do not think that programming in C++ or any other lower-level language
  is a must for having good performance. Good architecture, benchmarking 
  and profiling are far more important.

But I don't think it is as clear-cut as the author writes it. For some problems, having total control over your memory layout and what gets exectued when (i.e. C or C++ or ...) can lead to huge benefits. And sometimes, the runtime is just more clever optimizing the code than you are.

masklinn · on Nov 25, 2015

> This is so important: [benchmark]

As any follower of @mraleph (http://mrale.ph / https://twitter.com/mraleph) is aware, it's very easy to benchmark the wrong thing (or benchmark more or less nothing), or to write an isolated benchmark which doesn't reflect real-world use.

pjc50 · on Nov 25, 2015

There's one thing you can do without benchmarking: approximate complexity (O(n)) analysis. You need to be sensible about what N is though: even O(n^3) isn't too bad if it's an operation on a user-displayed list of half a dozen items.

With a bit of thought this even leads to cleaner code: do you really need to pass a large chunk of data around when you only need a particular precalculated quantity?

But generally benchmarking and applying Amdahl's law is the way to go. First identify which part of the system is slow!

on Nov 26, 2015

[deleted]

losvedir · on Nov 26, 2015

Have you not used redis? Every command[0] begins its documentation with its time complexity.

[0]http://redis.io/commands

adrianN · on Nov 25, 2015

Things like this make compiler or jit optimizations a little scary to me. It might not matter to a CSS preprocessor, but if you write performance critical code, optimizations become a leaky abstraction. Suddenly you have to understand exactly which circumstances allow your compiler to, for example, use autovectorization and be wary of updating your compiler, lest these criteria change.

I think there is some opportunity here to improve compiler output. I know that GCC can already explain why it doesn't vectorize loops (-ftree-vectorizer-verbose=2), but many other optimizations without such output can still make or break the performance of your program.

lmm · on Nov 25, 2015

I think what we'll end up with is languages that make it easier to be explicit about what you need. We're already seeing this happening with computer graphics, where the likes of Vulkan are in some sense lower level than OpenGL, but make it easier to understand when you're using the fast path or not. I see similar aspects to Rust, and even Haskell's recent introduction of a strict pragma.

hvidgaard · on Nov 25, 2015

I pretty sure halting problem reduces to figuring out code reordering and optimization like that. We cannot solve it exhaustively, so instead we do the next best thing, approximate and provide a common case. I agree that it's not optimal, but it's the best we can do.

adrianN · on Nov 25, 2015

Obviously it's impossible to perfectly optimize every time. I would just like some more diagnostics from the compiler that help me debug issues with wrong of missing optimizations.

semi-extrinsic · on Nov 25, 2015

gcc has a flag "-Wuninitialized" that warns about possible use of an uninitialised variable. It frequently yields false positives, because it's an approximation; a general "detect all uninitialized variables" algorithm can solve the halting problem.

http://gcc.gnu.org/wiki/Better_Uninitialized_Warnings

terinjokes · on Nov 25, 2015

I'm not a v8 expert, but if instead of

    delete this.indexes;

wouldn't the author avoid the performance regression with

    this.indexes = 0;

and still avoid external state?

arcatek · on Nov 25, 2015

Probably (or maybe by setting `null` .. i'm not sure how engines deal with changing data types). `delete` implies that the underlying object schema has to change. It's an heavy operation.

DiThi · on Nov 25, 2015

Exactly what I thought, but with "null" instead (or better, an empty version of whatever was there).

thameera · on Nov 25, 2015

Isn't saying "1.5 times faster" misleading? From what I see it's gotten 33% times faster.

Although, you might say that the previous version is 1.5 times slower than the current version.

J_Darnley · on Nov 25, 2015

No. The original wording is accurate. It can do 1.5 times as much work per unit time exactly correlating with real world speed.

"33% times faster" doesn't make sense. You need to drop the "times" or use "0.33 times" for that to make sense.

thecopy · on Nov 25, 2015

Yup. Same category of a misnomer as for example "n-fold times faster", n > 1, where they mean "n times as fast". 2^n vs n, for example, a 4 fold increase in something is the same as 16 times increase of that something.

mjevans · on Nov 25, 2015

More interesting would be actually knowing /why/ this made things faster.

pascalmahe · on Nov 25, 2015

The author points to V8 optimization :

> But if V8 thinks that the code is “too tricky”, it keeps it in slow but dynamic form.

The next paragraph shows what caused V8 to not optimize :

> For example, in the first set of changes I’ve used a global variable defined outside of the class, so it had an external state. In the second one, I’ve changed the class structure on the fly. As a result, V8 did not compile it to effective typed native code.

But you're right, saying "it's probably that" is not the same thing as delving in the code.

fenomas · on Nov 25, 2015

The short answer is that v8 has several different ways of storing objects and resolving property lookups, and deleting a property causes that object to be handled in the slowest mode (i.e. "dictionary mode", with property data stored in a hash table).

If you want gory details I recommend Vyacheslav Egorov's blog (mrale.ph). Here's an old slide of his that lists all the ways you can force an object into dictionary mode (at least ca. 2011):

http://s3.mrale.ph/nodecamp.eu/#54

mraleph · on Nov 25, 2015

Some of this slide is a bit outdated by now.

Using Object.seal, Object.freeze and Object.defineProperty with writable, enumerable, configurable not set to true does not cause object named properties convertion to dictionary mode. However they still convert elements storage (one containing properties with names "0", "1", etc) to dictionary mode.

Similary accessors don't cause object properties storage conversion to dictionary mode unless there is a transition clash.

    var obj = {};
    obj.__defineGetter__("foo", function () { return 0; })
    print(%HasFastProperties(obj));  // => true
    
    var obj = {};
    obj.__defineGetter__("foo", function () { return 0; })  // Transition clash
    print(%HasFastProperties(obj));  // => false

Nothing changed with respect to `delete obj.foo`: this always converts named properties storage conversion to dictionary mode. However `delete arr[index]` doesn't (at least for arrays that were not in slow mode already), rules for objects depend on the amount of holes in the elements part of the object.

davej · on Nov 25, 2015

I have experience with profiling the `delete` keyword. It slows down post-deletion access to the parent object significantly. Generally, I avoid using it and instead set the value to `null`. Technically, this creates a memory-leak because the pointer remains but depending on the characteristics of your application it may be worth it.

The usual disclaimers of "profile it first" and "YMMV" apply, but it's useful to be aware of potential low hanging fruit when JS performance is critical to your application.

nickcw · on Nov 25, 2015

Doesn't V8 have a profiler? I don't use it so forgive the ignorance. Having to bisect your code to find a performance regression seems like a slow way of doing it compared to using a profiler. Could a V8 expert explain?

fenomas · on Nov 25, 2015

V8 has a very nice profiler. Functions are the smallest granules it understands, but it's very good for finding slow functions, and (perhaps even more importantly) for finding out which functions v8 has given up on optimizing.

But I'm guessing it didn't help for this particular problem. Deleting properties from an object can hurt performance in v8, but it's not the deletion that's slow, it just makes the later accesses to that object slower. So the line of code that needed changing may not have been anywhere near the code that profiled poorly.

fibo · on Nov 25, 2015

Nice article, pray this mantra everyday

> never optimize, always profile

yoklov · on Nov 26, 2015

While theres a small but of truth in that, unfortunately reality is much more nuanced than this -- at least, it is when performance is non-negotiable.

Optimization, when done correctly, is a feedback loop between your assumptions and the profiler. Neither one of these, on their own, is enough -- your assumptions can be wrong, but the profiler has an extremely poor signal to noise ratio. (Note: Profiling wouldn't have caught this issue, since the problem spot was not actually slow, it just made everything else slower)

I work in game development (not primarily HTML5, although I have done HTML5/WebGL projects), and performance is a hard requirement for me. Not making 60fps reliably for the hardware we need to support is a show-stopping bug. The only way to avoid this reliably without needing to rewrite sections is to design with the optimizations you might need to perform in mind. Putting optimization off until the end would be irresponsible.

kentor · on Nov 25, 2015

should have used git bisect?

qwerty0000 · on Nov 25, 2015

who cares? it's already fast enough

thedz · on Nov 25, 2015

C'mon, basically the first paragraph:

> PostCSS, libsass and Less are already fast enough for any real-world task. Running benchmarks like these is like listening to audiophiles comparing gold-plated cables for their hobby systems. Don’t use these benchmarks as the main criteria for decision making when you are choosing a tool.

15155 · on Nov 25, 2015

If libsass is the benchmark, I can state wholeheartedly that it is not fast enough (to not do incremental compiles).