Update on Go 1.1 crypto performance

wereHamster · on April 22, 2013

Not really relevant for hashes, but Go also has native implementation of selected ciphers (eg. AES). Are those implementations constant-time? If not then they are vulnerable to timing attacks. People have put a lot effort into making OpenSSL ciphers constant-time. This effort was pushed by people working at Google. This guy works at Google and regularly posts about crypto stuff they are working on: http://www.imperialviolet.org/.

agl · on April 22, 2013

I'm afraid that the general AES implementation in Go is not constant-time. If you have a machine with AESNI instructions then they will be used and they are constant time.

AES is terribly hard to implement in a performant and constant-time fashion. The OpenSSL work is very impressive but, at least the bits that I'm aware of, are still very platform dependent and sometimes work only for certain modes (i.e. counter mode). Patches to implement Käsper and Schwabe's or Mike Hamburg's AES designs would be welcome, of course!

RC4 isn't constant time, but that's not the biggest problem with RC4. All common hash functions are.

If you need constant-time crypto and aren't tied to implementing an existing protocol then there is an implementation of NaCl's box[1] and secretbox in the go.crypto subrepo.

Otherwise, the modexp isn't constant-time, but RSA secret operations are blinded (hopefully correctly). P-224 is constant time, but P-{256,384,521} are generic code and are not. (I have a Go suitable P-256 design ready if I can find the time.)

In general, Go has not had nearly the review that OpenSSL has had and caution is warranted because of that.

[1] http://godoc.org/code.google.com/p/go.crypto/nacl/box

jgrahamc · on April 22, 2013

The blog you link to belongs to Adam Langley who works on Go's crypto stuff (he's user agl here). Go has a rather nice crypto/subtle package that contains constant-time implementations of primitives.

http://golang.org/pkg/crypto/subtle/

enneff · on April 22, 2013

Adam doesn't just work on the Go crypto packages. He is almost solely responsible for writing them.

jgrahamc · on April 22, 2013

I was channelling Rob Pike when he described Ken Thompson as someone who had "done some Unix stuff".

enneff · on April 22, 2013

I'm sure Adam would appreciate the understatement, but IMO he deserves more kudos than I can possibly give. :-)

Spooky23 · on April 22, 2013

There seem to be alot of corner cases where well-intentioned folks who follow the rules, but don't have encryption subject matter expertise can find themselves with some corner-case problem with encryption.

Oftentimes we rely on third parties to get it right, but who knows whether they have or not?

Are there good best practices or products out there to implement and evaluate encryption implementations? We use FIPS-140 validated modules, but is that enough?

dchest · on April 22, 2013

This post should be called "Go 1.1 old broken crypto performance" ;-) If you're designing a new system, there's no sane reason to use MD5, SHA1, or RC4. These performance improvements, though, are good for the native TLS implementation.

There are some small improvements to SHA-2, though, in Go 1.1.

carbocation · on April 22, 2013

> This post should be called "Go 1.1 old broken crypto performance" ;-) If you're designing a new system, there's no sane reason to use MD5, SHA1, or RC4.*

I think your post is being downvoted because it seems to conflate cryptographic primitives with tools that you can use without modification to safely store a password.

tptacek · on April 23, 2013

A better reason to downvote that comment (which I didn't do) is that the performance of important cryptosystems is limited by MD5, SHA1, and RC4 performance. Cryptosystems like, say, TLS.

Fun fact, by the way: there aren't any practical attacks on HMAC-MD5.

dchest · on April 22, 2013

I ported scrypt to Go (https://code.google.com/p/go/source/browse/scrypt/scrypt.go?...), so I know how to store passwords ;-)

This has nothing to do with passwords. You can read about brokenness of RC4, MD5, or SHA1 in their respective Wikipedia articles.

carbocation · on April 22, 2013

First, thanks for your contribution. Second, I didn't downvote you. Third, the question was never whether you knew how to store a password. There are many reasons to need to be able to use even 'broken' primitives (esp. for compatibility with legacy apps or conversion), which makes them valuable (or at least necessary) for some apps. This is why I was guessing that your post was focused on passwords.

dchest · on April 22, 2013

I understand. I also said that there's no reason to use them for _new_ systems.

BTW, suggested reading: http://cr.yp.to/talks.html#2013.03.12 ("Failures of secret-key cryptography").

carbocation · on April 22, 2013

Gulp. You did; I missed it! My apologies.

And thanks for that link. Slide 313 is terrifying.

kzrdude · on April 22, 2013

No. MD5 and RC4 are legacy crypto. In use, but shouldn't be used for new software. (And none of them were ever viable for your usage example).

mtourne · on April 22, 2013

The crypto scheme used should depend on the importance of what you want to protect. For instance, I think using md5 for a secure token that will expire in a few minutes is reasonable.

dchest · on April 22, 2013

What's reasonable about it? Why not use a non-broken hash algorithm?

mtourne · on April 25, 2013

they're small and easy to compute.

I'm not talking about signing my microsoft update cert with md5() here.

By all means, I do encourage you to get a valid token for the next 20 minutes, find a collision attack and get a secure token that will work for another timestamp "sometime in the future" (without passing the captcha!) on the system I'm building..

dchest · on April 29, 2013

Need small output? Truncate the output of SHA-256. What "easy to compute" means? Performance? Use BLAKE2. Code complexity? sha256() is only 3 letters more than md5() (you're probably not writing your own implementation).

Asking other people (especially, not professional cryptographers) to try to break your system which knowingly uses insecure primitives is an awful way to evaluate system security.

Jabbles · on April 22, 2013

What did the Go team change to do this? Compiler optimizations or code changes? SIMD instructions?

jgrahamc · on April 22, 2013

AMD64 and 386 assembly implementations. My blog post links to the actual changes.

drivebyacct2 · on April 22, 2013

Oh, I assumed some of it was from the GC/scheduler work, or mostly just the asm impls?

dchest · on April 22, 2013

While not related to these ASM implementations, the code generator for Go code has also been improved: e.g. it now can generate rotatation instructions, which helps SHA-2 and some other crypto primitives.

kzrdude · on April 22, 2013

Yup that's absolutely essential for most block ciphers and similar crypto (i.e. hash functions).

minopret · on April 22, 2013

My understanding is that all three of these algorithms MD5, SHA1, and RC4 are inherently difficult to parallelize. The result cannot be computed by splitting the input, processing the pieces separately, and then combining the results. So they would not get the benefit of using "goroutines" rather than threads, a benefit which is important to the performance of many Go programs.

enneff · on April 22, 2013

Goroutines are not primarily about performance, they are for modelling concurrent processes in a natural way. This can have a performance in some cases, but I wouldn't classify goroutines the way you have.

rubinelli · on April 22, 2013

Excuse my ignorance, but could you use goroutines with different algorithms and pick the one that finishes first? Perhaps not in the case of MD5/SHA1, but for problems that have some algorithms that are very fast most of the time, but have a pathological worst case?

jgrahamc · on April 22, 2013

Yes, you could do that. It would be very simple to give the same channel to two goroutines. Each goroutine would transmit the solution on the channel. You would just wait around to get the first item off the channel.

Toy example: http://play.golang.org/p/imM9nOwer_

numbernein · on April 22, 2013

Go has no way to abort a goroutine so if you tried multiple algorithms you would use total CPU time for all algorithms. Unlike actual threads Go gives no assurance that goroutines are preemptively scheduled, so doing so might take worst case time anyway. Even worse, there is no guarantee that new goroutines run next so doing so could introduce a scheduling delay.

So basically it would be a terrible idea to do this in Go. With C/pthreads it might have some use in some really exceptional case.

These are all reasons of course why goroutines are a bad idea. They should have just used threads, but apparently they wanted to support 32-bit architectures or some archaic OSes where threads have a high unit cost.

enneff · on April 22, 2013

Gccgo did use threads for each goroutine for a while, and it was significantly slower than the gc "green thread" style implementation (used by both compilers today). Goroutines weigh in at about 4k each. OS threads are 64k minimum. That allows more than an order of magnitude more goroutines than threads, from a memory perspective. From an execution standpoint, goroutines managed by the Go runtime allow some nice scheduling optimisations, as the runtime is aware of the communication between goroutines. (some of these optimisations are in Go 1.1)

Finally, there is no reason goroutines can't be preemptively scheduled. It is just an implementation detail, and work has already begun on this front.

teenth · on April 22, 2013

Gccgo didn't have a garbage collector for a while, around the same time it was using threads. Was it even using a pool? Did it run goroutines for a short time on the current thread before moving to its own thread? Certainly not.

Thread overhead in Linux is 8k, only slightly more than goroutine. It's 24k for windows. You guys don't even know basic facts that should have guided the design.

enneff · on April 22, 2013

Really? Can a Linux process spawn 1m threads?

I didn't work on the design. If I made a mistake that's my fault. The people that did design Go and implement the runtime know what they're doing. You didn't respond to my other points, either.

teenth · on April 22, 2013

Yes a Linux process can spawn 1m threads, and testlimit64.exe spawned 250k threads both with 64k user stack and 1 MiB user stack... because that is just reserved address space if it isn't actually used. Only the kernel overhead matters to how many threads.

I'm not sure what point you want addressed... scheduling optimization? It's also a deoptimization in other ways, such as predictable latency and fairness.

I agree the designers knew what they were doing: designing a 'modern' language for 32-bit computers. The question is why?

Roboprog · on April 22, 2013

Why 32 bit computers? There are a lot of them out there.

A 32 bit computer is more than enough to run many kinds of embedded or consumer electronic type of equipment. How many decades were little 8 bit or 16 bit microcontrollers of the 1970s still in use inside equipment? I'm not trying to say that "640 K ought to be enough for anyone", just that it's nice to be able to make native code targeted at only what is considered "standard server/desktop" today.

Can't really comment on the thread capacity issue, but having a systems oriented language also target slightly obsolete hardware, such as an Intel Atom, maybe, is a good thing.

enneff · on April 22, 2013

Even given that what you say is true, these are all implementation details and have nothing to do with the design of the language.

teenth · on April 22, 2013

An implementation is what you have to use, you can't use 'the language design'. When you have to use a special build system to generate Go -and- C stubs to call out, can't use a standard linker or partial compilation or shared libraries, have a massively large exe, can't embed in other programs (Go has to be 'main'), can't use with SELinux, has huge unpredictable latency spikes, etc it'll be cold comfort that those problems are 'just implementation details'.

It's too bad they had NIH or thought 32-bit was the future and didn't stop at just creating a language.

cmccabe · on April 22, 2013

I don't understand any of your criticisms. What else would you expect to use to compile Go besides "a special build system"? A not-special build system? It's a compiled language. You can, however, use gccgo to link Go code with C/C++ code.

Both gccgo and cgo already use partial compilation. You can invoke the 6l, 8l, etc. commands directly if you really want to. One of the design goals was build speed. Shared libraries are on the roadmap. See issue https://code.google.com/p/go/issues/detail?id=256.

Go does not "have to be main". You can use gccgo to mix Go code into your existing C or C++ binary. See http://golang.org/doc/install/gccgo

What the hell does "can't use with SELinux" mean? Are you talking about issue 871 that was fixed in 2010?

You keep repeating over and over that Go is optimized for 32-bit systems. Repeating something doesn't make it true. In fact, exactly the opposite is true, however. Go uses a lot of address space, which is great on 64 bit, not so good on 32-bit. This has been discussed a lot years ago: http://lwn.net/Articles/428100/ is a good place to start.

I've seen you repeat over and over in multiple discussion threads that having lots of kernel threads is no big deal. What you don't seem to realize is that in C/C++, thread overhead is somewhere between 1 MB and 2 MB a thread if you want to use pthreads and glibc and avoid random data corruption resulting from thread heap collisions (hint: you do). Linux also doesn't have an O(1) scheduler any more in mainline; it uses the completely fair scheduler (CFS), which has complexity O(log N). There are disadvantages to green threads, but to even start discussing this requires a lot more background than you have.

Again, you can read about any of this on wikipedia, LWN.net, or even the replies that inevitably get made to all of your posts on HN. So learn already.

cmccabe · on April 22, 2013

Thread overhead in Linux is 8k

Those are weasel words.

I assume your 8k figure is 4k kernel stack, and 4k userspace stack. First of all, your kernel may not even be compiled with 4k stacks. 8k kernel stacks are very common. But even if we assume that you do have a 4k kernel stack, you will never be able to do much useful with a 4k userspace stack.

If you want to use pthreads, the default thread stack size is 2 MB, and the minimum is 16kb. Let me Google that for you. http://pubs.opengroup.org/onlinepubs/7908799/xsh/pthread_att...

So the absolute mimimum for your normal C/C++ programmer, who doesn't want to do things like reimplement pthreads, is 20 kb. Even that figure seems suspicious to me because it doesn't take into account things like thread-local data in glibc and various other system libraries.

You will quickly realize that if you want to do "luxurious" things like use standard glibc functions, call recursive functions, and so forth, you need more than 4k of userspace stack. For historical reasons, only the main thread has a growable stack-- all others have a fixed size stack. If you exceed the stack size on a thread, you will get undefined behavior, possibly a security vulnerability or a crash.

jgrahamc · on April 22, 2013

You can easily abort a goroutine using a channel: http://play.golang.org/p/OYnAdqnDLz

jesstaa · on April 22, 2013

goroutines don't help with parallelization, they only help with concurrency.

pcwalton · on April 22, 2013

Huh? If you run with GOMAXPROCS=(some number greater than 1) goroutines will be multiplexed onto multiple OS threads.

jesstaa · on April 28, 2013

Yep, and threads are what are helping with the parallelism. Goroutines don't make parallelism any faster or easier than threads do. But they make concurrency much simpler.