>We regularly invent things like 4GL, graphical programming and frameworks and engines such as Unity just to enable more people to do programming.
You're right on the money on this.
Earlier this month I went to visit a company for a complete demo prototype of a full one-to-one train simulator trainer mostly designed and programmed by a former civil engineer using Unity engine. According to the company, they could not do it if Unity engine (or similar) is not around because it will be prohibitively expensive to develop.
In a related news, Unity recently released AI eco-system namely Unity AI Suite.
Please check the recent self-distillation work by MIT-ETH, UCLA and Apple [1],[2],[3],[4],[5].
Given the release timelines I suspect all 4.x after Opus 4 are probably self-distillation based fine-tuned models. The latest paper by Apple is focusing on code generation using the simple technique hence the name simple self-distillation (SSD) [4],[5].
I've got a strong feeling that self-distillation is the second best thing happened to LLM after transformer breakthrough.
So first - these are terrific papers and I'd not seen some of them before.
Having said that, I don't think these are classic student teacher distillation from random (which was my point). In fact, the "Embarrassingly Simple Self-Distillation" paper is using exactly what I was talking about "fine-tune on those samples with standard supervised fine-tuning".
It's worst than useless, it's borderline criminal /s
The fabricated title targeted the sensation rather than substance, typical scenario whenever "All" is in the title, and the worst when it's in the very first word.
Fun facts, the designer and builder of this railway is widely regarded as Father of Railways [1]. He's also the original inventor of the early safety lamp namely Geordie lamp [2].
He invented the first safety lamp about the same time as Sir Humphry Davy namely Davy lamp using different principles before the invention of electrical based lamp. Geordie lamp is safer than Davy lamp but somehow the original verdict for the invention went in favour of Davy but later overturned in favour of Stephenson.
Geordie lamp is used mainly used in the mines around Newcastle and North East area, whereas Davy lamp is being used all over UK. Interestingly based on the popularity of the Geordie safety lamp in Newcastle upon Tyne and the wider Tyneside region of North East England, its native-born are called Geordie [3].
“The next generation of pathologists must be equipped with the skills and knowledge to effectively navigate AI to their advantage and take ownership and oversight of AI tools.”
Fun facts, this paper is cited by Simple Self-Distillation (SSD) paper by Apple [1],[2]. I think it is a bad naming scheme due to the very common SSD namesake and the fact that it belongs to on-policy self-distillation [3]. But again according to the authors their proposed solution is simple because "SSD uses only temperature-shifted samples from the base model and standard cross-entropy training,without privileged context, feedback-conditioned teachers,or auxiliary supervision."
The Apple paper also cited another very similar idea of self-distillation paper by UCLA team. Both cited papers namely by MIT & ETH team, and the other by UCLA team proposed novel on-policy self-distillation technique. Interestingly both teams submitted their papers within one day from each other back in January this year to arXiv [4],[5]. No price for guessing who actually published the idea first.
IMHO, self-distillation fine-tuning is the future of LLM fine-tuning because it mitigates the forgetfulness of the SFT approach that can be cumbersome for lightweight fine-tuning rather than full post-training of LLM.
With the advent and proliferation of plethora open source and open weight LLM foundation models, anyone can fine-tuning these models for domain specialization or sub-specialization (like medicine sub-specialization, law disciplines, branches of architecture practices, etc) [6]. This fine-tuning process can be performed with the minimum resources of 8 H200 or even 4 H100 GPUs as reported respectively in either of the papers [4],[5]. Let's see if we can replicate that with much cheaper arrangements consisting of a couple of DGX Spark, or the latest eight of DGX Spark based nodes arrangement with a total of 1 TB RAM (128 GB x 8) [7],[8].
IMHO, if the results are valid, the self-distillation can be the second best thing happened to LLM after the transformer.
Did you source this from your Zotero or something like it? I love your format and resource discipline required, would be interested to know if this is “all human brain” or tool assisted.
Strange personally never heard or read about Accelerate before. I think it has the same main problem with generic naming language like Futhark (this very issue already mentioned in a different concurrent post).
As a modern array language, perhaps Accelerate should look into D4M as a basis, it's also started 10+ years ago [1].
D4M is based on math like SQL, specifically associative array algebra but not relational unlike SQL. It's more generic since can it caters to most modern data abstractions including spreadsheets, database tables, matrices, and graphs [2].
You can achieve 100M database inserts per second with D4M and Accumulo more than a decade ago back in 2014 [3].
[1] D4M: Dynamic Distributed Dimensional Data Model:
I think this is where D language make an excellent alternative to Python for AI assisted coding [1].
1) It's a very consistent language even if you compared to the other popular languages namely Python, Rust, C++ and Go. Try to perform doubly linked list with them and compare them all [1].
2) It's probably the most "Pythonic" among the compiled language according to Walter.
3) It utilizes GC by default, you can also manage your own memory and you can hybrid.
4) It compiled fast and run fast, heck it even has built-in REPL eco-system.
5) Regarding the small training set, with recent self-distillation fine-tuning approach it should be good enough, D (actually D2 version) has been around for more than a decade [2].
[1] Looking for a Simple Doubly Linked List Implementation:
>I still don’t understand why we lack a language that will take uncomplicated computation heavy code and turn it into SIMD / multi thread / multiprocessing / GPU code with minimal additional syntax.
It's already (partly) existed called D language, by default it's garbage collected (GC), can also be program without it or hybrid. It's a modern, backward compatible with C and it's included in GCC.
The linear algebra system in D or Mir GLAS is standalone BLAS implementation written directly in D [1]. It's already proven faster than the other widely existing conventional BLAS like OpenBLAS back in 2016, about ten years ago!
This popular OpenBLAS include Fortran based LAPACK (yes you read it right Fortran) and it is being used by almost all data processing languages currently Matlab, Julia, Rust and also Mojo [2].
Interestingly there is a very early stage of standalone BLAS implementation written directly in Mojo namely mojoBLAS similar to Mir GLAS just started very recently [3].
>Surely this is the sort of thing compiler / language design nerds dream about?
You can say this again.
Especially on the GC side of the programming language since this SIMD / multi thread / multiprocessing / GPU can be abstracted away.
Actually someone recently proposed VGC or virtualized garbage collector for Python in C++ for heteregenous GC [4],[5]. However, the current evaluation excludes JIT compilation, AOT optimization, SIMD acceleration, and GPU offloading.
I don't think mojo depends on OpenBLAS or other BLAS implementation. I remember that they took a lot of pride in the early days how
linalg primitives like matmul which was completely written in mojo was faster than MLK, openBLAS and other implementations.
You're right on the money on this.
Earlier this month I went to visit a company for a complete demo prototype of a full one-to-one train simulator trainer mostly designed and programmed by a former civil engineer using Unity engine. According to the company, they could not do it if Unity engine (or similar) is not around because it will be prohibitively expensive to develop.
In a related news, Unity recently released AI eco-system namely Unity AI Suite.
[1] Unity AI Suite:
https://unity.com/features/ai
reply