Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Inversion: Fast, Reliable Structured LLMs (rysana.com)
110 points by swyx on March 18, 2024 | hide | past | favorite | 25 comments


Most likely a small retrieval optimized model with focus on JSON tokens and a combination of the technique involved in grammar/guidance

https://github.com/guidance-ai/guidance

Same principle


I was initially impressed with the landing page, but it does look a bit suspect when things are claimed to be 100x faster without much info on the HW acceleration or the model sizes.

My best guess is that they're using two approaches to get this running faster:

- structured generation techniques from sglang (https://github.com/sgl-project/sglang) that allow them to generate faster JSON (with look-ahead / pre-fill) with strong guarantees on the output (i.e. 100% reliable, without requiring any retries).

- distilling a gpt-3.5 turbo-esque model from GPT-4 JSON outputs, and using it in conjuction with above to give the additional performance boosts on inference.

It doesn't seem like they're deploying on any custom silicon, nor have they optimized GPU kernels to suggest that the speed ups came there.


I thought they pretty clearly explained where the additional performance came from. If there is only 1 valid schema confirming option you can skip the LLM entirely. If there are only a limited number of possible tokens (eg. only } or ,) then you run it on smaller subset of the model. Between these two you capture a large amount of the actual token count of most json.


I've been doing a lot of indie work with structured generation and llama.cpp, you can get extremely fast responses with caching and deterministic token skipping.

When generating json, calling the llm when you already know that after

    { "brand": "Toyota"
Comes

     , "year": 
Is a massive waste. If the data itself is constrained too you can skip most of that too! You'll go down from needing 20 calls to the llm to just three for a simple piece of data like

    { "brand": "Toyota", "year": 1995 }
If they combine these techniques with a model that's specifically trained for structured output, along with a novel inference-time pruning technique that they were talking about in the post I can definitely see them getting these kinds of inference speeds.

I'm experimenting with a self hosted api that is fast enough to not even need a gpu for single user use cases (because the latency is good, but not batching). Once I'm done with the finishing touches I'll rent a GPU server for actual hosting.


Have you seen any blog posts that have some details on this? I can imagine, roughly, the concept but it sounds interesting and I'd like to get a better understanding.



Thanks!


Yeah, dottxt has a post on this, it's a technique starting with a c I believe, something like continuum or something?

The op also has something about compressing finite state machines, it's basically the same thing with slightly different details in the implementation.


> strong guarantees on the output (i.e. 100% reliable, without requiring any retries).

Has anyone seen a good JSON library that can handle slightly broken JSON? e.g. trailing commas, unescaped newlines, etc.? I have not found a good one.


Entirely broken JSON, no — I would be surprised if one existed. If you want slightly laxer semantics like trailing commas, JSON5 [1] is a pretty good spec and is JSON-compatible. I used to use it for LLMs (while telling them to emit JSON — no need to confuse them by explaining JSON5), in order to handle things like trailing commas, but in my experience LLMs have gotten good enough over the last year I mostly don't even bother anymore.

1: https://json5.org/


You can force LLMs to generate valid json by using a context free grammar FWIW


Please elaborate


Matt Rickard has a good entry level blog post about it, from the angle of regex constraining [0]. Context free grammars follow the same principle, except using a finite state machine to restrict the action space.

[0]: https://matt-rickard.com/rellm


Look at llama.cpp grammars, lmql, or guidance-ai


Should be easy to build on top of a lexer as a pre-parsing pass.


Dirtyjson


Despite reading it twice I couldn't come away with why they chose char/s or Hz as an appropriate measure. They also provided no benchmarks or model sizes except for a relative comparison with models 10x or 100x in size, which leads me to assume this is a small model maybe?


My guess is they're generating the structure of the JSON programmatically (i.e. keys, commas, braces), and doing JSON escaping for the strings programatically, and not handling JSON in the LLM at all. Hence they're comparing char/s: first of all it's not just generating tokens, and secondly it's better for their benchmarks to compare char/s (since they don't hit the LLM for a lot of their characters) rather than LLM tokens/s (which are probably somewhat faster, but not 100x faster).


Yeh it’s weird they don’t mention parameter size or other reasoning metrics. It’s a very cool approach to getting structured output from an LLM, but the benchmarks don’t show us the whole picture. I’m wondering if their approach can be used to delegate to different models at each step in a structured output. If it could be run with mistral 8x7B and still maintain its performance then that’s awesome.


The last bit is already under way with speculative decoding but wonder what exactly are they proposing


This seems like a very practical and well thought out approach. Turning unstructured data into valid structured data is in my experience one of the most important things for integrating LLMs deeply into a pipeline. Doing that fast and cheap goes a very long way for these use cases. Also, if you need stronger content generation than this, nothing prevents you from generating some higher quality content in another LLM and then passing it through this to structure it.


Currently, LLM models are not state of the art at Named Entity Recognition. They are slower, more expensive and less accurate than a fine tuned BERT model.

However, they are way easier to get started with using in context learning. Soon, they will be cheaper and probably faster enough too that training your own model will be a waste of time for 95% of use cases (probably higher because it will unlock use cases that wouldn’t break even with the old NLP approaches from a value perspective).

This is why I am tracking LLM structured outputs here:

https://github.com/imaurer/awesome-llm-json

And created an autocorrecting pydantic library that could be used for Named entity linking:

https://github.com/genomoncology/FuzzTypes


I was excited about their AI capabilities, but seeing that they built/promote their own UI framework, I couldn't stop but think that their focus is not where it should be. Why would you spend the time and resources to build your own UI framework, when your main product is an API?


Interesting but I think there's one comparison missing. When I use GPT-4 with function calling with a real system, in a single call it usually returns 5-6 responses - the first one with Content that has the plan / reasoning, followed by multiple function calls (parallel function calling).


"Ability against models with 10x or 100x more parameters"

This is a small model optimized for retrieval and function calling. "Reasoning" makes an appearance in the title but no standard benchmarks of general ability, such as MMLU or HumanEval, are mentioned. No details about the training process and no access to the models other than via API.

Nice marketing, but looks empty. I can also make an LLM that runs 1000x faster than Mistral:

def complete(prompt): print('As an AI language model...')




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: