I hate to say it but while this does seem very impressive and a step forward in how we interact with AI, the use-cases they present and the UX both seem unrealistic and/or unhelpful.
With the exception of the real-time translation (which seems like it should be a separate product all by itself), none of the use-cases they presented had much utility. I don't want anything to count the number animals in my stories or time a trivia quiz for me. The auto-slouch-detector, while the demo was pretty funny, just seems so dystopian and weird. AI interrupting you to scold you about taking elderly parents mountain biking instead of waiting for you to finish to scold you? No thanks.
The UX is also an issue - the model interrupting the user (even when apparently required by these strange use-cases) is jarring and makes one lose their flow. You can even see this in the demo videos that they put out - the employees/actors had to really concentrate to continue speaking as if they weren't being interrupted by a brash robotic machine. A human, when participating in this (rare) "invited interruption" has the ability to speak "under" the main speaker and I feel it's generally timed with a lot of nuance.
Even in the auto-translation demo, they ducked the human's audio but the AI steamrolled him and it would have been impossible to actually do that demo without either an incredible amount of control over one's speaking, or (more likely) muting the output. A human translator has a way of "pointing" the "output" to the intended speaker.
The very best part of this tech was presented in the first video where it shows the AI not needlessly interrupting the user. This seems to me more of an important bug fixed that the current models still (somehow) have.
Maybe a good use-case for this would be counting "um's" and the like while practising public speaking.
An omni model seems very useful for real-time human-computer interaction, off the top of my head:
- Voice assistants
- Customer experience
- Gaming
- Meeting assistants
- Real-time coach or user assistant for using software
- Translation
- Real-time work on a computer controlled by voice (frontend / mobile dev, CAD, 3D modeling, etc)
Traditionally a lot of these use cases with LLM agents are higher latency because the model needs to wait for the speaker to finish, then decide to call a tool or respond - if they call a tool they need to process the tool result and decide if they want to call a tool or respond, etc...
I'm not saying an omni model isn't useful for HCI - essentially my problem is that these demos seem to be highlighting the model's ability to interrupt the user (which is almost always not a good thing), it's ability to keep time (which should be a non-issue really), and it showcases these using fairly lame use-cases.
Product Consultant / AudioDiary / https://audiodiary.ai / Remote / CONTRACT / flexible hours / Up to $250 daily or performance-based - subject to discussion.
Our app, AudioDiary, has recently been through a period of highly organic growth. We need a someone with successful experience in scaling apps to help us grow more intentionally. We're open to any kind of improvement, from product to app store presence and beyond—the main goal being growth.
AudioDiary is a new and exciting project that's already helping thousands of people. We have big ambitions and we hope that working with us could be the start of something wonderful.
If interested, send us an email that includes your relevant work experience, in particular your success stories with growing new products.
I'm getting it from the fact that you aren't showing people a job posting. Nor are you asking about their experience. Your application process is to send you ideas for growth. That is why it sounds like free consulting.
I know the kind of person you are looking for. I am that type of person. And I'm telling you there is no way I'd download an app and send my ideas in as part of a job application. I'd gladly do so as a small paid gig during the application process, when we have talked and it feels like my experience matches your needs. But not as the initial contact, no way.
I don't understand, what could be built with this platform that wouldn't be made obsolete by conceivable updates to ChatGPT?
Another commenter suggested a hotel search function:
> Find me hotels in Capetown that have a pool by the beach .Should cost between 200 dollars to 800 dollars a night
ChatGPT can already do this. Similarly, their own pizza lookup example seems like it would exist or nearly exist with current functionality. I can't think of a single non-trivial app that could be built on this platform - and if there are any, I can't think of any that would be useful or not in immediate danger of being swallowed by advances to ChatGPT.
ChatGPT can only do this now because the information is essentially freely available. Booking.com etc post their pages on the web to get traffic. In the world OpenAI is imagining, people will rarely if ever interact with the internet directly, it’ll instead all be through intermediary LLMs. In that world, the organisations that own authoritative information about hotel prices and locations will not make that freely available to LLMs, they will sell it. ChatGPT is trying to get ahead by encouraging them to embed themselves directly into their platform so they get first dibs on this kinda stuff before they put up the walls.
> Find me hotels in Capetown that have a pool by the beach .Should cost between 200 dollars to 800 dollars a night
I built this 18 months ago at an OTA platform. We parse the query and identify which terms are locations, which are hotel features, which are room amenities etc. Then we apply those filters (we have thousands of attributes that can be filtered on, but cannot display all of them in the UI) and display the hotel search results in the regular UI. The input query is also through the normal search box.
This does not need and should not be done in a chatbot UX. All the implementation is on the backend and the right display is the already existing UI. This is semantic search and it comes as a standard capability in ElasticSearch, Supabase etc. Though we built our own version.
We built something like this too (in a different field), but it's actually quit hard to deal with all the edge cases that people might want to search for:
e.g. if the user asks "Find hotels in Capetown [...] that have availability for this christmas or new year": if your backend, or the response format that you're forcing the LLM to give, doesn't have the ability to do an OR on the date range, you can't give results that the user wants, so the LLM tries to do as best it can, and the user ends up getting only hotels which are available for both Christmas and new year (thus missing some that have availability for one or the other), or the LLM does some other unwanted thing. For us, users would even ask "June or August", and then got July included because that was the closest thing the backend / UI could do.
So this approach is actually less flexible than a chat interface, where the LLM can figure out "Ah, I need to do two separate hotel search MCP calls, and then merge the results to not show the same hotel twice".
We didn't support the time dimensions, but I think it could be done without too much issue. You suggest displaying search results in a chat interface but that doesn't work because there are easily hundreds of hotel results for most searches. The user would need to click on a thumbnail in chat into the list of search results on the OTA.
You want it in a chat with other tools and intelligence so that you can give softer preferences and for it to judge reviews and the like. Perhaps even look at the room layout and photos to see if it is something you would like. There are good reasons to surround the tool you describe with AI.
I don't think such massive amounts of text should be parsed at runtime. Hotels can have 100s or 1000s of reviews. We batch created attributes for hotels based on reviews, and when a semantic search was run, those attributes were matched.
There are multiple branches they are exploring. This is a more structured one. But they also work on Agents that load the website and produce clicks to do the task. Also, this requires hand design, but they also work on generating the gui just-in-time, based on context.
They also have this new design gui for visual programming of agents, with boxes and arrows.
It's going to be a hybrid of all these. Obviously the more explicit work done for interoperability, the easier it is, but the gaps can be bridged with the common sense of the AI at the expense of more time and compute. It's like, a self driving car can detect red lights and speed limit signs via cameras but if there are structured signals in smart infrastructure, then it's simpler and better.
But it's always interesting to see this dance between unstructured and structured. Apparently any time one gets big, the other is needed. When theres tons of structured code, we want AI common sense to cut through it because even if it's structured, it's messy and too complicated. So we generate the code. Now if we have natural language code generators we want to impose structure onto how they work, which we express in markup languages, then small scripts, then large scripts that are too complex and have too much boilerplate so we need AI to generate it from natural language etc etc
There’s an incredibly long tail of profitable software business that would like to have a dynamic presence on ChatGPT that OpenAI would never have any interest in stealing. OpenAI wants to be the entry point to the internet, much like Google has been for the last couple decades.
ChatGPT’s generic search will not be that good compared to apps specialized in this.
I tried buying a special kind of lamp this weekend, all LLMs and google sucked at this. The conversation did not help in finding more fine grained results.
I'm really not advocating for people to push out reams of AI drivel and not learn anything while doing it, but of these three groups which ones are likely to be the most effective?
The ability to easily edit in word processors surely atrophied people's ability to really reason out what they wanted to write before committing it to paper. Is it sad that these traits are less readily available in the human populace? Sure. Do we still use word processors anyway because of the tremendous benefits they have? Of course. Similar could be said for spellcheckers, tractors, calculators, power tools, etc.
With LLMs, it's so much quicker to access a tremendous breadth of information, as well as drill down and get a pretty good depth on a lot of things too. We lose some things by doing it this way, and it can certainly be very misused (usually in a fairly embarrassing way). We need to keep it human, but AI is here to stay and I think the benefits far exceed the "cognitive decline" as mentioned in this journal.
With the exception of the real-time translation (which seems like it should be a separate product all by itself), none of the use-cases they presented had much utility. I don't want anything to count the number animals in my stories or time a trivia quiz for me. The auto-slouch-detector, while the demo was pretty funny, just seems so dystopian and weird. AI interrupting you to scold you about taking elderly parents mountain biking instead of waiting for you to finish to scold you? No thanks.
The UX is also an issue - the model interrupting the user (even when apparently required by these strange use-cases) is jarring and makes one lose their flow. You can even see this in the demo videos that they put out - the employees/actors had to really concentrate to continue speaking as if they weren't being interrupted by a brash robotic machine. A human, when participating in this (rare) "invited interruption" has the ability to speak "under" the main speaker and I feel it's generally timed with a lot of nuance.
Even in the auto-translation demo, they ducked the human's audio but the AI steamrolled him and it would have been impossible to actually do that demo without either an incredible amount of control over one's speaking, or (more likely) muting the output. A human translator has a way of "pointing" the "output" to the intended speaker.
The very best part of this tech was presented in the first video where it shows the AI not needlessly interrupting the user. This seems to me more of an important bug fixed that the current models still (somehow) have.
Maybe a good use-case for this would be counting "um's" and the like while practising public speaking.
reply