Not sure what “your own” in the title is supposed to mean if you are running a model that you didn’t train using a framework that you didn’t write on a server that you don’t own.
I think in this case "your own" means under your control, rather than a service or license you pay for. "your own" as in ownership of artefacts, not as in being the creator.
Consider the source of the idiom: rolling your own cigarettes.
Which involves taking some rolling papers, a pouch of loose tobacco (or whatever), and perhaps a little device if you're rich. As opposed to manufactured cigarettes, you're just doing some manual assembly for the end-product.
You don't need to cultivate the plants or pulp any trees to roll your own.
Slammed an A380 in my old server that doesn't even have a GPU power connector & it works pretty well for stuff that will fit on it. They're only like, $150 brand new nowadays; could be a decent option.
Not sure what "baking your own bread" means if you are using wheat grown by someone else in an oven that you didn't build that is run with electricity you didn't created from your muscles' force. You haven't even contributed to the nuclear fusion which created the oxygen for the water molecules you've been using! How dare you, standing of the shoulders of giants!
Is it "building your own oven" if you go to Lowe's, buy an oven, and installed it yourself? You've done some work, but your integrating a pre-built appliance into your kitchen, not built your own oven
Wouldn't "Serverless OCR" mean something like running tesseract locally on your computer, rather than creating an AI framework and running it on a server?
You might be conflating "cloud" with serverless. Serverless is where developers can focus on code, with little care of the infrastructure it runs on, and is pay-as-you-go.
> You might be conflating "cloud" with serverless. Serverless is where developers can focus on code, with little care of the infrastructure it runs on, and is pay-as-you-go.
That's not what serverless means at all. Most function-as-a-service offerings require developers to bother about infrastructure aspects, such as runtimes and even underlying OS.
They just don't bother about managing it. They deploy their code on their choice of infrastructure, and go on with their lives.
A runtime is notably NOT infrastructure, had you said instruction set you might have landed closer to making a compelling argument, but the whole point is that AWS (and other providers) abstract away the underlying infrastructure and allow the developers to as I said, have "little care of the infrastructure it runs on". There is often advanced networking that CAN be configured, as well as other infrastructure components developers can choose to configure.
Unless the engineer takes steps to spin down EC2 infrastructure after execution, it is absolutely persistent compute that you're billed for whether you are doing actual processing or not. Whereas lambda and other services are billed only for execution time.
You can still be excited! Recently, GLM-OCR was released, which is a relatively small OCR model (2.5 GB unquantized) that can run on CPU with good quality. I've been using it to digitize various hand-written notes and all my shopping receipts this week.
(Shameless plug: I also maintain a simplified version of GLM-OCR without dependency on the transformers library, which makes it much easier to install: https://github.com/99991/Simple-GLM-OCR/)
When people mentions the number of lines of code, I've started to become suspicious. More often than not it's X number of lines, calling a massive library loading a large model, either locally or remote. We're just waiting for spinning up your entire company infrastructure in two lines of code, and then just being presented a Terraform shell script wrapper.
I do agree with the use of serverless though. I feel like we agree long ago that serverless just means that you're not spinning up a physical or virtual server, but simply ask some cloud infrastructure to run your code, without having to care about how it's run.
> When people mentions the number of lines of code, I've started to become suspicious.
Low LoC count is a telltale sign that the project adds little to no value. It's a claim that the project integrates third party services and/or modules, and does a little plumbing to tie things together.
hi. i run "ocr" with dmenu on linux, that triggers maim where i make a visual selection. a push notification shows the body (nice indicator of a whiff), but also it's on my clipboard
I am working on a client project, originally built using Google Vision APIs, and then I realized Tesseract is so good. Like really good. Also, if PDF text is available, then pdftotext tools are awesome.
My client's usecase was specific to scanning medical reports but since there are thousands of labs in India which have slightly different formats, I built an LLM agent which works only after the pdf/image to text process - to double check the medical terminology. That too, only if our code cannot already process each text line through simple string/regex matches.
There are perhaps extremely efficient tools to do many of the work where we throw the problem at LLMs.
HathiTrust (https://en.wikipedia.org/wiki/HathiTrust) has 6.7 millions of volumes in the public domain, in PDF from what I understand. That would be around a billion pages, if we consider a volume is ~200 pages. 5000 days to go through that with an A100-40G at 200k pages a day. That is one way to interpret what they say as being legal. I don't have any information on what happens at DeepSeek so I can't say if it's true or not.
Tried adding a receipt itemization feature into an app using OpenAI. It does 95% right but the remaining 5% are a mess. Mostly it mixes prices between items (Olive oil 0.99 while Banana 7.99). Is there some lightweight open source lib that can do this better?
So I'm trying to OCR 1000s of pages of old french dictionaries from the 1700s, has anything popped up that doesn't cost an arm and a leg, and works pretty decently?
I use Gemini for that. Split the PDF into 50 page chunks, throw it into aistudio and ask it to convert it. A couple of 1000 pages can be done with the free tier.
> The author spent $2 on A100 GPU time for 600 pages
Hopefully that cost would come down quite a bit, because that doesn't compete with most offerings right now IMO. I haven't tested it, but I can use models that have vision as an input modality for much cheaper, closer to 25k images per $1.
Question for the crowd -- with autoscaling, when a new pod is created it will still download the model right from huggingface?
I like to push everything into the image as much as I can. So in the image modal, I would run a command to trigger downloading the model. Then in the app just point to the locally downloaded model. So bigger image, but do not need to redownload on start up.