> Based on the shown graph, this is misleading at best, essentially false. After 120K writes/s p50 spikes from 10ms to 1s (1 second for a write!!!!). That's two orders of magnitude latency spike, and an unacceptable one for an OLTP workload. It clearly shows the server is completely saturated, which is clearly a non operational regime. Quoting 144K is equivalent to quoting the throughput of a highway at the moment traffic comes to a standstill.
> Based on this graph the highest number I'd quote is 120K. And probably you want to keep operating the server within a safe margin below peak, but since this is a benchmark, let's call 120K the peak. Because actually p50 is not even the clear-cut. It should be a higher percentile (say p95) at which latency is within reasonable bounds. But for the shake of not over complicating, it could be taken as a reference.
You definitely don't want to run a production system at saturation! But it's worthwhile to measure a complex system like Postgres at saturation, see when it gets there and how it behaves there, and then run at a slightly lower throughput.
> Therefore, you are not measuring Postgres peak performance, but rather Postgres performance under the IO constraints of this particular system. Certainly, 120K IOPS is the maximum that this particular instance can have. But it doesn't show if Postgres could do better under a more performant IO disk. Actually, a good test would have been to try the next instance (db.m7i.48xlarge) with 240K IOPS and see if performance doubles (within the same envelop of p50 latency) or not. And afterwards to test on an instance with local NVMe (you won't find this in RDS).
I've done some testing (not in the blog post)--doubling instance size/IOPS doesn't improve performance significantly because it doesn't affect the WAL bottleneck. Local NVMe should have a significant impact in theory, but I haven't tested this myself.
> 300 seconds test duration?? This is not operational. You are not accounting for checkpoints, background writer, and especially autovacuum. Given that workflow pattern includes UPDATEs, you must validate bloat generation (or, equivalently, bloat removal) by a) observing much longer periods of time (e.g. 1h) and b) making sure the autovacuum configuration (and/or individual table vacuum configuration if required) makes bloat contained in a stable way. Otherwise, shown performance numbers will degrade over time, making them not realistic.
Those are usage examples (notice the 1000 rps)--actual benchmarks were run at and were stable at much longer duration.
> You definitely don't want to run a production system at saturation! But it's worthwhile to measure a complex system like Postgres at saturation, see when it gets there and how it behaves there, and then run at a slightly lower throughput.
I disagree. It's worthless a number at saturation. Because "a slightly lower throughput" is at best an unqualified hand-waving. Real numbers can be quite far from that saturation point.
Quote instead real production numbers. You can define them clearly, it's not that hard. E.g.: p95 below 10ms latency. That's it. Measure and report that number.
> I've done some testing (not in the blog post)--doubling instance size/IOPS doesn't improve performance significantly because it doesn't affect the WAL bottleneck. Local NVMe should have a significant impact in theory, but I haven't tested this myself.
But those would be interesting numbers to share! "Doesn't improve performance significantly" --sorry, I'm not big friend of unqualified data points. Is it 10%, 20%, 50%? And definitely, when measured at saturation, surely you don't see improvements. But if measured at an operational regime, you should probably see notable improvements (unless other scaling factors start to dominate, in which case your benchmark becomes much more meaningful because then you are finding Postgres scaling limits and not just the limits of the disk on which it's running). Changes the picture dramatically.
> Those are usage examples (notice the 1000 rps)--actual benchmarks were run at and were stable at much longer duration.
Sorry, but if you use that as an example, gives me little confidence about the real intent. But glad to hear you run at longer duration --add that information to the post! But again, that's not enough. Show the bloat and demonstrate how stable it is, given the tuning required to keep it contained, of course. Also show the tps over time --I'm sure it drops notably in the presence of checkpoints-- and then the "under 10ms latency at p95" will become dominated by write performance during checkpoints.
Because when you determine your SLOs, it's not at the happy path, but the opposite. And saying "Postgres can do 144K writes/sec on this machine" is beyond the happy path, so it's not meaningful for me.
It's documented behavior for the low-level API (e.g. asyncio.call_soon https://docs.python.org/3/library/asyncio-eventloop.html#asy...). More broadly, this has been a stable behavior of the Python standard library for almost a decade now. If it does change, that would be a huge behavioral change that would come with plenty of warning and time for adjustment.
In my experience, developers who rely on precise and relatively obscure corner cases, tend to assume that they are more stable than they later prove to be. I've been that developer, and I've been burned because of it.
Even more painfully, I've been the maintenance programmer who was burned because some OTHER programmer trusted such a feature. And then it was my job to figure out the hidden assumption after it broke, long after the original programmer was gone. You know the old saying that you have to be twice as clever to debug code, as you need to be to write it? Debugging another person's clever and poorly commented tricks is no fun!
I'd therefore trust this feature a lot less than you appear to. I'd be tempted to instead wrap the existing loop with a new loop to which I can add instrumentation etc. It's more work. But if it breaks, it will be clear why it broke.
It's been a stable (and documented) behavior of the Python standard library for almost a decade now. It's possible it may change--nothing is ever set in stone--but that would be a large change in Python that would come with plenty of warning and time for adjustment.
And then one day, Astral creates a new Python implementation in Rust or something that is way faster and all the rage, but does this particular thing different than CPython. Whoops, you can’t use that runtime, because you now have cursed parts in your codebase that produce nondeterministic behaviour you can’t really find a reason for.
and then all the serverless platforms will start using Astral's new rust-based runtime to reduce cold starts, and in theory it's identical, except half of packages now don't work and it's very hard to anticipate which ones will and will not and behold! You have achieved Deno
Well, in my early days programming python I made a lot(!!) of code assuming non-concurrent execution, but some of that code will break in the future with GIL removal. Hopefully the Python devs keep these important changes as opt-ins.
That's the cool thing about this behavior--it doesn't matter how complex your program is, your async functions start in the same order they're called (though after that, they may interleave and finish in any order).
Only for tasks that are created in synchronous code. If you start two tasks that each make a web request and then start a new task with the result of that request you will immediately lose ordering.
The Restate model depends on a long-running external orchestrator to do the "pushing". However, that comes with downsides--you have to operate that orchestrator and its data store in production (and it's a single point of failure) and you have to rearchitect your application around it.
DBOS implements a simpler library-based architecture where each of your processes independently "pulls" from a database queue. To make this work in a serverless setting, we recommend using a cron to periodically launch serverless workers that run for as long as there are workflows in the queue. If a worker times out, the next will automatically resume its workflows from their last completed step. This Github discussion has more details: https://github.com/dbos-inc/dbos-transact-ts/issues/1115
1. Yes, currently versioning is either automatically derived from a bytecode hash or manually set. The intended upgrade mechanism is blue-green deployments where old workflows drain on old code versions while new workflows start on new code versions (you can also manually fork workflows to new code versions if they're compatible). Docs: https://docs.dbos.dev/java/tutorials/workflow-tutorial#workf...
That said, we're working on improvements here--we want it to be easier to upgrade your workflows without blue-green deployments to simplify operating really long-running (weeks-months) workflows.
2. Could I ask what the intended use case is? DBOS workflow creation is idempotent and workflows always recover from the last completed step, so a workflow that goes "database transaction" -> "enqueue child workflow" offers the same atomicity guarantees as a workflow that does "enqueue child workflow as part of database transaction", but the former is (as you say) much simpler to implement.
Here's an example of a common long-running workflow: SaaS trials. Upon trial start, create a workflow to send the customer onboarding messages, possibly inspecting the account state to influence what is sent, and finally close the account that hasn't converted. This will usually take 14-30 days, but could go for months if manually extended (as many organisations are very slow to move).
I think an "escape hatch" would be useful here. On version mismatch, invoke a special function that is given access to the workflow state and let it update the state to align it with the most recent code version.
> transactional enqueuing
A workflow that goes "database transaction" -> "enqueue child workflow" is not safe if the connection with DBOS is lost before the second step completes safely. This would make DBOS unreliable. Doing it the other way round can work provided each workflow checks that the "object" to which it's connected exists. That would deal with the situation where a workflow is created, but the transaction rolls back.
If both the app and DBOS work off the same database connection, you can offer exactly-once guarantees. Otherwise, the guarantees will depend on who commits first.
Personally, I would prefer all work to be done from the same connection so that I can have the benefit of transactions. To me, that's the main draw of DBOS :)
Yes, something like that is needed, we're working on building a good interface for it.
> transactional enqueueing
But it is safe as long as it's done inside a DBOS workflow. If the connection is lost (process crashes, etc.) after the transaction but before the child is enqueued, the workflow will recover from the last completed step (the transaction) and then proceed to enqueue the child. That's part of the power of workflows--they provide atomicity across transactions.
> > transactional enqueueing
> But it is safe as long as it's done inside a DBOS workflow.
Yes, but I was talking about the point at which a new workflow is created. If my transaction completes but DBOS disappears before the necessary workflows are created, I'll have a problem.
Taking my trial example, the onboarding workflow won't have been created and then perhaps the account will continue to run indefinitely, free of charge to the user.
Looking at the trial example, the way I would solve it is that when the user clicks "start trial", that starts a small synchronous workflow that first creates a database entry, then enqueues the onboarding workflow, then returns. DBOS workflows are fast enough to use interactively for critical tasks like this.
The main difference is that this is a library you can install and use in any application anywhere, while Durable Functions is (as I understand it) primarily for orchestrating serverless functions in Azure.
(disclaimer: I work at Microsoft, but am not directly involved with Durable Functions)
Being a library is a pretty interesting feature! Correct, Durable Functions allows you to write task-parallel orchestrations of task-parallel 'activities' (which are stateless functions), and these orchestrations are fully persistent and resilient, like DBOS executions. It also has the concept of 'Entities', which are named objects (of a type you define) that "live forever", and serialize all method invocations, which are the only way to change their private state. These are also persistent. The Netherite paper [1], section 2, describes this model well.
So, there seems to be a pretty close correspondence between DBOS steps and DF activities, and between workflows and orchestrations. I don't know what the correspondence is to DF entities is in the DBOS model.
Yes, agree the correspondence is close, the primary difference is in form factor and not in underlying guarantees (but form factor matters! Building this as a library was technically tricky, but unlocks a lot of use cases). Reading the Orleans and early durable functions papers in grad school (and many of your papers) was definitely helpful in our journey.
Haha yes, one thing you can use this for is "long waits" or "long sleeps" where a program waits hours or days or weeks for a notification (potentially through server restarts, etc) then wakes up as soon as the notification arrives or a timeout is reached. More info in the docs: https://docs.dbos.dev/java/tutorials/workflow-communication
> Based on this graph the highest number I'd quote is 120K. And probably you want to keep operating the server within a safe margin below peak, but since this is a benchmark, let's call 120K the peak. Because actually p50 is not even the clear-cut. It should be a higher percentile (say p95) at which latency is within reasonable bounds. But for the shake of not over complicating, it could be taken as a reference.
You definitely don't want to run a production system at saturation! But it's worthwhile to measure a complex system like Postgres at saturation, see when it gets there and how it behaves there, and then run at a slightly lower throughput.
> Therefore, you are not measuring Postgres peak performance, but rather Postgres performance under the IO constraints of this particular system. Certainly, 120K IOPS is the maximum that this particular instance can have. But it doesn't show if Postgres could do better under a more performant IO disk. Actually, a good test would have been to try the next instance (db.m7i.48xlarge) with 240K IOPS and see if performance doubles (within the same envelop of p50 latency) or not. And afterwards to test on an instance with local NVMe (you won't find this in RDS).
I've done some testing (not in the blog post)--doubling instance size/IOPS doesn't improve performance significantly because it doesn't affect the WAL bottleneck. Local NVMe should have a significant impact in theory, but I haven't tested this myself.
> 300 seconds test duration?? This is not operational. You are not accounting for checkpoints, background writer, and especially autovacuum. Given that workflow pattern includes UPDATEs, you must validate bloat generation (or, equivalently, bloat removal) by a) observing much longer periods of time (e.g. 1h) and b) making sure the autovacuum configuration (and/or individual table vacuum configuration if required) makes bloat contained in a stable way. Otherwise, shown performance numbers will degrade over time, making them not realistic.
Those are usage examples (notice the 1000 rps)--actual benchmarks were run at and were stable at much longer duration.
reply