Hacker Newsnew | past | comments | ask | show | jobs | submit | hbs's commentslogin

Indeed, the challenges around data when designing the products are of different nature, making sure data are collected (choice of sensors) and possibly also processed locally (Edge AI), like the case of CERN I mentioned where Blue Yonder's detectors were used to filter out 100k events out of the 600M received by the detectors' CMOS arrays.


We are working on making WarpScript (http://www.warp10.io) able to fetch data from M3, so you can benefit from 850+ functions for analyzing your time series data.


The problem is the definition of high cardinality, the Gorilla paper talks about 2B metrics at FB but the hardware sizing page for influx shows a scary graph which basically kills any series cardinality above a million

https://docs.influxdata.com/influxdb/v1.1/guides/hardware_si...

quite a difference!


InfluxData is a single server app with zero failover and zero sharding.

There is no "billion" metrics going in there. The architecture (or lack thereof) is not intended for that.

You can try vertical scaling (if you're not in the cloud), you'll squeeze more metrics until you overflow your network cards or your hard drives or your CPU.


Performance usually varies greatly with your set up. I have not tried the 1.1 release but earlier relases suffered very badly when authentication was turned on and when data came out of order. Nobody cares about the ideal case, it's never encountered.


I doubt you record all positions of a sensor updated several times per second in PostGres.

And in case you do store all your time series data in PostGres, WarpScript can still be used and retrieve data from PostGres.


Come on, stop your Hadoop hate speech, and simply bring SpaceCurve on the bandwagon against this patent.


What kind of project is that? Is it a side project or one that will give birth to a company? Requirements are rather different depending on importance you give to your (or your customers') data.


It's for work. If it were a side project I would probably have just grabbed InfluxDB and run with that since it looks the most fun, but since it's for a core part of the whole system then the risk of project abandonment is a bit high.


TempoDB has renamed itself into TempoIQ and no longer offer their storage service. I've heard some angry comments from customers who recently received an email telling them the storage service they were using was to be shutdown at the end of october!


I work at TempoIQ, and we still to offer our storage service. We've launched a new product (as TempoIQ) that is hosted in a private environment and offers storage, historical analysis, and real-time monitoring.

As for the customers on TempoDB, we are working them to transition to TempoIQ if the switch makes sense or offering to guide them in a transition to another time-series database like InfluxDB.


Does treasure data have a dedicated storage engine for time series? This kind of data has specific needs which are not met by general purpose storage layers.


To an extent, yes. We wrote our time-partitioned columnar storage from scratch: it has row-based storage for more recent data and column-based storage for historical data, and the data is merged from row-based to column-based periodically for performance. We realized from day one that much of "big data" is log/timestamped data, so our query execution engined are optimized for time-windowed queries.


When it comes to time series, reasoning in terms of byte size does not really make sense, it's better to state how many datapoints you need to handle and in how many distinct time series they are distributed.


8-16ish datapoints per sample and they'll be distributed more or less evenly during the day and then pretty much go dead at night. There may or may not be a value for every data point at every sample.


There's good news and bad news. Good news is storing this much data isn't hard; there's plenty of people who've done it and many systems will scale enough.

Bad news is picking a system means understanding access patterns -- reading, not writing. Do you only need to look within a single user? That's much easier. If you have to query across users, or do things like (and I have no idea what your problem domain is, but if it's utility usage, things like average usage by zip or block; if it's wearables, activity by city, etc), stuff gets much harder. How granular do you need to be able to query, and how far back? What is the sla on a query: are results calculated in batch mode or on demand for a website? You often have to duplicate data in order to optimize one set for throughput access and the other set for minimal random query time. Can you get away with logarithmic granularity for queries, ie every sample is available for 1 month, every 3rd for the next month, every 10th for a couple months after that, etc. What windowing functions do you need to run, and how frequently do they need to be updated? What is the ratio of writes to reads? If you have to access random data quickly, eg for a site, can you calculate > 1 day back in batch mode, cache those results, and add the last 24h of data at runtime? etc etc etc.

You need to have some conversations with the data consumers.

Edit: and I've assumed these data are read-only; if you can update them, then there's far more difficulty.


There should be no updates but there is a possibility that records can be added out of order. I've seen that this is a problem for some systems and not for others.


My guess would be you would want Cassandra, specifically to incur less overhead for empty values. I haven't built finance backtesting/monitoring infrastructure - which sounds exactly like what you're building - but in this case, I think you'll get real value from triggers, even if that's only being supported experimentally right now.


What will the sampling frequency be? How many samples per sampling interval?


Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: