It feels crazy to me that Intel spent years dedicating die space on consumer SKUs to "make fetch happen" with AVX-512, and as more and more libraries are finally using it, as Intel's goal is achieved, they have removed AVX-512 from their consumer SKUs.
It isn't that AMD has better AVX-512 support, which would be an impressive upset on it's own. Instead, it is only that AMD has AVX-512 on consumer CPUs, because Intel walked away from their own investment.
That is what Intel does, they build up a market (Optane) and then do a rug pull (Depth Cameras). They continue to do this thing where they do a huge push into a new technology, then don't see the uptake and let it die. Instead of building slowly and then at the right time, doing a big push. Optane support was just getting mature in the Linux kernel when they pulled it. And they focused on some weird cost cutting move when marketing it as a ram replacement for semi-idle VMs, ok.
The rugpull on Optane was incredibly frustrating. Intel developed a technology which made really meaningful improvements to workloads in an industry that is full of sticky late adopters (RDBMSes). They kept investing until the point where they had unequivocally made their point and the late adopters were just about getting it... and then killed it!
It's hard to understand how they could have played that particular hand more badly. Even a few years on, I'm missing Optane drives because there is still no functional alternative. If they just held out a bit longer, they would have created a set of enterprise customers who would still be buying the things in 2040.
It’s especially hard to understand because so much of their management had a degree which conferred a mastery of business administration. I mean, it’s almost like you could take any tenured software engineeer at the company, and they would have been in a better position to manage the company more effectively. That’s very surprising, and might suggest that people with MBAs are total idiots who understand everything through GRE-friendly analogy rather than, well, actually understanding anything.
They are a weird company. Their marketing people showed up and invested a significant amount into a buy of optane gear with our OEM a few months before they killed the product. They pulled the rug in themselves in addition to the customers.
One could see the death of Optane coming from a mile away. It was only kept afloat by Intel, and its main issue that is was really cool tech, but it was a solution looking for a problem.
You need scratch space that's resilient to a power outage? An NVDIMM is faster and cheaper. You need fast storage? Flash keeps getting faster and cheaper. Optane was squeezed from both sides and could never hope to generate the volume needed to cut costs.
So now imagine that you are at Intel deciding what initiatives to fund. The company is in trouble and needs to show some movement out of the red, preferably quickly. It also lost momentum and lost ground to competitors, so it needs to focus. What do you do? You kill all the side projects that will never make much money. And of course you kill a lot of innovation in the process, but how would you justify the alternative?
Isn’t NVMe disks basically same value as optane? Comments saying optane was amazing doesn’t make sense if NVMe is basically as good and there are other NVMe disk manufacturers
First, NVMe is a protocol to access block storage. It can be used to access any kind of block device, Optane, SSD, NVDIMM, virtual storage on EC2, etc. So it's true that the protocol is the same (well, not quite - more on this in a bit), but that's like saying a server is the same as an iPhone because they can both speak TCP/IP.
What was the "more in a bit" bit? Persistent memory (PMEM) devices like NVDIMMs and Optane can usually speak two protocols. They can either act as storage, or as memory expansion. But this memory also happens to be non-volatile.
This was sold as a revolution, but it turned out that it's not easy for current operating systems and applications to deal with memory with vastly different latencies. Also it turns out that software is buggy, and being able to lose state by rebooting is useful. And so Optane in memory mode never really caught on, and these devices were mostly used as a storage tier. However: look up MemVerge.
So you are right that it turned out to be a faster SSD, but the original promise was a lot more. And here comes the big problem: because Optane was envisioned as a separate kind of product between RAM and SSD, the big price differential could be justified. If it's just a faster SSD - well, the market has spoken.
I made the mistake early in our startup of spending several months and quite a bit of cash building our first iot product on the Intel Edison platform, only to get zero support on the bugs in the SPI chip and the non-existent (but advertised) microcontroller. We finally gave up and made our own boards based on another SOM (and eventually stopped building boards entirely) and they rather unceremoniously cancelled the Edison in 2017. I guess nobody else was surprised, but I had naively thought the platform did have potential and a huge company like Intel would support the things they sold.
> this thing where they do a huge push into a new technology, then don't see the uptake and let it die.
Do we need a second "killed by google"?
To companies like Intel or Google anything below a few hundred million users is a failure. Had these projects been in a smaller company, or been spun out, they'd still be successful and would've created a whole new market.
Maybe I'm biased — a significant part of my career has been working for German Mittelstand "Hidden Champions" — but I believe you don't need a billion customers to change the world.
Intel's 5G radio department was formed in 2011 by buying another firm and then it was bought by Apple in 2019. Apple announced a 5G modem this year (C1) . It took 14 years to get a viable 5g wireless modem but still doesn't have feature parity with Apple's cellular modems in the other iPhones. So this happens pretty often by Intel.
Until this day, I miss Optane — I work for a timeseries database company focused on finance, the amount of use cases I have that screams “faster than NVMe, slower than RAM” is insane. And these companies have money to throw at these problems.
Which begs the question, why isn’t anyone else stepping into this gap? Is the technology heavily patented?
When Energy Conversion Devices went bankrupt, it appears Intel pirated the technology, and never bothered to pay the royalties for the PCM memory in Optane.
Case No. 12-43166 is what killed Optane.
Or, in a manner of speaking, Intel being Intel killed Optane.
The legal risks were at most the last straw. If Optane had a promising future, Intel could have made the investments necessary to make the legal issues go away. If Optane had a promising future, Micron would have helped Intel secure that future. The long-term value of a persistent memory technology capable of taking a big chunk out of both the DRAM and NAND flash markets is huge.
Optane did not have a promising future. The $/GB gap between 3D XPoint memory and 3D NAND flash memory was only going to keep growing. Optane was doomed to only be appealing to the niche of workloads where flash memory is too slow and DRAM is too expensive. But even DRAM was increasing in density faster than 3D XPoint, and flash (especially the latency-optimized variants that are still cheaper than 3D XPoint) is fast enough for a lot of workloads. Optane needed a breakthrough improvement to secure a permanent place in the memory hierarchy, and Intel couldn't come up with one.
> They continue to do this thing where they do a huge push into a new technology, then don't see the uptake and let it die.
Except Intel deliberately made AVX 512 a feature exclusively available to Xeon and enterprise processors in future generations. This backward step artificially limits its availability, forcing enterprises to invest in more expensive hardware.
I wonder if Intel has taken a similar approach with Arc GPUs, which lack support for GPU virtualization (SR-IOV). They somewhat added vGPU support to all built-in 12th-14th Gen chips through the i915 driver on Linux. It’s a pleasure to have graphics-acceleration in multiple VMs simultaneously, through the same GPU.
They go out their way to segment their markets, ECC, AVX, Optane support (only specific server class skus). I hate it, I hate as a home pc user, I hate it as an enterprise customer, I hate as a shareholder.
Every company does this. If you're grandma only uses a web browser, word processor, and excel, does she really want to spend an additional $50 on a feature she'll not use? Same with NPUs. Different consumers want different features for different prices.
Except it hinders adoption, because not having a feature in entry-level products will mean less incentive (and ability) for software developers to use it. Compatibility is so valuable it makes everyone converge on the least common denominator, so when you price-gouge on a software-exposed feature, you might as well bury this feature altogether.
So far as killing HP PA-Risc, SGI MIPS, DEC Alpha, and seriously hurting the chance for adoption of Sparc, and POWER outside of their respective parents (did I miss any)?
Thing is, they could have killed it by 1998, without ever releasing anything, that would have killed the other architectures it was trying to compete with. Instead they waited until 2020 to end support.
What the VLIW of Itanium needed and never really got was proper compiler support. Nvidia has this in spades with CUDA. It's easy to port to Nvidia where you do get serious speedups. AVX-512 never offered enough of a speedup from what I could tell, even though it was well supported by at least ICC (and numpy/scipy when properly compiled)
> What the VLIW of Itanium needed and never really got was proper compiler support.
This is kinda under-selling it. The fundamental problem with statically-scheduled VLIW machines like Itanium is it puts all of the complexity in the compiler. Unfortunately it turns out it's just really hard to make a good static scheduler!
In contrast, dynamically-scheduled out-of-order superscalar machines work great but put all the complexity in silicon. The transistor overhead was expensive back in the day, so statically-scheduled VLIWs seemed like a good idea.
What happened was that static scheduling stayed really hard while the transistor overhead for dynamic scheduling became irrelevantly cheap. "Throw more hardware at it" won handily over "Make better software".
No, VLIW is even worse than this. Describing it as a compiler problem undersells the issue. VLIW is not tractable for a multitasking / multi tenant system due to cache residency issues. The compiler cannot efficiently schedule instructions without knowing what is in cache. But, it can’t know what’s going to be in cache if it doesn’t know what’s occupying the adjacent task time slices. Add virtualization and it’s a disaster.
If it's pure TFLOPs you're after, you do want a more or less statically scheduled GPU. But for CPU workloads, even the low-power efficiency cores in phones these days are out of order, and the size of reorder buffers in high-performance CPU cores keeps growing. If you try to run a CPU workload on GPU-like hardware, you'll just get pitifully low utilization.
So it's clearly true that the transistor overhead of dynamic scheduling is cheap compared to the (as-yet unsurmounted) cost of doing static scheduling for software that doesn't lend itself to that approach. But it's probably also true that dynamic scheduling is expensive compared to ALUs, or else we'd see more GPU-like architectures using dynamic scheduling to broaden the range of workloads they can run with competitive performance. Instead, it appears the most successful GPU company largely just keeps throwing ALUs at the problem.
I think OP meant "transistor count overhead" and that's true. There are bazillions of transistors available now. It does take a lot of power, and returns are diminishing, but there are still returns, even more so than just increasing core count. Overall what matters is performance per watt, and that's still going up.
> So far as killing HP PA-Risc, SGI MIPS, DEC Alpha, and seriously hurting the chance for adoption of Sparc, and POWER outside of their respective parents (did I miss any)?
I would argue that it was bound to happen one way or another eventually, and Itanium just happened to be a catalyst for the extinction of nearly all alternatives.
High to very high performance CPU manufacturing (NB: the emphasis is on the manufacturing) is a very expensive business, and back in the 1990's no-one was able (or willing) to invest in the manufacturing and commit to the continuous investment in keeping the CPU manufacturing facilities up to date. For HP, SGI, Digital Equipment, SUN, and IBM, a high performance RISC CPU was the single most significant enabler, yet not their core business. It was a truly odd situation where they all had a critical dependency on CPU's, yet none of them could manufacture them themselves and were all reliant on a third party[0].
Even Motorola that was in some very serious semicondutor business could not meet the market demands[1].
Look at how much it costs Apple to get what they want out of TSMC – it is tens of billions of dollars almost yearly, if not yearly. We can see very well today how expensive it is to manufacture a bleeding-edge, high-performing CPU – look no further than Samsung, GlobalFoundries, the beloved Intel, and many others. Remember the days when Texas Instruments used to make CPU's? Nope, they don't make them anymore.
[0] Yes, HP and IBM used to produced their own CPU's in-house for a while, but then that ceased as well.
[1] The actual reason why Motorola could not meet the market demand was, of course, an entirely different one – the company management did not consider the CPU's to be their core business as they primarily focused on other semiconductor products and on defence, which left the CPU production in an underinvested state. Motorola could have become a TSMC if they could see the future through a silicon dust shroud.
When Energy Conversion Devices went bankrupt, it appears Intel pirated the technology, and never bothered to pay the royalties for the PCM memory in Optane.
I am very disappointed about Optane drives. Perfect case for superfast vertically scalable database. I was going to build a solution based on this but suddenly it is gone for all practical intents and purposes.
This is an AMD CPU, but it's clear that the AVX512 benefits are marginal over the AVX2 version. Note that Intel's consumer chips do support AVX2, even on the E-cores.
But there's more to the story: This is a single-threaded benchmark. Intel gave up AVX512 to free up die space for more cores. Intel's top of the line consumer part has 24 cores as a result, whereas AMD's top consumer part has 16. We'd have to look at actual Intel benchmarks to see, but if the AVX2 to AVX512 improvements are marginal, a multithreaded AVX2 version across more cores would likely outperform a multithreaded AVX512 version across fewer cores. Note that Intel's E-cores run AVX2 instructions slower than the P-cores, but again the AVX boost is marginal in this benchmark anyway.
I know people like to get angry at Intel for taking a feature away, but the real-world benefit of having AVX512 instead of only AVX2 is very minimal. In most cases, it's probably offset by having extra cores working on the problem. There are very specific workloads, often single-threaded, that benefit from AVX-512, but on a blended mix of applications and benchmarks I suspect Intel made an informed decision to do what they did.
> We'd have to look at actual Intel benchmarks to see, but if the AVX2 to AVX512 improvements are marginal, a multithreaded AVX2 version across more cores would likely outperform a multithreaded AVX512 version across fewer cores.
Look at any existing heavily multithreaded benchmark like Blender rendering. The E-cores are so weak that it just about takes 2 of them to match the performance of an AMD core. If the only difference was AVX512 support then yeah, 24 AVX2 cores would beat 16 AVX-512 cores. But that's not the only difference, not even close.
That's not to say a 24 core Core 9 Ultra Whatever would be slower than a 16 core 9950X in this workload. Just that the E-cores are kinda shit, especially in the wonky counts Intel is using (too many to just be about power efficiency, too few to really offset how slow they are)
> The E-cores are so weak that it just about takes 2 of them to match the performance of an AMD core.
That's not "weak". If you look at available die-shot analyses, the E-cores are tiny compared to the P-cores, they take up a lot less than half in area and even less in power. P-cores are really only useful for the rare pure single-threaded workload, but E-cores will win otherwise.
We're not comparing to Intel's P cores but AMDs cores. 8 of AMDs cores fit in 70.6mm2 on a high performance process, and take up a fraction of that space on a high density process (see the 192 core Zen 5c chips)
AVX2 vs AVX512 in this case may be somewhat misleading. In .NET, even if you use 256bit-wide vectors, it will still take advantage of AVX512VL whenever available to fuse chained operations into masked, vpternlogd's, etc.[0] (plus standard operations like stack zeroing, struct copying, string comparison, element search, and other can use the full width)[1]
So to force true AVX2 the benchmark would have to be ran with `DOTNET_EnableAVX512F=0` which I assume is not the case here.
Isn't AVX-10 on the horizon, which will have most of the goodies that AVX-512 had? (I'm actually not even sure what the difference is supposed to be between them.)
AVX-10 used to have a AVX-10/256 version that's AVX-512 but without the 512-bit registers, but that's gone as of recently, so now AVX-10 is just a set of most AVX-512 extensions, and the stated goal is to, for future versions, guarantee each successive one being a superset of the previous (as opposed to AVX-512 with many independent extensions).
AVX-10 was mostly just a way for Intel to provide an excuse for why they're still a few years out from having AVX-512 in their E cores: they're targeting a standard that's not here yet. But the excuse doesn't really work now that AVX-10 requires doing a full AVX-512 implementation. We're back to Intel just dragging their heels on implementing the AVX-512 support that they were obviously going to need all along.
I mean, the most interesting part of the article for me:
> A bit surprisingly the AVX2 parser on 9950X hit ~20GB/s! That is, it was better than the AVX-512 based parser by ~10%, which is pretty significant for Sep.
They fixed it, that's the whole point, but I think there's evidence that AVX-512 doesn't actually benefit consumers that much. I would be willing to settle for a laptop that can only parse 20GB/s and not 21GB/s of CSV. I think vector assembly nerds care about support much more than users.
AVX512 is not just about width. It ships with a lot of very useful instructions available for narrower vectors with AVX512VL. It also improves throughput per instruction. You're not hand-writing intrinsified code usually yet compilers, especially JIT ones, can make use of it for all sorts of common operations that become x times faster. In .NET, having AVX512 will speed up linear search, memory copying, string comparison which are straightforward, but it will also affect its Regex performance which uses SearchValues<T> which under the hood is able to perform complex shuffles and vector lookups on larger vectors with much better throughput. AVX512 lends itself to a more compact codegen (although .NET is not perfect in that regard, I think it sometimes regresses vs AVX2 with its instruction choices, but it's a matter of iterative improvement).
If it's any consolation, Sep will happily use AVX-512 whenever available, without having to opt into that explicitly, including the server parts, as it will most likely run under a JIT runtime (although it's NAOT-compatible). So you're not missing out by being forced to target the lowest common denominator.
Intel is horrible with software. My laptop has a pretty good iGPU, but it's not properly supported by PyTorch or most other software. Vulkan inference with llama.cpp does wonders, and it makes me sad that most software other than llama.cpp does not take advantage of it.
Instead of doing 4 comparisons against each character `\n`, `\r`, `;` and `"` followed by 3 or operations, a common trick is to do 1 shuffle, 1 comparison and 0 or operations. I blogged about this trick: https://stoppels.ch/2022/11/30/io-is-no-longer-the-bottlenec... (Trick 2)
Edit: they do make use of ternary logic to avoid one or operation, which is nice. Basically (a | b | c) | d is computed using `vpternlogd` and `vpor` resp.
Take that, Intel and your "let's remove AVX-512 from every consumer CPU because we want to put slow cores on every single one of them and also not consider multi-pumping it"
A lot of this stems from the 10nm hole they had to dig themselves out from. Yields are bad, so costs are high, so let's cut the die as much as possible, ship Atom-derived cores and market it as an energy-saving measure. The expensive parts can be bigger and we'll cut the margins on those to retain the server/cloud sector. Also our earnings go into the shitter and we lose market share anyway, but at least we tried.
This issue is less about Intel's fab failures and more about their inability to decouple their architecture update cadence from their fab progress. They stopped iterating on their CPU designs while waiting for 10nm to get fixed. That left them with an oversized P core and an outdated E core, and all they could do for Alder Lake was slap them onto one die and ship it, with no ability to produce a well-matched pair of core designs in any reasonable time frame. We're still seeing weird consequences of their inability to port CPU designs between processes and fabs: this year's laptop processors have HyperThreading only in the lowest-cost parts—those that still have the CPU chiplet fabbed at Intel while the higher core count parts are made by TSMC.
They claim a 3GB/s improvement versus previous version of sep on equal hardware — and unlike “marketing” benchmarks, include the actual speed achieved and the hardware used.
Do note that this speed even before the 3GB/s improvement exceeds the bandwidth of most disks, so the bottleneck is loading data in memory. I don't know of many applications where CSV is produced and consumed in memory, so I wonder what the use is.
Slower than network! In-memory processing of OLAP tables, streaming splitters, large data set division… but also the faster the parser, the less time you spend parsing and the more you spend doing actual work
This is honestly something that caught me off-guard a bit. If you have good internal network connectivity, small queries and your relational database has the data in memory, it can be faster to fetch data from the DB via the network than reading it from disk.
Like, sure, I can give you an application server with faster disks and more memory and you or me are certainly capable of implementing an application server that could load the data from disk faster than all of that. And then we build caching to keep the hot data in memory, because that's faster.
But then we've spent very advanced development resources to build a relational database with some application code at the edge.
This can make sense in some high frequency trading situations, but in many more mundane web-backends, a chunky database and someone capable of optimizing stupid queries enable and simplify the work of a much bigger number of developers.
You can also get this with Infiniband, although it is less surprising, and basically what you’d expect to see.
I did once use a system where the network bandwidth was in the same ballpark as the memory bandwidth, which might not be surprising for some of the real HPC-heads here but it surprised me!
Perhaps, but i think we are well past the moore's law era where a 3x speed up is to be expected just from hardware. Its still a pretty impressive feat in the modern era.
> You can't claim this when you also do a huge hardware jump
Well, they did. Personally, I find it an interesting way of looking at it, it's a lens for the "real performance" one could get using this software year over year. (Not saying it isn't a misleading or fallacious claim though.)
Agreed. How hard is it to keep hardware fixed, load the data into memory, and use a single core for your benchmarks? When I see a chart like that I think, "What else are they hiding?"
I mean... A single 9950x core is going to struggle to do more than 16 GB/second of direct mem copy bandwidth. So being within an order of magnitude of that seems reasonable
If we are lucky we will see Arthur Whitney get triggered and post either a one liner beating this or a shakti engine update and a one liner beating this. Progress!
I have. I think it's a pretty easy situation for certain kinds of startups to find themselves in:
- Someone decides on CSV because it's easy to produce and you don't have that much data. Plus it's easier for the <non-software people> to read so they quit asking you to give them Excel sheets. Here <non-software people> is anyone who has a legit need to see your data and knows Excel really well. It can range from business types to lab scientists.
- Your internal processes start to consume CSV because it's what you produce. You build out key pipelines where one or more steps consume CSV.
- Suddenly your data increases by 10x or 100x or more because something started working: you got some customers, your sensor throughput improved, the science part started working, etc.
Then it starts to make sense to optimize ingesting millions or billions of lines of CSV. It buys you time so you can start moving your internal processes (and maybe some other teams' stuff) to a format more suited for this kind of data.
It's become a very common interchange format, even internally; it's also easy to deflate. I have had to work on codebases where CSV was being pumped out at basically the speed of a NIC card (its origin was Netflow, and then aggregated and otherwise processed, and the results sent via CSV to a master for further aggregation and analysis).
I really don't get, though, why people can't just use protocol buffers instead. Is protobuf really that hard?
protobuf is more friction, and actually slow to write and read.
For better or worse, CSV is easy to produce via printf. Easy to read by breaking lines and splitting by the delimiter. Escaping delimiters part of the content is not hard, though often added as an afterthought.
Protobuf requires to install a library. Understand how it works. Write a schema file. Share the shema to others. The API is cumbersome.
Finally to offer this mutable struct via setter and getter abstraction, with variable length encoded numbers, variable length strings etc. The library ends up quite slow.
In my experience protobuf is slow and memory hungry. The generated code is also quite bloated, which is not helping.
> For better or worse, CSV is easy to produce via printf. Easy to read by breaking lines and splitting by the delimiter. Escaping delimiters part of the content is not hard, though often added as an afterthought.
Based on the amount of software I seen producing broken CSV or can't parse (more-or-less) valid CSV, I don't think that is true.
It seems to be easy, because just printf("%s,%d,%d\n", ...) but it is full of edge cases most programmers don't think about.
Not an issue when you control both ends of the pipe. CSV is a great interchange format for tabular data, especially so if it's only/mostly numeric. If you need to pass tabular data from internal service X to internal service Y it's great. And it's really fast.
Mostly because dependencies are hard and extra so when you need another team using a different language to also neeeds support the same format.
I’d love to pass parquet data around, or SQLite dbs, or something else, but that requires dedicated support from other teams upstream/downstream.
Everyone and everything supports CSV, and when they don’t they can hack a simple parser quickly. I know that getting a CSV parser right for all the edge cases is very hard, but they don’t need to. They just need to support the features we use. That’s simple and quick and everyone quickly moves on to the actual work of processing the data.
Yeah there's no format to support that way. Maybe I'm more biased towards numeric data (sensor readings, etc), but I never have to worry about libraries and dependencies to say
data = (uint32_t *)read(f);
Or
data = struct.unpack...
Sounds like you're dealing with more heavily formatted or variably formatted data that benefits from more structure to it
Thanks to everyone above for some great responses. Cap'n Proto seems to do exactly what you're describing (the in-memory representation is identical to what's on the wire, and then getter/setter methods are generated which look at that).
yep use it a lot for internal stuff and I can't recall the last time we had an issue with parsing or using it. It just works for us as a data interchange file format for tabular data. Of course our character set is basically just ascii letter and numbers, we don't even need commas or quotation marks.
Extremely hard to tell an HR person, "Right-click on here in your Workday/Zendesk/Salesforce/etc UI and export a protobuf". Most of these folks in the business world LIVE in Excel/Spreadsheet land so a CSV feels very native. We can agree all day long that for actual data TRANSFER, CSV is riddled with edge cases. But it's what the customers are using.
Kind of, there isn't a 1:1 mapping of protobuf wire types to schema types, so you need to package the protobuf schema with the data and compile it to parse the data, or decide on the schema before-hand. So now you need to decide on a file format to bundle the schema and the data.
I'm not the biggest fan of Protobuf, mostly around the 'perhaps-too-minimal' typing of the system and the performance differentials present on certain languages in the library.
e.x. I know in .NET space, MessagePack is usually faster than proto, I think similar is true for JVM. Main disadvantage is there's not good schema based tooling around it.
I shudder to think of what it means to be storing the _results_ of processing 21 GB/s of CSV. Hopefully some useful kind of aggregation, but if this was powering some kind of search over structured data then it has to be stored somewhere...
Just because you’re processing 21GB/s of CSV doesn’t mean you need all of it.
If your data is coming from a source you don’t own, it’s likely to include data you don’t need. Maybe there’s 30 columns and you only need 3 - or 200 columns and you only need 1.
Erm, maybe file based? JSON is the king if you count exchanges worldwide a sec. Maybe no 2 is form-data which is basically email multipart, and if course there's email as a format. Very common =)
JSON tabular data only adds a couple of brackets per line and at the start/end of the file vs CSV. In exchange for these bits (that basically disappear when compressed), you get a guaranteed standard formatting. Seems like a decent tradeoff to me.
Humans generate decisions / text information at rates of ~bytes per second at most. There is barely enough humans around to generate 21GB/s of information even if all they did was make financial decisions!
So 21 GB/s would be solely algos talking to algos... Given all the investment in the algos, surely they don't need to be exchanging CSV around?
The only real example I can think of is the US options market feed. It is up to something like 50 GiB/s now, and is open 6.5 hours per day. Even a small subset of the feed that someone may be working on for data analysis could be huge. I agree CSV shouldn't even be used here but I am sure it is.
CSV is a questionable choice for a dataset that size. It's not very efficient in terms of size (real numbers take more bytes to store as text than as binary), it's not the fastest to parse (due to escaping) and a single delimiter or escape out of place corrupts everything afterwards. That not to mention all the issues around encoding, different delimiters etc.
Its great for when people need to be in the loop, looking at the data, maybe loading in Excel etc. (I use it myself...). But not enough humans around for 21 GB/s
> (real numbers take more bytes to store as text than as binary)
Depends on the distribution of numbeds in the sataset. It's quite common to have small numbers. For these text is a more efficient representation compared to binary, especially compared to 64-bit or larger binary encodings.
Standards (whether official or de facto) often aren't the best in isolation, but they're the best in reality because they're widely used.
Imagine you want to replace CSV for this purpose. From a purely technical view, this makes total sense. So you investigate, come up with a better standard, make sure it has all the capabilities everyone needs from the existing stuff, write a reference implementation, and go off to get it adopted.
First place you talk to asks you two questions: "Which of my partner institutions accept this?" "What are the practical benefits of switching to this?"
Your answer to the first is going to be "none of them" and the answer to the second is going to be vague hand-wavey stuff around maintainability and making programmers happier, with maybe a little bit of "this properly handles it when your clients' names have accent marks."
Next place asks the same questions, and since the first place wasn't interested, you have the same answers....
Replacing existing standards that are Good Enough is really, really hard.
> Humans generate decisions / text information at rates of ~bytes per second at most
Yes, but the consequences of these decisions are worth much more. You attach an ID to the user, and an ID to the transaction. You store the location and time where it was made. Ect.
Why are you theoretising? I can tell you from out there its used massively, and its not going away in contrary. Even rather small banks can end up generating various reports etc. which can easily become huge.
The speed of human decision has basically 0 role here, as it doesn't with messaging generally, there is way more to companies than just direct keyboard-to-output link.
There’s a calculation for ns/row in the article that is never translated into rows per second but is about 27 ns/row, which is about 37,000 per second. Which means these rows are 570k apiece if that’s 21GB. Which seems like an awfully cooked benchmark.
So ~570 per line. Still seems a bit contrived. I’d expect a SIMD version to still work line by line but I don’t know that I would try to shove that much into a line if I wanted to read it really fucking fast.
In my experience I've found it difficult to get substantial gains with custom SIMD code compared to modern compiler auto-vectorization, but to be fair that was with more vector-friendly code than JSON parsing.
Sounds interesting, I'll give it a look. I'm unfortunately limited to CSV, XML, or XLS from the source system, then am transforming it and loading it into another DB.
Considering the non-standard nature of CSV, quoting throughput numbers in bytes is meaningless. It makes sense for JSON, since you know what the output is going to be (e.g. floats, integers, strings, hashmaps, etc).
With CSV you only get strings for each column, so 21 GB/s of comma splitting would be the pinnacle of meaninglessness. Like, okay, but I still have to parse the stringy data, so what gives? Yeah, the blog post does reference float parsing, but a single float per line would count as "CSV".
Now someone might counter and say that I should just read the README.MD, but then that suspicion simply turns out to be true: They don't actually do any escaping or quoting by default, making the quoted numbers an example of heavily misleading advertising.
CSV is standardized in RFC 4180 (well, as standardized as most of what we considered internet "standard").
Otherwise agree, if you don't do escaping (a.k.a. "quoting", the same thing for CSV), you are not implementing it correctly. For example, if you quote a line break, in RFC 4180, this line break will be in that quoted string, but if you don't need to handle that, you can implement CSV parsing much faster (proper handling line break with quoted string requires 2-pass approach (if you are going to use many-core) while not handling it at all can be done with 1-pass approach). I discussed about this detail in https://liuliu.me/eyes/loading-csv-file-at-the-speed-limit-o...
Side note: RFCs are great standards, as they are readable.
As an example of how not to do it:
XML can be assumed a standard, but I cannot afford to read it. DIN/ISO is great for manufacturing in theory, but bad for zero-cost of initial investment like IT.
I have been privileged in my career to never need to parse Excel output but occasionally feed it input. Especially before Grafana was a household name.
Putting something out so manager stops asking you 20 questions about the data is a double edged sword though. Those people can hallucinate more than a pre-Covid AI engine. Grafana is just weird enough that people would rather consume a chart than try to make one, then you have some control over the acid trip.
It feels crazy to me that Intel spent years dedicating die space on consumer SKUs to "make fetch happen" with AVX-512, and as more and more libraries are finally using it, as Intel's goal is achieved, they have removed AVX-512 from their consumer SKUs.
It isn't that AMD has better AVX-512 support, which would be an impressive upset on it's own. Instead, it is only that AMD has AVX-512 on consumer CPUs, because Intel walked away from their own investment.
That is what Intel does, they build up a market (Optane) and then do a rug pull (Depth Cameras). They continue to do this thing where they do a huge push into a new technology, then don't see the uptake and let it die. Instead of building slowly and then at the right time, doing a big push. Optane support was just getting mature in the Linux kernel when they pulled it. And they focused on some weird cost cutting move when marketing it as a ram replacement for semi-idle VMs, ok.
They keep repeating the same mistakes all the way back to https://en.wikipedia.org/wiki/Intel_iAPX_432
The rugpull on Optane was incredibly frustrating. Intel developed a technology which made really meaningful improvements to workloads in an industry that is full of sticky late adopters (RDBMSes). They kept investing until the point where they had unequivocally made their point and the late adopters were just about getting it... and then killed it!
It's hard to understand how they could have played that particular hand more badly. Even a few years on, I'm missing Optane drives because there is still no functional alternative. If they just held out a bit longer, they would have created a set of enterprise customers who would still be buying the things in 2040.
It’s especially hard to understand because so much of their management had a degree which conferred a mastery of business administration. I mean, it’s almost like you could take any tenured software engineeer at the company, and they would have been in a better position to manage the company more effectively. That’s very surprising, and might suggest that people with MBAs are total idiots who understand everything through GRE-friendly analogy rather than, well, actually understanding anything.
They are a weird company. Their marketing people showed up and invested a significant amount into a buy of optane gear with our OEM a few months before they killed the product. They pulled the rug in themselves in addition to the customers.
Optane was incredible. It's insane that Intel dropped this.
Executives. That everyone on here claims fairly earn their multi million dollar salaries.
One could see the death of Optane coming from a mile away. It was only kept afloat by Intel, and its main issue that is was really cool tech, but it was a solution looking for a problem.
You need scratch space that's resilient to a power outage? An NVDIMM is faster and cheaper. You need fast storage? Flash keeps getting faster and cheaper. Optane was squeezed from both sides and could never hope to generate the volume needed to cut costs.
So now imagine that you are at Intel deciding what initiatives to fund. The company is in trouble and needs to show some movement out of the red, preferably quickly. It also lost momentum and lost ground to competitors, so it needs to focus. What do you do? You kill all the side projects that will never make much money. And of course you kill a lot of innovation in the process, but how would you justify the alternative?
Isn’t NVMe disks basically same value as optane? Comments saying optane was amazing doesn’t make sense if NVMe is basically as good and there are other NVMe disk manufacturers
They are not really the same.
First, NVMe is a protocol to access block storage. It can be used to access any kind of block device, Optane, SSD, NVDIMM, virtual storage on EC2, etc. So it's true that the protocol is the same (well, not quite - more on this in a bit), but that's like saying a server is the same as an iPhone because they can both speak TCP/IP.
What was the "more in a bit" bit? Persistent memory (PMEM) devices like NVDIMMs and Optane can usually speak two protocols. They can either act as storage, or as memory expansion. But this memory also happens to be non-volatile.
This was sold as a revolution, but it turned out that it's not easy for current operating systems and applications to deal with memory with vastly different latencies. Also it turns out that software is buggy, and being able to lose state by rebooting is useful. And so Optane in memory mode never really caught on, and these devices were mostly used as a storage tier. However: look up MemVerge.
So you are right that it turned out to be a faster SSD, but the original promise was a lot more. And here comes the big problem: because Optane was envisioned as a separate kind of product between RAM and SSD, the big price differential could be justified. If it's just a faster SSD - well, the market has spoken.
I made the mistake early in our startup of spending several months and quite a bit of cash building our first iot product on the Intel Edison platform, only to get zero support on the bugs in the SPI chip and the non-existent (but advertised) microcontroller. We finally gave up and made our own boards based on another SOM (and eventually stopped building boards entirely) and they rather unceremoniously cancelled the Edison in 2017. I guess nobody else was surprised, but I had naively thought the platform did have potential and a huge company like Intel would support the things they sold.
> this thing where they do a huge push into a new technology, then don't see the uptake and let it die.
Do we need a second "killed by google"?
To companies like Intel or Google anything below a few hundred million users is a failure. Had these projects been in a smaller company, or been spun out, they'd still be successful and would've created a whole new market.
Maybe I'm biased — a significant part of my career has been working for German Mittelstand "Hidden Champions" — but I believe you don't need a billion customers to change the world.
Intel's 5G radio department was formed in 2011 by buying another firm and then it was bought by Apple in 2019. Apple announced a 5G modem this year (C1) . It took 14 years to get a viable 5g wireless modem but still doesn't have feature parity with Apple's cellular modems in the other iPhones. So this happens pretty often by Intel.
Until this day, I miss Optane — I work for a timeseries database company focused on finance, the amount of use cases I have that screams “faster than NVMe, slower than RAM” is insane. And these companies have money to throw at these problems.
Which begs the question, why isn’t anyone else stepping into this gap? Is the technology heavily patented?
Yes, and Intel got caught skirting them.
Indeed. Octane/3dxpoint was mind blowing futuristic stuff but it was just gone after 5 years? On the market? Talk about short sighted.
They got caught is what happened.
Caught doing what? Can you provide some context or links to search?
When Energy Conversion Devices went bankrupt, it appears Intel pirated the technology, and never bothered to pay the royalties for the PCM memory in Optane.
Case No. 12-43166 is what killed Optane.
Or, in a manner of speaking, Intel being Intel killed Optane.
The legal risks were at most the last straw. If Optane had a promising future, Intel could have made the investments necessary to make the legal issues go away. If Optane had a promising future, Micron would have helped Intel secure that future. The long-term value of a persistent memory technology capable of taking a big chunk out of both the DRAM and NAND flash markets is huge.
Optane did not have a promising future. The $/GB gap between 3D XPoint memory and 3D NAND flash memory was only going to keep growing. Optane was doomed to only be appealing to the niche of workloads where flash memory is too slow and DRAM is too expensive. But even DRAM was increasing in density faster than 3D XPoint, and flash (especially the latency-optimized variants that are still cheaper than 3D XPoint) is fast enough for a lot of workloads. Optane needed a breakthrough improvement to secure a permanent place in the memory hierarchy, and Intel couldn't come up with one.
> They continue to do this thing where they do a huge push into a new technology, then don't see the uptake and let it die.
Except Intel deliberately made AVX 512 a feature exclusively available to Xeon and enterprise processors in future generations. This backward step artificially limits its availability, forcing enterprises to invest in more expensive hardware.
I wonder if Intel has taken a similar approach with Arc GPUs, which lack support for GPU virtualization (SR-IOV). They somewhat added vGPU support to all built-in 12th-14th Gen chips through the i915 driver on Linux. It’s a pleasure to have graphics-acceleration in multiple VMs simultaneously, through the same GPU.
They go out their way to segment their markets, ECC, AVX, Optane support (only specific server class skus). I hate it, I hate as a home pc user, I hate it as an enterprise customer, I hate as a shareholder.
Every company does this. If you're grandma only uses a web browser, word processor, and excel, does she really want to spend an additional $50 on a feature she'll not use? Same with NPUs. Different consumers want different features for different prices.
Except it hinders adoption, because not having a feature in entry-level products will mean less incentive (and ability) for software developers to use it. Compatibility is so valuable it makes everyone converge on the least common denominator, so when you price-gouge on a software-exposed feature, you might as well bury this feature altogether.
Three fallacies and you are OUT!
Well, Itanium might be a counterexample, they probably tried to make that work for far too long..
Itanium was more of an HP product than an Intel one.
Itanium worked as intended.
So far as killing HP PA-Risc, SGI MIPS, DEC Alpha, and seriously hurting the chance for adoption of Sparc, and POWER outside of their respective parents (did I miss any)?
Thing is, they could have killed it by 1998, without ever releasing anything, that would have killed the other architectures it was trying to compete with. Instead they waited until 2020 to end support.
What the VLIW of Itanium needed and never really got was proper compiler support. Nvidia has this in spades with CUDA. It's easy to port to Nvidia where you do get serious speedups. AVX-512 never offered enough of a speedup from what I could tell, even though it was well supported by at least ICC (and numpy/scipy when properly compiled)
> What the VLIW of Itanium needed and never really got was proper compiler support.
This is kinda under-selling it. The fundamental problem with statically-scheduled VLIW machines like Itanium is it puts all of the complexity in the compiler. Unfortunately it turns out it's just really hard to make a good static scheduler!
In contrast, dynamically-scheduled out-of-order superscalar machines work great but put all the complexity in silicon. The transistor overhead was expensive back in the day, so statically-scheduled VLIWs seemed like a good idea.
What happened was that static scheduling stayed really hard while the transistor overhead for dynamic scheduling became irrelevantly cheap. "Throw more hardware at it" won handily over "Make better software".
No, VLIW is even worse than this. Describing it as a compiler problem undersells the issue. VLIW is not tractable for a multitasking / multi tenant system due to cache residency issues. The compiler cannot efficiently schedule instructions without knowing what is in cache. But, it can’t know what’s going to be in cache if it doesn’t know what’s occupying the adjacent task time slices. Add virtualization and it’s a disaster.
It only works for fixed workloads, like accelerators, with no dynamic sharing.
Yeah, VLIW is still used for stuff like DSP and GPUs, but it doesn't make sense for general computing.
GPUs have long since moved away from VLIW as well
>What happened was that static scheduling stayed really hard while the transistor overhead for dynamic scheduling became irrelevantly cheap
Is the latter part true? AFAIK most of modern CPU die area and power consumption goes towards overhead as opposed to the actual ALU operations.
If it's pure TFLOPs you're after, you do want a more or less statically scheduled GPU. But for CPU workloads, even the low-power efficiency cores in phones these days are out of order, and the size of reorder buffers in high-performance CPU cores keeps growing. If you try to run a CPU workload on GPU-like hardware, you'll just get pitifully low utilization.
So it's clearly true that the transistor overhead of dynamic scheduling is cheap compared to the (as-yet unsurmounted) cost of doing static scheduling for software that doesn't lend itself to that approach. But it's probably also true that dynamic scheduling is expensive compared to ALUs, or else we'd see more GPU-like architectures using dynamic scheduling to broaden the range of workloads they can run with competitive performance. Instead, it appears the most successful GPU company largely just keeps throwing ALUs at the problem.
I think OP meant "transistor count overhead" and that's true. There are bazillions of transistors available now. It does take a lot of power, and returns are diminishing, but there are still returns, even more so than just increasing core count. Overall what matters is performance per watt, and that's still going up.
"they could have killed it by 1998, without ever releasing anything"
perhaps Intel really wanted it to work and killing other architectures was only a side effect?
> So far as killing HP PA-Risc, SGI MIPS, DEC Alpha, and seriously hurting the chance for adoption of Sparc, and POWER outside of their respective parents (did I miss any)?
I would argue that it was bound to happen one way or another eventually, and Itanium just happened to be a catalyst for the extinction of nearly all alternatives.
High to very high performance CPU manufacturing (NB: the emphasis is on the manufacturing) is a very expensive business, and back in the 1990's no-one was able (or willing) to invest in the manufacturing and commit to the continuous investment in keeping the CPU manufacturing facilities up to date. For HP, SGI, Digital Equipment, SUN, and IBM, a high performance RISC CPU was the single most significant enabler, yet not their core business. It was a truly odd situation where they all had a critical dependency on CPU's, yet none of them could manufacture them themselves and were all reliant on a third party[0].
Even Motorola that was in some very serious semicondutor business could not meet the market demands[1].
Look at how much it costs Apple to get what they want out of TSMC – it is tens of billions of dollars almost yearly, if not yearly. We can see very well today how expensive it is to manufacture a bleeding-edge, high-performing CPU – look no further than Samsung, GlobalFoundries, the beloved Intel, and many others. Remember the days when Texas Instruments used to make CPU's? Nope, they don't make them anymore.
[0] Yes, HP and IBM used to produced their own CPU's in-house for a while, but then that ceased as well.
[1] The actual reason why Motorola could not meet the market demand was, of course, an entirely different one – the company management did not consider the CPU's to be their core business as they primarily focused on other semiconductor products and on defence, which left the CPU production in an underinvested state. Motorola could have become a TSMC if they could see the future through a silicon dust shroud.
Bad habits are hard to break!
Optane was cancelled because manufacturer sold the fab
Oh? Complete coincidence they got caught not paying ECDL royalties?
?
wdym
When Energy Conversion Devices went bankrupt, it appears Intel pirated the technology, and never bothered to pay the royalties for the PCM memory in Optane.
Case No. 12-43166 is what finally killed Optane.
Being right at the wrong time, is the same as being wrong
I am very disappointed about Optane drives. Perfect case for superfast vertically scalable database. I was going to build a solution based on this but suddenly it is gone for all practical intents and purposes.
In this article, they saw the following speeds:
Original: 18 GB/s
AVX2: 20 GB/s
AVX512: 21 GB/s
This is an AMD CPU, but it's clear that the AVX512 benefits are marginal over the AVX2 version. Note that Intel's consumer chips do support AVX2, even on the E-cores.
But there's more to the story: This is a single-threaded benchmark. Intel gave up AVX512 to free up die space for more cores. Intel's top of the line consumer part has 24 cores as a result, whereas AMD's top consumer part has 16. We'd have to look at actual Intel benchmarks to see, but if the AVX2 to AVX512 improvements are marginal, a multithreaded AVX2 version across more cores would likely outperform a multithreaded AVX512 version across fewer cores. Note that Intel's E-cores run AVX2 instructions slower than the P-cores, but again the AVX boost is marginal in this benchmark anyway.
I know people like to get angry at Intel for taking a feature away, but the real-world benefit of having AVX512 instead of only AVX2 is very minimal. In most cases, it's probably offset by having extra cores working on the problem. There are very specific workloads, often single-threaded, that benefit from AVX-512, but on a blended mix of applications and benchmarks I suspect Intel made an informed decision to do what they did.
> We'd have to look at actual Intel benchmarks to see, but if the AVX2 to AVX512 improvements are marginal, a multithreaded AVX2 version across more cores would likely outperform a multithreaded AVX512 version across fewer cores.
Look at any existing heavily multithreaded benchmark like Blender rendering. The E-cores are so weak that it just about takes 2 of them to match the performance of an AMD core. If the only difference was AVX512 support then yeah, 24 AVX2 cores would beat 16 AVX-512 cores. But that's not the only difference, not even close.
That's not to say a 24 core Core 9 Ultra Whatever would be slower than a 16 core 9950X in this workload. Just that the E-cores are kinda shit, especially in the wonky counts Intel is using (too many to just be about power efficiency, too few to really offset how slow they are)
> The E-cores are so weak that it just about takes 2 of them to match the performance of an AMD core.
That's not "weak". If you look at available die-shot analyses, the E-cores are tiny compared to the P-cores, they take up a lot less than half in area and even less in power. P-cores are really only useful for the rare pure single-threaded workload, but E-cores will win otherwise.
We're not comparing to Intel's P cores but AMDs cores. 8 of AMDs cores fit in 70.6mm2 on a high performance process, and take up a fraction of that space on a high density process (see the 192 core Zen 5c chips)
AVX2 vs AVX512 in this case may be somewhat misleading. In .NET, even if you use 256bit-wide vectors, it will still take advantage of AVX512VL whenever available to fuse chained operations into masked, vpternlogd's, etc.[0] (plus standard operations like stack zeroing, struct copying, string comparison, element search, and other can use the full width)[1]
So to force true AVX2 the benchmark would have to be ran with `DOTNET_EnableAVX512F=0` which I assume is not the case here.
[0]: https://devblogs.microsoft.com/dotnet/performance-improvemen...
[1]: https://devblogs.microsoft.com/dotnet/performance-improvemen...
[dead]
Isn't AVX-10 on the horizon, which will have most of the goodies that AVX-512 had? (I'm actually not even sure what the difference is supposed to be between them.)
AVX-10 used to have a AVX-10/256 version that's AVX-512 but without the 512-bit registers, but that's gone as of recently, so now AVX-10 is just a set of most AVX-512 extensions, and the stated goal is to, for future versions, guarantee each successive one being a superset of the previous (as opposed to AVX-512 with many independent extensions).
AVX-10 was mostly just a way for Intel to provide an excuse for why they're still a few years out from having AVX-512 in their E cores: they're targeting a standard that's not here yet. But the excuse doesn't really work now that AVX-10 requires doing a full AVX-512 implementation. We're back to Intel just dragging their heels on implementing the AVX-512 support that they were obviously going to need all along.
I mean, the most interesting part of the article for me:
> A bit surprisingly the AVX2 parser on 9950X hit ~20GB/s! That is, it was better than the AVX-512 based parser by ~10%, which is pretty significant for Sep.
They fixed it, that's the whole point, but I think there's evidence that AVX-512 doesn't actually benefit consumers that much. I would be willing to settle for a laptop that can only parse 20GB/s and not 21GB/s of CSV. I think vector assembly nerds care about support much more than users.
That probably just means it's a memory bandwidth bound problem. It's going to be a different story for tasks that require more computation.
You can still saturate an ultrawide vector unit with narrower instructions if you have wide enough dispatch
AVX512 is not just about width. It ships with a lot of very useful instructions available for narrower vectors with AVX512VL. It also improves throughput per instruction. You're not hand-writing intrinsified code usually yet compilers, especially JIT ones, can make use of it for all sorts of common operations that become x times faster. In .NET, having AVX512 will speed up linear search, memory copying, string comparison which are straightforward, but it will also affect its Regex performance which uses SearchValues<T> which under the hood is able to perform complex shuffles and vector lookups on larger vectors with much better throughput. AVX512 lends itself to a more compact codegen (although .NET is not perfect in that regard, I think it sometimes regresses vs AVX2 with its instruction choices, but it's a matter of iterative improvement).
It’s wild seeing how stupid Intel is being.
If it's any consolation, Sep will happily use AVX-512 whenever available, without having to opt into that explicitly, including the server parts, as it will most likely run under a JIT runtime (although it's NAOT-compatible). So you're not missing out by being forced to target the lowest common denominator.
Intel is horrible with software. My laptop has a pretty good iGPU, but it's not properly supported by PyTorch or most other software. Vulkan inference with llama.cpp does wonders, and it makes me sad that most software other than llama.cpp does not take advantage of it.
Sounds like something to try. Do I just need to compile Vulkan support to use the igpu?
Instead of doing 4 comparisons against each character `\n`, `\r`, `;` and `"` followed by 3 or operations, a common trick is to do 1 shuffle, 1 comparison and 0 or operations. I blogged about this trick: https://stoppels.ch/2022/11/30/io-is-no-longer-the-bottlenec... (Trick 2)
Edit: they do make use of ternary logic to avoid one or operation, which is nice. Basically (a | b | c) | d is computed using `vpternlogd` and `vpor` resp.
really cool thanks
Take that, Intel and your "let's remove AVX-512 from every consumer CPU because we want to put slow cores on every single one of them and also not consider multi-pumping it"
A lot of this stems from the 10nm hole they had to dig themselves out from. Yields are bad, so costs are high, so let's cut the die as much as possible, ship Atom-derived cores and market it as an energy-saving measure. The expensive parts can be bigger and we'll cut the margins on those to retain the server/cloud sector. Also our earnings go into the shitter and we lose market share anyway, but at least we tried.
This issue is less about Intel's fab failures and more about their inability to decouple their architecture update cadence from their fab progress. They stopped iterating on their CPU designs while waiting for 10nm to get fixed. That left them with an oversized P core and an outdated E core, and all they could do for Alder Lake was slap them onto one die and ship it, with no ability to produce a well-matched pair of core designs in any reasonable time frame. We're still seeing weird consequences of their inability to port CPU designs between processes and fabs: this year's laptop processors have HyperThreading only in the lowest-cost parts—those that still have the CPU chiplet fabbed at Intel while the higher core count parts are made by TSMC.
This is a staggering ~3x improvement in just under 2 years since Sep was introduced June, 2023.
You can't claim this when you also do a huge hardware jump
They also included 0.9.0 vs 0.10.0. on the new hardware. (21385 vs 18203), so the jump because of software is 17%.
Then if we take 0.9.0 on previous hardware (13088) and add the 17%, it's 15375. Version 0.1.0 was 7335.
So... 15375/7335 -> a staggering 2.1x improvement in just under 2 years
They claim a 3GB/s improvement versus previous version of sep on equal hardware — and unlike “marketing” benchmarks, include the actual speed achieved and the hardware used.
Do note that this speed even before the 3GB/s improvement exceeds the bandwidth of most disks, so the bottleneck is loading data in memory. I don't know of many applications where CSV is produced and consumed in memory, so I wonder what the use is.
"We can parse at x GB/s" is more or less the reciprocal of "we need y% of your CPU capacity to saturate I/O".
Higher x -> lower y -> more CPU for my actual workload.
Slower than network! In-memory processing of OLAP tables, streaming splitters, large data set division… but also the faster the parser, the less time you spend parsing and the more you spend doing actual work
This is honestly something that caught me off-guard a bit. If you have good internal network connectivity, small queries and your relational database has the data in memory, it can be faster to fetch data from the DB via the network than reading it from disk.
Like, sure, I can give you an application server with faster disks and more memory and you or me are certainly capable of implementing an application server that could load the data from disk faster than all of that. And then we build caching to keep the hot data in memory, because that's faster.
But then we've spent very advanced development resources to build a relational database with some application code at the edge.
This can make sense in some high frequency trading situations, but in many more mundane web-backends, a chunky database and someone capable of optimizing stupid queries enable and simplify the work of a much bigger number of developers.
You can also get this with Infiniband, although it is less surprising, and basically what you’d expect to see.
I did once use a system where the network bandwidth was in the same ballpark as the memory bandwidth, which might not be surprising for some of the real HPC-heads here but it surprised me!
Decompression is your friend. Usually CSV compresses really well.
Multiple cores decompressing LZ4 compressed data can achieve crazy bandwidth. More than 5 GB/s per core.
Perhaps, but i think we are well past the moore's law era where a 3x speed up is to be expected just from hardware. Its still a pretty impressive feat in the modern era.
> You can't claim this when you also do a huge hardware jump
Well, they did. Personally, I find it an interesting way of looking at it, it's a lens for the "real performance" one could get using this software year over year. (Not saying it isn't a misleading or fallacious claim though.)
Yea wtf is that chart, it literally skips 4 cpu generations where it shows “massive performance gain”.
Straight to the trash with this post.
But it repeats the 0.9.0 test on the new hardware. So the first big jump is a hardware change, but the second jump is the software changes.
It also appears to be reporting whole-CPU vs. single thread, 1.3 GB/sec is not impressive for single thread perf
Agreed. How hard is it to keep hardware fixed, load the data into memory, and use a single core for your benchmarks? When I see a chart like that I think, "What else are they hiding?"
Folks should check out https://github.com/dathere/qsv if they need an actually fast CSV parser.
I mean... A single 9950x core is going to struggle to do more than 16 GB/second of direct mem copy bandwidth. So being within an order of magnitude of that seems reasonable
4 generations?
5950x is Zen 3
9950x is Zen 5
Sine Zen 2 (3000) the mobile CPUs are up by a thousand respectively to their desktop counterparts. edit: Or Nx2000 where N is from Zen N.
And even with 2, CPU generations aren't what they used to be back when a candy bar cost less than a dollar.
If we are lucky we will see Arthur Whitney get triggered and post either a one liner beating this or a shakti engine update and a one liner beating this. Progress!
I shudder to think who needs to process a million lines of csv that fast...
I have. I think it's a pretty easy situation for certain kinds of startups to find themselves in:
- Someone decides on CSV because it's easy to produce and you don't have that much data. Plus it's easier for the <non-software people> to read so they quit asking you to give them Excel sheets. Here <non-software people> is anyone who has a legit need to see your data and knows Excel really well. It can range from business types to lab scientists.
- Your internal processes start to consume CSV because it's what you produce. You build out key pipelines where one or more steps consume CSV.
- Suddenly your data increases by 10x or 100x or more because something started working: you got some customers, your sensor throughput improved, the science part started working, etc.
Then it starts to make sense to optimize ingesting millions or billions of lines of CSV. It buys you time so you can start moving your internal processes (and maybe some other teams' stuff) to a format more suited for this kind of data.
It's become a very common interchange format, even internally; it's also easy to deflate. I have had to work on codebases where CSV was being pumped out at basically the speed of a NIC card (its origin was Netflow, and then aggregated and otherwise processed, and the results sent via CSV to a master for further aggregation and analysis).
I really don't get, though, why people can't just use protocol buffers instead. Is protobuf really that hard?
protobuf is more friction, and actually slow to write and read.
For better or worse, CSV is easy to produce via printf. Easy to read by breaking lines and splitting by the delimiter. Escaping delimiters part of the content is not hard, though often added as an afterthought.
Protobuf requires to install a library. Understand how it works. Write a schema file. Share the shema to others. The API is cumbersome.
Finally to offer this mutable struct via setter and getter abstraction, with variable length encoded numbers, variable length strings etc. The library ends up quite slow.
In my experience protobuf is slow and memory hungry. The generated code is also quite bloated, which is not helping.
See https://capnproto.org/ for details from the original creator of protobuf.
Is CSV faster than protobuf? I don't know, and I haven't tested. But I wouldn't be surprised if it is.
> For better or worse, CSV is easy to produce via printf. Easy to read by breaking lines and splitting by the delimiter. Escaping delimiters part of the content is not hard, though often added as an afterthought.
Based on the amount of software I seen producing broken CSV or can't parse (more-or-less) valid CSV, I don't think that is true.
It seems to be easy, because just printf("%s,%d,%d\n", ...) but it is full of edge cases most programmers don't think about.
Not an issue when you control both ends of the pipe. CSV is a great interchange format for tabular data, especially so if it's only/mostly numeric. If you need to pass tabular data from internal service X to internal service Y it's great. And it's really fast.
Hmmm if they're just internal tools, why not just an array of structs? No parsing needed. Can have optionals. Can't go faster than nothing.
Mostly because dependencies are hard and extra so when you need another team using a different language to also neeeds support the same format.
I’d love to pass parquet data around, or SQLite dbs, or something else, but that requires dedicated support from other teams upstream/downstream.
Everyone and everything supports CSV, and when they don’t they can hack a simple parser quickly. I know that getting a CSV parser right for all the edge cases is very hard, but they don’t need to. They just need to support the features we use. That’s simple and quick and everyone quickly moves on to the actual work of processing the data.
Yeah there's no format to support that way. Maybe I'm more biased towards numeric data (sensor readings, etc), but I never have to worry about libraries and dependencies to say
data = (uint32_t *)read(f);
Or
data = struct.unpack...
Sounds like you're dealing with more heavily formatted or variably formatted data that benefits from more structure to it
Thanks to everyone above for some great responses. Cap'n Proto seems to do exactly what you're describing (the in-memory representation is identical to what's on the wire, and then getter/setter methods are generated which look at that).
yep use it a lot for internal stuff and I can't recall the last time we had an issue with parsing or using it. It just works for us as a data interchange file format for tabular data. Of course our character set is basically just ascii letter and numbers, we don't even need commas or quotation marks.
Precisely. And if things get a bit more complicated slap a ‘|’ as a separator and you are almost guaranteed to never need to quote anything.
Extremely hard to tell an HR person, "Right-click on here in your Workday/Zendesk/Salesforce/etc UI and export a protobuf". Most of these folks in the business world LIVE in Excel/Spreadsheet land so a CSV feels very native. We can agree all day long that for actual data TRANSFER, CSV is riddled with edge cases. But it's what the customers are using.
It's extremely unlikely they need to load spreadsheets large enough for 21Gb/s speed to matter
You’d be surprised. Big telcos use CSV and SFTP for CDR data, and there’s a lot of it.
Oh absolutely! I'm just mentioning why CSV is chosen over Protobufs.
Kind of, there isn't a 1:1 mapping of protobuf wire types to schema types, so you need to package the protobuf schema with the data and compile it to parse the data, or decide on the schema before-hand. So now you need to decide on a file format to bundle the schema and the data.
I'm not the biggest fan of Protobuf, mostly around the 'perhaps-too-minimal' typing of the system and the performance differentials present on certain languages in the library.
e.x. I know in .NET space, MessagePack is usually faster than proto, I think similar is true for JVM. Main disadvantage is there's not good schema based tooling around it.
I shudder to think of what it means to be storing the _results_ of processing 21 GB/s of CSV. Hopefully some useful kind of aggregation, but if this was powering some kind of search over structured data then it has to be stored somewhere...
Just because you’re processing 21GB/s of CSV doesn’t mean you need all of it.
If your data is coming from a source you don’t own, it’s likely to include data you don’t need. Maybe there’s 30 columns and you only need 3 - or 200 columns and you only need 1.
Enterprise ETL is full of such cases.
For all its many weaknesses, I believe CSV is still the most common data interchange format.
Erm, maybe file based? JSON is the king if you count exchanges worldwide a sec. Maybe no 2 is form-data which is basically email multipart, and if course there's email as a format. Very common =)
I meant file-based.
I honestly wonder if JSON is king. I used to think so until I started working in fintech. XML is unfortunately everywhere.
JSON: because XML is too hard.
Developers: hey, let's hack everything XML had back onto JSON except worse and non-standardized. Because it turns out you need those things sometimes!
JSON isn't great for tabular data. And an awful lot of data is tabular.
JSON tabular data only adds a couple of brackets per line and at the start/end of the file vs CSV. In exchange for these bits (that basically disappear when compressed), you get a guaranteed standard formatting. Seems like a decent tradeoff to me.
Yeah, I don’t like parsing XML, but I’d rather do that than deal with the Lovecraftian API design that comes with complex JSON representations.
lots of folks in Finance, you can share csv with any Finance company and they can process it. It's text.
Humans generate decisions / text information at rates of ~bytes per second at most. There is barely enough humans around to generate 21GB/s of information even if all they did was make financial decisions!
So 21 GB/s would be solely algos talking to algos... Given all the investment in the algos, surely they don't need to be exchanging CSV around?
The only real example I can think of is the US options market feed. It is up to something like 50 GiB/s now, and is open 6.5 hours per day. Even a small subset of the feed that someone may be working on for data analysis could be huge. I agree CSV shouldn't even be used here but I am sure it is.
CSV is a questionable choice for a dataset that size. It's not very efficient in terms of size (real numbers take more bytes to store as text than as binary), it's not the fastest to parse (due to escaping) and a single delimiter or escape out of place corrupts everything afterwards. That not to mention all the issues around encoding, different delimiters etc.
Its great for when people need to be in the loop, looking at the data, maybe loading in Excel etc. (I use it myself...). But not enough humans around for 21 GB/s
> (real numbers take more bytes to store as text than as binary)
Depends on the distribution of numbeds in the sataset. It's quite common to have small numbers. For these text is a more efficient representation compared to binary, especially compared to 64-bit or larger binary encodings.
Standards (whether official or de facto) often aren't the best in isolation, but they're the best in reality because they're widely used.
Imagine you want to replace CSV for this purpose. From a purely technical view, this makes total sense. So you investigate, come up with a better standard, make sure it has all the capabilities everyone needs from the existing stuff, write a reference implementation, and go off to get it adopted.
First place you talk to asks you two questions: "Which of my partner institutions accept this?" "What are the practical benefits of switching to this?"
Your answer to the first is going to be "none of them" and the answer to the second is going to be vague hand-wavey stuff around maintainability and making programmers happier, with maybe a little bit of "this properly handles it when your clients' names have accent marks."
Next place asks the same questions, and since the first place wasn't interested, you have the same answers....
Replacing existing standards that are Good Enough is really, really hard.
You might have accumulated some decades of data in that format and now want to ingest it into a database.
Yes, but if you have decades of data, what turns on having to wait for a minute or 10 minutes to convert it?
> Humans generate decisions / text information at rates of ~bytes per second at most
Yes, but the consequences of these decisions are worth much more. You attach an ID to the user, and an ID to the transaction. You store the location and time where it was made. Ect.
I think these would add only small amount of information (and in a DB would be modelled as joins). Only adds lots of data if done very inefficiently.
Why are you theoretising? I can tell you from out there its used massively, and its not going away in contrary. Even rather small banks can end up generating various reports etc. which can easily become huge.
The speed of human decision has basically 0 role here, as it doesn't with messaging generally, there is way more to companies than just direct keyboard-to-output link.
You seem to not realize that most humans are not coders.
And non coders use proprietary software, which usually has an export into CSV or XLS to be compatible with Microsoft Office.
That cartesian product file accounting sends you at year end?
Ugh.....I do unfortunately.
In basically every situation it is inferior to HDF5.
I do not think there is an actual explanation besides ignorance, laziness or "it works".
I was expecting to see assembly language and was pleasantly surprised to see C#. Very impressive.
Nice work!
Modern .NET has the deepest integration with SIMD and vector intrinsics of what most people would consider "high-level languages".
https://learn.microsoft.com/en-us/dotnet/standard/simd
Tanner Gooding at Microsoft is responsible for a lot of the developments in this area and has some decent blogposts on it, e.g.
https://devblogs.microsoft.com/dotnet/dotnet-8-hardware-intr...
The article doesn't clearly define what this 21 GB/s code is doing.
- What format exactly is it parsing? (eg. does the dialect of CSV support quoted commas, or is the parser merely looking for commas and newlines)?
- What is the parser doing with the result (ie. populating a data structure, etc)?
There’s a calculation for ns/row in the article that is never translated into rows per second but is about 27 ns/row, which is about 37,000 per second. Which means these rows are 570k apiece if that’s 21GB. Which seems like an awfully cooked benchmark.
That would be 37,000,000, not 37,000.
So ~570 per line. Still seems a bit contrived. I’d expect a SIMD version to still work line by line but I don’t know that I would try to shove that much into a line if I wanted to read it really fucking fast.
In my experience I've found it difficult to get substantial gains with custom SIMD code compared to modern compiler auto-vectorization, but to be fair that was with more vector-friendly code than JSON parsing.
I need this, just finished 300GB of CSV extracts, and manipulating, data integrity checks, and so on take longer than they should.
Why wouldn't you use a data format meant to store floating point numbers?
HDF5 gives you a great way to store such data.
Sounds interesting, I'll give it a look. I'm unfortunately limited to CSV, XML, or XLS from the source system, then am transforming it and loading it into another DB.
tbh the way intel keeps killing cool tech gets on my nerves - wish they'd just stick it out for once
> Net 9.0
heh, do it again with mawk.
There are very good alternatives to csv for storing and exchanging floating point/other data.
The HDF5 format is very good and allows far more structure in your files, as well as metadata and different types of lossless and lossy compression.
Considering the non-standard nature of CSV, quoting throughput numbers in bytes is meaningless. It makes sense for JSON, since you know what the output is going to be (e.g. floats, integers, strings, hashmaps, etc). With CSV you only get strings for each column, so 21 GB/s of comma splitting would be the pinnacle of meaninglessness. Like, okay, but I still have to parse the stringy data, so what gives? Yeah, the blog post does reference float parsing, but a single float per line would count as "CSV".
Now someone might counter and say that I should just read the README.MD, but then that suspicion simply turns out to be true: They don't actually do any escaping or quoting by default, making the quoted numbers an example of heavily misleading advertising.
CSV is standardized in RFC 4180 (well, as standardized as most of what we considered internet "standard").
Otherwise agree, if you don't do escaping (a.k.a. "quoting", the same thing for CSV), you are not implementing it correctly. For example, if you quote a line break, in RFC 4180, this line break will be in that quoted string, but if you don't need to handle that, you can implement CSV parsing much faster (proper handling line break with quoted string requires 2-pass approach (if you are going to use many-core) while not handling it at all can be done with 1-pass approach). I discussed about this detail in https://liuliu.me/eyes/loading-csv-file-at-the-speed-limit-o...
Side note: RFCs are great standards, as they are readable.
As an example of how not to do it: XML can be assumed a standard, but I cannot afford to read it. DIN/ISO is great for manufacturing in theory, but bad for zero-cost of initial investment like IT.
[flagged]
Then show us your elixir implementation?
Why not use Parquet?
Because that would be too logical. :)
It is an interesting benchmark anyway.
Excel does not output Parquet.
True. But also Excel probably collapses into a black hole going straight to hell trying to handle 21GB of data.
Excel .xlsx files are limited to 1,048,576 rows and 16,384 columns.
Excel .xls files are limited to 65,536 rows and 256 columns.
21GB/s, not 21GB ...
mawk would handle a 21 GB csv (or maybe one true awk) fast enough.
Excel often outputs broken csv :)
I have been privileged in my career to never need to parse Excel output but occasionally feed it input. Especially before Grafana was a household name.
Putting something out so manager stops asking you 20 questions about the data is a double edged sword though. Those people can hallucinate more than a pre-Covid AI engine. Grafana is just weird enough that people would rather consume a chart than try to make one, then you have some control over the acid trip.
Or HDF5 or any other format which is actually meant to store large amounts of floating point data.