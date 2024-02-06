The recent explosion of interest in large-scale data processing is highlighting the limitations of tools built for the era of not just large-scale but large-volume data. Data comes from millions—billions—of sources, flowing through a vast infrastructure of complex systems built by different vendors, all of them desperate to claim their share of the giant AI pie. Without this deep technical standard it could all collapse.

Apache Arrow, an open source project, has emerged as the de facto mechanism used by most of the industry for data interoperability. It is maintained by a company called the now familiar Voltron Data. open core Approach where the core product is free and open source, subsidized by enterprise support and the sale of complementary products.

Arrow fills a need that was previously met by sending CSV or JSON structured data back and forth, or making direct data connections between systems via JDBC or ODBC. None of these approaches could keep up with the volume and velocity required for data science in recent years, but attempts at proprietary solutions never achieved widespread industry support.

“Data systems have a lot of modular, composable parts that must be standardized. Josh Patterson, CEO of Voltron Data, said people should just build it once and adopt it instead of repeating the same thing over and over again in slightly different ways. “This allows us to build really, really large-scale systems that are far more elegant and efficient than otherwise.”

Josh Patterson, CEO of Voltron Data Voltron Data (Supplied)

While this is largely true, it is extremely difficult to get competitors to agree on a common standard. Arrow’s success is even more remarkable given the competitive pressure in the sport. Traditionally, competitors would try to outdo each other rather than cooperate.

“In the early days, even in open source, everyone was like ‘I’ll come up with a better CSV parser than you,’” Patterson said. “’We’re not going to share the CSV parser, I’m going to rewrite my own parser!’ You know? ‘My version of joining will be better than your version of joining!’”

Patterson observed that collaborating on a common set of fundamental primitives and an interoperable, open standard has proven extremely beneficial. “This de-composable stack is relatively new, and I think it’s an exciting thing.”

The open core model is being questioned by many companies that once held firm beliefs. Priced and well packaged, an open core model can sustain a large business – Red Hat is the most famous example – although it is not without its challenges. Companies like MongoDB, Elastic and more recently HashiCorp have opted to move away from a completely open core approach, citing business imperatives. Although they are in the minority, there is still lively discussion of the merits of alternative approaches going on in the tech industry.

“I personally believe we’re moving a little bit away from open source,” Patterson says, “We’re seeing this evolution of open source, but I think open standards are going to become more common.”

In the AI ​​field, we are seeing some signs of this with so-called open models, although they resemble the commercial licensing approaches of MongoDB, Elastic, or HashiCorp more than the open source traditions of the Apache, MIT, or GPL licenses. An open standard has yet to emerge and instead we are seeing a general battle for the dominant position as the proprietary controller of the de facto standard.

It is not clear where this struggle for position will take us. OpenStack remains a cautionary tale for many, while Kubernetes and Arrow show us a different path. Or we could end up in a market dominated by a single company that controls the de facto standard, as Microsoft once did for Windows, and AWS now does for S3.

The tech industry loves to democratize things, so maybe this is another opportunity to do so?