Databricks has announced its acquisition of MosaicML, a San Francisco-based generative artificial intelligence startup, in a deal worth approximately $1.3 billion. The move comes as businesses increasingly seek to develop their ChatGPT-like tools, and Databricks aims to meet the growing demand.
Databricks, a data storage and management startup, plans to combine its AI-ready data management technology with MosaicML’s language model platform. This integration will empower businesses to construct cost-effective language models using their proprietary data. Many companies rely on third-party language models trained on publicly available data obtained online.
MosaicML, launched in 2021 and set to operate as an independent service under Databricks, has been dedicated to reducing the cost of generative AI. According to Naveen Rao, the co-founder, and CEO, their efforts have led to a significant decrease in expenses, from millions of dollars per model to hundreds of thousands. MosaicML employs 62 individuals and has secured $64 million in funding thus far.
The completion of the deal is anticipated to take place before the end of Databricks’s second quarter, which concludes on July 31.
Privacy and Security Concerns in Generative AI
Generative AI applications are specifically designed to generate unique text, images, and computer code based on natural-language prompts provided by users. The interest in this technology has experienced a significant surge since November when the AI startup OpenAI introduced ChatGPT, an online generative AI chatbot.
Both Anthropic and OpenAI offer licensed language models to companies that can develop generative AI applications using these models. The commercial demand for these models has been robust, substantially expanding the generative AI market. This growth has created opportunities for startups like MosaicML, which claim to provide similar AI models at a lower cost and tailored to a company’s specific data.
Databricks CEO Ali Ghodsi emphasized that when building a model from scratch, you have control over the data you feed into it. According to Ghodsi, off-the-shelf models, which come pre-trained on internet data, often contain irrelevant information that can distort the outcomes. He further noted that many companies have concerns about privacy and security when sharing their data with external vendors who develop the models.
According to Sreekar Krishna, the U.S. artificial intelligence leader at KPMG, some experts in machine learning and AI vendors argue that the computational and synthesis abilities of large language models, such as the one driving ChatGPT, surpass those of smaller models that possess strong but constrained capabilities in specific domains. Additionally, there are ongoing difficulties in managing data and determining the most appropriate models for particular applications.
Krishna emphasized that “Data has always been the crucial factor for achieving success,” this need has become even more pronounced with the emergence of large language models.
Databricks: Empowering Data Preparation for AI Applications
Technology leaders in the corporate sector are facing increasing pressure to prepare their data for AI models. Data is the fundamental basis for all algorithms, enabling them to identify patterns and make predictions. Replit, a provider of programming tools, is one example of a company that has utilized Databricks for its data pipeline. They have leveraged this data with MosaicML to train a code generation model.
Databricks, known for its “lakehouse” technology, is designed to facilitate preparing and managing business data for AI applications. It integrates data, analytics, and AI programming tools into a unified system. Databricks generates revenue by offering cloud-based software, including analytics and AI capabilities, that harnesses AI-ready data—what CEO Ghodsi calls the “picks and shovels”—to build enterprise tech systems. In the previous year, Databricks reported annualized revenue exceeding $1 billion.
According to market analytics firm PitchBook Data, the global generative AI market is anticipated to reach $42.6 billion by the end of this year, with a projected compound annual growth rate of 32% to reach $98.1 billion by 2026. Venture funding for generative AI startups has risen to $12.7 billion in the first five months of 2023, compared to $4.8 billion in 2022.
Databricks, founded in Berkeley, California, ten years ago by a group of data scientists, currently holds a private-market valuation of $38 billion after a fundraising round that raised $1.6 billion in August 2021. Notable investors in Databricks include Morgan Stanley’s Counterpoint Global, Andreessen Horowitz, Baillie Gifford, UC Investments, and ClearBridge Investments.
The Value of Domain-Specific Models in AI Applications
Larry Pickett, Chief Information and Digital Officer of Syneos Health, a biopharmaceutical services company, acknowledges that training specialized health models using domain-specific data can be costly, estimated between $1 million to $2 million. Analysts suggest that such “domain-specific” models may provide more value to companies than ChatGPT, as they possess industry-specific terminology and expertise.
However, Pickett believes Syneos Health can achieve significant cost savings by utilizing smaller, pre-trained models instead of relying on the entire OpenAI dataset. He mentions the availability of such models in open-source libraries like those provided by machine-learning startup Hugging Face.
Krishna noted, “Not everybody, every application, requires a GPT-4,” referencing OpenAI’s extensive language model. He explained that large language models are now being tailored for particular purposes, and when they reach that level of specialization, they become small enough to be embedded in any cellphone.
Â