“
6Pages write-ups are some of the most comprehensive and insightful I’ve come across – they lay out a path to the future that businesses need to pay attention to.
— Head of Deloitte Pixel
“
At 500 Startups, we’ve found 6Pages briefs to be super helpful in staying smart on a wide range of key issues and shaping discussions with founders and partners.
— Thomas Jeng, Director of Innovation & Partnerships, 500 Startups
“
6Pages is a fantastic source for quickly gaining a deep understanding of a topic. I use their briefs for driving conversations with industry players.
— Associate Investment Director, Cambridge Associates
Read by
Used at top MBA programs including
Nov 1 2024
13 min read
1. The debate over defining open-source AI
- It’s a strange time in the world of open-source software. Arguably, open source has never been hotter, having risen in importance over the past year and a half as a counterpoint to proprietary AI models epitomized by OpenAI (a name now viewed as ironic by some in retrospect). What is called open source is also increasingly being viewed as muddled, polluted, or nothing of the sort – with many pointing to the models Meta has released as exemplifying this dynamic. The Open Source Initiative (OSI)’s release of its official definition of open-source AI this week is now spurring further debate, rather than clarifying the issue.
- Let’s rewind to the early days of open-source software. In the early 90s, Linus Torvalds combined his Linux kernel with the GNU Project’s components into an operating system known as GNU/Linux – a free alternative to the proprietary UNIX. GNU/Linux was placed under the GNU General Public License (GPL), which allowed others to freely use, modify, and distribute it. The open license drew in contributions from programmers around the world, resulting in a rapid pace of development that allowed Linux to become a full-fledged operating system.
- The success of Linux, as well as open-source web server Apache and Netscape’s notable shift towards open-source browsers, paved the path for the founding of the Open Source Initiative in Feb 1998. The OSI popularized the original “open source” term (it was previously referred to as “free”) and became the recognized steward of the Open Source Definition (OSD). The OSD is used to evaluate whether a given license is indeed open-source and maintain a list of OSI-approved licenses.
- The OSD establishes criteria for open source such as free redistribution, source code must be available at reasonable cost, modifications and derived works are allowed, no discrimination against users or types of usage, and must be technology-neutral, among other criteria. Some OSI-approved licenses are “copyleft” (requires reciprocal openness allowing others to use derived works), while others are “permissive” (very few obligations attached). Examples of popular OSI-approved licenses include the MIT License (permissive), Apache 2.0 (permissive), BSD (permissive), and GPL 3.0 (copyleft).
- In 2022, OSI began developing the Open Source AI Definition (OSAID) 1.0 that was released this past week, in collaboration with industry stakeholders. (Meta is a backer of OSI and participated in discussions.) Under the new definition, an open-source AI system – including its model, weights, parameters, and other structural elements – must allow users to: (1) “Use the system for any purpose and without having to ask for permission”; (2) “Study how the system works and inspect its components”; (3) “Modify the system for any purpose, including to change its output”; and (4) “Share the system for others to use with or without modifications, for any purpose.”
- More specifically, users must be given access to the “preferred form” to make modifications. This includes ”[s]ufficiently detailed information about the data used to train the system so that a skilled person can build a substantially equivalent system,” as well as “[t]he complete source code used to train and run the system” and the model parameters (e.g. weights).
- On the data front, this must include: “(1) the complete description of all data used for training, including (if used) of unshareable data, disclosing the provenance of the data, its scope and characteristics, how the data was obtained and selected, the labeling procedures, and data processing and filtering methodologies; (2) a listing of all publicly available training data and where to obtain it; and (3) a listing of all training data obtainable from third parties and where to obtain it, including for fee.”
- The OSI’s definition throws a wrench in the world of LLMs (large language models), where being “open” has not necessarily meant that the LLM could be inspected and built upon, or that it is made available under an established open-source license. Being open often just meant releasing the training weights and starting code.
- Notably, the OSI’s definition excludes Meta’s Llama models, which are among the most performant models widely available for use. (Models from Mistral, Stability AI, Microsoft, and X are also excluded based on their license terms.) Meta has long been criticized for using bespoke licenses that are not very permissive and engaging in “open-washing.” While Meta has gradually made its licenses more permissive, the license for Llama 3.2 still puts a cap on licensees of 700M+ monthly active users before they must request a license from Meta. Furthermore, Meta doesn’t provide detailed information about the training data or the complete source code used to train the system.
- Meta, which has planted its flag on being a champion of open source, has publicly disagreed with the OSI’s definition. In its words, “Existing open-source definitions for software do not encompass the complexities of today’s rapidly advancing AI models. We are committed to keep working with the industry on new definitions to serve everyone safely and responsibly within the AI community.”
- OSI has noted that Google and Microsoft have stopped using the term “open source” for models that it doesn’t consider open, whereas Meta continues to use the term. AI startup Mistral has started to use the term “open weight” instead of “open source.” One of the key differences with “open-weight” models is that developers cannot see how they were built or make significant modifications easily.
- Some believe the OSI’s new definition has not gone far enough to protect the essential freedoms represented in the original OSD. Under the new definition, an open-source model could still withhold training data (e.g. for confidentiality reasons or to shield players from copyright concerns). There are advocates for a signed declaration that reverts the definition of open-source back to OSD, effectively voiding the new definition for AI.
- Some “do not believe the term open source can or should be extended into the AI world,” and are advocating for a new bespoke name. It’s not clear whether a term of art originally used to describe source code can be stretched to cover the very different space occupied by AI. There’s also a debate as to whether reproducibility is even relevant in the realm of AI.
- On the surface, this debate about the definition of open-source AI may seem academic. Meta and any other player may continue to call their models “open-source” in contravention of the OSI’s definition. (The OSI doesn’t hold the trademark.) However, part of the rationale for Meta’s championship of open-source is to capitalize on learnings and emergent capabilities from a community attracted to its openly available models. That rationale may be eroded by a moral position viewed as tainted. (Meta has experience being on the wrong side of a social debate.)
- Still, Meta’s disputed use of “open-source” will likely be outweighed by the value that it is offering by making a near state-of-the-art LLM widely available, at least in the near term. Meta continues to release new useful models – most recently, a set of fast quantized Llama models that can run on mobile devices, and NotebookLlama providing a more open version of Google’s popular NotebookLM. There are a relatively limited number of players operating at this level and even fewer willing to undertake the disclosures and risks associated with abiding by the OSI’s new definition.
- It may be that the only AI systems that end up meeting the definition are those trained with freely available public data. The broader industry, however, seems to be heading in a different direction, towards more proprietary data used in training. It begs the question of what use is a definition that not many are using.
Related Content:
- Aug 2 2024 (3 Shifts): Open AI models are here to stay
- Apr 26 2024 (3 Shifts): Llama 3, Quest’s OS, and Meta's open-source strategy
Become an All-Access Member to read the full brief here
All-Access Members get unlimited access to the full 6Pages Repository of697 market shifts.
Become a MemberAlready a Member?Log In
Disclosure: Contributors have financial interests in Meta, Microsoft, Alphabet, OpenAI, and Rocket Lab. Google and OpenAI are vendors of 6Pages.
Have a comment about this brief or a topic you'd like to see us cover? Send us a note at tips@6pages.com.
All Briefs
Get unlimited access to all our briefs.
Make better and faster decisions with context on far-reaching shifts.
Become a MemberAlready a Member?Log In
Get unlimited access to all our briefs.
Make better and faster decisions with context on what’s changing now.
Become a MemberAlready a Member?Log In