Appendix A: State of the art in Generative AI
Back to the Report: Generative Artificial Intelligence for Education and Pedagogy
This appendix gives a layperson’s description of how Generative AI models are trained for various domains like language, coding, images, and other modalities, and how these models are deployed to users in current systems.
Section A1.1: Language
Generative AI models of natural language are known as Large Language Models (LLMs) or Generative Pretrained Transformers (GPT, from OpenAI's branding). It is worth breaking these acronyms down to understand the current state of this approach. “Large” refers to the capacity of the underlying system and is proportional to the amount of data it is trained on. Current open large models are trained on more than 1 trillion word tokens, whereas proprietary models are likely trained on orders of magnitude more. Transformer is the name of the model design used for language applications and is critical to their generative ability. The current generation of LLMs for human language are created with a two-step procedure. First, the models are pretrained on a large set of text collected from articles, published books, research papers, social media, and other collected sources. Then a second Instruction training stage is applied to train the models to be helpful and responsive to common human instructions. This second stage data comes from human interaction with the published system, as well as large teams of human annotators who write answers to challenging questions and provide feedback on model answers. This stage gives companies more control over system behavior.
There are several major commercial chat LLM interfaces, of which notable choices include: ChatGPT (from OpenAI based on GPT 3.5 / 3), Bard (from Google based on PaLM), and Claude (from Anthropic). These present a relatively similar interface where users can chat with the models and receive textual responses. Evaluation of these models is still challenging, but ChatGPT is currently considered more advanced than the others, particularly in terms of reasoning and factuality. For this reason, ChatGPT has become somewhat synonymous with Generative AI, and will act as the representative model for this category throughout the report. In addition to chat, LLM providers offer models to third-party services, which use them for products. Notable examples include DuoLingo, which offers a language learning bot, Grammarly, which uses LLMs as part of its pipeline, and Khan Academy.
There have been major efforts to produce high-quality openly-available LLMs. Most notable has been LLaMa (from Meta), a large language model that is available for extension and development. This is a first stage LLM trained on a large amount of available text data. The release of this and related models has led to several different community projects to produce an instruction-tuned LLM variant, including: OpenAssistant, GPT4Aall, Alpaca. These models can be used throughout HuggingFace Chat. Currently commercial versions are free-to-use, but given the cost of serving these models, it might be necessary in the future for universities to run open-variants of the systems. These models are also critical for supporting domain-specific language models for domains such as medicine, science, and mathematics.
Section A1.2: Code
While technically similar to LLMs for language, generative AI for code are impactful enough to be considered a separate modality. For code LLMs, the pretraining data is supplemented to include large amounts of code. Specifically, these models use code scraped from sites like GitHub. As with LLMs, it is not public knowledge what code bases proprietary LLM models have seen, which makes it challenging to know if specific problems are in the training data. There do exist open replications of this data collection process, e.g., TheStack, which expose this information for open models. Even without seeing any running instances of the code, these models learn both to generate working code, and also how to generate code directed by natural language (either from code comments or from language data). Code models can be used to explain complex code inputs. Models can further be instruction trained to act like responsive agents in a chat environment, even updating code conversationally.
In terms of code model usage, LLMs such as ChatGPT and Bard are the primary proprietary models. Both generate code, as well as amend it, based on users' corrections and updates. As with language, ChatGPT is currently better at producing precise code output, although Bard has added additional features for code, such as including citations to original sources. Another popular tool is CoPilot, based on OpenAI’s tools, which integrates into a user’s IDE and provides both powerful autocomplete as well as integrated question answering. Researchers are also developing open-source coding LLMs such as StarCoder, which are more transparent as to their training data and may allow for customized or constrained tools.
Section A1.3: Images
Generative models for images are also produced using a large amount of training instances consisting of text / image pairs from the web. Notably, though, the technology for image generation uses a different method, known as diffusion. Unlike in LLMs, which generate each word one at a time, image models synthesize images by adding detail to generation step by step. To handle text prompts, these models are paired with a LLM text model. One important consequence of this approach is that generative models of images do not need to be as large or trained on as many instances to perform well. As such, there has been more progress on development of open-source systems for generative image models.
The most commonly used systems for image generation are Stable Diffusion (originally an academic system, connected to Stability AI), Dall-E (from OpenAI), and Midjourney (a commercial system). These systems allow the user to enter a prompt and produce a relevant image based on the content and style requested. Image generation systems also allow for more complex edits, such as combining images, in-painting content, and tuning to specific domains.
Section A1.4: Other modalities
Generative AI is additionally being utilized in other modalities. Recently, many demos for video generation have shown that it is possible to generate short clips based on textual prompts, most notably being the model from Runway (Gen1). There are also speech-generation models, e.g., Vall-e (from Microsoft), that can mimic the voice of the speaker from a short example. Using similar techniques, companies have released music generation systems that can generate music from descriptions or from demonstrations, e.g., MusicLM (from Google).
Section A1.5: Generative AI in software tools
Current user-facing generative AI apps are primarily technical demos of the underlying technologies. Companies also provide or sell interfaces to the models and inference code to facilitate application development. Likely in the coming years, user interaction will happen through customized applications. One example of this mentioned above is Github CoPilot, a tool that uses OpenAI LLMs that is integrated directly into VSCode, a common programming tool. Another example of this are recent versions of Adobe Photoshop, which include Generative Fill to allow for automatic generation within image editing environments. Related tools like Canva have incorporated AI methods into design and editing processes in addition to full image generation. New user-facing tools are offering AI first interfaces that directly incorporate AI systems into their structure. For example, sudowrite is an online writing assistant tool that uses LLMs to complete and expand upon story ideas.
Back to the Report: Generative Artificial Intelligence for Education and Pedagogy