Before Artificial General Intelligence (AGI) takes off, before we are all fooming in the mouth, and certainly before the AGI doomers run out of field to move their goalposts, it is already a foregone conclusion that we have arrived at the age of Artificial Sufficient Intelligence (ASI).
The energy has become undeniably infectious over the past few weeks. Attend a house party in Hayes Valley, eavesdrop at Haus, or stop by any live-work in SOMA; it becomes increasingly apparent two things are true:
People are using LLMs to solve real world problems
We do not know what the best way to build with these things is
How did we get here? The AGI foundries (OpenAI, Anthropic, Cohere, etc.) have made several high conviction bets:
Natural language is a viable universal interface for controlling artificial intelligence
Natural language understanding via foundational LLMs will continue to improve
Steadfastly decreasing loss on a massive unsupervised dataset equates to increased general intelligence
Alignment unlocks capabilities thus productivity
Programming with English
The universality of this interface comes from the power of in-context learning. Although foundational LLMs are just trained to predict the most likely next token, they are capable of performing very specific tasks when given a few examples at inference time. Besides the curious scientific mystery, this has tremendous implications for approximating intelligence for cheap.
Armed with some context and a prompt, any product engineer can inject their app with intelligence. The workflow here is exactly like training an overseas personal assistant hired off UpWork or ScaleAI:
Provide examples of the desired work: given some input, expect some output.
Ask for them to apply this process to new work.
How did we go from a next token prediction model to this? This is where instruction tuning and RLHF come into play. OpenAI has:
Finetuned a foundational LLM (GPT) to take instructions and provide answers to questions.
Built a response reward scorer using human feedback and further finetuned the model to be aligned with what humans expect a conversation to look like.
Contrast the following responses. The text-davinci-003 flavor of GPT3 is not only trained to do next token prediction, but also to perform well on following commands (instruction tuned) and optimized via human feedback (RLHF). It is succinct and does as told. Compare this against ChatGPT is also intruction+RLHF but optimized for being conversational. ChatGPT not only fulfills the demand but does so more thoroughly and conversationally.
RLHF is not only useful for aligning the model towards only task-fulfillment or being a conversational partner, RLHF can also prevent hallucinations. Because the tendency to “must give next token” is so strong, these models can often behave like a new graduate—overeager to answer—responding with a forced narration of half-truths and full-lies.
It is clear that alignment is as important for capability as it is for safety. The dual wielding tiger mom of post-training unlocks real productivity capabilities that were not possible with just GPT. Instruction tuning makes shareholders ecstatic and RLHF keeps HR happy.
So what’s the next missing piece for capability? Where would any knowledge worker be without their working memory and knowledge base access? Enter the space of Language Model Programs (LMPs), Toolformer, and vector databases. A design pattern mixing calls to an LLM, APIs for other services, and semantic retrieval emulates working memory to automate knowledge work
As an example, we emulate the lookup of relevant IFixIt documentation based on the given query about how to remove an iPhone screen. By passing relevant knowledge, the LLM is able to answer the given question.
Approaches like this have spawned tools that range from digital study companions to text-to-SQL query writers and enterprise partnerships with OpenAI. Imagine how many more unannounced projects applying this to personal coaching, customer support, language tutoring, Bible study, data analysis, to name a few.
Acceleration
Here’s a timeline of how quickly matters are accelerating:
February 24: Meta releases LLaMA.
March 1: OpenAI releases a 10x cheaper version of GPT-3.5.
March 3: Google releases Flan-UL2.
March 12: Stanford releases Alpaca.
March 14: Google announces enterprise tools powered by PaLM.
March 14: Anthropic opens the waitlist for Claude.
March 14: OpenAI releases GPT-4.
March 14: Duolingo launches a GPT powered tutor.
March 14: Khan Academy launches a GPT powered tutor.
March 14: Microsoft reveals Bing was using GPT-4.
The gold rush is well underway.
Tell the VCs to ring the NYSE bell and put in a downpayment for yacht #2, tech returns are back: once OpenAI trains an intelligence of the caliber of Terence Tao, we now have the playbook for charging Allstate $5/hour for the most effective, tireless Insurance Claims Agent at a cost basis of $0.000002 per token, letting us reap a nice margin of 99.9999%.
Should we just sit back and wait? It would probably take a different amount of time and curriculum to get Terence Tao to ramp up on a math-adjacent field like deep learning vs law. What about to be a stellar insurance claim agent? If GPT4 is already performing better on standardized tests (that are highly predictive of capabilities for knowledge work in humans), why are we waiting? Two things are clear:
If AGI is possible and we are just waiting for Terence Tao, there is work to be done to onboard him to endless industries.
Many industries can already benefit from the existing sufficient intelligences.
Grab your pickaxes.
Help LLMs “learn on the job”
Instruction tune on workflows that only exist in specific industries.
Grant retrieval access to proprietary codebases, private medical records, and corporate databases
Build a better retrieval system that is not just nearest neighbor search but learn the entire index.
Increase and enrich the possible input modalities.
Cheaper finetuning (compute) and RLHF (feedback collection).
Explore ways to reduce human in the loop, e.g. applying RL to entire workflows.