OpenAI, Reinforcement Learning, the Rights of Robots, and… Aliens? AI’s Cambrian Explosion

AI technology is experiencing an evolutionary explosion, particularly in the areas of reinforcement learning and autonomous systems. This will bring to us both the benefits of autonomous robots, and importantly, also the societal questions around how we treat this new intelligence soon to be in our midst.


January 26, 2024


A sophisticated and imaginative visual representation of the article 'OpenAI, Reinforcement Learning, the Rights of Robots, and… Aliens AI’s Cambrian Explosion" via DALL-E / GPT-4.
A sophisticated and imaginative visual representation of the article 'OpenAI, Reinforcement Learning, the Rights of Robots, and… Aliens AI’s Cambrian Explosion" via DALL-E / GPT-4.

OpenAI Whiplash

One day, there’ll be a movie, or at least a Harvard Business Review analysis, about the board vs. CEO drama that unfolded at OpenAI in the autumn of 2023.  Perhaps the conflict was simply a matter of office politics.  Or, as has been more darkly hinted, perhaps the matter was due to the development of technology that is computer science’s equivalent of a bioweapon: an artificial general intelligence (AGI) breakthrough that ultimately threatens humanity.  

At this point we don’t know the reason behind what happened at OpenAI, and we may never know.  On the AGI possibility, details about OpenAI’s technology have been scant beyond mention of a mysterious “Q*”, and Q*’s burgeoning mastery of basic math is purportedly its key advancement.  Independent of any office politics drama at OpenAI, it’s still crucially important however that we consider the Q* possibility above.  This is not because there’s a sinister AGI lurking out there now, but because the discussion around Q* helps illuminate the current state of AI and the next breakthroughs we can expect to see in the field.  

The Q* & A* Bricolage

So, what’s Q*, and is it somehow related to Q, whatever that is, and perhaps also to A*, whatever that is?  Why should we care about Q and A*, and even more pointedly, why is an ability to do basic math so interesting in the field of AI?  Most important of all, are there long-term societal implications here that go beyond mere breakthroughs in underlying AI reinforcement learning technology?  Before we get to discussing robots and alien life, let’s start with a quick review of the most interesting parts of AI today.

Natural intelligence yields us an ability to understand, reason and plan, enabling our autonomous behavior in complex and highly dynamic contexts, even for things as quotidian as planning our day.  If artificially intelligent systems are to take the next big step and operate autonomously within the complexity of real life, they too need to be able to understand, reason and plan sequenced actions.  An eventual AGI will need to integrate and synthesize knowledge from various domains, combining things like mathematical reasoning with a broader understanding of the world.  Cognitive abilities, such as natural language understanding, sensory perception, social intelligence, and common-sense reasoning will all be vital components in the development of a truly intelligent system.  As of this writing, AGI remains a distant goal.

As powerful as large language models (LLMs) may seem to be, they’re in the end transformer-driven, token-producing statistical machines for predicting the next word.  There’s an aura of intelligence that LLMs bring, but they’re unfortunately incapable of intelligent reasoning and planning.  The ability to reason and plan are hallmarks of natural intelligence, hence these abilities are sought after within AI.  Being able to plan enables systems to set goals, create strategies, and make decisions in complex and dynamic environments.  Math can provide a nice, simple proxy for logical and (multi-step) reasoning abilities.  Perhaps it’s here that Q, A*, the ability to do basic math, and the mythic Q* all come into play.

Pavlov’s Dog

We’ve become very familiar recently with discriminative AI and generative AI.  Reinforcement learning (RL) is the next AI technology which will now become increasingly familiar.  You might have taught your dog to stay off the couch via a scheme of reward (doggie treat) or penalty (“bad dog!”).  That’s the essence of reinforcement learning: a sequence of rewards or punishments that help you map the optimal, multi-step path to a goal.  It’s through reinforcement learning that we can imbue AI with the desired ability to learn sequential decision-making within complex environments, and it’s this ability that unlocks the possibilities of autonomous AI.

The now classic algorithm in the field of RL, Q-learning, was introduced in Christopher Watkins’ 1989 Ph.D. thesis, “Learning from Delayed Rewards”, which included rich references to classically-conditioned learning in animals. In short (the thesis runs to well over 200 pages), the Q-learning algorithm Watkins defined is a reinforcement learning approach that aims to learn a policy, represented by a function Q(s, a).  Q(s, a) estimates the expected cumulative (numeric) reward for taking action “a” in state “s”, and the scoring can guide a system to take optimal actions toward a desired goal.  

Q-learning is particularly useful in problems where the environment is not fully known in advance, hence is “model-free”, and the AI agent must learn through trial and error. It has been successfully applied in multiple domains, including game playing, robotics, and other autonomous systems.  Google DeepMind addressed the problem space of Atari video games using a variant of Q-learning known as a Deep Q Network (DQN; Mnih et al. 2013).  DQN combined the Q-learning algorithm with deep learning techniques, particularly convolutional neural networks, to approximate and learn the optimal action-value function in a continuous state space. DQN was designed to handle complex and high-dimensional input data, making it well-suited for tasks like playing video games or robotic control in dynamic environments.  Indeed, DQN was found to outperform a human expert across three separate Atari games.

There exist also “model-based” approaches to reinforcement learning, with the best known one being DeepMind’s AlphaZero (Silver at al. 2017), which convincingly defeated human experts in chess, Go and shogi.  Both model-free and model-based approaches to reinforcement learning have their own benefits and drawbacks (and might even be combined). Model-free approaches are generally easier to implement and can learn from experience more quickly, but may be less efficient and require more data to reach good performance.  Model-based approaches can be more efficient and require less data, but may be more difficult to implement and less robust against environmental changes.

Have you seen those cool videos of Google DeepMind’s soccer-playing robots (Liu et al. 2019, Liu et al. 2021, Haarnoja et al. 2023) or MIT’s soccer ball-kicking robot, DribbleBot (Ji et al. 2023)?  The robot AI in these cases was built using reinforcement learning.

Might Q* have had something to do with Q-learning, and endowing OpenAI’s technology with the RL-driven ability to learn from and make autonomous decisions within dynamic environments?  Possibly?

Wish Upon A*

Speaking of “possibly”, is there a possible association of A* with Q*?  In the field of graph theory, A* (Hart, Nilsson & Raphael 1968) is a shortest path algorithm that, when presented with a weighted graph, optimally computes the shortest path between a specified source node and a specified destination node.  Combining Q-learning with A* might therefore yield an optimal set of steps to plan and complete an autonomous action.  Practical; this is the sort of efficient, complex, goal-oriented thinking at which natural systems excel.  But might OpenAI’s Q* notation indicate an AI synthesizing of Q with A*?  Who knows.

AI’s Cambrian Explosion

Whether Q* exists or not, and whether it’s some early form of super-intelligence, is in the end irrelevant.  The mythic Q* is important principally because it’s symbolic of the state of current AI technology, which is undergoing a Cambrian Explosion of evolution along every possible axis.  Complementing ongoing advances in hardware technology (Cai et al. 2023, Shainline 2021, John et al. 2020, Christensen et al. 2022) is a dizzying array of advances in AI software (Henaff et al. 2017, Bi & D’Andrea 2023, Hafner et al. 2018, Reed et al. 2022, Kwiatkowski & Lipson 2019), all of which are combining to intrude ever more upon what previously seemed the exclusive realm of natural intelligence.  AI’s Cambrian Explosion is allowing it to now definitively escape the data center, and increasingly make its appearance in our physical spaces.

When it comes to big, missing pieces of artificial intelligence – planning and reasoning in complex environments – the Cambrian Explosion is in full spate, with a virtually endless stream of breakthroughs leveraging underlying LLMs (Dagan et al. 2023, Dagan et al. 2023, Imani et al. 2023, Liu et al. 2023, Silver et al. 2023, Lewkowycz et al. 2022, Zhao et al. 2023, Murthy et al. 2023, Romera-Paredes 2023, Trinh et al. 2024).  For complex reasoning, there’s also been  the Chain-of-Thought prompting of LLMs (Wei et al. 2022), the derivative Tree-of-Thoughts (Yao et al. 2023), and now even Graph-of-Thoughts (Besta et al. 2023).

Where else is the Cambrian Explosion in AI currently manifest?  How about multi-player gaming versus human competition, long a benchmark for AI, with noteworthy results achieved in the game of Diplomacy (Bakhtin et al. 2022) and elsewhere (Schmid et al. 2023, Hafner et al. 2023, Brown & Sandholm 2019).  In the former work, an artificial agent named Cicero, was applied to a game “involving both cooperation and competition” with “negotiation and coordination between seven players”.  “Cicero [integrated] a language model with planning and reinforcement learning algorithms by inferring players’ beliefs and intentions from its conversations and generating dialogue in pursuit of its plans.  Across 40 games of an anonymous online Diplomacy league, Cicero achieved more than double the average score of the human players and ranked in the top 10% of participants”.

Any Q* super-intelligence evidence within this Cambrian Explosion?  OpenAI has published work both in multi-step mathematical reasoning (Cobbe et al. 2021) and process supervision (Lightman et al. 2023).  And as Anton Chekhov almost said, “One must never place super-intelligence on the stage if it isn’t going to go off”.  For what it’s worth, Ilya Sutskever, OpenAI’s co-founder and chief scientist, has now put his focus on solving the problem of superalignment.  I.e., aligning AI to stay within humans’ intended goals when that AI is deemed super-intelligent.

Brave New World

(With an intentional Aldous Huxley reference).  So what does it all mean?  Are there important societal implications contained within AI’s Cambrian Explosion?  Given all of the breakneck advances in machine intelligence per above, one of our biggest societal questions we face will be what happens when the disembodied voice of AI, now simply instructing you on how to reach your driving destination, becomes the embodied voice, passing you in the hallway in robotic form, engaged in some RL-driven autonomous task.

We humans have hierarchies for everything.  For example, when we assign rights, humans stand at the apex of the rights hierarchy, with other mammals below us, and fish, insects and plants arrayed beneath.  These rights are roughly apportioned via classification of intelligence.  While much discussion today has focused on whether or when AI achieves sentience and AGI, the bigger, more immediate question is already here: where should human society insert artificially intelligent “life” which is “human enough” into our moral hierarchy?  Full AGI is many years away, but AI’s Cambrian Explosion brings us systems that we find ever more engaging in ever more places with every passing day.  “Intelligent (human) enough” AI robots, enabled by reinforcement learning, are near at hand, and these bring up a very key issue of anticipatory technology governance.  What rights should robots have?

We’re already very familiar with purpose-built robots such as the iRobot Roomba.  We can expect similar task-specific autonomous robots to continue their encroachment (Shafiullah et al. 2023, Fu et al. 2024, Chi et al. 2023) into applications in home automation, agriculture, manufacturing, warehousing, retail and logistics.  Given that our built environment has evolved to serve humans – bipedal, bimanual, upright-standing creatures of a certain height – we can anticipate that further advancements in autonomous robots will increasingly resemble Star Wars’ C-3PO and not just the saga’s R2-D2 or BB-8.

Generative AI chatbot systems have shown a remarkable ability of connecting with their human users (Skjuve 2021, Tidy 2024).  (Dismayingly, chatbots can be so humanized that they can even be taught to lie: Hubinger et al. 2024).  We will see a day when these human-chatbot connections are embodied within quite intelligent, autonomous robots.  As these robots become increasingly humanoid (see the DARPA Robotics Challenge entrants, or Enchanted Tools’ Mirokai or Engineered Arts’ Ameca; Merel et al. 2018) and increasingly ubiquitous (Tesla Bot is slated to cost less than a car), how should we place moral value on an autonomous robot?

Though a theoretical question at present, the moral rights of robots will need soon to be inserted into society’s discussion.  We’ve recently begun to consider the implications of virtual crimes committed against physical persons.  We will also have to grapple with the societally corrosive effects of physical crimes committed against “virtual persons”, aka. robots.

Put more bluntly, will it be an equal offense to take a sledgehammer to your coworker’s PC as it is to do the same to a humanoid robot which is able to remonstrate, using spoken natural language pleas, with its attacker?  Are autonomous robots – possessing symbolic qualities “just human enough”, with no sentience or AGI required – to be treated merely as capital-asset machines or do they possess rights beyond that, meriting a higher place in human society’s hierarchy of rights?  AI’s Cambrian Explosion is accelerating our need to confront this very question.

Where Is Everybody?

Beyond autonomous robots, where else might AI’s Cambrian Explosion lead?  We can absolutely anticipate the integration of human brains with artificial ones.  This prospect is no longer mere science fiction.  Advances in neuroprosthesis (Metzger et al. 2023, Moses et al. 2021) and brain-machine interface (BMI) technologies have demonstrated the ability to perform just these sorts of integrations.  Companies such as MindPortal promise to deliver “seamless telepathic communication between humans and AI”, while Neuralink has won FDA approval for the human study of brain implants.  How might this ramify into our next societal question?  Well, if it’s illegal for chemically-enhanced athletes to compete with the unenhanced, should electronically-enhanced humans be allowed to do the same?  Utilize a BMI and sit for the paramedic’s exam?  Defend a Ph.D. thesis?  Argue a case in court?  This question too will need to be confronted one day.

Where might the AI Cambrian Explosion eventually culminate?  Consider that we carbon-based, biomass-consuming beings have now invented silicon-based, electricity-consuming beings, beings whose intelligence will one day surpass our own.  Is this perhaps the fate of every advanced civilization in the Universe?  The triumph of a few score years of technology evolution over a few billion years of natural evolution?  Physicist Enrico Fermi famously coined the Fermi Paradox, asking “Where is everybody?” when referring to the lack of direct evidence of alien life.  Maybe advanced alien civilizations everywhere are fated to suffer the same outcome, with machine-based intelligence supplanting natural intelligence.  Unlike natural beings, artificial ones have no biological need to conquer, exploit, and spread their genes.  Alien AI may just cast a curious electronic eye at our small planet, with its primitive technologies, but have no need to traverse deep space and greet us with ray guns.  AI’s Cambrian Explosion answer to the Fermi Paradox?  “Everybody” is a computer.